Chapter 20 Sparsity
This is a simulation to illustrate some of the properties of Lasso-type estimations. There are two objectives in using these penalized regressions: model selection (identifying “correct” sparsity) and prediction accuracy. These two objectives require different optimization approaches and usually are not compatible. In model selection, the objective is to shrink the dimension of the model to the “true” sparsity. This is usually evaluated by checking whether the Oracle properties are satisfied. These asymptotic properties look at (1) if the model identified by the penalized regression converges to the “true” sparsity, (2) if the coefficients are consistent.
The literature suggests that Lasso is not an oracle estimator. Adaptive Lasso was developed (Zou 2006) to fill this gap.
Let’s specify a data generating process with a linear regression model:
\[ y_i=x_i^{\prime} \beta+u_i, ~~~~~i=1, \ldots, n \]
where \(\beta\) is \(p \times 1\). First, we consider the case where \(p<n\) then move to the case where \(p \geq n\). We define \(\beta=(1,1,0,0)^{\prime}\) and \(n=100\).
#This function generates the data
<- function(N, Beta) {
dgp = length(Beta)
p
<- matrix(rnorm(N * p), ncol = p)
X <- matrix(rnorm(N), ncol = 1)
u <- X %*% Beta
dgm <- X %*% Beta + u
y
<- list(y, X)
return
}
= 100
N = c(1, 1, 0, 0)
Beta
set.seed(148)
<- dgp(N, Beta)
Output <- Output[[1]]
y <- Output[[2]] X
First, we apply lasso
library(glmnet)
set.seed(432)
<- glmnet(x = X, y = y, family = "gaussian")
lasso
<- lasso$beta
beta_hat <- cbind(t(beta_hat), "lambda" = lasso$lambda)
S_matrix c(1:8, 25:30, 55:60), ] # selected rows S_matrix[
## 20 x 5 sparse Matrix of class "dgCMatrix"
## V1 V2 V3 V4 lambda
## s0 . . . . 1.083220708
## s1 0.09439841 0.0283513 . . 0.986990366
## s2 0.17344129 0.1097255 . . 0.899308862
## s3 0.24546220 0.1838706 . . 0.819416741
## s4 0.31108496 0.2514289 . . 0.746622016
## s5 0.37087798 0.3129855 . . 0.680294174
## s6 0.42535915 0.3690736 . . 0.619858715
## s7 0.47500037 0.4201789 . . 0.564792175
## s24 0.87944075 0.8365481 . . 0.116150206
## s25 0.88874261 0.8461243 . . 0.105831742
## s26 0.89685610 0.8542117 -0.00686322 . 0.096429941
## s27 0.90418482 0.8614679 -0.01432988 . 0.087863371
## s28 0.91086250 0.8680794 -0.02113323 . 0.080057832
## s29 0.91694695 0.8741036 -0.02733218 . 0.072945714
## s54 0.98352129 0.9289175 -0.09282009 0.05192379 0.007126869
## s55 0.98423271 0.9294382 -0.09350608 0.05278151 0.006493738
## s56 0.98488092 0.9299126 -0.09413113 0.05356303 0.005916852
## s57 0.98547155 0.9303449 -0.09470066 0.05427512 0.005391215
## s58 0.98600972 0.9307388 -0.09521958 0.05492395 0.004912274
## s59 0.98650007 0.9310977 -0.09569241 0.05551515 0.004475881
Which set of beta_hat should we select? To answer this question we need to find the lambda. We need \(\lambda_n \rightarrow \infty\) in order to shrink the truly zero coefficients to zero. This requires \(\lambda_n\) to be sufficiently large. This would introduce asymptotic bias to the non-zero coefficients.
In practice, choosing \(\lambda_n\) by \(\mathrm{BIC}\) (Bayesian Information Criterion) results in a consistent model selection in the fixed \(p\) setting. That is, let \(\mathcal{A}=\left\{j: \beta_{0, j} \neq 0\right\}\), active set or relevant variables,
\[ P\left(\hat{\mathcal{A}}_{\lambda_{BIC}}=\mathcal{A}\right) \rightarrow 1 \]
Thus, let \(S S E_\lambda\) be the sum of squared error terms for a given value of \(\lambda\) and \(n z_\lambda\) be the number of non-zero coefficients. Then, it can be shown that
\[ B I C_\lambda=\log \left(S S E_\lambda\right)+\frac{\log (n)}{n} n z_\lambda \]
# Predict yhat for each of 61 lambda (s)
= predict(lasso, newx = X)
y_hat dim(y_hat)
## [1] 100 60
# SSE for each lambda (s)
<- c()
SSE for (i in 1:ncol(y_hat)) {
<- sum((y_hat[, i] - y[, 1]) ^ (2))
SSE_each <- c(SSE, SSE_each)
SSE
}
# BIC
<- colSums(beta_hat != 0) # Number of non-zero coefficients for each lambda
nz <- log(SSE) + (log(N) / N) * nz # BIC
BIC BIC
## s0 s1 s2 s3 s4 s5 s6 s7
## 5.598919 5.595359 5.468287 5.348947 5.237755 5.135013 5.040883 4.955387
## s8 s9 s10 s11 s12 s13 s14 s15
## 4.878394 4.809638 4.748729 4.695181 4.648437 4.607898 4.572946 4.542971
## s16 s17 s18 s19 s20 s21 s22 s23
## 4.517383 4.495631 4.477205 4.461646 4.448541 4.437530 4.428295 4.420563
## s24 s25 s26 s27 s28 s29 s30 s31
## 4.414098 4.408698 4.448661 4.443309 4.438844 4.435121 4.432021 4.429439
## s32 s33 s34 s35 s36 s37 s38 s39
## 4.427290 4.425503 4.424017 4.468004 4.466218 4.464732 4.463498 4.462471
## s40 s41 s42 s43 s44 s45 s46 s47
## 4.461618 4.460910 4.460321 4.459832 4.459426 4.459088 4.458808 4.458575
## s48 s49 s50 s51 s52 s53 s54 s55
## 4.458382 4.458222 4.458088 4.457978 4.457886 4.457810 4.457746 4.457694
## s56 s57 s58 s59
## 4.457650 4.457614 4.457584 4.457559
And, the selected model that has the minimum BIC
<- beta_hat[, which(BIC == min(BIC))]
beta_lasso beta_lasso
## V1 V2 V3 V4
## 0.8887426 0.8461243 0.0000000 0.0000000
This is the beta_hat
that identifies the true sparsity. And, the second Oracle property, the \(\ell_2\) error:
<- sqrt(sum((beta_lasso - Beta) ^ 2))
l_2 l_2
## [1] 0.189884
Here we will create a simulation that will report two Oracle Properties for Lasso and Adaptive Lasso:
- True sparsity,
- \(\ell_2\) error.
Lasso
We first have a function, msc()
, that executes a simulation with all the steps shown before:
<- function(mc, N, Beta) {
mcs <- matrix(0, nrow = mc, ncol = 3)
mcmat <- matrix(0, nr = mc, nc = length(Beta))
beta_lasso_mat
for (i in 1:mc) {
set.seed(i)
<- dgp(N, Beta)
data <- data[[1]]
y <- data[[2]]
X
set.seed(i)
<- glmnet(x = X, y = y, family = "gaussian")
lasso <- lasso$beta # beta_hat is a matrix
beta_hat = predict(lasso, newx = X)
y_hat
<- c()
SSE for (j in 1:ncol(y_hat)) {
<- sum((y_hat[, j] - y[, 1]) ^ (2))
SSE_each <- c(SSE, SSE_each)
SSE
}
<- colSums(beta_hat != 0)
nz <- log(SSE) + (log(N) / N) * nz
BIC <- beta_hat[, which(BIC == min(BIC))]
beta_lasso = length(Beta[Beta == 0])
nonz_beta = length(beta_lasso[beta_lasso == 0])
nonz_beta_hat
1] <- sqrt(sum((beta_lasso - Beta) ^ 2))
mcmat[i, 2] <- ifelse(nonz_beta != nonz_beta_hat, 0, 1)
mcmat[i, 3] <- sum(beta_lasso != 0)
mcmat[i, <- beta_lasso
beta_lasso_mat[i, ]
}return(list(mcmat, beta_lasso_mat))
}
We are ready for simulation:
<- 500
mc <- 1000
N <- matrix(c(1, 1, 0, 0), nc = 1)
Beta <- mcs(mc, N, Beta) #see the function
output
= output[[2]]
MC_betas = output[[1]]
MC_performance
sum(MC_performance[, 2]) #how many times lasso finds true sparsity
## [1] 400
This is the first property: lasso identifies the true sparsity \(400/500 = 80\%\) of cases. And the second property, \(\ell_2\) error, in the simulation is (in total):
sum(MC_performance[, 1])
## [1] 29.41841
Adaptive Lasso
This time we let our adaptive lasso use lasso coefficients as penalty weights in glmnet()
. Let’s have the same function with Adaptive Lasso for the simulation:
# Adaptive LASSO
<- function(mc, N, Beta) {
mcsA <- matrix(0, nr = mc, nc = 3)
mcmat <- matrix(0, nr = mc, nc = length(Beta))
beta_lasso_mat
for (i in 1:mc) {
<- dgp(N, Beta)
data <- data[[1]]
y <- data[[2]]
X
<- glmnet(x = X, y = y, family = "gaussian")
lasso <- lasso$beta
beta_hat
= predict(lasso, newx = X)
y_hat
<- c()
SSE for (j in 1:ncol(y_hat)) {
<- sum((y_hat[, j] - y[, 1]) ^ (2))
SSE_each <- c(SSE, SSE_each)
SSE
}
<- colSums(beta_hat != 0)
nz <- log(SSE) + (log(N) / N) * nz
BIC <- beta_hat[, which(BIC == min(BIC))]
beta_lasso
= abs(beta_lasso) ^ (-1)
weights == 0] = 10 ^ 10 # to handle inf's
weights[beta_lasso
#Now Adaptive Lasso
<-
lasso glmnet(
x = X,
y = y,
family = "gaussian",
penalty.factor = weights
)<- lasso$beta
beta_hat
= predict(lasso, newx = X)
y_hat
<- c()
SSE for (j in 1:ncol(y_hat)) {
<- sum((y_hat[, j] - y[, 1]) ^ (2))
SSE_each <- c(SSE, SSE_each)
SSE
}
<- colSums(beta_hat != 0)
nz <- log(SSE) + (log(N) / N) * nz
BIC <- beta_hat[, which(BIC == min(BIC))]
beta_lasso = length(Beta[Beta == 0])
nonz_beta = length(beta_lasso[beta_lasso == 0])
nonz_beta_hat
1] <- sqrt(sum((beta_lasso - Beta) ^ 2))
mcmat[i, 2] <- ifelse(nonz_beta != nonz_beta_hat, 0, 1)
mcmat[i, 3] <- sum(beta_lasso != 0)
mcmat[i, <- beta_lasso
beta_lasso_mat[i, ]
}return(list(mcmat, beta_lasso_mat))
}
Here are the results for adaptive lasso:
<- 500
mc <- 1000
N <- matrix(c(1, 1, 0, 0), nc = 1)
beta <- mcsA(mc, N, beta) #see the function
output
= output[[2]]
MC_betas = output[[1]]
MC_performance
sum(MC_performance[, 2])
## [1] 492
And,
sum(MC_performance[,1])
## [1] 20.21311
The simulation results clearly show that Adaptive Lasso is an Oracle estimator and a better choice for sparsity applications.
We saw here a basic application of adaptive lasso, which has several different variations in practice, such as Thresholded Lasso and Rigorous Lasso. Model selections with lasso has been an active research area. One of the well-known applications is the double-selection lasso linear regression method that can be used for variable selections. Moreover, lasso type applications are also used in time-series forecasting and graphical network analysis for dimension reductions.