Chapter 20 Sparsity
This is a simulation to illustrate some of the properties of Lasso-type estimations. There are two objectives in using these penalized regressions: model selection (identifying “correct” sparsity) and prediction accuracy. These two objectives require different optimization approaches and usually are not compatible. In model selection, the objective is to shrink the dimension of the model to the “true” sparsity. This is usually evaluated by checking whether the Oracle properties are satisfied. These asymptotic properties look at (1) if the model identified by the penalized regression converges to the “true” sparsity, (2) if the coefficients are consistent.
The literature suggests that Lasso is not an “Oracle” estimator. Adaptive Lasso was developed (Zou 2006) to fill this gap.
Let’s specify a data generating process with a linear regression model:
\[ y_i=x_i^{\prime} \beta_0+u_i, ~~~~~i=1, \ldots, n \]
where \(\beta_0\) is \(p \times 1\). First, we consider the case where \(p<n\) then move to the case where \(p \geq n\). We define \(\beta_0=(1,1,0,0)^{\prime}\) and \(n=100\).
#This function generates the data
<- function(N, Beta){
dgp = length(Beta)
p
<- matrix(rnorm(N*p), ncol = p)
X <- matrix(rnorm(N), ncol = 1)
u <- X%*%Beta
dgm <- X%*%Beta + u
y
<- list(y, X)
return
}
= 100
N = c(1,1,0,0)
Beta
set.seed(148)
<- dgp(N, Beta)
Output <-Output[[1]]
y <-Output[[2]] X
First, we apply lasso
library(glmnet)
set.seed(432)
<- glmnet(x = X, y = y, family = "gaussian")
lasso
<- lasso$beta
beta_hat cbind(t(beta_hat), "lambda" = lasso$lambda)
## 60 x 5 sparse Matrix of class "dgCMatrix"
## V1 V2 V3 V4 lambda
## s0 . . . . 1.083220708
## s1 0.09439841 0.0283513 . . 0.986990366
## s2 0.17344129 0.1097255 . . 0.899308862
## s3 0.24546220 0.1838706 . . 0.819416741
## s4 0.31108496 0.2514289 . . 0.746622016
## s5 0.37087798 0.3129855 . . 0.680294174
## s6 0.42535915 0.3690736 . . 0.619858715
## s7 0.47500037 0.4201789 . . 0.564792175
## s8 0.52023159 0.4667442 . . 0.514617595
## s9 0.56144460 0.5091728 . . 0.468900386
## s10 0.59899636 0.5478321 . . 0.427244568
## s11 0.63321213 0.5830571 . . 0.389289338
## s12 0.66438826 0.6151527 . . 0.354705946
## s13 0.69279478 0.6443971 . . 0.323194848
## s14 0.71867775 0.6710435 . . 0.294483108
## s15 0.74226135 0.6953227 . . 0.268322040
## s16 0.76374985 0.7174449 . . 0.244485050
## s17 0.78332937 0.7376020 . . 0.222765672
## s18 0.80116950 0.7559683 . . 0.202975783
## s19 0.81742476 0.7727030 . . 0.184943974
## s20 0.83223594 0.7879510 . . 0.168514061
## s21 0.84573135 0.8018445 . . 0.153543737
## s22 0.85802785 0.8145036 . . 0.139903334
## s23 0.86923197 0.8260382 . . 0.127474708
## s24 0.87944075 0.8365481 . . 0.116150206
## s25 0.88874261 0.8461243 . . 0.105831742
## s26 0.89685610 0.8542117 -0.00686322 . 0.096429941
## s27 0.90418482 0.8614679 -0.01432988 . 0.087863371
## s28 0.91086250 0.8680794 -0.02113323 . 0.080057832
## s29 0.91694695 0.8741036 -0.02733218 . 0.072945714
## s30 0.92249088 0.8795926 -0.03298044 . 0.066465418
## s31 0.92754230 0.8845940 -0.03812692 . 0.060560814
## s32 0.93214497 0.8891511 -0.04281620 . 0.055180758
## s33 0.93633875 0.8933034 -0.04708890 . 0.050278651
## s34 0.94015996 0.8970867 -0.05098203 . 0.045812035
## s35 0.94463571 0.9004434 -0.05531593 0.005031468 0.041742220
## s36 0.94880161 0.9034936 -0.05933362 0.010054959 0.038033956
## s37 0.95259743 0.9062729 -0.06299446 0.014632189 0.034655124
## s38 0.95605603 0.9088053 -0.06633008 0.018802790 0.031576458
## s39 0.95920739 0.9111127 -0.06936937 0.022602886 0.028771293
## s40 0.96207878 0.9132152 -0.07213866 0.026065393 0.026215331
## s41 0.96468945 0.9151338 -0.07466124 0.029219211 0.023886433
## s42 0.96707383 0.9168790 -0.07696041 0.032093941 0.021764428
## s43 0.96924639 0.9184692 -0.07905533 0.034713288 0.019830936
## s44 0.97122595 0.9199181 -0.08096415 0.037099940 0.018069210
## s45 0.97302965 0.9212383 -0.08270339 0.039274568 0.016463992
## s46 0.97467312 0.9224412 -0.08428812 0.041256008 0.015001376
## s47 0.97617058 0.9235372 -0.08573207 0.043061423 0.013668695
## s48 0.97753502 0.9245359 -0.08704774 0.044706449 0.012454406
## s49 0.97877824 0.9254459 -0.08824654 0.046205336 0.011347991
## s50 0.97991101 0.9262750 -0.08933883 0.047571066 0.010339867
## s51 0.98094316 0.9270304 -0.09033409 0.048815468 0.009421301
## s52 0.98188361 0.9277188 -0.09124093 0.049949321 0.008584339
## s53 0.98274051 0.9283460 -0.09206721 0.050982445 0.007821730
## s54 0.98352129 0.9289175 -0.09282009 0.051923790 0.007126869
## s55 0.98423271 0.9294382 -0.09350608 0.052781508 0.006493738
## s56 0.98488092 0.9299126 -0.09413113 0.053563029 0.005916852
## s57 0.98547155 0.9303449 -0.09470066 0.054275122 0.005391215
## s58 0.98600972 0.9307388 -0.09521958 0.054923954 0.004912274
## s59 0.98650007 0.9310977 -0.09569241 0.055515146 0.004475881
Which beta_hat? To answer this question we need to find the lambda. We need \(\lambda_n \rightarrow \infty\) in order to shrink the truly zero coefficients to zero. This requires \(\lambda_n\) to be sufficiently large. On the other hand, \(\lambda_n / \sqrt{n} \rightarrow 0\) is needed in order not to shrink too much. This would introduce asymptotic bias to the non-zero coefficients. Information-criterion based model selection is very fast, but it relies on a proper estimation of degrees of freedom, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated by this model. In practice, choosing \(\lambda_n\) by \(\mathrm{BIC}\) results in consistent model selection in the fixed \(p\) setting, i.e., Let \(\mathcal{A}=\left\{j: \beta_{0, j} \neq 0\right\}\), active set or relevant variables
\[ P\left(\hat{\mathcal{A}}_{\lambda_{BIC}}=\mathcal{A}\right) \rightarrow 1 \]
We find lambda by minimum Bayesian Information Criterion (BIC). Let \(S S E_\lambda\) be the sum of squared error terms for a given value of \(\lambda\) and \(n z_\lambda\) be the number of non-zero coefficients. Then,
\[ B I C_\lambda=\log \left(S S E_\lambda\right)+\frac{\log (n)}{n} n z_\lambda \]
# Predict yhat for each of 61 lambda (s)
= predict(lasso, newx = X)
y_hat dim(y_hat)
## [1] 100 60
# SSE for each lambda (s)
<- c()
SSE for (i in 1:ncol(y_hat)) {
<- sum((y_hat[,i]-y[,1])^(2))
SSE_each <- c(SSE, SSE_each)
SSE
}
# BIC
<- colSums(beta_hat!=0) # Number of non-zero coefficients for each lambda
nz <- log(SSE) + (log(N)/N)*nz #Now BIC
BIC BIC
## s0 s1 s2 s3 s4 s5 s6 s7
## 5.598919 5.595359 5.468287 5.348947 5.237755 5.135013 5.040883 4.955387
## s8 s9 s10 s11 s12 s13 s14 s15
## 4.878394 4.809638 4.748729 4.695181 4.648437 4.607898 4.572946 4.542971
## s16 s17 s18 s19 s20 s21 s22 s23
## 4.517383 4.495631 4.477205 4.461646 4.448541 4.437530 4.428295 4.420563
## s24 s25 s26 s27 s28 s29 s30 s31
## 4.414098 4.408698 4.448661 4.443309 4.438844 4.435121 4.432021 4.429439
## s32 s33 s34 s35 s36 s37 s38 s39
## 4.427290 4.425503 4.424017 4.468004 4.466218 4.464732 4.463498 4.462471
## s40 s41 s42 s43 s44 s45 s46 s47
## 4.461618 4.460910 4.460321 4.459832 4.459426 4.459088 4.458808 4.458575
## s48 s49 s50 s51 s52 s53 s54 s55
## 4.458382 4.458222 4.458088 4.457978 4.457886 4.457810 4.457746 4.457694
## s56 s57 s58 s59
## 4.457650 4.457614 4.457584 4.457559
Let’s select \(\lambda\) that has the minimum BIC
<- beta_hat[ ,which(BIC==min(BIC))]
beta_lasso beta_lasso
## V1 V2 V3 V4
## 0.8887426 0.8461243 0.0000000 0.0000000
This is the beta_hat
that identifies the true sparsity. And, the second Oracle property, the \(\ell_2\) error:
<- sqrt(sum((beta_lasso - Beta)^2))
l_2 l_2
## [1] 0.189884
Here we will create a simulation that will report two Oracle Properties Lasso and Adaptive Lasso:
- True sparsity,
- \(\ell_2\) error.
Lasso:
We first have a function, msc()
, that executes a simulation with all the steps shown before:
<- function(mc, N, Beta){
mcs <- matrix(0, nrow = mc, ncol = 3)
mcmat <- matrix(0, nr = mc, nc = length(Beta))
beta_lasso_mat
for (i in 1:mc) {
set.seed(i)
<- dgp(N, Beta)
data <- data[[1]]
y <- data[[2]]
X
set.seed(i)
<- glmnet(x = X, y = y, family = "gaussian")
lasso <- lasso$beta; # beta_hat is a matrix
beta_hat = predict(lasso, newx = X)
y_hat
<- c()
SSE for (j in 1:ncol(y_hat)) {
<- sum((y_hat[,j]-y[,1])^(2))
SSE_each <- c(SSE, SSE_each)
SSE
}
<- colSums(beta_hat!=0)
nz <- log(SSE) + (log(N)/N)*nz
BIC <- beta_hat[ ,which(BIC==min(BIC))]
beta_lasso = length(Beta[Beta==0])
nonz_beta = length(beta_lasso[beta_lasso==0])
nonz_beta_hat
1] <- sqrt(sum((beta_lasso - Beta)^2))
mcmat[i,2] <- ifelse(nonz_beta!= nonz_beta_hat, 0, 1)
mcmat[i,3] <- sum(beta_lasso!=0)
mcmat[i,<- beta_lasso
beta_lasso_mat[i,]
}return(list(mcmat,beta_lasso_mat))
}
We are ready for simulation:
<- 500
mc <- 1000
N <- matrix(c(1,1,0,0), nc = 1)
Beta <- mcs(mc, N, Beta) #see the function
output
= output[[2]]
MC_betas = output[[1]];
MC_performance sum(MC_performance[,2]) #how many times lasso finds true sparsity
## [1] 400
This is the first property: lasso identifies the true sparsity \(400/500 = 80\%\) of cases. And the second property, \(\ell_2\) error, in the simulation is (in total):
sum(MC_performance[,1])
## [1] 29.41841
Adaptive Lasso
This time we let our adaptive lasso use lasso coefficients as penalty weights in glmnet()
. Let’s have the same function with Adaptive Lasso for the simulation:
# Adaptive LASSO
<- function(mc, N, Beta){
mcsA
<- matrix(0, nr = mc, nc = 3)
mcmat <- matrix(0, nr = mc, nc = length(Beta))
beta_lasso_mat
for (i in 1:mc) {
<- dgp(N, Beta)
data <- data[[1]]
y <- data[[2]]
X
<- glmnet(x = X, y = y, family = "gaussian")
lasso <- lasso$beta;
beta_hat = predict(lasso, newx = X)
y_hat
<- c()
SSE for (j in 1:ncol(y_hat)) {
<- sum((y_hat[,j]-y[,1])^(2))
SSE_each <- c(SSE, SSE_each)
SSE
}
<- colSums(beta_hat!=0)
nz <- log(SSE) + (log(N)/N)*nz
BIC <- beta_hat[ ,which(BIC==min(BIC))]
beta_lasso
= abs(beta_lasso)^(-1)
weights ==0] = 10^10 # to handle inf's
weights[beta_lasso
#Now Adaptive Lasso
<- glmnet(x = X, y = y, family = "gaussian", penalty.factor = weights)
lasso <- lasso$beta;
beta_hat = predict(lasso, newx = X)
y_hat
<- c()
SSE for (j in 1:ncol(y_hat)) {
<- sum((y_hat[,j]-y[,1])^(2))
SSE_each <- c(SSE, SSE_each)
SSE
}
<- colSums(beta_hat!=0)
nz <- log(SSE) + (log(N)/N)*nz
BIC <- beta_hat[ ,which(BIC==min(BIC))]
beta_lasso = length(Beta[Beta==0])
nonz_beta = length(beta_lasso[beta_lasso==0])
nonz_beta_hat
1] <- sqrt(sum((beta_lasso - Beta)^2))
mcmat[i,2] <- ifelse(nonz_beta!= nonz_beta_hat, 0, 1)
mcmat[i,3] <- sum(beta_lasso!=0)
mcmat[i,<- beta_lasso
beta_lasso_mat[i,]
}return(list(mcmat,beta_lasso_mat))
}
Here are the results for adaptive lasso
<- 500
mc <- 1000
N <- matrix(c(1,1,0,0), nc = 1)
beta <- mcsA(mc, N, beta) #see the function
output
= output[[2]]
MC_betas = output[[1]];
MC_performance sum(MC_performance[,2])
## [1] 492
And,
sum(MC_performance[,1])
## [1] 20.21311
The simulation results clearly shows that Adaptive Lasso is an Oracle estimator and the regular Lasso isn’t. In model selection Adaptive Lasso is a better estimator.