Chapter 20 Sparsity

This is a simulation to illustrate some of the properties of Lasso-type estimations. There are two objectives in using these penalized regressions: model selection (identifying “correct” sparsity) and prediction accuracy. These two objectives require different optimization approaches and usually are not compatible. In model selection, the objective is to shrink the dimension of the model to the “true” sparsity. This is usually evaluated by checking whether the Oracle properties are satisfied. These asymptotic properties look at (1) if the model identified by the penalized regression converges to the “true” sparsity, (2) if the coefficients are consistent.

The literature suggests that Lasso is not an “Oracle” estimator. Adaptive Lasso was developed (Zou 2006) to fill this gap.

Let’s specify a data generating process with a linear regression model:

\[ y_i=x_i^{\prime} \beta_0+u_i, ~~~~~i=1, \ldots, n \]

where \(\beta_0\) is \(p \times 1\). First, we consider the case where \(p<n\) then move to the case where \(p \geq n\). We define \(\beta_0=(1,1,0,0)^{\prime}\) and \(n=100\).

#This function generates the data
dgp <- function(N, Beta){
  p = length(Beta)
  
  X <- matrix(rnorm(N*p), ncol = p)
  u <- matrix(rnorm(N), ncol = 1)
  dgm <- X%*%Beta
  y <- X%*%Beta + u

  return <- list(y, X)
}

N = 100
Beta = c(1,1,0,0)

set.seed(148)
Output <- dgp(N, Beta)
y <-Output[[1]]
X <-Output[[2]]

First, we apply lasso

library(glmnet)

set.seed(432)
lasso <- glmnet(x = X, y = y, family = "gaussian")

beta_hat <- lasso$beta
cbind(t(beta_hat), "lambda" = lasso$lambda)
## 60 x 5 sparse Matrix of class "dgCMatrix"
##             V1        V2          V3          V4      lambda
## s0  .          .          .          .           1.083220708
## s1  0.09439841 0.0283513  .          .           0.986990366
## s2  0.17344129 0.1097255  .          .           0.899308862
## s3  0.24546220 0.1838706  .          .           0.819416741
## s4  0.31108496 0.2514289  .          .           0.746622016
## s5  0.37087798 0.3129855  .          .           0.680294174
## s6  0.42535915 0.3690736  .          .           0.619858715
## s7  0.47500037 0.4201789  .          .           0.564792175
## s8  0.52023159 0.4667442  .          .           0.514617595
## s9  0.56144460 0.5091728  .          .           0.468900386
## s10 0.59899636 0.5478321  .          .           0.427244568
## s11 0.63321213 0.5830571  .          .           0.389289338
## s12 0.66438826 0.6151527  .          .           0.354705946
## s13 0.69279478 0.6443971  .          .           0.323194848
## s14 0.71867775 0.6710435  .          .           0.294483108
## s15 0.74226135 0.6953227  .          .           0.268322040
## s16 0.76374985 0.7174449  .          .           0.244485050
## s17 0.78332937 0.7376020  .          .           0.222765672
## s18 0.80116950 0.7559683  .          .           0.202975783
## s19 0.81742476 0.7727030  .          .           0.184943974
## s20 0.83223594 0.7879510  .          .           0.168514061
## s21 0.84573135 0.8018445  .          .           0.153543737
## s22 0.85802785 0.8145036  .          .           0.139903334
## s23 0.86923197 0.8260382  .          .           0.127474708
## s24 0.87944075 0.8365481  .          .           0.116150206
## s25 0.88874261 0.8461243  .          .           0.105831742
## s26 0.89685610 0.8542117 -0.00686322 .           0.096429941
## s27 0.90418482 0.8614679 -0.01432988 .           0.087863371
## s28 0.91086250 0.8680794 -0.02113323 .           0.080057832
## s29 0.91694695 0.8741036 -0.02733218 .           0.072945714
## s30 0.92249088 0.8795926 -0.03298044 .           0.066465418
## s31 0.92754230 0.8845940 -0.03812692 .           0.060560814
## s32 0.93214497 0.8891511 -0.04281620 .           0.055180758
## s33 0.93633875 0.8933034 -0.04708890 .           0.050278651
## s34 0.94015996 0.8970867 -0.05098203 .           0.045812035
## s35 0.94463571 0.9004434 -0.05531593 0.005031468 0.041742220
## s36 0.94880161 0.9034936 -0.05933362 0.010054959 0.038033956
## s37 0.95259743 0.9062729 -0.06299446 0.014632189 0.034655124
## s38 0.95605603 0.9088053 -0.06633008 0.018802790 0.031576458
## s39 0.95920739 0.9111127 -0.06936937 0.022602886 0.028771293
## s40 0.96207878 0.9132152 -0.07213866 0.026065393 0.026215331
## s41 0.96468945 0.9151338 -0.07466124 0.029219211 0.023886433
## s42 0.96707383 0.9168790 -0.07696041 0.032093941 0.021764428
## s43 0.96924639 0.9184692 -0.07905533 0.034713288 0.019830936
## s44 0.97122595 0.9199181 -0.08096415 0.037099940 0.018069210
## s45 0.97302965 0.9212383 -0.08270339 0.039274568 0.016463992
## s46 0.97467312 0.9224412 -0.08428812 0.041256008 0.015001376
## s47 0.97617058 0.9235372 -0.08573207 0.043061423 0.013668695
## s48 0.97753502 0.9245359 -0.08704774 0.044706449 0.012454406
## s49 0.97877824 0.9254459 -0.08824654 0.046205336 0.011347991
## s50 0.97991101 0.9262750 -0.08933883 0.047571066 0.010339867
## s51 0.98094316 0.9270304 -0.09033409 0.048815468 0.009421301
## s52 0.98188361 0.9277188 -0.09124093 0.049949321 0.008584339
## s53 0.98274051 0.9283460 -0.09206721 0.050982445 0.007821730
## s54 0.98352129 0.9289175 -0.09282009 0.051923790 0.007126869
## s55 0.98423271 0.9294382 -0.09350608 0.052781508 0.006493738
## s56 0.98488092 0.9299126 -0.09413113 0.053563029 0.005916852
## s57 0.98547155 0.9303449 -0.09470066 0.054275122 0.005391215
## s58 0.98600972 0.9307388 -0.09521958 0.054923954 0.004912274
## s59 0.98650007 0.9310977 -0.09569241 0.055515146 0.004475881

Which beta_hat? To answer this question we need to find the lambda. We need \(\lambda_n \rightarrow \infty\) in order to shrink the truly zero coefficients to zero. This requires \(\lambda_n\) to be sufficiently large. On the other hand, \(\lambda_n / \sqrt{n} \rightarrow 0\) is needed in order not to shrink too much. This would introduce asymptotic bias to the non-zero coefficients. Information-criterion based model selection is very fast, but it relies on a proper estimation of degrees of freedom, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated by this model. In practice, choosing \(\lambda_n\) by \(\mathrm{BIC}\) results in consistent model selection in the fixed \(p\) setting, i.e., Let \(\mathcal{A}=\left\{j: \beta_{0, j} \neq 0\right\}\), active set or relevant variables

\[ P\left(\hat{\mathcal{A}}_{\lambda_{BIC}}=\mathcal{A}\right) \rightarrow 1 \]

We find lambda by minimum Bayesian Information Criterion (BIC). Let \(S S E_\lambda\) be the sum of squared error terms for a given value of \(\lambda\) and \(n z_\lambda\) be the number of non-zero coefficients. Then,

\[ B I C_\lambda=\log \left(S S E_\lambda\right)+\frac{\log (n)}{n} n z_\lambda \]

# Predict yhat for each of 61 lambda (s)
y_hat = predict(lasso, newx = X)
dim(y_hat)
## [1] 100  60
# SSE for each lambda (s)
SSE <- c()
for (i in 1:ncol(y_hat)) {
  SSE_each <- sum((y_hat[,i]-y[,1])^(2))
  SSE <- c(SSE, SSE_each)
}

# BIC
nz <- colSums(beta_hat!=0) # Number of non-zero coefficients for each lambda
BIC <- log(SSE) + (log(N)/N)*nz #Now BIC
BIC
##       s0       s1       s2       s3       s4       s5       s6       s7 
## 5.598919 5.595359 5.468287 5.348947 5.237755 5.135013 5.040883 4.955387 
##       s8       s9      s10      s11      s12      s13      s14      s15 
## 4.878394 4.809638 4.748729 4.695181 4.648437 4.607898 4.572946 4.542971 
##      s16      s17      s18      s19      s20      s21      s22      s23 
## 4.517383 4.495631 4.477205 4.461646 4.448541 4.437530 4.428295 4.420563 
##      s24      s25      s26      s27      s28      s29      s30      s31 
## 4.414098 4.408698 4.448661 4.443309 4.438844 4.435121 4.432021 4.429439 
##      s32      s33      s34      s35      s36      s37      s38      s39 
## 4.427290 4.425503 4.424017 4.468004 4.466218 4.464732 4.463498 4.462471 
##      s40      s41      s42      s43      s44      s45      s46      s47 
## 4.461618 4.460910 4.460321 4.459832 4.459426 4.459088 4.458808 4.458575 
##      s48      s49      s50      s51      s52      s53      s54      s55 
## 4.458382 4.458222 4.458088 4.457978 4.457886 4.457810 4.457746 4.457694 
##      s56      s57      s58      s59 
## 4.457650 4.457614 4.457584 4.457559

Let’s select \(\lambda\) that has the minimum BIC

beta_lasso <- beta_hat[ ,which(BIC==min(BIC))]
beta_lasso
##        V1        V2        V3        V4 
## 0.8887426 0.8461243 0.0000000 0.0000000

This is the beta_hat that identifies the true sparsity. And, the second Oracle property, the \(\ell_2\) error:

l_2 <- sqrt(sum((beta_lasso - Beta)^2))
l_2
## [1] 0.189884

Here we will create a simulation that will report two Oracle Properties Lasso and Adaptive Lasso:

  • True sparsity,
  • \(\ell_2\) error.

Lasso:
We first have a function, msc(), that executes a simulation with all the steps shown before:

mcs <- function(mc, N, Beta){
  mcmat <- matrix(0, nrow = mc, ncol = 3)
  beta_lasso_mat <- matrix(0, nr = mc, nc = length(Beta))
  
  for (i in 1:mc) {
    set.seed(i)
    data <- dgp(N, Beta)
    y <- data[[1]]
    X <- data[[2]]
    
    set.seed(i)
    lasso <- glmnet(x = X, y = y, family = "gaussian")
    beta_hat <- lasso$beta; # beta_hat is a matrix
    y_hat = predict(lasso, newx = X)
    
    SSE <- c()
    for (j in 1:ncol(y_hat)) {
      SSE_each <- sum((y_hat[,j]-y[,1])^(2))
      SSE <- c(SSE, SSE_each)
    }
    
    nz <- colSums(beta_hat!=0) 
    BIC <- log(SSE) + (log(N)/N)*nz
    beta_lasso <- beta_hat[ ,which(BIC==min(BIC))]
    nonz_beta = length(Beta[Beta==0])
    nonz_beta_hat = length(beta_lasso[beta_lasso==0])
    
    mcmat[i,1] <- sqrt(sum((beta_lasso - Beta)^2))
    mcmat[i,2] <- ifelse(nonz_beta!= nonz_beta_hat, 0, 1)
    mcmat[i,3] <- sum(beta_lasso!=0)
    beta_lasso_mat[i,] <- beta_lasso
  }
  return(list(mcmat,beta_lasso_mat))
}

We are ready for simulation:

mc <- 500
N <- 1000
Beta <- matrix(c(1,1,0,0), nc = 1)
output <- mcs(mc, N, Beta) #see the function

MC_betas = output[[2]]
MC_performance = output[[1]];
sum(MC_performance[,2]) #how many times lasso finds true sparsity
## [1] 400

This is the first property: lasso identifies the true sparsity \(400/500 = 80\%\) of cases. And the second property, \(\ell_2\) error, in the simulation is (in total):

sum(MC_performance[,1])
## [1] 29.41841

Adaptive Lasso This time we let our adaptive lasso use lasso coefficients as penalty weights in glmnet(). Let’s have the same function with Adaptive Lasso for the simulation:

# Adaptive LASSO
mcsA <- function(mc, N, Beta){
  
  mcmat <- matrix(0, nr = mc, nc = 3)
  beta_lasso_mat <- matrix(0, nr = mc, nc = length(Beta))
  
  for (i in 1:mc) {
    data <- dgp(N, Beta)
    y <- data[[1]]
    X <- data[[2]]
    
    lasso <- glmnet(x = X, y = y, family = "gaussian")
    beta_hat <- lasso$beta;
    y_hat = predict(lasso, newx = X)
    
    SSE <- c()
    for (j in 1:ncol(y_hat)) {
      SSE_each <- sum((y_hat[,j]-y[,1])^(2))
      SSE <- c(SSE, SSE_each)
    }
    
    nz <- colSums(beta_hat!=0)
    BIC <- log(SSE) + (log(N)/N)*nz
    beta_lasso <- beta_hat[ ,which(BIC==min(BIC))]
    
    weights = abs(beta_lasso)^(-1)
    weights[beta_lasso==0] = 10^10 # to handle inf's
    
    #Now Adaptive Lasso
    lasso <- glmnet(x = X, y = y, family = "gaussian", penalty.factor = weights)
    beta_hat <- lasso$beta;
    y_hat = predict(lasso, newx = X)
    
    SSE <- c()
    for (j in 1:ncol(y_hat)) {
      SSE_each <- sum((y_hat[,j]-y[,1])^(2))
      SSE <- c(SSE, SSE_each)
    }
    
    nz <- colSums(beta_hat!=0)
    BIC <- log(SSE) + (log(N)/N)*nz
    beta_lasso <- beta_hat[ ,which(BIC==min(BIC))]
    nonz_beta = length(Beta[Beta==0])
    nonz_beta_hat = length(beta_lasso[beta_lasso==0])
    
    mcmat[i,1] <- sqrt(sum((beta_lasso - Beta)^2))
    mcmat[i,2] <- ifelse(nonz_beta!= nonz_beta_hat, 0, 1)
    mcmat[i,3] <- sum(beta_lasso!=0)
    beta_lasso_mat[i,] <- beta_lasso
  }
  return(list(mcmat,beta_lasso_mat))
}

Here are the results for adaptive lasso

mc <- 500
N <- 1000
beta <- matrix(c(1,1,0,0), nc = 1)
output <- mcsA(mc, N, beta) #see the function

MC_betas = output[[2]]
MC_performance = output[[1]];
sum(MC_performance[,2])
## [1] 492

And,

sum(MC_performance[,1])
## [1] 20.21311

The simulation results clearly shows that Adaptive Lasso is an Oracle estimator and the regular Lasso isn’t. In model selection Adaptive Lasso is a better estimator.