Chapter 20 Sparsity

This is a simulation to illustrate some of the properties of Lasso-type estimations. There are two objectives in using these penalized regressions: model selection (identifying “correct” sparsity) and prediction accuracy. These two objectives require different optimization approaches and usually are not compatible. In model selection, the objective is to shrink the dimension of the model to the “true” sparsity. This is usually evaluated by checking whether the Oracle properties are satisfied. These asymptotic properties look at (1) if the model identified by the penalized regression converges to the “true” sparsity, (2) if the coefficients are consistent.

The literature suggests that Lasso is not an oracle estimator. Adaptive Lasso was developed (Zou 2006) to fill this gap.

Let’s specify a data generating process with a linear regression model:

\[ y_i=x_i^{\prime} \beta+u_i, ~~~~~i=1, \ldots, n \]

where \(\beta\) is \(p \times 1\). First, we consider the case where \(p<n\) then move to the case where \(p \geq n\). We define \(\beta=(1,1,0,0)^{\prime}\) and \(n=100\).

#This function generates the data
dgp <- function(N, Beta) {
  p = length(Beta)
  
  X <- matrix(rnorm(N * p), ncol = p)
  u <- matrix(rnorm(N), ncol = 1)
  dgm <- X %*% Beta
  y <- X %*% Beta + u
  
  return <- list(y, X)
}

N = 100
Beta = c(1, 1, 0, 0)

set.seed(148)
Output <- dgp(N, Beta)
y <- Output[[1]]
X <- Output[[2]]

First, we apply lasso

library(glmnet)

set.seed(432)
lasso <- glmnet(x = X, y = y, family = "gaussian")

beta_hat <- lasso$beta
S_matrix <- cbind(t(beta_hat), "lambda" = lasso$lambda)
S_matrix[c(1:8, 25:30, 55:60), ] # selected rows
## 20 x 5 sparse Matrix of class "dgCMatrix"
##             V1        V2          V3         V4      lambda
## s0  .          .          .          .          1.083220708
## s1  0.09439841 0.0283513  .          .          0.986990366
## s2  0.17344129 0.1097255  .          .          0.899308862
## s3  0.24546220 0.1838706  .          .          0.819416741
## s4  0.31108496 0.2514289  .          .          0.746622016
## s5  0.37087798 0.3129855  .          .          0.680294174
## s6  0.42535915 0.3690736  .          .          0.619858715
## s7  0.47500037 0.4201789  .          .          0.564792175
## s24 0.87944075 0.8365481  .          .          0.116150206
## s25 0.88874261 0.8461243  .          .          0.105831742
## s26 0.89685610 0.8542117 -0.00686322 .          0.096429941
## s27 0.90418482 0.8614679 -0.01432988 .          0.087863371
## s28 0.91086250 0.8680794 -0.02113323 .          0.080057832
## s29 0.91694695 0.8741036 -0.02733218 .          0.072945714
## s54 0.98352129 0.9289175 -0.09282009 0.05192379 0.007126869
## s55 0.98423271 0.9294382 -0.09350608 0.05278151 0.006493738
## s56 0.98488092 0.9299126 -0.09413113 0.05356303 0.005916852
## s57 0.98547155 0.9303449 -0.09470066 0.05427512 0.005391215
## s58 0.98600972 0.9307388 -0.09521958 0.05492395 0.004912274
## s59 0.98650007 0.9310977 -0.09569241 0.05551515 0.004475881

Which set of beta_hat should we select? To answer this question we need to find the lambda. We need \(\lambda_n \rightarrow \infty\) in order to shrink the truly zero coefficients to zero. This requires \(\lambda_n\) to be sufficiently large. This would introduce asymptotic bias to the non-zero coefficients.

In practice, choosing \(\lambda_n\) by \(\mathrm{BIC}\) (Bayesian Information Criterion) results in a consistent model selection in the fixed \(p\) setting. That is, let \(\mathcal{A}=\left\{j: \beta_{0, j} \neq 0\right\}\), active set or relevant variables,

\[ P\left(\hat{\mathcal{A}}_{\lambda_{BIC}}=\mathcal{A}\right) \rightarrow 1 \]

Thus, let \(S S E_\lambda\) be the sum of squared error terms for a given value of \(\lambda\) and \(n z_\lambda\) be the number of non-zero coefficients. Then, it can be shown that

\[ B I C_\lambda=\log \left(S S E_\lambda\right)+\frac{\log (n)}{n} n z_\lambda \]

# Predict yhat for each of 61 lambda (s)
y_hat = predict(lasso, newx = X)
dim(y_hat)
## [1] 100  60
# SSE for each lambda (s)
SSE <- c()
for (i in 1:ncol(y_hat)) {
  SSE_each <- sum((y_hat[, i] - y[, 1]) ^ (2))
  SSE <- c(SSE, SSE_each)
}

# BIC
nz <- colSums(beta_hat != 0) # Number of non-zero coefficients for each lambda
BIC <- log(SSE) + (log(N) / N) * nz # BIC
BIC
##       s0       s1       s2       s3       s4       s5       s6       s7 
## 5.598919 5.595359 5.468287 5.348947 5.237755 5.135013 5.040883 4.955387 
##       s8       s9      s10      s11      s12      s13      s14      s15 
## 4.878394 4.809638 4.748729 4.695181 4.648437 4.607898 4.572946 4.542971 
##      s16      s17      s18      s19      s20      s21      s22      s23 
## 4.517383 4.495631 4.477205 4.461646 4.448541 4.437530 4.428295 4.420563 
##      s24      s25      s26      s27      s28      s29      s30      s31 
## 4.414098 4.408698 4.448661 4.443309 4.438844 4.435121 4.432021 4.429439 
##      s32      s33      s34      s35      s36      s37      s38      s39 
## 4.427290 4.425503 4.424017 4.468004 4.466218 4.464732 4.463498 4.462471 
##      s40      s41      s42      s43      s44      s45      s46      s47 
## 4.461618 4.460910 4.460321 4.459832 4.459426 4.459088 4.458808 4.458575 
##      s48      s49      s50      s51      s52      s53      s54      s55 
## 4.458382 4.458222 4.458088 4.457978 4.457886 4.457810 4.457746 4.457694 
##      s56      s57      s58      s59 
## 4.457650 4.457614 4.457584 4.457559

And, the selected model that has the minimum BIC

beta_lasso <- beta_hat[, which(BIC == min(BIC))]
beta_lasso
##        V1        V2        V3        V4 
## 0.8887426 0.8461243 0.0000000 0.0000000

This is the beta_hat that identifies the true sparsity. And, the second Oracle property, the \(\ell_2\) error:

l_2 <- sqrt(sum((beta_lasso - Beta) ^ 2))
l_2
## [1] 0.189884

Here we will create a simulation that will report two Oracle Properties for Lasso and Adaptive Lasso:

  • True sparsity,
  • \(\ell_2\) error.

Lasso

We first have a function, msc(), that executes a simulation with all the steps shown before:

mcs <- function(mc, N, Beta) {
  mcmat <- matrix(0, nrow = mc, ncol = 3)
  beta_lasso_mat <- matrix(0, nr = mc, nc = length(Beta))
  
  for (i in 1:mc) {
    set.seed(i)
    data <- dgp(N, Beta)
    y <- data[[1]]
    X <- data[[2]]
    
    set.seed(i)
    lasso <- glmnet(x = X, y = y, family = "gaussian")
    beta_hat <- lasso$beta    # beta_hat is a matrix
    y_hat = predict(lasso, newx = X)
    
    SSE <- c()
    for (j in 1:ncol(y_hat)) {
      SSE_each <- sum((y_hat[, j] - y[, 1]) ^ (2))
      SSE <- c(SSE, SSE_each)
    }
    
    nz <- colSums(beta_hat != 0)
    BIC <- log(SSE) + (log(N) / N) * nz
    beta_lasso <- beta_hat[, which(BIC == min(BIC))]
    nonz_beta = length(Beta[Beta == 0])
    nonz_beta_hat = length(beta_lasso[beta_lasso == 0])
    
    mcmat[i, 1] <- sqrt(sum((beta_lasso - Beta) ^ 2))
    mcmat[i, 2] <- ifelse(nonz_beta != nonz_beta_hat, 0, 1)
    mcmat[i, 3] <- sum(beta_lasso != 0)
    beta_lasso_mat[i, ] <- beta_lasso
  }
  return(list(mcmat, beta_lasso_mat))
}

We are ready for simulation:

mc <- 500
N <- 1000
Beta <- matrix(c(1, 1, 0, 0), nc = 1)
output <- mcs(mc, N, Beta) #see the function

MC_betas = output[[2]]
MC_performance = output[[1]]

sum(MC_performance[, 2]) #how many times lasso finds true sparsity
## [1] 400

This is the first property: lasso identifies the true sparsity \(400/500 = 80\%\) of cases. And the second property, \(\ell_2\) error, in the simulation is (in total):

sum(MC_performance[, 1])
## [1] 29.41841

Adaptive Lasso

This time we let our adaptive lasso use lasso coefficients as penalty weights in glmnet(). Let’s have the same function with Adaptive Lasso for the simulation:

# Adaptive LASSO
mcsA <- function(mc, N, Beta) {
  mcmat <- matrix(0, nr = mc, nc = 3)
  beta_lasso_mat <- matrix(0, nr = mc, nc = length(Beta))
  
  for (i in 1:mc) {
    data <- dgp(N, Beta)
    y <- data[[1]]
    X <- data[[2]]
    
    lasso <- glmnet(x = X, y = y, family = "gaussian")
    beta_hat <- lasso$beta
    
    y_hat = predict(lasso, newx = X)
    
    SSE <- c()
    for (j in 1:ncol(y_hat)) {
      SSE_each <- sum((y_hat[, j] - y[, 1]) ^ (2))
      SSE <- c(SSE, SSE_each)
    }
    
    nz <- colSums(beta_hat != 0)
    BIC <- log(SSE) + (log(N) / N) * nz
    beta_lasso <- beta_hat[, which(BIC == min(BIC))]
    
    weights = abs(beta_lasso) ^ (-1)
    weights[beta_lasso == 0] = 10 ^ 10 # to handle inf's
    
    #Now Adaptive Lasso
    lasso <-
      glmnet(
        x = X,
        y = y,
        family = "gaussian",
        penalty.factor = weights
      )
    beta_hat <- lasso$beta
    
    y_hat = predict(lasso, newx = X)
    
    SSE <- c()
    for (j in 1:ncol(y_hat)) {
      SSE_each <- sum((y_hat[, j] - y[, 1]) ^ (2))
      SSE <- c(SSE, SSE_each)
    }
    
    nz <- colSums(beta_hat != 0)
    BIC <- log(SSE) + (log(N) / N) * nz
    beta_lasso <- beta_hat[, which(BIC == min(BIC))]
    nonz_beta = length(Beta[Beta == 0])
    nonz_beta_hat = length(beta_lasso[beta_lasso == 0])
    
    mcmat[i, 1] <- sqrt(sum((beta_lasso - Beta) ^ 2))
    mcmat[i, 2] <- ifelse(nonz_beta != nonz_beta_hat, 0, 1)
    mcmat[i, 3] <- sum(beta_lasso != 0)
    beta_lasso_mat[i, ] <- beta_lasso
  }
  return(list(mcmat, beta_lasso_mat))
}

Here are the results for adaptive lasso:

mc <- 500
N <- 1000
beta <- matrix(c(1, 1, 0, 0), nc = 1)
output <- mcsA(mc, N, beta) #see the function

MC_betas = output[[2]]
MC_performance = output[[1]]

sum(MC_performance[, 2])
## [1] 492

And,

sum(MC_performance[,1])
## [1] 20.21311

The simulation results clearly show that Adaptive Lasso is an Oracle estimator and a better choice for sparsity applications.

We saw here a basic application of adaptive lasso, which has several different variations in practice, such as Thresholded Lasso and Rigorous Lasso. Model selections with lasso has been an active research area. One of the well-known applications is the double-selection lasso linear regression method that can be used for variable selections. Moreover, lasso type applications are also used in time-series forecasting and graphical network analysis for dimension reductions.