Chapter 19 Adaptive Lasso

Adaptive lasso is a method for regularization and variable selection in regression analysis that was introduced by Zou ((Zou 2006)) in The Adaptive Lasso and Its Oracle Properties. In this paper, the author proposed the use of a weighted \(\ell_{1}\) penalty in the objective function, with the weights chosen to adapt to the correlation structure of the data. He showed that this method can result in a more stable model with fewer coefficients being exactly zero, compared to the standard lasso method which uses a simple \(\ell_{1}\) penalty.

Since its introduction, adaptive lasso has been widely used in a variety of applications in statistical modeling and machine learning. It has been applied to problems such as feature selections in genomic data, high-dimensional regressions, and model selections with generalized linear models. Adaptive lasso is useful in situations where the predictors are correlated and there is a need to select a small subset of important variables to include in the model. It has been shown that adaptive lasso is an oracle efficient estimator (consistency in variable selection and asymptotic normality in coefficient estimation), while the plain lasso is not.

Consider the linear regression model:

\[ y_i=x_i^{\prime} \beta+\epsilon_i, ~~~~i=1, \ldots, n ~~~~\text{and} ~~~~\beta \text { is } (p \times 1) \] The adaptive Lasso estimates \(\beta\) by minimizing

\[ L(\beta)=\sum_{i=1}^n\left(y_i-x_i^{\prime} \beta\right)^2+\lambda_n \sum_{j=1}^p \frac{1}{w_j}\left|\beta_j\right| \]

where, typically \(w_j=(\left|\hat{\beta}_{O L S_j}\right|)^{\gamma}\) or \(w_j=(\left|\hat{\beta}_{Ridge_j}\right|)^{\gamma}\), where \(\gamma\) is a positive constant for adjustment of the Adaptive Weights vector, and suggested to be the possible values of 0.5, 1, and 2.

The weights in adaptive lasso (AL) provides a prior “intelligence” about variables such that,while the plain Lasso penalizes all parameters equally, the adaptive Lasso is likely to penalize non-zero coefficients less than the zero ones. This is because the weights can be obtained from the consistent least squares estimator. If \(\beta_{AL, j}=0\), then \(\hat{\beta}_{O L S, j}\) is likely to be close to zero leading to a very small \(w_j\). Hence, truly zero coefficients are penalized a lot. Calculating the weights in adaptive lasso requires a two-step procedure

Here is an example where we use the ridge weight in adaptive lasso:

library(ISLR)
library(glmnet)

remove(list = ls())

data(Hitters)
df <- Hitters[complete.cases(Hitters$Salary), ]
X  <- model.matrix(Salary~., df)[,-1]
y <- df$Salary

# Ridge weights with gamma = 1
g = 1
set.seed(1)
modelr <- cv.glmnet(X, y, alpha = 0)
coefr <- as.matrix(coef(modelr, s = modelr$lambda.min))
w.r <- 1/(abs(coefr[-1,]))^g

## Adaptive Lasso
set.seed(1)
alasso <- cv.glmnet(X, y, alpha=1, penalty.factor = w.r)

## Lasso
set.seed(1)
lasso <- cv.glmnet(X, y, alpha=1)

# Sparsity
cbind(LASSO = coef(lasso, s="lambda.1se"),
           ALASSO = coef(alasso, s="lambda.1se"))
## 20 x 2 sparse Matrix of class "dgCMatrix"
##                       s1          s1
## (Intercept) 127.95694754   -7.109481
## AtBat         .             .       
## Hits          1.42342566    2.054867
## HmRun         .             .       
## Runs          .             .       
## RBI           .             .       
## Walks         1.58214111    3.573120
## Years         .            31.573334
## CAtBat        .             .       
## CHits         .             .       
## CHmRun        .             .       
## CRuns         0.16027975    .       
## CRBI          0.33667715    .       
## CWalks        .             .       
## LeagueN       .            29.811080
## DivisionW    -8.06171262 -138.088953
## PutOuts       0.08393604    .       
## Assists       .             .       
## Errors        .             .       
## NewLeagueN    .             .

We can see the difference between lasso and adaptive lasso in this example: PutOuts, CRuns, and CRBI picked by lasso are not selected by adaptive lasso. There are only three common features in both methods: Hits, Walks, and DivisionW. To understand which model is better in terms of catching the true sparsity, we will have a simulation to illustrate some of the properties of the Lasso and the adaptive Lasso.

References

Zou, Hui. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the American Statistical Association 101 (476): 1418–29. https://doi.org/10.1198/016214506000000735.