Chapter 18 Lasso

The penalty in ridge regression, \(\lambda \sum_{j} \beta_{j}^{2}\), will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero. This may present a problem in model interpretation when the number of variables is quite large. One of the key advantages of Lasso is that it can set the coefficients of some features to exactly zero, effectively eliminating those features from the model. By eliminating unnecessary or redundant features from the model, Lasso can help to improve the interpretability and simplicity of the model. This can be particularly useful when you have a large number of features and you want to identify the most important ones for predicting the target variable.

The lasso (least absolute shrinkage and selection operator) is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients minimize the following quantity:

\[\begin{equation} \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2}+\lambda \sum_{j=1}^{p}\left|\beta_{j}\right|=\operatorname{RSS}+\lambda \sum_{j=1}^{p}\left|\beta_{j}\right| \tag{18.1} \end{equation}\]

The lasso also shrinks the coefficient estimates towards zero. However, the \(\ell_{1}\) penalty, the second term of equation 18.1, has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large. Hence, the lasso performs variable selection. As a result, models generated from the lasso are generally much easier to interpret than those produced by ridge regression.

In general, one might expect lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients and the remaining predictors have no significant effect on the outcome. This property is known as “sparsity”, because it results in a model with a relatively small number of non-zero coefficients. In some cases, Lasso can find a true sparsity pattern in the data, which means that it can identify a small subset of the most important features that are sufficient to accurately predict the target variable. This can be particularly useful when you have a large number of features, and you want to identify the most important ones for predicting the target variable.

Now, we will apply lasso to the same data in the last chapter, Hitters. Again, we will follow a similar way to compare ridge and lasso as in ISLR :

library(glmnet)
library(ISLR)
remove(list = ls())

data(Hitters)
df <- Hitters[complete.cases(Hitters$Salary), ]
X  <- model.matrix(Salary~., df)[,-1]
y <- df$Salary

# Without a specific grid on lambda
set.seed(1)
train <- sample(1:nrow(X), nrow(X)*0.5)
test <- c(-train)
ytest <- y[test]

# Ridge
set.seed(1)
ridge.out <- cv.glmnet(X[train,], y[train], alpha = 0)
yhatR <- predict(ridge.out, s = "lambda.min", newx = X[test,])
mse_r <- mean((yhatR - ytest)^2)

# Lasso
set.seed(1)
lasso.out <- cv.glmnet(X[train,], y[train], alpha = 1)
yhatL <- predict(lasso.out, s = "lambda.min", newx = X[test,])
mse_l <- mean((yhatL - ytest)^2)

mse_r
## [1] 139863.2
mse_l
## [1] 143668.8

Now, we will define our own grid search:

# With a specific grid on lambda + lm()
grid = 10^seq(10, -2, length = 100)

set.seed(1)
train <- sample(1:nrow(X), nrow(X)*0.5)
test <- c(-train)
ytest <- y[test]

#Ridge
ridge.mod <- glmnet(X[train,], y[train], alpha = 0,
                    lambda = grid, thresh = 1e-12)
set.seed(1)
cv.outR <- cv.glmnet(X[train,], y[train], alpha = 0)
bestlamR <- cv.outR$lambda.min
yhatR <- predict(ridge.mod, s = bestlamR, newx = X[test,])
mse_R <- mean((yhatR - ytest)^2)

# Lasso
lasso.mod <- glmnet(X[train,], y[train], alpha = 1,
                    lambda = grid, thresh = 1e-12)
set.seed(1)
cv.outL <- cv.glmnet(X[train,], y[train], alpha = 1)
bestlamL <- cv.outL$lambda.min
yhatL <- predict(lasso.mod, s = bestlamL, newx = X[test,])
mse_L <- mean((yhatL - ytest)^2)

mse_R
## [1] 139856.6
mse_L
## [1] 143572.1

Now we will apply our own algorithm

grid = 10^seq(10, -2, length = 100)
MSPE <- c()
MMSPE <- c()

for(i in 1:length(grid)){
  for(j in 1:100){
    set.seed(j)
    ind <- unique(sample(nrow(df), nrow(df), replace = TRUE))
    
    train <- df[ind, ]
    xtrain <- model.matrix(Salary~., train)[,-1]
    ytrain <- df[ind, 19]
    
    test <- df[-ind, ]
    xtest <- model.matrix(Salary~., test)[,-1]
    ytest <- df[-ind, 19]
  
    model <- glmnet(xtrain, ytrain, alpha = 1,
                    lambda = grid[i], thresh = 1e-12)
    yhat <- predict(model, s = grid[i], newx = xtest)
    MSPE[j] <- mean((yhat - ytest)^2)
    }
  MMSPE[i] <- mean(MSPE)
}

min(MMSPE)
## [1] 119855.1
grid[which.min(MMSPE)]
## [1] 2.656088
plot(log(grid), MMSPE, type="o", col = "red", lwd = 3)

What are the coefficients?

coef_lasso <- coef(model, s=grid[which.min(MMSPE)], nonzero = T)
coef_lasso
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  285.73172897
## AtBat         -1.26603002
## Hits           2.04074005
## HmRun          0.02355750
## Runs           2.77938363
## RBI            0.45867292
## Walks          3.44852914
## Years         -8.78839869
## CAtBat        -0.26343169
## CHits          1.29690477
## CHmRun         1.32913790
## CRuns          0.05007662
## CRBI          -0.05515544
## CWalks        -0.33624685
## LeagueN      132.06438132
## DivisionW   -119.26618910
## PutOuts        0.19772257
## Assists        0.64809649
## Errors        -6.97381705
## NewLeagueN   -54.62728800

We can also try a classification problem with LPM or Logistic regression when the response is categorical. If there are two possible outcomes, we use the binomial distribution, else we use the multinomial.