Parametric models in prediction

In simple regression or classification problems, we cannot train a parametric model in a way that the fitted model minimizes the out-of-sample prediction error. We could (and did) fit the parametric models manually by adding or removing predictors and their interactions and polynomials. As we have seen in earlier chapters, by dropping a variable in a regression, for example, it is possible to reduce the variance at the cost of a negligible increase in bias.

In fitting the predictive model, some of the variables used in a multiple regression model may not be well associated with the response. Keeping those “irrelevant” variables often leads to unnecessary complexity in the resulting model. Regularization or penalization is an alternative and automated fitting procedure that refers to a process that removes irrelevant variables or shrinks the magnitude of their parameters, which can yield better prediction accuracy and model interpretability by preventing overfitting.

There are several types of regularization techniques that can be used in parametric models. Each of these techniques adds a different type of penalty term to the objective function and can be used in different situations depending on the characteristics of the data and the desired properties of the model. Two methods, Ridge and Lasso, are two of well-known benchmark techniques that reduce the model complexity and prevent overfitting resulting from simple linear regression.

The general principle in penalization can be shown as

\[ \widehat{m}_\lambda(\boldsymbol{x})=\operatorname{argmin}\left\{\sum_{i=1}^n \underbrace{\mathcal{L}\left(y_i, m(\boldsymbol{x})\right)}_{\text {loss function }}+\underbrace{\lambda\|m\|_{\ell_q}}_{\text {penalization }}\right\} \]

where \(\mathcal{L}\) could be conditional mean, quantiles, expectiles, \(m\) could be linear, logit, splines, tree-based models, neural networks. The penalization, \(\ell_q\), could be lasso (\(\ell_1\)) or ridge (\(\ell_2\)). And, \(\lambda\) regulates overfitting that can be determined by cross-validation or other methods. It puts a price to pay for a having more flexible model:

\(\lambda\rightarrow0\): it interpolates data, low bias, high variance
\(\lambda\rightarrow\infty\): linear model high bias, low variance

There are two fundamental goals in statistical learning: achieving a high prediction accuracy and identifying relevant predictors. The second objective, variable selection, is particularly important when there is a true sparsity in the underlying model. By their nature, penalized parametric models are not well-performing tools for prediction. But, they provide important tools for model selection specially when \(p>N\) and the true model is sparse. This section starts with two major models in regularized regressions, Ridge and Lasso, and develops an idea on sparse statistical modelling with Adaptive Lasso.

Although there are many sources on the subject, perhaps the most fundamental one is Statistical Learning with Sparsity by Hastie et al. (2015).