
统计学机器学习中,Lasso算法(英语:least absolute shrinkage and selection operator,又译最小绝对值收敛和选择算子、套索算法)是一种同时进行特征选择正则化(数学)的回归分析方法,旨在增强统计模型的预测准确性和可解释性,最初由斯坦福大学统计学教授Robert Tibshirani英语Robert Tibshirani于1996年基于Leo Breiman非负参数推断(Nonnegative Garrote, NNG)提出[1][2]。Lasso算法最初用于计算最小二乘法模型,这个简单的算法揭示了很多估计量的重要性质,如估计量岭回归(Ridge regression,也叫吉洪诺夫正则化)和最佳子集选择的关系,Lasso系数估计值(estimate)和软阈值(soft thresholding)之间的联系。它也揭示了当协变量共线时,Lasso系数估计值不一定唯一(类似标准线性回归)。




Robert Tibshirani最初使用Lasso来提高预测的准确性与回归模型的可解释性,他修改了模型拟合的过程,在协变量中只选择一个子集应用到最终模型中,而非用上全部协变量。这是基于有着相似目的,但方法有所不同的Breiman的非负参数推断。






假设一个样本包括N种事件,每个事件包括p个协变量和一个输出值。让 为输出值,并且 为第i种情况的协变量向量,那么Lasso要计算的目标方程就是:

对所有  ,计算  [1]

这里   是一个决定规则化程度的预定的自由参数。 设 协变量矩阵,那么  ,其中   的第 i 行,那么上式可以写成更紧凑的形式:

对所有  ,计算  

这里   是标准   范数  维的1的向量。

因为  ,所以有


对变量进行中心化是常用的数据处理方法。并且协方差一般规范化为   ,这样得到的解就不会依赖测量的规模。





其中    的关系取决于数据特征。

Orthonormal covariates

Some basic properties of the lasso estimator can now be considered.

Assuming first that the covariates are orthonormal so that  , where   is the inner product and   is the Kronecker delta, or, equivalently,  , then using subgradient methods it can be shown that


  is referred to as the soft thresholding operator, since it translates values towards zero (making them exactly zero if they are small enough) instead of setting smaller values to zero and leaving larger ones untouched as the hard thresholding operator, often denoted  , would.

This can be compared to ridge regression, where the objective is to minimize




So ridge regression shrinks all coefficients by a uniform factor of   and does not set any coefficients to zero.

It can also be compared to regression with best subset selection, in which the goal is to minimize


where   is the "  norm", which is defined as   if exactly m components of z are nonzero. In this case, it can be shown that


where   is the so-called hard thresholding function and   is an indicator function (it is 1 if its argument is true and 0 otherwise).

Therefore, the lasso estimates share features of the estimates from both ridge and best subset selection regression since they both shrink the magnitude of all the coefficients, like ridge regression, but also set some of them to zero, as in the best subset selection case. Additionally, while ridge regression scales all of the coefficients by a constant factor, lasso instead translates the coefficients towards zero by a constant value and sets them to zero if they reach it.

Correlated covariates

Returning to the general case, in which the different covariates may not be independent, a special case may be considered in which two of the covariates, say j and k, are identical for each case, so that  , where  . Then the values of   and   that minimize the lasso objective function are not uniquely determined. In fact, if there is some solution   in which  , then if   replacing   by   and   by  , while keeping all the other   fixed, gives a new solution, so the lasso objective function then has a continuum of valid minimizers.[5] Several variants of the lasso, including the Elastic Net, have been designed to address this shortcoming, which are discussed below.






