L2 penalty in action
To see how the L2 penalty works, we can use the same simulated linear regression problem we used for the Ll penalty. To fit a ridge regression model, we use the glmnet() function from the glmnet package. As mentioned previously, this function can actually fit the L1 or the L2 penalties, and which one occurs is determined by the argument, alpha. When alpha = 1, it fits lasso, and when alpha = 0, it fits ridge regression. This time, we choose alpha = 0. Again, we evaluate a range of lambda options and tune this hyper-parameter automatically using cross-validation. This is accomplished by using the cv.glmnet() function. We plot the ridge regression object to see the error for a variety of lambda values:
m.ridge.cv <- cv.glmnet(X[1:100, ], y[1:100], alpha = 0)
plot(m.ridge.cv)
Although the shape is different from lasso in that the error appears to asymptote for higher lambda values, it is still clear that, when the penalty gets too high, the cross-validated model error increases. As with lasso, the ridge regression model seems to do well with very low lambda values, perhaps indicating the L2 penalty does not improve out-of-sample performance/generalizability by much.
Finally, we can compare the OLS coefficients with those from lasso and the ridge regression model:
> cbind(OLS = coef(m.ols),Lasso = coef(m.lasso.cv)[,1],Ridge = coef(m.ridge.cv)[,1])
OLS Lasso Ridge
(Intercept) 2.958 2.99 2.9919
X[1:100, ]1 -0.082 1.41 0.9488
X[1:100, ]2 2.239 0.71 0.9524
X[1:100, ]3 0.602 0.51 0.9323
X[1:100, ]4 1.235 1.17 0.9548
X[1:100, ]5 -0.041 0.00 -0.0023
Although ridge regression does not shrink the coefficient for the fifth predictor to exactly 0, it is smaller than in the OLS, and the remaining parameters are all slightly shrunken, but quite close to their true values of 3, 1, 1, 1, 1, and 0.