L1 Regularization or Lasso or L1 norm

In this article we will understand why do we need regularization, what is regularization, what are different types of regularizations -L1 and L2, what is the difference between L1 and L2 regularization

prerequisites: Machine learning basics, Linear Regression, Bias and variance, Evaluating the performance of a machine learning model

Image for post

Why do we need regularization?

We want to predict the ACT score of a student. For the prediction we use student?s GPA score. This model fails to predict the ACT score for a range of student?s as the model is too simple and hence has a high bias.

We now start to add more features that may have an influence on student?s ACT score. we add more input features to our model, attendance percentage, average grades of the student in middle school and junior high, BMI of the student, average sleep duration. we see that model has started to get too complex with more input features.

Our model has also learnt data patterns along with the noise in the training data. When a model tries to fit the data pattern as well as noise then the model has a high variance ad will be overfitting.

An overfitted model performs well on training data but fails to generalize.

Goal of our machine learning algorithm is to learn the data patterns and ignore the noise in the data set.

How do we solve the problem of overfitting?

we can solve the problem of overfitting using

Regularization technique
Cross Validation
Drop out

What is Regularization?

Regularization is a technique to discourage the complexity of the model. It does this by penalizing the loss function. This helps to solve the overfitting problem.

Let?s understand how penalizing the loss function helps simplify the model

Loss function is the sum of squared difference between the actual value and the predicted value

Image for post Loss function for a linear regression with 4 input variables. In the equation i=4

As the degree of the input features increases the model becomes complex and tries to fit all the data points as shown below

Image for post

When we penalize the weights ?_3 and ?_4 and make them too small, very close to zero. It makes those terms negligible and helps simplify the model.

Image for post

Regularization works on assumption that smaller weights generate simpler model and thus helps avoid overfitting.

What if the input variables have an impact on the output?

To ensure we take into account the input variables, we penalize all the weights by making them small. This also makes the model simpler and less prone to overfitting

Image for post Loss function with regularization term highlighted in red box

We have added the regularization term to the sum of squared differences between the actual value and predicted value. Regularization term keeps the weights small making the model simpler and avoiding overfitting.

? is the penalty term or regularization parameter which determines how much to penalizes the weights.

When ? is zero then the regularization term becomes zero. We are back to the original Loss function.

Image for post when ? is zero

When ? is large, we penalizes the weights and they become close to zero. This results is a very simple model having a high bias or is underfitting.

Image for post when ? is very large

so what is the right value for ? ?

It is somewhere in between 0 and a large value. we need to find an optimal value of ? so that the generalization error is small.

A simple approach would be try different values of ? on a subsample of data, understand variability of the loss function and then use it on the entire dataset.

What is L1 and L2 regularization ?

L1 regularization is also referred as L1 norm or Lasso.

In L1 norm we shrink the parameters to zero. When input features have weights closer to zero that leads to sparse L1 norm. In Sparse solution majority of the input features have zero weights and very few features have non zero weights.

To predict ACT score not all input features have the same influence on the prediction. GPA score has a higher influence on ACT score than BMI of the student. L1 norm will assign a zero weight to BMI of the student as it does not have a significant impact on prediction. GPA score will have a non zero weight as it is very useful in predicting the ACT score.

L1 regularization does feature selection. It does this by assigning insignificant input features with zero weight and useful features with a non zero weight.

Image for post L1 regularization

In L1 regularization we penalize the absolute value of the weights. L1 regularization term is highlighted in the red box.

Lasso produces a model that is simple, interpretable and contains a subset of input features

L2 Regularization or Ridge Regularization

Image for post L2 Regularization

In L2 regularization, regularization term is the sum of square of all feature weights as shown above in the equation.

L2 regularization forces the weights to be small but does not make them zero and does non sparse solution.

L2 is not robust to outliers as square terms blows up the error differences of the outliers and the regularization term tries to fix it by penalizing the weights

Ridge regression performs better when all the input features influence the output and all with weights are of roughly equal size

Difference between L1 and L2 regularization