In this article we will understand why do we need regularization, what is regularization, what are different types of regularizations -L1 and L2, what is the difference between L1 and L2 regularization
prerequisites: Machine learning basics, Linear Regression, Bias and variance, Evaluating the performance of a machine learning model
Why do we need regularization?
We want to predict the ACT score of a student. For the prediction we use student?s GPA score. This model fails to predict the ACT score for a range of student?s as the model is too simple and hence has a high bias.
We now start to add more features that may have an influence on student?s ACT score. we add more input features to our model, attendance percentage, average grades of the student in middle school and junior high, BMI of the student, average sleep duration. we see that model has started to get too complex with more input features.
Our model has also learnt data patterns along with the noise in the training data. When a model tries to fit the data pattern as well as noise then the model has a high variance ad will be overfitting.
An overfitted model performs well on training data but fails to generalize.
Goal of our machine learning algorithm is to learn the data patterns and ignore the noise in the data set.
How do we solve the problem of overfitting?
we can solve the problem of overfitting using
- Regularization technique
- Cross Validation
- Drop out
What is Regularization?
Regularization is a technique to discourage the complexity of the model. It does this by penalizing the loss function. This helps to solve the overfitting problem.
Let?s understand how penalizing the loss function helps simplify the model
Loss function is the sum of squared difference between the actual value and the predicted value
Loss function for a linear regression with 4 input variables. In the equation i=4
As the degree of the input features increases the model becomes complex and tries to fit all the data points as shown below
When we penalize the weights ?_3 and ?_4 and make them too small, very close to zero. It makes those terms negligible and helps simplify the model.
Regularization works on assumption that smaller weights generate simpler model and thus helps avoid overfitting.
What if the input variables have an impact on the output?
To ensure we take into account the input variables, we penalize all the weights by making them small. This also makes the model simpler and less prone to overfitting
Loss function with regularization term highlighted in red box
We have added the regularization term to the sum of squared differences between the actual value and predicted value. Regularization term keeps the weights small making the model simpler and avoiding overfitting.
? is the penalty term or regularization parameter which determines how much to penalizes the weights.
When ? is zero then the regularization term becomes zero. We are back to the original Loss function.
when ? is zero
When ? is large, we penalizes the weights and they become close to zero. This results is a very simple model having a high bias or is underfitting.
when ? is very large
so what is the right value for ? ?
It is somewhere in between 0 and a large value. we need to find an optimal value of ? so that the generalization error is small.
A simple approach would be try different values of ? on a subsample of data, understand variability of the loss function and then use it on the entire dataset.
What is L1 and L2 regularization ?
L1 regularization is also referred as L1 norm or Lasso.
In L1 norm we shrink the parameters to zero. When input features have weights closer to zero that leads to sparse L1 norm. In Sparse solution majority of the input features have zero weights and very few features have non zero weights.
To predict ACT score not all input features have the same influence on the prediction. GPA score has a higher influence on ACT score than BMI of the student. L1 norm will assign a zero weight to BMI of the student as it does not have a significant impact on prediction. GPA score will have a non zero weight as it is very useful in predicting the ACT score.
L1 regularization does feature selection. It does this by assigning insignificant input features with zero weight and useful features with a non zero weight.
L1 regularization
In L1 regularization we penalize the absolute value of the weights. L1 regularization term is highlighted in the red box.
Lasso produces a model that is simple, interpretable and contains a subset of input features
L2 Regularization or Ridge Regularization
L2 Regularization
In L2 regularization, regularization term is the sum of square of all feature weights as shown above in the equation.
L2 regularization forces the weights to be small but does not make them zero and does non sparse solution.
L2 is not robust to outliers as square terms blows up the error differences of the outliers and the regularization term tries to fix it by penalizing the weights
Ridge regression performs better when all the input features influence the output and all with weights are of roughly equal size
Difference between L1 and L2 regularization
L1 Regularization
L1 penalizes sum of absolute value of weights.
L1 has a sparse solution
L1 has multiple solutions
L1 has built in feature selection
L1 is robust to outliers
L1 generates model that are simple and interpretable but cannot learn complex patterns
L2 Regularization
L2 regularization penalizes sum of square weights.
L2 has a non sparse solution
L2 has one solution
L2 has no feature selection
L2 is not robust to outliers
L2 gives better prediction when output variable is a function of all input features
L2 regularization is able to learn complex data patterns
we see that both L1 and L2 regularization have their own strengths and weakness.
Elastic net regularization is a combination of both L1 and L2 regularization.