Bias & Variance in Machine Learning

Read original article here

Linear Regression is a machine learning algorithm that is used to predict a quantitative target, with the help of independent variables that are modeled in a linear manner, to fit a line or a plane (or hyperplane) that contains the predicted data points. For a second, let’s consider this to be the best-fit line (for better understanding). So, usually, points from the training data don’t really lie on the best-fit line only, and that makes perfect sense because any data isn’t perfect. That is why we are making predictions in the first place, and not just plotting a random line.

The linear regression line cannot be curved in order to include all the training set data points, and hence is unable to capture an accurate relationship at times. This is called bias. In mathematical terms, intercept obtained in the linear regression equation, is the bias.

The target (y) has some values in the data-set, and the above equation calculates the predicted values for the same. If the “Intercept” itself is very high, and it reaches close to the predicted y values, then it would mean that the changes in y, caused by the other two parts of our equation — the independent variables(x1 and x2), would be less. This means that the amount of variance explained by x1 and x2, would be less, and that would eventually cause an underfitting model to be built. An underfitting model has a low R-squared (the amount of variance in the target, explained by the independent variables).

Underfit can also be understood by thinking of how the best-fit line/plane is captured in the first place. The best-fit line/plane captures the relationship between the target and the independent variable. If this relationship is captured to a very high extend, it leads to low bias and vice versa.

Now that we understand what bias is, and how a high bias causes an underfitting model, it becomes clear that for a robust model, we need to remove this underfit.

In a scenario where we create a curve that passes through all data points and can showcase the existing relationship between the independent variables and the dependant variable, then there would be no bias in the model.

A model that has overfitted on train data, will result in a new phenomenon called “variance”. Time to consider a few models:

On calculating the errors on the training data (test data is not in the picture yet), we observe the following:

Now, let’s bring in the train data, and understand variance.

So, if the model has overfitted on train data, then it “understands” and “knows” the train data to such a high extent, that it is possible that it will struggle with the test data, and hence it will be unable to capture a relationship when test data is used as input to that model. In broader terms, this means that there will be a high difference of fit between the train data and the test data (as train data shows a perfect validation and test data is unable to capture a relationship). This difference of fit is referred to as “variance”, and it is usually caused when the model understands only the train data and struggles with any new input given to it.

On validating the above models on test data, we notice this:

Now we understand that both bias and variance can cause problems in our prediction model. So, how do we go about solving this issue?

A couple of terms to understand before we proceed:

Coming back to the solution, we can do the following to try to build a trade-off between the bias and variance being caused:

Usually, a model is built on train data and tested on the same, but there’s one more thing that people prefer. Testing the model on a part of the train data, and this is called the validation data.

As mentioned, model validation is done on a part of the train data. So, if we keep choosing a new set of data points from the train data for validating each iteration, and keep averaging the results obtained from these sets of data, we are doing cross-validation. This is an optimized method to understand the behavior of the model on the train data and a way to understand whether there is a presence of an overfit or not.

Forward Chaining: While working with time-series data, K-Fold CV and Leave-One-Out CV can create a problem, since it is very much possible that some years could have a pattern that other years don’t have, so using random sets of data for cross-validation would not make sense. In fact, it is possible that the existing trends could go unnoticed, which is not what we want. So, usually, in this kind of case, a forward-chaining method is used, wherein each fold that we form (for cross-validation), contains a train set, created by adding up data of a consecutive year to the previous train set and validating it on the test set (which contains only the consecutive year to the latest year used in the train set).

Regularization is a technique that helps in reducing both, the bias and the variance, by penalizing beta coefficients attached to our model’s independent variables.

I’ve written a whole article on “Feature Selection in Machine Learning”, where I have described Regularization and its types in much more depth. Feel free to check it out here:

There is no perfect model. It has to be made perfect, by using its imperfections in a positive manner. Once you are able to identify that bias or variability exists in your model, then you can do a ton of things to change that. You may try feature selection and feature transformation as well. You may try removing some over-fitting variables. Based on what is possible at that moment, the decision can be made, and the model can definitely be improved if there is a possibility of that happening.

Thank you for reading! Happy learning!

Bias & Variance in Machine Learning was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Images Powered by Shutterstock

The Data Daily

Bias & Variance in Machine Learning