Machine Learning: How to Handle Class Imbalance

Read original article here

When building a classification model, you may get great results (high accuracy score) only to realize that your model is only predicting every observation to one class. This is caused by class imbalance. Class imbalance is a problem in machine learning where the total number of one class of data significantly outnumbers the total number of another class of data. To illustrate what class imbalance looks like and how it works, let’s say that you have a two-class dataset that includes 50 diabetes patients and 5000 non-diabetes patients. In this example, the classification model will tend to classify patients as non-diabetes patients because it can’t pick up on the data and trends that would lead to a patient having diabetes. Even though the model would end up with an accuracy score of 99% by just classifying every patient as a non-diabetes patient, the model is not functioning effectively to classify the patients appropriately.

Many datasets will have an uneven number of instances in each class, but a small difference is usually acceptable. As a rule of thumb, if a two-class dataset has a difference of greater than 65% to 35%, than it should be looked at as a dataset with class imbalance. If you are using a dataset with more than two classes and are unsure of whether it has class imbalance, you can always try running your models without making any adjustments and determine whether the model is functioning appropriately or whether it is only predicting to certain classes.

As discussed earlier, Accuracy Score is not a good metric to use when there is class imbalance in your data. Some metrics that can be more helpful when handling class imbalance are:

This might be self explanatory, but, by collecting more data, you may be able to create a more balanced dataset. If more data is available that can help balance the classes in your dataset, this can be an easy and simple way to combat class imbalance.

You can resample your dataset to create a more balanced dataset by either adding copies of instances from the minority class (over-sampling) or deleting instances from the majority class (under-sampling). These two approaches are very simple and easy to implement. As a general rule-of-thumb, you should generally use under-sampling when you have a very large dataset. In the same sense, you should generally use over-sampling when your dataset is small. Regardless, it doesn’t hurt to try both methods and see how your results are affected.

There are algorithms that you can use to generate synthetic samples such as SMOTE (Synthetic Minority Over-sampling Technique) and Tomek links.

SMOTE is an oversampling method that creates synthetic samples from the minority class. It works by selecting minority observations that are similar to each other and drawing a line between the examples in order to create new synthetic samples.

Tomek links work by detecting observations of opposite classes that are nearest neighbors. It removes the majority instance of these pairs. The goal of Tomek links is to clarify the border between minority and majority classes to make the minority region more distinct in the model.

Images Powered by Shutterstock

The Data Daily

Machine Learning: How to Handle Class Imbalance