Handling `missing` data?

Read original article here

These can be in the form of NaN, undefinedor null —or maybe something else?

But what should we do with them? Should we drop those rows or should we replace these values? Let’s see how to get rid of these missing values in a sensical manner.

Now some practitioners of data science can say that — to do ‘nothing’ to these values. But do not do this — most of the algorithms will throw an error when they encounter data with missing values.

So, what can we do to fill these missing values?

This is a quick fix but ‘mean’ as a statistic is not robust and it would not also scale well with categorical features. Also, if the data is skewed — it would not take it to take into account the correlation.

This also affects the variance of the resulting dataset — so be careful, this might result in high bias if you train ML algorithms with missing data filled with mean.

Now let's consider a new DataFrame, the one with categorical features.

We can replace the missing values with the mode — or the most frequent values present in each column. This again does not take into account the correlations and can introduce bias in the data by unwantedly assigning more labels to a specific category.

But the above-mentioned approaches work primarily well with small datasets, what if you have a big dataset?

Let’s take a DataFrame with a few more rows and columns.

Now let’s use the DataFrame shown above and we can fill in the missing values using interpolation along each column.

But what does interpolation do? It initiates a function that suits your data. This function can then be used to extrapolate values for missing data.

There are different interpolation strategies available in Pandas such as linear or polynomial. Remember, if you use the polynomial strategy, you would need to specify the order of the polynomial for interpolation.

It is more computationally expensive than the two methods we presented above but depending upon the strategy used for interpolation; it can give better results.

The three methods shown above work out within the Pandas framework. But if you have access to other libraries like one can also you the . Let’s consider using it with the DataFrame used in method 3.

KNNImputer works on the principle as to how each new point resembles all the other points in the data. So, by checking the k-nearest neighbors in the dataset for a missing value, we can impute them based on the neighborhood or the closest ‘k points’.

This is more versatile overall the three methods mentioned above because it works for all kinds of data — continuous, discrete, and categorical.

If you know of any other common ways to fill in the missing values in your datasets, please mention it down in the comments.

I hope you find this article interesting and useful.

Images Powered by Shutterstock

The Data Daily

Handling `missing` data?