Logo

The Data Daily

Dealing with sparse categorical variables in predictive modeling

Dealing with sparse categorical variables in predictive modeling

One of the biggest challenges a data scientist must deal with is to find an efficient way to numerically encode qualitative features. Indeed, only numerical representation of categorical variables can be used as input of predictive models. The most known method is called one-hot encoding, and it works by creating dummy variables. Hence, if a qualitative column has n modalities, n columns will be added to the dataset.

Even if most of the time dummy encoding is an effective and flexible way to reach a good performance, there are situations in which it would be needed to explore other methods, like “Frequency encoding” and “Target encoding”. The former aims at replacing each modality with its frequency inside the training set, while the latter replaces it with the mean value of the target variable.

The optimization of predictive models often involves an iterative process where a global minimum of a regularized error functions is minimized. This is possible thanks to gradients. Sometimes, using a continuous features space can make this process easier and generate more statistical power. Another popular method is called “Label encoding”, to each category is assigned a value from 1 to N (here N is the number of categories for the feature).

Dummy encoding is an effective technique to manage qualitative variable, but when the number of categories is huge it may result in a too large number of columns. An excess of sparsity in the data can lead to:

On the other hand, target and frequency encodings generate only a one-dimensional representation. Furthermore, both methods do not take into consideration ordinality in the data. Is there a way to arrange qualitative variables in a way so that the similarity between elements is preserved?

The challenge is encoding the concatenated string in a way that a similar combination of qualitative features shows a similar encoding vector representation, if it happens, we will state that they are meaningfully represented into a continuous space.

A Word2Vec model allows to assign to each word in a text a list of numerical coordinates, let’s see it in a one-dimensional space:

As you can see, a similar combination of tokens has a similar vector coordinate. Word2vec is a simple neural network that is able to map words into a pre-define number of dimensions, called embeddings. It works by predicting the most probable word for each word in the vocabulary. You can find a lot of articles which explain very well its mechanism out there:

The result is a set of coordinates associated to each word, this set of coordinates could be averaged, so that we get the average value for each embedding dimension, and the rows are placed into a new coordinates system:

In the table above, we can observe that the two Porsche Carrera are extremely close in the embeddings space. The only difference is given by the fact that the first one is model S, while the second one is not.

Same vehicles will have the same encoding representation:

Since the variables involved in the analysis are way too many, it is almost impossible to provide an exhaustive visual representation, but a coloured scatterplot is useful to understand what the technique is doing:

As a result, semantically similar car models are supposed to belong to the same context, hence you will find them closer in the embeddings space. We are visualizing only two out of five embeddings’ dimensions coloured by one out of five variables.

Instead, what happens when using dummy encoding on this example dataset? The complexity skyrockets! As a result of dummy encoding, we get 1.627 from 19 columns!

However, to make Word2Vec model work correctly, we need to define some parameters:

It represents the minimum frequency value of words to be present in the vocabulary.

The maximum distance between the current word and its neighbouring words. If your neighbouring word is greater than the width, then, some neighbouring words would not be considered as being related to the current word. A good value for that is the average number of tokens for each concatenated string.

Define the number of embeddings dimensions to map tokens. Then, the average value for each dimension defines the coordinates of the concatenated string.

It defines the training algorithm used, the alternatives are two: COBW (Continuous Bag of Words) and SKIP-gram. On the KNIME Hub, you can find a component which, given a set of input parameters, estimate the Word2Vec model using Python and automatically compute the average vector.

Something really challenging is the choice of the number of embeddings dimensions to correctly represent the data. An analysis of the variability can be useful to define the correct cut off, for example, through a PCA analysis, we can inspect the cumulative variance of embeddings dimensions explained by the number of extracted components:

Suppose that fourteen embeddings have been extracted. With four components we capture 90% of the variability. On the other hand, the graph becomes almost flat with more than seven components. Hence, we can conclude that extracting more than seven embeddings would not add any useful information. For the purpose of this article, I choose to extract five embeddings and 2 PCA components, which capture 90% of the variability.

Now it looks cleaner than before!

Finally, we can use the Support Vector Machine algorithm to classify High Consumption vehicles and Low Consumption vehicles. The binary target variable has been created by splitting the CO2 Emission using the median value: 246. The KNIME Workflow used for the analyses can be found at the following link on the KNIME Hub:

The advantage of extracting only two components is that we can visualize how the target variable is distributed comprehensively:

We will find the separating vectors using a Gaussian Kernel, which allows to map the two-dimensional feature space into a higher dimensional one, where we can separate classes through a linear decision surface. The radial basis function computes pairwise similarities between elements and project them closer in the new space as the distance is lower:

We need to find the optimal values of gamma to define the most appropriate shape of the decision surface:

The parameter sigma corresponds to the standard deviation of the Gaussian Kernel:

A low value of sigma makes the decision surface strongly dependent on the observations closest to the boarder.

Hence, with just two principal components we reach a satisfying result of 0.87 AUC. As we have seen, dummy encoding generates a huge number of columns. Indeed, when looking at the frequency distribution of the count of unique combinations for the five embedded variables, they occur in most cases only once or twice. As a consequence, the model would have seen a low number of unique combinations as training instances, affecting the ability to learn patterns and to make good predictions.

Images Powered by Shutterstock