In my most recent default scoring data science projects, I wanted an automatic tool that could warn me, especially during the development stage, when my model’s predictions were incoherent, whether it was because there had been a problem in the data processing, or because the model simply had to be retrained.

The main difficulty was that a sharp increase or decrease in scores can either be incoherent or legitimate, which means that the change in a score is the consequence of an actual and significant change in an observation’s data.

Making that distinction is complicated without knowing the model’s outputs’ distribution, although it can be helped with interpretability tools such as SHAP or LIME, but that means going through each prediction one by one.

An example of SHAP values computed on an XGBoost model trained on the data from the “Home Credit Default Risk” Kaggle competition. Image by author.

The key principle we’ll use here is the Central Limit Theorem, with which we’ll study normal or pseudo-normal distributions derived from our model’s predictions. Thanks to that, we’ll be able to deduce whether a prediction series is coherent or anomalous with a method that can be fully automated.

Let’s see our model as a dice roll. We don’t know if the dice is loaded or not, which means we do not know its probability distribution.

However, if we roll the dice 1000 times and average the results, we’ll get a value that is stable and won’t change much if we reiterate the process. For instance, here’s the graph we get by doing 3000 iterations of 1000 rolls for an unloaded dice :

Here, even if we don’t know that the dice is unloaded, we know that the means of the rolls have a normal distribution, in which 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean.

If the dice suddenly changes and becomes loaded, it will immediately be visible :

Here, the dice roll was loaded in favor of the 6 : it had a 30% probability of rolling a 6, and 14% for the other values. The probability of having a result like that if the dice was unchanged is extremely low, so we can safely assume that the dice has been altered.

What I did with my model was the same. I took the average of every score given for every monthly prediction previously made, and see if the newest predictions are consistent with them.

Even if some observations have legitimately had a significant increase or decrease in their score, the mean of all the scores should globally follow a normal distribution. And that’s what we can see :

If a series of predictions has a mean that lies outside two standard deviations of the mean, we can see it as abnormal and analyse it carefully.

This method can also be applied to the monthly predictions’ standard deviations.

Now what if our dice was loaded in favor of the 3 ? It would naturally shift the mean of a 1000 roll series towards 3 but that would not be a value far enough from the mean to be seen as anomalous :

Position of a die loaded in favor of the 3. Image by author.

Here, the anomaly can’t be seen in terms of the rolls’ mean, but it is visible if we count the number of times we rolled a 3 among the 1000 rolls :

Here again, the number of threes follows a normal distribution and the loaded dice’s result is visibly abnormal.

In a model, we can apply that by counting the number of observations for which there has been a significant increase or decrease in their score. If half of the observations had an increase of 0.3 and the other half had a decrease of 0.3, the result can’t be seen in terms of the predictions’ mean but it will be seen with that method.

Applied to my model, I got the following graphs :

Here, I kept a copy of my model and didn’t train it at all for several months. We can see from both the histogram and the line plot that the model shows signs of degradation after June 2021. Previous data analysis suggested that the model only had to be re-trained every year, but this anomaly analysis suggests that this should rather happen after 4 months.

Anomalies can happen, among multiple reasons, because : – There is an anomaly in the data processing pipeline – The model is unstable or has to be re-trained – There is an external factor. For instance, if there is an economic crisis, most observations will have a sharp increase in their default score because their health has declined. Anomalous does not necessarily mean incoherent.

In my case, the data processing pipeline wasn’t definitive yet and even if the model’s re-training frequency had already been inferred, I wanted to monitor it on current data. Therefore, that CLT-based anomaly detection was necessary mainly for the first two cases.

Originally posted here by Antoine Villatte. Reposted with permission.

Images Powered by