Tutorials

March 30, 2022

5 min read

How to test ML models #2 🧱 Categorical data drift

Testing drift of categorical feature distribution is essential in AI / ML, requiring specific metrics

Cars drifting

Kullback-Leibler (KL) divergence, Population stability index (PSI), Chi-square, etc. Drifts for categorical data in Machine Learning can be tested using many methods. How to choose the right one?

Testing drifts of ML feature distribution is essential in AI. It enables you to make sure your model is still valid by checking if data in inference time is close to the data used for training the model. Data drift measures the difference in statistical distribution between a reference data, often the training set, and a “current” dataset, such as the test set or a batch of examples in inference time.

In this post, we’ll give the definition, formula, and interpretation of the main categorical drift metrics between two datasets.

Notation: we note r_i the probability of category i from the categorical variable R (reference feature) and c_i the probability of category i from the categorical feature C (current feature).

Kullback-Leibler divergence (relative entropy)

KL divergence is widely used for measuring how a categorical feature C of distribution c_i is different from a reference distribution r_i from another categorical feature R. It is defined as the expectation of the logarithmic difference between the probabilities c_i and r_i, where the expectation is computed using the probabilities of C. Mathematically, the Kullback-Leibler divergence from C to R is given by:

KL divergence has the big advantage to be interpretable: it’s the expected excess surprise from using C when you expect your data to have the distribution of R. But its main disadvantage is that it does not satisfy the general properties of a distance metric. For example, it’s not symmetrical: when you permute C and R, you won’t necessarily find the same value. Imagine that the distance from point A to point B is different from the distance from point B to point A. This could be problematic!

To address this issue, one can use a symmetrical version of the Kullback-Leibler divergence, it’s called the Jeffreys divergence, often known as the Population stability index (PSI) in Machine Learning.

Population stability index (PSI)

PSI was originally developed in the Banking and Finance industries for testing the changes in the distribution of a risk score over time. It’s defined as the sum of KL divergence from C to R and the one from R to C. It’s given by the following formula:

As you might observe, PSI is symmetrical since switching C and R gives the same result. But PSI has another big advantage. Since it has been used by the financial industry over the past decades, practitioners have proposed domain thresholds to make a decision using the PSI indicator (Siddiqi, 2005):

PSI < 0.1: no significant population change
0.1 < PSI < 0.2: moderate population change
PSI > 0.2: significant population change

These thresholds are crucial to building testing methods in Machine Learning.

The Population Stability Index has also some disadvantages. The value of the PSI can be influenced by the probabilities of categories. In particular, the PSI has unreliable properties when frequencies for a category approach 0. As you might have noticed, PSI is diverging for categories with very little data. For example, this is the case when the test set contains a new category that is not present in the train set. Unfortunately, this is quite frequent in day-to-day practices in Machine Learning!

Here is a simple code implementation of the PSI value that bypasses this limitation by bounding the low-frequent categories.

Using this function, you can easily write tests as follows:

Chi-square (goodness of fit)

Domain thresholds are great to implement tests. But are they reliable? They might look a bit arbitrary! To overcome this issue, traditional statistical tests fix decision thresholds using probabilistic hypotheses.

Statistical theorems, such as the law of large numbers, give the approximative distribution of aggregated data. Using these laws, one can compute an aggregated statistical measure (test statistics) for testing purposes: if this value is above a given threshold, we can reject at a 95% confidence level that the C and R have the same distribution.

These tests, called goodness of fit tests, are used to test if a given sample data C fits a distribution from a reference population R. As explained here, the chi-square test is the most common of the goodness of fit tests. It’s defined by:

with c_i and r_i the number of observations (and not the probability) of the category i for the variables C and R.

Since this value follows a chi-square distribution, if it is above the 95% quantile of the chi-square distribution, one can reject the fact that the features R and C are coming from the same distribution.

Below is a simple implementation of the chi-square test for drift measure of categorical features in Machine Learning.

Using this implementation, you can easily write tests adapted for your ML models:

Unfortunately, the chi-square test has also its disadvantages: as with all statistical tests, it’s only valid under some conditions. Chi-square tests are invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the reference and current frequencies should be at least 5. Furthermore, Pearson recommends the total number of samples to be greater than 13 to avoid over-rejection of the test.

Conclusion

While different ways exist to test the drift of categorical features in Machine learning, there is no best method. One should find threshold that is either based on domain knowledge, like the PSI score, or probabilistic theory, like the chi-square. But each of these methods is valid under some restrictive conditions. It depends on your data!

At Giskard, we provide open-source resources for tuning these ML testing methods and integrating them into your model.

Check our Github!

References

Harold Jeffreys. (1948). Theory of probability. p158.
Taplin & Hunt. (2019). The Population Accuracy Index: A New Measure of Population Stability for Model Monitoring
Siddiqi, N. (2012). Credit risk scorecards: developing and implementing intelligent credit scoring (Vol. 3). John Wiley & Sons.
Pearson, Karl. “On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling”, Philosophical Magazine. Series 5. 50 (1900), pp. 157–175.

Integrate | Scan | Test | Automate

Giskard: Testing platform to secure LLM Agents

Get alerted of new vulnerabilities

Protect against AI risks

Identify security vulnerabilities & hallucination

Enable cross-team collaboration

GET STARTED