In this post, we focus on one of the most important biases: measurement 📏
Data is the result of measurements that are done either by a human or a machine. Noise is inherent to every measurement. Usually, it is possible to get rid of the measurement noise by using aggregated measurement points.
Unfortunately, this technique does not really work in real ML projects. Noise is not random with respect to the event we want to predict. Put differently, measurement biases happen when the measurement noise is correlated with the target variable.
Here are some examples:
❌ In image recognition, the training data may be collected by a different type of camera than the one used for production.
❌ In #NLP, data labelling may be influenced by workers’ regional context. This induces inconsistent annotation, leading to measurement bias.
Fortunately, physics, and especially metrology, give a method to detect measurement bias: calibration. It is the act of comparing measurements values with standards of known accuracy.
There are several ways to apply calibration in Machine Learning:
✅ Always compare the output of different data collection processes. To do that, use monitoring tools to assess changes of data distributions.
✅ Provide best practices and clear guidelines for your data collection process.
At Giskard, we help AI professionals detect measurement biases by enriching the modeling process with new reference points.