G
Blog
November 2, 2021
2 min read

Where do biases in ML come from? #2 ❌ Exclusion

What happens when your AI / ML model is missing important variables? The risks of endogenous and exogenous exclusion bias.

Variables crossing
Jean-Marie John-Mathews, Ph.D.
Variables crossing
Variables crossing

In this second post about the reasons for biases in ML, we focus on one of the most important biases in ML: exclusion biases. ⚠

What does research tell us about them?

Exclusion biases happen when some important variables are left out of the model. Also called omitted variables biases, they are most common at the data preprocessing stage. As represented by Mehrabi et al. (2019), they are biases coming from the process between data and algorithm. The omitted phenomenon can be endogenous or exogenous to the modeling process:

✅ Endogenous variables

The exclusion bias is endogenous when the excluded variable has a causal relationship with the target variable at the time the model is built. Most often it happens when data scientists delete valuable data thought to be unimportant.

For example, imagine that you predict customer value using a dataset of past customer sales in America and Canada. 98% of the customers are from America, so you choose to delete the location data thinking it is irrelevant. In this case, your model will not be able to get the fact that your Canadian customers spend two times more. Having excluded the location variable makes the model biased, which leads to performance losses.

Read more about the Seven types of data bias in machine learning.

✅ Exogenous variables

The exclusion bias is exogenous when the causal relationship between the omitted variable and the target variable triggers after the modeling process.

For example (Mehrabi et al., 2019), imagine that you implement a model to predict churn rate and you suddenly realize that majority of users churns without receiving any warning from the designed model. Now imagine that the reason for churn is the appearance of a new strong competitor in the market which offers the same solution, but for half the price. This new fact has a causal relationship with churn and is missed by the modeling process. This creates a biased model and makes model performance drop.

At Giskard, we help AI professionals to identify these exclusion biases by enriching the modeling process with new information. In an upcoming series of posts, we are going to deep dive into other types of biases and how to address them.

Want to know more? ✋ Contact us hello@giskard.ai

Bibliography

Integrate | Scan | Test | Automate

Giskard: Testing & evaluation framework for LLMs and AI models

Automatic LLM testing
Protect agaisnt AI risks
Evaluate RAG applications
Ensure compliance

Where do biases in ML come from? #2 ❌ Exclusion

What happens when your AI / ML model is missing important variables? The risks of endogenous and exogenous exclusion bias.

In this second post about the reasons for biases in ML, we focus on one of the most important biases in ML: exclusion biases. ⚠

What does research tell us about them?

Exclusion biases happen when some important variables are left out of the model. Also called omitted variables biases, they are most common at the data preprocessing stage. As represented by Mehrabi et al. (2019), they are biases coming from the process between data and algorithm. The omitted phenomenon can be endogenous or exogenous to the modeling process:

✅ Endogenous variables

The exclusion bias is endogenous when the excluded variable has a causal relationship with the target variable at the time the model is built. Most often it happens when data scientists delete valuable data thought to be unimportant.

For example, imagine that you predict customer value using a dataset of past customer sales in America and Canada. 98% of the customers are from America, so you choose to delete the location data thinking it is irrelevant. In this case, your model will not be able to get the fact that your Canadian customers spend two times more. Having excluded the location variable makes the model biased, which leads to performance losses.

Read more about the Seven types of data bias in machine learning.

✅ Exogenous variables

The exclusion bias is exogenous when the causal relationship between the omitted variable and the target variable triggers after the modeling process.

For example (Mehrabi et al., 2019), imagine that you implement a model to predict churn rate and you suddenly realize that majority of users churns without receiving any warning from the designed model. Now imagine that the reason for churn is the appearance of a new strong competitor in the market which offers the same solution, but for half the price. This new fact has a causal relationship with churn and is missed by the modeling process. This creates a biased model and makes model performance drop.

At Giskard, we help AI professionals to identify these exclusion biases by enriching the modeling process with new information. In an upcoming series of posts, we are going to deep dive into other types of biases and how to address them.

Want to know more? ✋ Contact us hello@giskard.ai

Bibliography

Get Free Content

Download our guide and learn What the EU AI Act means for Generative AI Systems Providers.