September 27, 2023
10 min read

Guide to Model Evaluation: Eliminating Bias in Machine Learning Predictions

Explore our tutorial on model fairness to detect hidden biases in machine learning models. Understand the flaws of traditional evaluation metrics with the help of the Giskard library. Our guide, packed with examples and a step-by-step process, shows you how to tackle data sampling bias and master feature engineering for fairness. Learn to create domain-specific tests and debug your ML models, ensuring they are fair and reliable.

Eliminating bias in Machine Learning predictions
Josiah Adesola
Eliminating bias in Machine Learning predictions
Eliminating bias in Machine Learning predictions

Data is one of the most invaluable assets in today's world, with a staggering 328.77 million terabytes generated daily, encompassing a diverse array of content, from videos and text to spoken words. This data, spanning personal and supplementary information about individuals, can unveil profound insights about a person's identity.

However, these data patterns often need a more crucial context for our behaviors and interpersonal interactions as human beings. When fed into machine learning algorithms, they perpetuate societal assumptions, generating predictions that raise legitimate concerns about privacy and fairness. These concerns extend to their impact on diverse groups of people across various facets of life, affecting things like how insurance costs are calculated, credit scores for different groups, and even health predictions.

Because of these concerns, it's important to make sure that machine learning models built with this data are fair and don't favor one group over another in ways that can cause long term harm.

This article will:

  • Introduce unethical practices in machine learning.
  • Develop a model for salary predictions to introduce classical model evaluation.
  • Talk about the problems with traditional model evaluations that don't consider fairness.
  • Introduce Giskard as a tool to ensure machine learning models are fair in their predictions.

Now, let's dive in!  

Fig 2. Image credit: giphy.com

☯ ️Machine Learning bias, and unethical practices: Ensuring AI Fairness

To understand the concept of data bias and its potential impact on individuals, let’s examine a scenario involving insurance payments.

Insurance provides financial security in unforeseen situations like accidents, but it has faced criticism globally for how insurers distribute premium payments across people in different geographical locations. When you pay premiums to an insurance company, they use various factors to calculate coverage, including age, gender, location and health status, through using ML to scale the process for calculating insurance premiums and coverages 

Since the data that insurers use contains human bias, they may carry assumptions using protected and proxy attribute correlations (see Table 1) within the dataset without understanding the context. This can invariably lead to discrimination, potentially disadvantaging certain income or racial groups.

For instance, an insurance company may use ZIP codes as proxy variables to determine premium costs. However, these algorithms lack the understanding that ZIP codes can often be linked to socioeconomic factors like race and income.

Fairness Implication — This potentially leads to unfair and discriminatory pricing practices that may violate anti-discrimination laws wherever these algorithms are deployed. This also reduces transparency and accountability because it makes it difficult for people to understand why they are paying higher than other people from different zip codes.

This can happen in other domains like, banking, hiring and salary distribution, education, etc., and might have already affected you in one way or another. It means that there’s a need to evaluate ML models to understand how they make predictions and make them better by improving fairness.

🧪 Overview of classical Machine Learning model evaluation

Traditional model evaluation techniques tend to focus on assessing the overall predictive performance of a model without delving deeply into the fairness or potential biases associated with specific variables, including protected and proxy variables.

In many cases, classic model evaluation primarily emphasizes metrics such as accuracy, precision, recall, F1-score, and ROC AUC, among others, to gauge how well a model performs in making predictions. These metrics generally evaluate the model's overall effectiveness in terms of correctly classifying outcomes but may not thoroughly examine how the model treats different subgroups or the fairness of its predictions with respect to protected or proxy variables.

However, when some of these metrics are combined and analyzed alongside demographic or protected variables, they can provide a more comprehensive view of a model's behavior and fairness.

Table 1. Showing the difference between protected and proxy variables based on definition, examples and usage

Protected Variables Proxy Variables
Definition Sensitive attributes are often related to privacy and fairness, such as race, gender, age, or disability status. Non-sensitive attributes that indirectly correlate with protected variables and may unintentionally introduce bias.
Example - Race or ethnicity
- Gender (Sex)
- Age
- Disability status
- Zip code (correlated with race and income)
- Education level (correlated with age)
- Job title (correlated with gender)
Usage - Used to assess and monitor fairness in machine learning models.
- Protected from direct use to prevent discrimination.
- May inadvertently introduce bias if not considered during model development.
- Should be identified and addressed to ensure fairness.

💰 Use Case: Evaluating a model trained on adult income data for salary prediction

To show how models might seem to perform well using standard evaluation methods but exhibit biases when fairness is taken into account, the adult income dataset from Kaggle is used.

This dataset is notorious for its inherent bias (just as in the insurance use case discussed earlier), particularly due to its imbalanced nature. It serves as an ideal example to underscore the critical significance of fairness, particularly in a sensitive domain like predicting salaries. Much like how insurance predictions can significantly influence people's lives, this particular use case provides a good basis for discussing fairness considerations.

It is also a popular dataset for building a binary classifier that predicts if a person makes over $50,000 a year or not, given their demographic variation.

With this background on the data and the aim of training the model, you can kick off by installing and importing relevant libraries. Note that Python is the primary programming language, and Google Colab Notebooks is the coding environment for this walkthrough.

📚 Install Giskard and import libraries to evaluate a model

If you do not have the giskard library and its dependencies installed, you can do that with the following:

What’s Giskard? Testing framework for model evaluation 

Giskard is a testing framework created to reduce the risks associated with biases, performance issues, and errors, improving the reliability of machine learning models. With the aid of this tool, you can find hidden flaws in your machine learning models like performance bias, unrobustness, data leakage, overconfidence, stochasticity, and unethical behavior. Giskard helps you automatically scan your models for vulnerabilities and offers concise descriptions of these risks if they are present.

Let’s import the relevant libraries including giskard:

Load and Preprocess the dataset

Download the CSV file programmatically and proceed to preprocessing in the Google Colab coding environment. 

The data used in this exercise was sourced from here. It contains some rows with missing values, poorly formatted column names, and certain columns that won't be necessary for our purposes.

You can take a look at information about the dataset by using the df.info() pandas command to determine:

  • Variables you won’t need
  • Categorical variables
  • Numerical variables

After specifying the variables you can start preprocessing and preparing data for training.

Set constant variables you will need for splitting the data for simplicity and then split your data into training and testing. 

This appears to be a typical method for splitting data into training and testing sets. However, it's important to recognize that this method might not necessarily enhance the model's performance on subgroups, especially when dealing with imbalanced data. Even when various techniques, such as oversampling, undersampling, or employing SMOTE (Synthetic Minority Over-sampling Technique), are utilized to address imbalanced data, there can still be lingering data sampling bias that traditional model evaluations may not detect.

Classic Model Evaluation Pitfall #1: Addressing data sampling bias

Data sampling bias occurs when the data collected doesn't accurately represent the entire population you want to make predictions about. For the "Adult" dataset, one might assume that this is a fair representation of society, but when you take a closer look, you might find that certain racial groups are underrepresented or overrepresented.

In other words, the data we've collected doesn't accurately reflect the diversity of the population. This is where sampling bias creeps in. Also, splitting the dataset might exacerbate this bias by creating an imbalance in subgroup representation, leading to unintended bias in your model's evaluation and predictions.

The consequences of such bias can be profound. Imagine a scenario where a model with this bias is used to determine eligibility for loans or job opportunities. It could disproportionately deny opportunities to certain racial groups, perpetuating social disparities.

Initiate bias mitigation: Wrap your Dataset with Giskard for ML model evaluation

Wrapping your dataset with Giskard is the first step towards preparing to scan your model for performance issues. l. Datasets represent a potential major source of bias for ML models. Bias mitigation can be done by carefully selecting features to train on. 

Classic Model Evaluation Pitfall #2: Difficulty in enabling efficient feature engineering based on fairness

Traditional model evaluations are blind when it comes to group disparities in predictions made for demographic groups. A model may achieve high accuracy while still treating certain groups unfairly. It doesn’t tell you much about the data it uses to come to that generalization. 

When working with classical model evaluations, many ML practitioners check the dataset used for developing the model only when their model performance is low or too high. This leaves you with a blindspot and implicitly reduces the interpretability of the model. Measures you might take here include but are not limited to manually slicing the data, evaluating the specific subsets through the model and writing logs before fixing the problems.

As we’ll see later, giskard automatically slices and tests the data slices by subgroups to give you valuable information needed for feature engineering. You can investigate perturbations that might be harming the quality and performance of your model on the subgroups you are interested in and this invariably helps you enhance the model when you make the right changes.

Train the ML model

Here, the data is coded with OneHotCoder and then passed into the Pipeline method and trained with RandomForestClassifier.


Output - Train ML model

A test accuracy of 0.82 might give confidence to most data scientists. They might be happy to move on and consider that their model is good enough to perform in production. . However, looking at accuracy metrics alone may not unveil potential fairness, robustness, over/underconfidence, spurious correlations or a whole host of other issues that could pop up when confronting the model to the real world.

Ensure AI fairness: Wrap and Scan your model for ML model evaluation

Just like the dataset, giskard library is used to wrap the model in order to prepare the scan. 


Result: Wrapped Test accuracy: 0.82

After wrapping the model then you can scan your model to check for vulnerabilities.


Scan results for ML model evaluation

The giskard library produces a report to help understand the different vulnerabilities of our model. Here we notice that the main issues detected are performance biases and underconfidence.  

This report highlights the significance of evaluating machine learning model fairness by combining classic metrics with their global counterparts. It provides insights into whether the model's overall performance aligns with its performance across various demographic subgroups.

This is critical because it offers a proactive approach to addressing fairness concerns in machine learning models. Instead of data scientists needing to manually investigate their data to identify these issues, which can be time-consuming and faulty, the giskard scan report streamlines the process. Data scientists and stakeholders quickly pinpoint potential fairness concerns or irregularities tied to specific subgroups in the data to make fast informed actions, saving them the headache of sifting through piles of data and blindly trying out different methods to anticipate and mitigate  a model’s performance biases.

Table 2. Shows how metrics for a subgroup combined with their global metrics can reveal fairness concerns

Metric and Global Metric Combination Description What It Reveals
Accuracy and Global Accuracy - Subgroup Accuracy: Measures overall correctness of predictions without subgroup considerations.
- Global Accuracy: Provides an overall view of model performance across all data.
Highlights whether the model's correctness (the proportion of correctly predicted instances) on a subgroup aligns with its performance across all data. Disparities between accuracy and global accuracy may indicate fairness concerns related to specific demographic groups.
Precision and Global Precision - Subgroup Precision: Quantifies the correctness of positive predictions but doesn't account for subgroup-specific biases.
- Global Precision: Offers an overall view of model precision across all data.
Assesses whether the model's precision is consistent with overall performance. Differences between precision and global precision suggest potential fairness issues where the model may perform differently for specific demographic groups.
Recall and Global Recall - Subgroup Recall: Measures the model's ability to capture true positives among actual positives within a subgroup.
- Global Recall: Calculates recall across the entire dataset, regardless of subgroups.
Highlights whether the model captures true positives consistently across different demographic subgroups. Disparities between recall and global recall indicate varying subgroup performance and potential fairness concerns.
F1-Score and Global F1-Score - Subgroup F1-Score: Provides a balanced view of overall performance, considering both precision and recall.
- Global F1-Score: Calculates the F1-Score across the entire dataset, without subgroup considerations.
Assesses whether the model's balanced performance extends to all demographic subgroups. Differences between the F1-Score and global F1-Score reveal potential fairness concerns where the model may not achieve the same balance of precision and recall for all groups.

Classic Model Evaluation Pitfall #3: Lack of Fairness and Interpretability 

Traditional model evaluation relies on common metrics like accuracy, precision, recall, and F1-score to assess a model's performance. These are handy for telling us how well a model is doing in general, but they don't really explain whether the model is being fair to different groups of people. This is a problem because it means we might not realize when our models are being unfair, which can lead to unfair treatment based on things like race, gender, or age without us even knowing it.

Giskard empowers users with actionable insights for improving fairness. The vulnerabilities detected suggest potential interventions and adjustments to reduce disparities and promote equitable outcomes. In essence, Giskard bridges the gap in fairness interpretability by providing a transparent and actionable framework for assessing and addressing fairness concerns, ensuring that machine learning models adhere to ethical and equitable standards.

Generate a test suite from the Scan

The results generated from the scan can serve as building blocks to create a comprehensive test suite that incorporates domain-specific challenges and considerations, enhancing the overall testing process of your ML model.

Understanding Test Suites in ML model training and evaluation

Test suites are organized collections of reusable components designed to make the evaluation and validation processes for machine learning models more efficient and consistent.

They include various test cases, each tailored to assess specific aspects of a model's performance. The main goal of using such test suites is to improve the efficiency and consistency of testing. 

Additionally, they help you maintain consistent testing practices, enable model comparisons, and quickly identify any unexpected changes or issues in the behavior of your machine learning models.


Test suite in ML model evaluation

After running your first test suite, this report tells you that when the model's performance is assessed specifically for individual groups for example, the groups with "Unmarried" based on their relationship status, the Recall metric is quite low (0.2449), and it does not meet the expected performance level, as indicated by the threshold ( i.e failed).

This suggests that the model may not be effectively capturing all relevant positive cases for this subgroup, indicating a potential area for improvement.

This report hints that you might want to tweak your data or try out a different model to make sure it meets fairness standards. 

Customize your suite by loading objects from the Giskard catalog

Test suites can be customized in the giskard library. Giskard’s catalog provides you with the capability to import various elements like:

  • Tests, encompassing types like metamorphic, performance, prediction, and data drift tests, as well as statistical assessments.
  • Slicing functions, which include detectors for attributes like toxicity, hate speech, and emotion.
  • Transformation functions, offering features such as generating typos, paraphrasing, and fine-tuning writing styles.

The code below adds an F1 test in the suite.

The report below shows that our model failed this newly added test. 


Failed tests

Moving forward, the logical progression involves addressing and rectifying the issues identified by the giskard scan to reduce model bias. To achieve this, you can initiate the Giskard Hub, a platform that allows you to debug your failing tests, thoroughly examine and assess your data and models for fairness, and compare different versions of your model to ensure you choose the best eliminating guesswork involved in the process.

✅ AI Safety with Giskard Hub: Enhance Machine Learning model evaluation and deployment

Giskard Hub serves as a collaborative platform designed for debugging failing tests, curating specialized tests tailored to your specific domain, facilitating model comparisons, and gathering domain expert feedback on your machine learning models. It plays a pivotal role in enhancing AI safety and accelerating the deployment process.

The Giskard Hub can be deployed either through HuggingFace spaces, on-premise, on the cloud or through Giskard’s custom SaaS offering. In real-world scenarios where sensitive information is meant to be accessed on-premise, we will opt to use the Giskard hub on-premise.

1. Install giskard server and all its requirements on your local machine 

Ensure docker and all other requirements are installed on your systems. Check here for how to install this on either Linux, macOS machine, or WSL2 (Windows Subsystem for Linux) in Windows.

2. Start the server on your local machine by inputting the following command on your terminal:

When it runs, it will provide you with a localhost address http://localhost:19000/.

If you are new to using giskard, you will need to set up an ngrok account which creates a layer of security when uploading objects from your Colab notebook to the Giskard server.

3. Request a free trial license 

Giskard Hub - request license
Giskard Hub - Launch

Launch Giskard! 

Giskard Hub - getting started

You are set to upload all your project assets from your Colab notebook.

4. Set up an ngrok account and generate your ngrok API token, then expose the giskard server to the internet to allow you upload objects to the Giskard Hub.


Giskard Hub - ngrok API token

5. Go to your Colab notebook and upload the first test suite you just ran with all its assets

Use the external server link –<ngrok_externeal_server_link>– to see the uploaded test suite on Giskard Hub.

Giskard Hub - Testing

6. While the other terminal runs the server, open a new terminal and execute this command on your local machine to start the ML worker. 

The ML worker allows the giskard server to execute your model and all its associated operations directly from the Python environment where you trained it. This prevents dependency headaches and makes sure you can run test suites and model directly from the server. 

7. Curate your tests, debug, and re-run tests

After running your initial test suite, add new tests, run them on newer versions of your models, and debug them, by reviewing failing subgroups to identify and address issues detected by Giskard.

Giskard Hub - List of failed tests

Play around with the Giskard platform to maintain a centralized Hub for all your tests and model experiments in the most comprehensive way possible.

Giskard Hub - debugging

Through this guide, you've learned to scan your model and develop comprehensive test suites using the Giskard Python library. You’ve also seen how the Giskard Hub can prove to be the ideal debugging and model evaluation companion when training new ML models. . Giskard's tools simplify and automate many tasks that are often missed when creating ML models.

If you found this helpful, consider giving us  a star on Github and becoming part of our Discord community. We appreciate your feedback and hope Giskard becomes an indispensable tool in your quest to create superior ML models.

Integrate | Scan | Test | Automate

Giskard: Testing & evaluation framework for LLMs and AI models

Automatic LLM testing
Protect agaisnt AI risks
Evaluate RAG applications
Ensure compliance