October 6, 2022
5 min read

Why do Citibeats & Altaroad Test AI Models? The Business Value of Test-Driven Data Science

Why do great Data Scientists & ML Engineers love writing tests? Two customer case studies on improving model robustness and ensuring AI Ethics.

Dollar Planets Generated by OpenAI DALL·E
Alex Combessie
Dollar Planets Generated by OpenAI DALL·E
Dollar Planets Generated by OpenAI DALL·E

Captain's Log, Stardate 2022.10. We have entered a spectacular star system in the Data sector on a most critical mission of AI research. We are studying the effects of Quality Tests on our AI-powered computer core.”

Why do great Data Scientists & Machine Learning (ML) Engineers love writing tests?

You have probably heard about the importance of tests for a long time. But the value of tests in the context of Artificial Intelligence (AI) can seem elusive. You may view testing AI models as difficult and tedious.

At Giskard, we understand this challenge. We have been building products using AI / ML for the last ten years. ML Testing is a new field that emerged in research three years ago. It's difficult to know where to start testing Machine Learning systems and why.

Great Data Scientists & ML Engineers test AI models for two reasons.
First, tests help to improve model performance & robustness.
Second, tests help eliminate biases in AI models.

In other words, tests help to avoid the risks of ethical scandals or model performance issues. These risks lead to catastrophic incidents for your AI team and your business if left unchecked. Tests help protect and secure your AI workflow when you retrain your AI model or your data science team changes.

In the last few weeks, we have worked with our early adopters to implement a Test-Driven Data Science workflow so that they materialize the value of AI Quality. Here are two case studies explaining how they proceeded and what they have achieved.

⛑ Eliminate the risk of ethical biases with automated tests

Citibeats Home Page - At the forefront of AI Ethics

Citibeats uses AI to transform ethically sourced public opinions into real-time, data-driven actionable insights. Based in Barcelona, Spain, they serve major public institutions worldwide, such as the World Health Organization.

AI is central to their product offering, as it helps identify topics and trends in text data sources such as social media. Their Data Science team has developed Natural Language Processing (NLP) models to classify messages by themes, detect needs, perceptions, or narratives such as complaints, and compute social indicators.

We started to work together in April 2022. Their Head of Data, Arnault Gombert, was very interested in testing AI models. Arnault and his CTO, Abby Seneor, have identified Ethical AI as the core value proposition of Citibeats. But they did not have a way to test the ethics of their NLP models on sensitive categories.

This absence of tests introduced a risk of harming minorities in the long run and eventually losing contracts with public clients if an ethical bias appeared. For Citibeats, Ethics is a significant question of trust between their product, clients, and citizens.

At Giskard, we believe the best way to eliminate the risk of ethical biases is by using automated tests. These tests will allow you to test for bias before releasing your AI model into the wild, a.k.a. production.

After four months of deep collaboration with their team, we are happy to announce that the Citibeats team has created a complete test suite for their NLP models. It encompasses over 30 different tests to evaluate models on specific fairness and robustness metrics.

Citibeats AI Test dashboard on Giskard

To design these tests, we drew from the excellent research paper: Beyond Accuracy: Behavioral Testing of NLP Models with CheckList (Ribeiro, 2020).

We are fortunate to have Citibeats as our design partner. They are not only the first AI team to try our AI Test module but also the first ones to contribute new tests to our open-source repository. A special thank you to Alejandro Bonell de Pascual and Meryem Akrout for being the first contributors!

How to add your custom ML tests to Giskard's open-source project

This first contribution validates our open-source approach. Tests created by one user are valuable to the entire AI community. We aim to help our users quickly find the relevant tests for their AI models - as fast as you can download pre-trained models on Huggingface.

Soon, you will be able to use these custom NLP Ethics tests on your models. We have planned a joint communication with the Citibeats team to announce it in the coming weeks. It will include a technical deep dive into the testing methods we used. Stay tuned!

👷  Improve model robustness with domain expert feedback

Altaroad Home Page - Using AI to simplify industrial processes

Altaroad offers a new way of tracing material and waste flows transported by heavy goods vehicles by automating processes. It allows the traceability of flows on building and industrial sites to reduce transportation costs, improve safety and measure their carbon footprint. They serve major construction companies in France, such as Vinci and Grand Paris Express.

Altaroad uses AI to bring intelligence and automation to its data platform. Their Data Science team bridges the gap between physics and machine learning models to build robust AI models on sensor data.

We started to work together in May 2022 after an introduction by our data scientist, Princy Pappachan. Their Head of Data, Jean Milpied, was interested in our help to improve the robustness of their AI model.

At the time, they did not have an interface to get feedback on AI models from their users. They had identified that the model lacked performance, but there was no way to evaluate where precisely the model needed to be improved. Besides, it was essential to involve domain experts to evaluate AI models.

Altaroad used the Giskard AI Inspect module to build a collaborative feedback loop between their data science and industrial engineering teams. In just 30 minutes, they collected 72 pieces of feedback from their engineering stakeholder!

Altaroad AI Feedback Dashboard on Giskard

This feedback helped identify some errors in the data & AI model:

  • Wrong labels for some examples
  • Wrong model predictions in specific data ranges
  • False values for some features

Thanks to this feedback, they identified and prioritized eight actions to improve their AI model, from data labeling to feature engineering. We also discussed mixing the AI model with expert rules in a "Composite AI" with optimal robustness.

In their next iteration, they will develop a new AI model version considering the actions identified. Following the Test-Driven Development (TDD) methodology, they will first write tests based on the collected feedback to ensure that the next version fits these requirements.

Giskard helps to implement such a Test-Driven Data Science (TDDS) workflow. To quote Kent Beck, the inventor of TDD and one of the 17 original signatories of the Agile Manifesto:

"Test-Driven Development helps you to pay attention to the right issues at the right time so you can make your designs cleaner, you can refine your designs as you learn."

⚡️ What’s Next: The Power of Data Slices

Our R&D team is busy working on our next significant feature: Data Slices!

Data slices will connect the business feedback collected during the AI Inspect phase to the AI Test suites. It will make the user experience of writing tests much faster and more connected to the underlying business value.

The user story is that as a Data Scientist, you will be able to create and save specific data slices so that:

  • You can quickly inspect sensitive cases
  • You can turn feedback into valuable business tests

Data slices will also benefit business users of our platform, who can inspect the model on these specific slices and know which test is connected to which slice.

Here is a mock-up of what it looks like:

Mock-up of Giskard upcoming Data Slice feature

For instance, you can define specific data slices for men and women and compare the performance of the model (Precision, Recall, F1 Score, etc.) between those two. As you can reuse these slices across datasets, evaluation & testing operations are much more efficient.

We are also working on our API to automate AI Tests in your CI/CD pipeline so that you can get a green Giskard “passing” badge on your Git repo if all your tests succeed.

These features will be rolling out with the next release in October/November. If you are excited about these upcoming features, all you need to do is:

👉 Drop us an email at hello@giskard.ai to be involved in the Beta

Thanks again to all 447 of you for subscribing to this newsletter. We are very interested in your feedback if you have any questions or ideas on how we can improve.

Integrate | Scan | Test | Automate

Giskard: Testing & evaluation framework for LLMs and AI models

Automatic LLM testing
Protect agaisnt AI risks
Evaluate RAG applications
Ensure compliance