February 15, 2023
8 min read

FOSDEM 2023: Presentation on CI/CD for ML and How to test ML models?

In this talk, we explain why testing ML models is an important and difficult problem. Then we explain, using concrete examples, how Giskard helps ML Engineers deploy their AI systems into production safely by (1) designing fairness & robustness tests and (2) integrating them in a CI/CD pipeline.

Giskard at FOSDEM 2023
Alex Combessie
Giskard at FOSDEM 2023
Giskard at FOSDEM 2023

🍿 Full video

You can find our slide deck below:

🇺🇸 English transcript

Hello everyone, do you hear me well? Thanks! Pretty large audience. If I may ask, a quick show of hands: who among you have some experience, just any level of experience, with Machine Learning?

Okay, cool, awesome! I'll be talking today about how to run testing on Machine Learning systems. There are different keywords: CI/CD, quality assurance.

A few words about us: I'm one of the founders of Giskard. We are building a collaborative and open-source software platform to precisely ensure the quality of AI models. And I'll be explaining in this presentation a bit about how it works.

In terms of agenda, I have prepared two sections on the Why: why a project on testing Machine Learning systems is needed, and why I personally decided to work on that problem. Some of the risks and why classical software testing methods don't quite work on AI, and then I'll do some more concrete examples on two important quality criteria that you might want to test for Machine Learning: one is robustness and the other is fairness, and if we have the time, it's just 30 minutes, I hope that we can do a quick demo of an example use case where we run the full CI/CD pipeline on a Machine Learning model.

Why? A personal story with memes

To start off easy, I put together a series of memes to explain my personal story of why I came to create a company and a project around this Machine Learning testing thing. About 10 years ago, I started in Machine Learning, statistics, data science and you know, you start using the scikit-learn API and you're like: yeah it's super easy, anybody can be a data scientist, you just “dot fit”, “dot predict” and that's it, you're a data scientist.

And probably if you're here today, you're like have you tested your model, and sure, train_test_split().

The reality if you've deployed in production is quite different. If you've deployed to production, often you'll have this painful discovery where you have your product manager, business stakeholders,
to whom you said: look, I worked really hard on the fine tuning and the grid search to get to 85% accuracy and you push your first version to production.

And things don't quite work out, you don't reproduce these good accuracy numbers.

This was me, I hope it's not you. One of my first experiences deploying Machine Learning in production was on a fraud detection system. Frauds are notoriously difficult as a use case for Machine Learning because what you're trying to detect doesn't quite want to be detected. There are people behind it, who have a vested interest not to have your MachineLearning system detect them. So often, in terms of performance, I ended up doing a lot of hotfixes in production. It's bad.

Five years ago this was my stance on Machine Learning in production: a very painful, grueling experience, where you just never know when you're going to have a complaint or when you're gonna be on call to solve something in production.

That's when I decided to buff up and switch roles to join a software engineering team. I was at Dataiku back then, so I moved internally
from data science to the product team.

Here are some of the things to summarize, as someone with a Machine Learning background but no real software engineering experience, I was told: you must learn the ways of the CI/CD, otherwise your project will not come to production. For context, I was specifically at that time, in charge of creating an open-source layer to federate all the NLP and Computer Vision APIs that vendors in the cloud provide, and then to do the same for pre-trained NLP and Time Series models.

What was difficult in this context is: I was not even the one in charge of the models, and the models would be retrained, fine-tuned, so the guarantees into the properties of these systems, as an engineer, that's more difficult when there are some elements in the stack that you don't have control of.

This is a bit of a repeat of a previous meme and I really wanted to say: One does not simply ship an ML product without tests.

The challenge I had then, is that from an engineering management standpoint, I was told: no, it's easy: engineers all write their test cases, so you do Machine Learning, just write them all, just write all the test cases.

This was me back at square one, you're telling me I just need to write unit tests? Okay... But that will not really solve the issue.

That's the beginning of a quest that set me on to build something to solve that gap between: I want to test my models, I need to test my models and how can I do that. Because clearly, and I'll explain why, unit testing your model is really not enough.

Why? Risks & why classic testing fails

A different angle on the Why, I'll try to take a step back and talk about quality in general. I think in this track we all agree that quality matters. And if you look at AI, it's an industry and an engineering practice that is far younger than software engineering or civil engineering and it's just riddled with incidents.

I encourage you, if you don't know that resource already, it's an open-source database, incidentdatabase.ai and it's a public collection of reports, mostly in the media, of AI models that have not worked properly and it's a really great work that has been going on for about two years and a half, it's a global initiative and just in this time they collected more than 2 000 incidents.

Since these are public reports, think of it as the tip of the iceberg of course, there are a lot of incidents internal to companies that are not necessarily spoken out in the media. The incident database has a very interesting taxonomy of the different types of incidents. It's very multifaceted.

I took the liberty to simplify it in three big categories of incidents: one is ethics, the other is business, economic impact and the third one is security. And we're talking about really, if they happen at a global company scale, incidents which are very very severe.

In ethics, you can have a sexist credit scoring algorithm that exposes the company to lawsuits, to brand image damages, etc. And these are notoriously hard to identify, because in a way Machine Learning is precisely about discrimination, so it's hard to tell a machine that is learning to discriminate not to discriminate against certain sensitive groups.

I'll speak on some methods that can be used precisely on this problem but Apple was working with at the time Goldman Sachs on deploying this algorithm and probably some tests and safeguards were unfortunately skipped. So it was actually discovered on Twitter that in a simple case a male loan applicant would get a 10x their loan limit compared to his wife. That sparked a huge controversy, but probably exposed Apple to some lawsuits.

In another area that is not with sensitive features such as gender, there was a huge catastrophe a year and a half ago, that happened to Zillow, a real estate company, where there there was a small bias that was overestimating the prices of homes and they decided to put this algorithm live to buy and sell houses.

It turns out that this tiny bias which was left unchecked was exploited by the real estate agents in the US and literally this created a loss of nearly half a billion dollar. Going back to testing, this could have been anticipated and avoided.

Now, on a cybersecurity spectrum, there's a lot of good research from cybersecurity labs showing that you can hack, for example a Computer Vision system in an autonomous driving context. So here, you put a special tape on the road and you can crash a Tesla.

We don't quite know if these types of vulnerabilities have been exploited in real life yet, but as AI becomes super ubiquitous and obviously there are some bad actors out there that might want to hack these systems. It introduces a new type of attack vectors, so that's also something we need to care about.

So, both from practitioners of AI and from a regulatory standpoint, testing just makes sense.

Yann Lecun, Chief AI Scientist at Meta, was actually taking a stance at the beginning of last year on his Twitter saying that if you want trust in the system, you need tests. And also, making a slight criticism towards some of the explainability methods, because two years ago if you followed that realm, people were saying you just need explainability and then your problems will go away.

Well, that's just part of the answer. Lastly, and this was covered in some of the talks this morning in the big auditorium, there's a growing regulatory requirement to put some checks and balances in place and it also says that you need, specifically in case your AI system is high-risk, you need to put quality measures in place.

And the definition of high-risk AI systems is pretty large: obviously you have anything related to infrastructure, like critical infrastructure, defence, etc. But you also have all AI systems that are involved in human resources and public service, and financial services because these are considered obviously critical components of society.

Now that we agreed that it's an important problem, these are some of the challenges. Because if you've tried, if you've encountered some of these issues, you probably looked at some easy solutions, taking some analogies on what you might do to do this.

There are three points that make this problem of testing Machine Learning a bit special, meaning it's still a big work in progress: point one is that it is really not enough to check the data quality, to guarantee the quality of a Machine Learning system.

One of my co-founders, during his PhD, proved experimentally, you can run experiments, that can have really clean data and a bad model. So you cannot just say it's an upstream problem, it's actually technically systems engineering. You have to take the data, the Machine Learning model and the user into context to analyze its properties.

Moreover, the errors of a Machine Learning system are often caused by pieces of data that did not exist when the model was created, they were clean but they did not exist.

Second point, it's pretty hard just to copy-paste some of the testing methods from software into AI. One, yes you can do some unit tests on Machine Learning models but they won't prove much. Because the principle is that it's a transactional system and things are moving quite a lot. So that’s a good baseline, if you have a Machine Learning system and you have some unit tests that's already step one. It's better to have that than to have nothing.

But you have to embrace the fact that there has got to be a large number of test cases. So you cannot just test on a 3, 5, 100, even a thousand cases will not be enough. The models themselves are probabilistic, so you have to take into account statistical methods of tests.

And lastly and I think this is specific to, because there has been software systems work that were heavily dependent on data, but AI came with the fact that you increase the number of data inputs compared to traditional systems, so you very quickly come into issues of combinatorial problems and it's factually impossible to generate all the combinations. Very simple example of that: how can you test an NLP system?

Lastly, tests on AI touch a lot of different points. If you want to have a complete test coverage, you really need to take into account multiple criteria. Performance of the system but also robustness to errors, fairness, privacy, security, reliability and also, that is becoming an increasingly important topic with Green AI: what is the carbon impact of this AI? Do you really need that many GPUs? Can you make your system a bit more energy efficient?

Today I'll focus, because I see we have 10 more minutes, on two aspects: robustness and ethics.

How? Testing the robustness of ML systems

I'll start with robustness. Who has a read or heard about this paper, quick show of hands? Who has heard of behavioral testing, because that's not Machine Learning specific?

Ribeiro, three years ago, along with other co-writers of this paper did, I think a fantastic job to see how to adapt behavioral testing, which is a really good practice from software engineering, to the context of Machine Learning and specifically wrote something for NLP models.

The main problem that this research paper aimed to solve was test case generation, because really NLP is by essence a problem (NLP is natural language processing) so you have an input text, it's just raw text, so you cannot unit test this really.

But what you can do is to generate test cases that rely on mapping, the input and the input changes in the text to expectations. I'll give three examples from very simple, to a bit more complex.

One is the principle of minimum functionality. For example, if you are building a sentiment prediction Machine Learning system, you could just have a test that says if you have "extraordinary" in the sentence you should always predict that the model will say it's a positive message.

Now you will probably tell me, yeah but what about if the user has written it's "not extraordinary" or "absolutely not extraordinary"?

And that actually brings me to the concept of test template and the fact that probably for NLP what you need to do and this is obviously language-specific, is start to have templates where you change the text by, for example, adding negations. And then, you might want to test if your system, if you're adding negation, if you have a certain direction, because normally if a Machine Learning model has understood, it should, if it's about sentiment, understand that putting "not" AND "extraordinary" or "not good" you have then synonyms, will not affect the system too much.

Either your system, you want it to move to a certain direction, or there are cases where you want the opposite behavior, you want robustness. So that's called invariance, so for instance, you want a system that is robust to typos, to just changing a location name, putting synonyms, etc.

We've created this diagram to explain it, and it's a really thriving field in the research, there is a lot of research going on these days about testing Machine Learning systems and metamorphic testing is one of the leading methods to do that.

The principle is, if I take an analogy, is very similar to, if you've worked in finance or if you have some friends who work there, the principle of backtesting an investment strategy. You simulate different changes in the market conditions and you see how your strategy, your algorithm, behaves, what is the variance of that strategy?

This concept applies very well to Machine Learning. You need two things: you need one, to define a perturbation, what I was explaining earlier in NLP, a perturbation might be adding typos, adding negation. In another context, like let's say in an industrial case, it might be about doubling the values of some sensors, or adding noise to an image. And then, pretty simply, you define an expectation, a test expectation in terms of a metamorphic relation between the output of a Machine Learning model and the distribution of the output after perturbation.

And once you have that, and if you have enough data, then you can do actual statistical tests, see if there's a difference in distribution, etc. I won't have too much time to dive into all the details of this, but we have wrote a technical guide on this topic and you have a linking QR code up there.

How? Testing the fairness of ML systems

Next, I'll talk a bit about a really tricky topic which is AI fairness. I want to emphasize that, at least our recommendation, is not to come at the problem of AI ethics with a closed mind or a top-down definition of this is an ethical system or no, this is an unethical system.

My co-founder did his PhD on precisely this topic and wrote a full paper on this, looking at the philosophical and sociological implications of this. The gist of it is that, yes to a certain extent, you can adopt a top-down approach to AI fairness, saying, for instance as an organization, we want to test the fairness on explicitly three sensitive categories. You can say, well we want to check for gender balance, we want to check for race balance. That means, if the country where you deploy Machine Learning allows to collect this data, this is not always the case.

But the challenge with these approaches is that A. you might not have the data to measure this and B. you may miss out because often when this exercise of defining the quality criteria for fairness and for balance are done, you only have a limited sample.

Taking some sociological analogy, it is really important to have this kind of top-down definition of AI ethics meet the reality on the ground and confront the actual users and the makers of the systems to get them to define the definition of ethics, rather than a big organization, if I put a bit of a caricature, that says AI ethics, yeah we wrote a charter about this, you follow, you read this, you sign and then whoops, you're ethical.

Having said that, there are some good top-down metrics to adopt, that are a baseline and I'll explain one of them, which is disparate impact. Disparate impact is actually a metric from the human resources management industry from at least 40 years ago, it's not new.

Essentially, it's about setting a rule of 80 percent, where you measure the probability of a positive outcome, with respect to a given protected population and you say, I want the proportion of the probability of a positive outcome relative to the probability of a positive outcome in the unprotected context, to be above 80 percent. For instance, if you are building a model to predict the churn of customers, and you want to check whether your model is biased or not for age class, this formula allows you to really define this metric and write a concrete test case.


I just have three minutes, so I'll highlight one of the features of our project, which is putting human feedback, so really having an interface where users, and not only data scientists, can change the parameters, so there's a link to metamorphic testing and actually give human feedback to point out where the biases may be.

The benefit of this approach is that it allows for the community to precisely define what are the risks. Sadly, we won't have time to do a demo.

But this phase in our project, we call that the inspection phase and it's about, before you test, and this is super important and again one of the things where it's different from traditional software testing, before you even test you need to confront yourself with the data and the model, so that's where actually we think explainability methods really shine, because they allow to debug and to identify the zones of risks and this is precisely what helps, once you have qualified feedback, to know where you should put your effort in tests.

So in a nutshell, what I'm saying for testing Machine Learning systems is, it's not a matter of creating hundreds of tests, of automating everything, but rather to have a good idea of, from a fairness standpoint and for a performance standpoint, of what are the 10, 15 maybe max 20 tests that you want in your platform.

If you want to get started actually on it, this is our GitHub and if you have a Machine Learning System to test, we are interested in your feedback.