October 12, 2021
6 min read

8 reasons why you need Quality Testing for AI

Understand why Quality Assurance for AI is the need of the hour. Gain competitive advantage from your technological investments in ML systems.

Trust in AI systems
Alex Combessie
Trust in AI systems
Trust in AI systems

More and more Artificial Intelligence (AI) models are being deployed in various industries such as healthcare, marketing, human resources, or manufacturing. According to a recent global survey conducted by McKinsey, AI is being increasingly adopted in standard business processes, with nearly 25 percent year-over-year growth and with growing interest from the general public, business leaders, and governments.

While AI is becoming mainstream IT software, many failures have been reported over the past three years. Databases that record AI incidents are now numerous, such as the AI incident database, the AIAAIC repository, or the Awful AI database, to name just a few. These databases report the most serious AI failures recording issues about discrimination, privacy, safety, or security. To address such new challenges, organizations are currently producing regulatory tools such as ethical charters, self-regulatory guidelines, or proposals for a regulation on AI.

The incidents revealed by these databases are only the tip of the iceberg, though. As suggested by the European regulatory four-level pyramid, most AI systems do not represent a high risk. The submerged part of the iceberg is more about issues on the technical and functional quality of the AI systems. In this article, we focus on presenting this submerged part of the iceberg. These quality challenges may be less documented in the mainstream press but are well-reviewed in the scientific literature. To present these challenges, we map the research concepts on AI quality throughout the lifecycle of AI models: prototyping, deployment, and production.

📓 Prototyping

The prototyping stage focuses on creating a Proof of Concept (POC) to demonstrate the theoretical performance of the models. AI prototypes are often built by data scientists in isolated environments (Python notebooks), using test sets randomly extracted from historical datasets. Today it is well documented in the academic literature that most of this prototyping time is allocated to data preparation, often called feature engineering. This very time-consuming step allows data scientists to integrate business knowledge into the training dataset, which significantly increases the performance of the models.

Here are some of the challenges reported by research.

Costly access to business experts

Access to experts can be a crucial bottleneck for collecting high-quality labels and features. Business experts often neither have the time nor the will to devote themselves to the tedious task of feature engineering and data labeling. Moreover, experts rarely speak the same language as the data scientist, which complicates the interaction.

Collecting the right data

As explained by the article from Lin and Ryaboy, finding data sources and understanding their structure is a major task, which may prevent data scientists from even getting started on the actual application development. In big organizations where every business unit is autonomous and calling for another, it is often impossible to keep track of which dataset is stored by which entity, and in which form. Data scientists need to have strong support from business experts to be able to collect data.

Joining disparate datasets

As reported by the article of Madaio et al., a lesser-known but important problem is data dispersion. There can be multiple relevant data sources that may have different schemas, different conventions, and their own way of storing and accessing the data. Joining this information into a single dataset suitable for AI can be a complicated task in its own right. Data scientists need commitment from business experts to be able to apply these conventions and reconstruct data.

As a result of these challenges, ensuring efficient interaction between business and data scientists is crucial for the prototyping stage. Today, this interaction is often materialized by time-consuming and unstructured meetings between business and data scientists.

🚀 Deployment

The deployment stage consists in getting the model produced by the data scientist ready for production. This stage also presents many challenges. A recent study conducted by analysts at the International Data Corporation (IDC) found that a significant portion of attempted deployments fails. One of the main reasons is that IT has a very different software development culture than the data scientists’. IT people usually follow DevOps guidelines, i.e. techniques and tools required to successfully maintain and support existing production systems.

DevOps is based on the implementation of tests at different levels throughout the software development process. These tests verify the correctness of new software features while ensuring the non-regression of other developed features throughout the development lifecycle. Many academic papers show the difficulties of applying DevOps to AI systems. Machine learning is even qualified as non-testable software. Here are three main reasons cited in a recent scientific literature review on AI testing.

AI follows a data-driven programming paradigm

According to a recent article from Cambridge, unlike in regular software products where changes only happen in the code, AI systems change along three axes: the code, the model, and the data. The model’s behavior may evolve in response to the frequent provision of new data. As a consequence, testing criteria are highly dependent on data, which makes it hard to implement tests.

AI is not easily breakable in small unit components

Some AI properties (e.g., accuracy or precision) only emerge as a combination of different components such as the training data, the learning program, and even the learning library. It is hard to break these properties into smaller components that can be tested in isolation.

AI errors are systemic and self-amplifying

AI is characterized by many feedback loops and interactions between AI components. The output of one model can be ingested into the training base of another model. As a result, AI errors can be difficult to localize and therefore difficult to measure and treat. Today, there are plenty of examples of systemic errors of AI such as echo chambers or filter bubbles in recommender systems.

These different challenges for testing are summarized by Zhang’s article representing the differences between traditional software testing and ML testing.

Differences between traditional software testing and ML testing.
Source: Zhang et al. (2020)

⚙️ Production

The last stage of a complete AI development process is production. This is when the AI system is exposed to real users. It is also when runtime issues may happen, requiring maintenance. Here again, several challenges are reported by the research literature:

Building trust with end-users

Data scientists often forget that their models are part of a product that is interacting with end-users. Well-known challenges like appropriation and usability by the end-user are key to increase user engagement and trust with the AI-powered product.

Research shows that trust should not be reduced to technical explainability tools. The explanations of AI decisions are context-dependent and do not always reveal biases and errors of the model. As recommended by Soden’s article, it is crucial to make the development process transparent by taking into account end-users voices and investing in context-aware personalization for AI systems’ interfaces.

Beyond statistical monitoring

A particularly important problem that directly impacts the quality of the model is concept drift. Concept drift in AI is understood as changes observed in the joint distribution p(X, y), where X is the model input and y is the model output. When concept drift is not detected, it can have major adverse effects on model performance. This phenomenon is often addressed by multiplying monitoring tools of probability distribution distances, such as Kullback–Leibler divergence.

However, these indicators often lack operational business sense and therefore are difficult to link with end-user KPIs. They are created a posteriori to monitor errors in models that are already built. For these reasons, a more integrated vision of monitoring is necessary to ensure good system quality and reliability.

Explainability and monitoring interfaces are often designed in a close world by data scientists for data scientists. To build trust and ensure the business validity of AI systems, it is essential to include non-technical people more structurally in the AI development process.


At Giskard, we are working on innovative solutions to rethink the quality of AI. We are on a journey to help data scientists, engineers & business experts work together to create better AI models. We are building a visual quality management platform for AI models, from prototype to production.

Quality Assurance is not just a matter of regulation, it is a necessary requirement for all mature industries to gain a sustainable competitive advantage from technological investments. AI is no exception at all.

With Giskard, we aim at bringing AI into the era of maturity. We help professionals to build quality-by-design systems to increase and secure the business value of their AI models.

To learn more about our solutions, contact us at hello@giskard.ai.

Integrate | Scan | Test | Automate

Giskard: Testing & evaluation framework for LLMs and AI models

Automatic LLM testing
Protect agaisnt AI risks
Evaluate RAG applications
Ensure compliance