How to deploy a robust HuggingFace AI model for sentiment analysis into production?

Note: The API used by this tutorial is deprecated. To use the new API, please refer to the official documentation: https://docs.giskard.ai

🧭 Introduction

Social media listening completely revolutionizes how companies manage their customer relationships. With this practice, companies can avoid conducting costly and manual surveys to understand their customers directly. This is especially crucial for B2C companies that need to be reactive in a world where people interact more and more online.

One leading practice that enables social media listening is the sentiment analysis model. These algorithms enable companies to measure in real-time the polarity of vast amounts of content in social media. The principle is to classify whether a given document's expressed opinion is positive, negative, or neutral. This simple yet powerful method enables practitioners to aggregate these three polarities to get the general feeling of a whole corpus of documents.

Example of Sentiment Analysis dashboard on TrustPilot reviews from the MonkeyLearn app

While sentiment analysis models are being increasingly deployed in the industry, these simple ML classification models remain a challenge for Data scientists. Here are some examples of challenging sentiment classification:

Sarcastic sentences: “I'd really truly love going out in this weather!”
Double negation: “I do not dislike cabin cruisers.”
Ambivalent feeling: “I love my mobile, but would not recommend it to any of my colleagues.”

As we see in the above examples, analyzing words one by one is not sufficient (or even can be misleading) to get the general feeling of a sentence.

Fortunately, today's large deep learning models can catch the context of a document, but as we will see in this article, they are not completely robust: we need to test them before putting them into production. In this article, we will illustrate with code examples:

How to practically implement a sentiment analysis model by fine-tuning a BERT model
How to explain and inspect the model to find particular bugs and issues
How to create tests to put the model into production in a robust way.

🏋🏽 Train the model using a pre-trained model from HuggingFace

To do so, we use a pre-trained Distill Bert model from HuggingFace that we fine-tune with the Twitter US airline database. Thanks to the wonderful work of HuggingFace, the fine-tuning process is made easy.

The final model that we want to deploy into production is the following:

🧐 Find edge cases and bugs of the model

Now that we created the model, we need to create some tests to put it into production. This is crucial for creating a robust CI/CD pipeline. The open-source project Giskard enables data scientists to inspect & explain ML models to find buggy cases.

First, we need to wrap a dataset, which we want to analyse model on:

Second, we have to wrap a model, which will be analysed by Giskard:

Now we are ready to use Giskard scanner to find performance, robustness, discriminative and other kinds of issues a model potentially has:

By calling display(results) one can conveniently view the detected issues directly in the jupyter notebook. For example, the given model drastically underperforms on the customers' reviews that contain the word "thanks".

Giskard allows to build a set of tests, called a "test-suite", based on the detected issues. Thus the user can easily check, if the changes they applied in the new version of the model helped to resolve the detected problems:

By calling test_suite.run(), the user can also visualise the result of test-suite execution.

🔍 Use the Giskard server

Additionally to the Giskard Python library, the Giskard server is the app that you can install either locally or on your cloud instance. The Giskard server provides a convenient User-Interface (UI) to debug tests, compare models, collaborate with other users, etc. To learn more about its functionality check the documentation.

In order to use the UI, we need to upload the test suite artefact, which also contains the wrapped dataset and model:

The UI provides a debugger, which helps to analyse each prediction separately:

In this example we can see, that the model misclassified feedback with the word "thanks", because the sentence is kind of sarcastic and has negative connotation. Thus we can make a conclusion, that adding more "sarcastic" samples could potentially increase model performance.

With Giskard, we can easily play with the model to check if the model's behaviour makes sense. For example, we may expect that replacing the word "great" with "appealing" should not change the prediction much. A good model should be invariant with synonyms.

Reversely, replacing the word "bad" with "good" should increase the positive sentiment of the tweet. Unfortunately, after some inspections, we see that it's not always the case...

Since, as data scientists, we may not have the domain knowledge to answer all the questions, we can invite other people to provide feedback on the model we created. Collecting domain feedback on the model is critical for a robust deployment and avoiding nasty surprises at production time.

A customer service manager can help the data scientist finding buggy patterns by providing feedback through the interface.

In the screenshot above, we see a customer service manager providing feedback that these kind of tweets, which contain information about long waiting time, are sometimes misclassified. After a bit of investigation, the manager told us, which timing is appropriate. In that case, it can be really interesting for the data scientist to test the performance of the model on the tweets, which contain information about abnormal waiting time, but do not have explicit negative sentiment.

Another great functionality of Giskard UI is the ability to re-execute already generated tests and add new ones. Creating tests is crucial for at least two reasons:

Comparison between two models: this enables the data scientist to make sure there is no regression each time a new model is created (after a re-training step for instance). Comparing test results between two models is essential to ensure that the development is going in the right direction. Regarding DevOps, the tests can be automatically executed when a new version of the model code is merged (CI) or when the model is deployed in the production environment (CD).
Comparison between two datasets: The tests can also be executed at run time each time we collect new data points. Executing the same test among different datasets is commonly called monitoring. Some tests, like drift tests, are great for monitoring because they enable data scientists to ensure that their models behave correctly in a changing environment.

With Giskard, we can easily leverage the knowledge we got from the previous inspection step to create and execute tests. Here are two examples of tests that we can write based on our learnings of the model:

We can create a small database of tweets that contains sarcastic tweets. We can then expect the model to classify negatively at least half of these tweets.
We may expect the positive sentiment probability to decrease after removing the word “thanks”. More precisely, we may expect the ratio of decreasing examples to be over 50%. This test is called a metamorphic decreasing function. For more details on this function, you can refer to this article.

We can easily create a few dozen of tests by generalizing some problematic patterns we identified during inspection time. For more complex tests, the Giskard open-source community provides test sets examples that we can easily reproduce in our case study.

We can view the test suite generated on the previous steps with an additional manually created test, which checks global F1 score:

AI Test Suite in Giskard to ensure the robustness of our Sentiment Analysis model. We can see, that the performance is very close to the acceptable Test F1 score, thus by resolving a part of detected issues we can improve model to pass this test.

Once you’re satisfied with your model, Giskard opened up an API that you can use to execute the tests either in the CI/CD pipeline or at run-time for monitoring purposes. You can also log all these metrics in your favorite experiment tracking tool.

✅ Conclusion

In this article, we described how to make a sentiment analysis ML model ready for production. We saw that a robust ML model requires:

qualitative insights to capture edge cases
quantitative metrics to generalize to many contexts

We implemented these general ideas by showing some code snippets and concrete implementation cases using open-source tools such as HuggingFace and Giskard.

How to deploy a robust HuggingFace model for sentiment analysis into production?

🧭 Introduction

🏋🏽 Train the model using a pre-trained model from HuggingFace

🧐 Find edge cases and bugs of the model

🔍 Use the Giskard server

✅ Conclusion

You will also like

How to test ML models? #1 👉 Introduction

How to test ML models #2 🧱 Categorical data drift

How to test ML models? #4 🎚 Metamorphic testing