🧭 Introduction
Social media listening completely revolutionizes how companies manage their customer relationships. With this practice, companies can avoid conducting costly and manual surveys to understand their customers directly. This is especially crucial for B2C companies that need to be reactive in a world where people interact more and more online.

One leading practice that enables social media listening is the sentiment analysis model. These algorithms enable companies to measure in real-time the polarity of vast amounts of content in social media. The principle is to classify whether a given document's expressed opinion is positive, negative, or neutral. This simple yet powerful method enables practitioners to aggregate these three polarities to get the general feeling of a whole corpus of documents.

While sentiment analysis models are being increasingly deployed in the industry, these simple ML classification models remain a challenge for Data scientists. Here are some examples of challenging sentiment classification:
- Sarcastic sentences: “I'd really truly love going out in this weather!”
- Double negation: “I do not dislike cabin cruisers.”
- Ambivalent feeling: “I love my mobile, but would not recommend it to any of my colleagues.”
As we see in the above examples, analyzing words one by one is not sufficient (or even can be misleading) to get the general feeling of a sentence.
Fortunately, today's large deep learning models can catch the context of a document, but as we will see in this article, they are not completely robust: we need to test them before putting them into production. In this article, we will illustrate with code examples:
- How to practically implement a sentiment analysis model by fine-tuning a BERT model
- How to explain and inspect the model to find particular bugs and issues
- How to create tests to put the model into production in a robust way.
🏋🏽 Train the model using a pre-trained model from HuggingFace
To do so, we use a pre-trained Distill Bert model from HuggingFace that we fine-tune with the Twitter US airline database. Thanks to the wonderful work of HuggingFace, the fine-tuning process is made easy.
The final model that we want to deploy into production is the following:
🧐 Find edge cases and bugs of the model
Now that we created the model, we need to create some tests to put it into production. This is crucial for creating a robust CI/CD pipeline. The first step is to do some manual inspection of the model to get ideas of some edge cases we may need to test. The open-source project Giskard enables data scientists to inspect & explain ML models to find buggy cases.

As seen in the screenshot above, we can explain incorrect or borderline classifications. In the example below, it seems like the model gets it correctly. The word "stuck" is highlighted in green when explaining words that contribute the most to the prediction (negative sentiment for this example).
By exploring the incorrect examples and generating explanations, we discovered that the model often misclassifies some patterns. For example, we see that sarcastic speech is one of the patterns that are hard to identify by the algorithm.

As seen in the example above, it should be interesting to analyze all the tweets that contain “wait” or “waiting” to examine whether the model has a bad performance on these tweets. This kind of particular data slice will be necessary to create relevant tests with business knowledge.
With Giskard, we can easily play with the model to check if the model's behaviour makes sense. For example, we may expect that replacing the word "great" with "appealing" should not change the prediction much. A good model should be invariant with synonyms.
Reversely, replacing the word "bad" with "good" should increase the positive sentiment of the tweet. Unfortunately, after some inspections, we see that it's not always the case...
Since, as data scientists, we may not have the domain knowledge to answer all the questions, we can invite other people to provide feedback on the model we created. Collecting domain feedback on the model is critical for a robust deployment and avoiding nasty surprises at production time.

In the screenshot above, we see a customer service manager providing feedback that tweets containing “copy and paste” are quite common and often misclassified. After a bit of investigation, the manager says that people are often complaining about customer service replies that are just copy-paste of existing templates. In that case, it can be really interesting for the data scientist to test the performance of the model on the tweets that contain “copy and paste”.
Let’s see now how we can create tests using this feedback!
⛑ Create tests to deploy your model into production
Now that we collected some buggy examples, it's now time to create tests to make sure our model can be put into production. Creating tests is crucial for at least two reasons:
- Comparison between two models: this enables the data scientist to make sure there is no regression each time a new model is created (after a re-training step for instance). Comparing test results between two models is essential to ensure that the development is going in the right direction. Regarding DevOps, the tests can be automatically executed when a new version of the model code is merged (CI) or when the model is deployed in the production environment (CD).
- Comparison between two datasets: The tests can also be executed at run time each time we collect new data points. Executing the same test among different datasets is commonly called monitoring. Some tests, like drift tests, are great for monitoring because they enable data scientists to ensure that their models behave correctly in a changing environment.
With Giskard, we can easily leverage the knowledge we got from the previous inspection step to create and execute tests. Here are three examples of tests that we can write based on our learnings of the model:
- We can create a small database of tweets that contains sarcastic tweets. We can then expect the model to classify negatively at least half of these tweets.
- We may expect the positive sentiment probability to decrease after removing the word “thanks”. More precisely, we may expect the ratio of decreasing examples to be over 50%. This test is called a metamorphic decreasing function. For more details on this function, you can refer to this article.
- As we saw in the previous section, we may expect all the tweets containing the “copy and paste” pattern to have an F1 performance over 0.5. This is called a sliced-based performance test. These are absolutely important to have a fine-grained vision of the performance of your model.
We can easily create a few dozen of tests by generalizing some problematic patterns we identified during inspection time. For more complex tests, the Giskard open-source community provides test sets examples that we can easily reproduce in our case study. We put below an example of a Giskard code preset of a metamorphic test that checks that model is invariant when we replace the content with similar synonym content.
Here is for instance the test suite that can be easily created by Giskard.

Once you’re satisfied with your model, Giskard opened up an API that you can use to execute the tests either in the CI/CD pipeline or at run-time for monitoring purposes. You can also log all these metrics in your favorite experiment tracking tool.
✅ Conclusion
In this article, we described how to make a sentiment analysis ML model ready for production. We saw that a robust ML model requires:
- qualitative insights to capture edge cases
- quantitative metrics to generalize to many contexts
We implemented these general ideas by showing some code snippets and concrete implementation cases using open-source tools such as HuggingFace and Giskard.
Hope that helps you!
{{richtext-style-text-line}}