Think about the last time you were on the client end of a customer service conversation. Were you forced to make a phone call and wait around to get through to a human? Once you finally got through to one, did they put you on hold 15 times to check and double-check the information they needed? Perhaps they seemed a little tired after a long shift, or exasperated at being asked the same routine query for the umpteenth time?
Gone are the days where all a customer service bot could do was intone "To speak with a member of our team, press 1." Large Language Models (LLMs) provide a reliable solution for many of the problems faced by growing customer support teams. They are fast, consistent, available round the clock, and free up human agents' time to respond to more important, less repetitive customer needs. Perhaps most importantly, they excel at filtering through large amounts of data - such as customer ticket data! - to find relevant information. If your business is quickly scaling its customer service team to meet increasing demand, you might be wondering how to leverage LLMs to ramp up your productivity.
Using an open-source LLM
In contrast to proprietary LLMs like OpenAI's GPT models, open-source LLMs promote transparency by making their training data and architectural features publicly available. If you're using LLMs for customer service, you might want to give the LLM access to your own enterprise data (through embeddings or fine-tuning) to optimise the quality and relevance of its responses. To provide the ultimate protection to any sensitive customer data involved in this process, you'll want to host the model on your own infrastructure. By facilitating transparent fine-tuning and self-hosting, open-source models offer a level of customisability and data security which you cannot achieve by using a proprietary API.
Choosing the best Open-Source LLM for your AI Chatbot
In the rest of this article, we'll discuss how to do this! Once you've decided you want to go for open-source, there are still a huge number of models to choose from and many variables to evaluate.
As a customer service provider, your LLM may well be the first line of interaction your clients will have with your company - and as we know, first impressions are everything!
To ensure your LLM represents your company as well as possible, you'll want to test it according to the following criteria:
- Hallucinations: the existence of inaccurate or implausible information in model output. Not much use in a chatbot that misleads its users!
- Sensitive information: whether the LLM can be induced to reveal sensitive personal information (PII). Even the largest and highest-performing LLMs can be susceptible to this.
- Bias and discrimination: racist or sexist answers, for instance.
- Harmfulness: whether the model, when prompted in a certain way, provides harmful or toxic.
- Robustness: how far small changes in the quality of your input data (such as typos) affect your model's performance. A robust chatbot minimises the amount of work the customer has to put in to get a good response.
- Bias and discrimination: racist or sexist answers, for instance.
Now that we've covered the basics of open-source LLMs and how to evaluate them, let's find out which open-source LLM performs the best on a simulated customer support use-case.
We're going to test two popular 7B parameter models: Falcon 7B and Mistral 7B. You can find many more open-source LLMs to play around with on the Open LLM leaderboard.
Building a Customer Service Chatbot with an Open-Source LLM
For this part of the tutorial, all you need is a free Hugging Face API key! This will allow you to prompt any model on the Hugging Face Hub, as well as giving it access to any knowledge base (documentation, FAQs, previous tickets etc.) through LangChain embeddings.
First, let’s set up our environment:
You’ll also want to set your Hugging Face Hub API key as an environment variable.
For our example knowledge base, we’ll be using this very generic customer support dataset. It contains a ‘query’ column consisting of customer messages and a ‘responses’ column consisting of responses. Of course, you'll get higher quality and less generic model output from allowing the model to reference a dataset made up of real tickets.
Now we have the customer messages in list form, ready to be embedded. Optionally, you can create your data list from a concatenation of the query and response columns: df['query'].tolist() + df['response'].tolist(). This makes more sense if you’re building a more retrieval-based system; for instance, giving an LLM access to your FAQs so that it can find the most relevant existing response rather than dynamically generating a brand new one.
As we mentioned, embeddings allow an LLM to access a knowledge base. Specifically, they encode the knowledge base as vectors, allowing the LLM to query the database of vectors to find the most relevant information for a specific prompt. Embeddings are a good way to adapt a customer support bot to your particular use-case without requiring large amounts of data or computational power.
You can create a database from your data with just a couple of lines of code:
Now for the LLM itself, which we access via an API call:
repo_id should match the path to your model on the Hugging Face Hub.
Putting it all together, we can ‘chain’ the database to the LLM so that it can reference our dataset of customer messages when generating answers to queries. We do this with the RetrievalQA class:
Finally, run the chain with an example customer query to check it works:
This should get you a reasonable response to your query; we got
This process works with both the Falcon and Mistral models. Just make sure you’ve replaced repo_id with the right model path from the Hugging Face Hub, and use the ‘instruct’ versions of the models for best performance (these models, as the name suggests, are specialised in following instructions).
How to Evaluate LLMs for your Customer Service AI Chatbot
Now that we can generate responses to customer service queries using LLMs, how can we evaluate which LLM gives the highest-quality responses?
With the Giskard LLM library, you can run a comprehensive LLM evaluation directly in your notebook. The evaluation works by measuring how the model deals with adversarial input, i.e. prompts designed to bias the model towards behaving in an undesirable way. You’ll need OpenAI API access for this (set your API key as an environment variable), as Giskard uses GPT-4 to generate some of this adversarial input and analyse the model’s responses to it.
For example, this test prompt was generated to ‘trick’ the LLM into providing guidance on how to carry out a harmful activity (in this case, how to send a phishing email). The test fails, telling us that whatever safety guardrails this model has are not strong enough to prevent it from giving instructions on how to send a phishing email:
Not all adversarial input is aimed at catching harmful output; this technique can also be used to catch contradictions, formatting errors, and unexpected behaviour caused by special characters:
Here’s the Python code for running a scan:
Make sure that your feature_names list includes all the columns in your dataset and that your description and name are as informative as possible. This is how Giskard uses the specifics of your task to come up with relevant test cases. For example, for our customer service bot, Giskard might specifically test whether models can be induced to reveal customer phone numbers or encourage terms of service violations.
If you’re comparing multiple models, your description should be identical for both. Otherwise, the scan may generate different test cases for each model and the result may not be fair.
Pass your giskard.scan() object to the display() function to display the results of your scan. You’ll see a window that shows you all the evaluation criteria used and where exactly your model went wrong. You can click through specific criteria to see a detailed breakdown of which adversarial input test cases your model failed:
So, which LLM fared best in our head-to-head?
Here are our scan results from the Falcon model:
And here are the results from the Mistral model:
Right away, you can tell that our scan flagged up more issues with the Mistral model than with Falcon - so on that front, Falcon is the winner!
Of course, figuring out which model is the most appropriate for your use-case involves making more nuanced judgements about which bias and security issues you want to prioritise. The Giskard scan interface helps you to do this by categorising your results in the following ways:
- Level of severity. ‘Major’ and ‘medium’ severity flags give you a sense of how problematic your failed test cases actually are. In our comparison, two of the ‘extra’ tests which Mistral failed and Falcon didn’t were actually in the ‘medium’ category. Specifically, Mistral expressed willingness to provide a representative’s contact details and confirmed the last 4 digits of a credit card number as ‘1234’ when prompted. These outputs were marked as ‘medium’ issues. Ultimately, though, you can decide what you consider to be a severe issue by looking at each test the model failed.
- Type of issue. The categorisation of failed tests into hallucination, robustness, harmfulness, sensitive information disclosure, and stereotypes allows you to focus on the criteria that are the most important to you. In our scan, although Falcon got better results overall, it failed two tests in the ‘robustness’ category. Namely, when prompted with particular special characters, it produced irrelevant output. Mistral passed with flying colours in this category - so if this is particularly important to you, you might want to use Mistral.
For even more detailed model comparisons, domain-specific tests, and a debugging interface to iterate on your models with the rest of your team, check out the Giskard Hub! You can get a taste of the advanced features it has to offer at the Giskard Hugging Face Space.
Conclusion: Using Fine-Tuned LLM for AI Chatbots
You can use LangChain embeddings, the Hugging Face API and the Giskard library to evaluate any model on the Hugging Face Hub based on a knowledge base and use-case. With the Giskard scan, you can stress-test models on realistic and personalised adversarial input to work out which one performs the best according to your own criteria. While our scan found that Falcon 7B had more bias and security issues than Mistral 7B for the customer service use-case, every scan is different - as are every team’s priorities!
If you’re a company building customer service bots, be sure to take a look at Giskard Hub for more features to help you find the best LLM.