Customer Support Quality Assurance
Implement a comprehensive QA system for customer support responses using Rating-based LLM evaluation, Exact Match rules, and Manual review
Live example
Try out this agent setup in the Latitude Playground.
Overview
This tutorial demonstrates how to build a quality assurance system for customer support responses using three specific Latitude evaluation types:
- LLM-as-Judge: Rating evaluation for helpfulness assessment
- Programmatic Rules with Exact Match for required information validation
- Human-in-the-Loop manual evaluation for customer satisfaction scoring
The Prompt
This is the prompt that will be used to generate customer support responses. It is a simple prompt that takes a customer query and generates a response. It doesn’t use a knowledge base or any additional information.
In this example, the prompt is very simple, but you could also upload documents to OpenAI and use their new Responses API file search. This implements a knowledge base search that can be used to find relevant information in the documents, so your responses to customer support queries can be based on actual documentation. However, this is out of the scope of this tutorial.
The Evaluations
To create new evaluations, go to the evaluations tab in the Latitude Playground and click on “Add Evaluation”.
Helpfulness Assessment (LLM-as-Judge)
Helpfulness Assessment (LLM-as-Judge)
This is how we configure an LLM-as-Judge evaluation to assess the helpfulness of customer support responses.
Configure the evaluation
This evaluation uses the Rating metric from the AI to assess response quality, with criteria such as Assess how well the response follows the given instructions and a 1-5 rating scale where 1 means Not faithful, doesn’t follow the instructions and 5 means Very faithful, follows the instructions.
Create an experiment from the evaluation
An Experiment is a way of running the prompt many times and validating, with this evaluation, if it passes the criteria. Before creating the experiment, we need to create a dataset. Click on “Generate dataset”.
Create the synthetic dataset
A synthetic dataset is generated by the system to test the evaluation. It allows us to test the evaluation without having to create a real dataset. It sets columns for each parameter in our prompt.
Run the experiment
Once we have the dataset, select it in the dataset selector and click “Run experiment”.
View experiment results
After running the experiment with 30 rows of the synthetic dataset you just created, you can see the results! The green counter shows the successful cases. Yellow represents results that failed the evaluation, and red means errors occurred during the experiment run.
Required Information Validation (Programmatic Rule - Exact Match)
Required Information Validation (Programmatic Rule - Exact Match)
The goal of this evaluation is to ensure every response contains mandatory elements like ticket numbers and proper closing statements. Let’s set it up.
Configure the evaluation
Create dataset with expected output
We need to create another dataset, but this time it must have an expected output column. You can use the same dataset but add a new column with the expected output. In this case, we want to ensure our prompt always responds with the sentence Is there anything else I can help you with today?
Contains Ticket Number (Programmatic Rule - Regular Expression)
Contains Ticket Number (Programmatic Rule - Regular Expression)
Configure the evaluation
To configure this evaluation, we use a regular expression to ensure the customer support response contains a ticket number. So in this case, we require:
- The ticket number starts with
TCKT-
- Followed by 4 digits (
-\d{4}
)
This is the shape of our ticket column in the dataset.
Now we’re ready to create this new evaluation.
Run the experiment
This step is the same as for the first evaluation. We create an experiment and see the results. In this case, we should see that the AI responded with the ticket number because it’s part of our prompt. This is a basic check, but ensures future modifications to the prompt keep the ticket number.
Manual Evaluation (HITL - Human in the Loop)
Manual Evaluation (HITL - Human in the Loop)
Configure the evaluation
Customer satisfaction involves nuanced judgment about tone, cultural sensitivity, and domain-specific accuracy that automated systems might miss, making it perfect for human evaluation.
Annotate past conversations (logs)
The first way to enable human evaluators to review responses is to give them access to Latitude’s logs. When they click on the logs in the right panel, now that we’ve configured the HITL evaluation, they will be able to assign a score from 1 to 5 as previously configured.
Annotate with the SDK
Another way to add manual evaluations is to use the Latitude SDK. You can see an example of how to do it here.
Minimum score
One thing we didn’t do when configuring the evaluation is to set a minimum score required to pass. Let’s do it now: Go to the manual evaluation detail at the top right of the screen and click Settings.
Manual evaluation results
Now our human evaluator has scored the responses and we can see the results in the experiment.
In the image, we see an evaluation with score 1
but in green. This was before we set the minimum score to 3
. The next one didn’t pass and is shown in red.
Live Mode
We’ve done a lot of work so far. We set up four types of evaluations but only tested against synthetic data. Now we want to test our evaluations against real customer interactions—this is what we call Live Mode.
Let’s set the Helpfulness Assessment evaluation to live mode. Go to the evaluation’s detail, click the top right corner Settings, and at the bottom under Advanced configuration, you can see the Evaluate live logs toggle.
We did the same for the Contains Ticket Number programmatic rule evaluation.
Conclusion
By setting up a robust evaluation framework for customer support responses, we’ve learned how different types of automated and manual evaluations work together to ensure high-quality service. Automated LLM-based ratings help us assess response helpfulness at scale, while programmatic rules—like exact match and regular expressions—ensure critical information such as ticket numbers and required statements are always included. Human-in-the-loop (manual) evaluations provide the nuanced judgment that only real people can offer, especially for customer satisfaction and tone. Testing our system with both synthetic and real data (live mode) gives us confidence that our evaluations are both reliable and effective. Ultimately, these evaluations help us catch issues early, improve our AI prompts, and consistently deliver accurate and customer-friendly support—leading to better customer satisfaction and operational excellence.
Resources
- LLM-as-Judge Evaluation — How to use LLMs to evaluate responses
- Programmatic Rule Evaluation — How to use programmatic rules to evaluate responses
- Human-in-the-Loop Evaluation — How to use human evaluators to evaluate responses
- Running Evaluations — How to run evaluations against synthetic and live data
- Datasets — How to create datasets for evaluations