Customer Support Quality Assurance
Implement a comprehensive QA system for customer support responses using Rating-based LLM evaluation, Exact Match rules, and Manual review
Live example
Try out this agent setup in the Latitude Playground.
Overview
This tutorial demonstrates how to build a quality assurance system for customer support responses using three specific Latitude evaluation types:
- LLM-as-Judge: Rating evaluation for helpfulness assessment
- Programmatic Rules with Exact Match for required information validation
- Human-in-the-Loop manual evaluation for customer satisfaction scoring
The Prompt
This is the prompt that will be used to generate customer support responses. It is a simple prompt that takes a customer query and generates a response. It doesn’t use a knowledge base or any additional information.
In this example, the prompt is very simple, but you could also upload documents to OpenAI and use their new Responses API file search. This implements a knowledge base search that can be used to find relevant information in the documents, so your responses to customer support queries can be based on actual documentation. However, this is out of the scope of this tutorial.
The Evaluations
To create new evaluations, go to the evaluations tab in the Latitude Playground and click on “Add Evaluation”.
Live Mode
We’ve done a lot of work so far. We set up four types of evaluations but only tested against synthetic data. Now we want to test our evaluations against real customer interactions—this is what we call Live Mode.
Let’s set the Helpfulness Assessment evaluation to live mode. Go to the evaluation’s detail, click the top right corner Settings, and at the bottom under Advanced configuration, you can see the Evaluate live logs toggle.
We did the same for the Contains Ticket Number programmatic rule evaluation.
Conclusion
By setting up a robust evaluation framework for customer support responses, we’ve learned how different types of automated and manual evaluations work together to ensure high-quality service. Automated LLM-based ratings help us assess response helpfulness at scale, while programmatic rules—like exact match and regular expressions—ensure critical information such as ticket numbers and required statements are always included. Human-in-the-loop (manual) evaluations provide the nuanced judgment that only real people can offer, especially for customer satisfaction and tone. Testing our system with both synthetic and real data (live mode) gives us confidence that our evaluations are both reliable and effective. Ultimately, these evaluations help us catch issues early, improve our AI prompts, and consistently deliver accurate and customer-friendly support—leading to better customer satisfaction and operational excellence.
Resources
- LLM-as-Judge Evaluation — How to use LLMs to evaluate responses
- Programmatic Rule Evaluation — How to use programmatic rules to evaluate responses
- Human-in-the-Loop Evaluation — How to use human evaluators to evaluate responses
- Running Evaluations — How to run evaluations against synthetic and live data
- Datasets — How to create datasets for evaluations