Customer Support Quality Assurance

Live example

Try out this agent setup in the Latitude Playground.

Overview

This tutorial demonstrates how to build a quality assurance system for customer support responses using three specific Latitude evaluation types:

LLM-as-Judge: Rating evaluation for helpfulness assessment
Programmatic Rules with Exact Match for required information validation
Human-in-the-Loop manual evaluation for customer satisfaction scoring

The Prompt

This is the prompt that will be used to generate customer support responses. It is a simple prompt that takes a customer query and generates a response. It doesn’t use a knowledge base or any additional information.

---
provider: OpenAI
model: gpt-4.1
---

You are a helpful customer support agent. Respond to the customer inquiry below with empathy and provide a clear solution.

Customer inquiry: {{customer_message}}
Customer tier: {{tier}}
Product: {{product_name}}

Requirements:
- Always include the ticket number: {{ticket_number}}
- Address the customer by name if provided
- Provide specific next steps
- End with "Is there anything else I can help you with today?"

In this example, the prompt is very simple, but you could also upload documents to OpenAI and use their new Responses API file search. This implements a knowledge base search that can be used to find relevant information in the documents, so your responses to customer support queries can be based on actual documentation. However, this is out of the scope of this tutorial.

The Evaluations

To create new evaluations, go to the evaluations tab in the Latitude Playground and click on “Add Evaluation”.

Helpfulness Assessment (LLM-as-Judge)

Required Information Validation (Programmatic Rule - Exact Match)

Contains Ticket Number (Programmatic Rule - Regular Expression)

Manual Evaluation (HITL - Human in the Loop)

Live Mode

We’ve done a lot of work so far. We set up four types of evaluations but only tested against synthetic data. Now we want to test our evaluations against real customer interactions—this is what we call Live Mode.

Let’s set the Helpfulness Assessment evaluation to live mode. Go to the evaluation’s detail, click the top right corner Settings, and at the bottom under Advanced configuration, you can see the Evaluate live logs toggle.

We did the same for the Contains Ticket Number programmatic rule evaluation.

Manual evaluations can’t be set to live mode because human evaluators review the responses manually after the AI responds to the customer. The Required Information Validation evaluation is also not suitable because it requires an expected output to match against the AI response.

Conclusion

By setting up a robust evaluation framework for customer support responses, we’ve learned how different types of automated and manual evaluations work together to ensure high-quality service. Automated LLM-based ratings help us assess response helpfulness at scale, while programmatic rules—like exact match and regular expressions—ensure critical information such as ticket numbers and required statements are always included. Human-in-the-loop (manual) evaluations provide the nuanced judgment that only real people can offer, especially for customer satisfaction and tone. Testing our system with both synthetic and real data (live mode) gives us confidence that our evaluations are both reliable and effective. Ultimately, these evaluations help us catch issues early, improve our AI prompts, and consistently deliver accurate and customer-friendly support—leading to better customer satisfaction and operational excellence.

Resources

LLM-as-Judge Evaluation — How to use LLMs to evaluate responses
Programmatic Rule Evaluation — How to use programmatic rules to evaluate responses
Human-in-the-Loop Evaluation — How to use human evaluators to evaluate responses
Running Evaluations — How to run evaluations against synthetic and live data
Datasets — How to create datasets for evaluations

Overview

Prompting Techniques

SDK examples

Use cases

Customer Support Quality Assurance

Live example

Overview

The Prompt

The Evaluations

Live Mode

Conclusion

Resources

Overview

Prompting Techniques

SDK examples

Use cases

Live example

​Overview

​The Prompt

​The Evaluations

​Live Mode

​Conclusion

​Resources

Overview

The Prompt

The Evaluations

Live Mode

Conclusion

Resources