Live example

Try out this agent setup in the Latitude Playground.

Overview

This tutorial demonstrates how to build a quality assurance system for customer support responses using three specific Latitude evaluation types:

  • LLM-as-Judge: Rating evaluation for helpfulness assessment
  • Programmatic Rules with Exact Match for required information validation
  • Human-in-the-Loop manual evaluation for customer satisfaction scoring

The Prompt

This is the prompt that will be used to generate customer support responses. It is a simple prompt that takes a customer query and generates a response. It doesn’t use a knowledge base or any additional information.

---
provider: OpenAI
model: gpt-4.1
---

You are a helpful customer support agent. Respond to the customer inquiry below with empathy and provide a clear solution.

Customer inquiry: {{customer_message}}
Customer tier: {{tier}}
Product: {{product_name}}

Requirements:
- Always include the ticket number: {{ticket_number}}
- Address the customer by name if provided
- Provide specific next steps
- End with "Is there anything else I can help you with today?"

In this example, the prompt is very simple, but you could also upload documents to OpenAI and use their new Responses API file search. This implements a knowledge base search that can be used to find relevant information in the documents, so your responses to customer support queries can be based on actual documentation. However, this is out of the scope of this tutorial.

The Evaluations

To create new evaluations, go to the evaluations tab in the Latitude Playground and click on “Add Evaluation”.

Live Mode

We’ve done a lot of work so far. We set up four types of evaluations but only tested against synthetic data. Now we want to test our evaluations against real customer interactions—this is what we call Live Mode.

Let’s set the Helpfulness Assessment evaluation to live mode. Go to the evaluation’s detail, click the top right corner Settings, and at the bottom under Advanced configuration, you can see the Evaluate live logs toggle.

We did the same for the Contains Ticket Number programmatic rule evaluation.

Manual evaluations can’t be set to live mode because human evaluators review the responses manually after the AI responds to the customer. The Required Information Validation evaluation is also not suitable because it requires an expected output to match against the AI response.

Conclusion

By setting up a robust evaluation framework for customer support responses, we’ve learned how different types of automated and manual evaluations work together to ensure high-quality service. Automated LLM-based ratings help us assess response helpfulness at scale, while programmatic rules—like exact match and regular expressions—ensure critical information such as ticket numbers and required statements are always included. Human-in-the-loop (manual) evaluations provide the nuanced judgment that only real people can offer, especially for customer satisfaction and tone. Testing our system with both synthetic and real data (live mode) gives us confidence that our evaluations are both reliable and effective. Ultimately, these evaluations help us catch issues early, improve our AI prompts, and consistently deliver accurate and customer-friendly support—leading to better customer satisfaction and operational excellence.

Resources