Skip to main content
A Test Set is a structured collection of test cases designed to evaluate specific functionalities, workflows, or scenarios in your project. Each test set can contain multiple test cases, and simulations/evaluations will analyze the aggregate results of all test cases within the set.

How to Generate a Test Set

Quick Start

  1. Enter your test scenario in the input box.
  2. (Optional) Add extra context:
    • Attach files (such as text, JSON, or markdown)
    • Choose an agent to evaluate
    • Pick a relevant category from those suggested
  3. (Optional) Add metadata:
    • Define metadata fields to extract from each test case.
    • Example: key: “ticket_number”, description: “X-###” will generate entries like “X-001” per test case
    • Example: key: “destination”, description: “enter a possible airport code the user is flying to” will generate entries like “SFO”
  4. Submit using the arrow button or by pressing Enter.
  5. Review and modify your test set in the test set editor.

Alternative Options

  • Upload from file: Use “Upload from file” to import CSV/Excel test cases
  • Manual mode: Use “Use manual creation mode” to create a blank test set and add cases yourself
Tips for better test cases:
  • Be specific in your description for better test cases
  • Attaching agent prompts or documentation helps generate more relevant tests
  • You can edit, add, or remove test cases after generation

Uploading from CSV/Excel

Import test cases in bulk by uploading a properly formatted CSV or Excel file.

Column Structure

input
string
required
The test case input or prompt. This column is case-insensitive and must be present in your file.
expected_behaviors
string | array
Expected behaviors for the test case. Parsing rules (applied during test-set ingest/validation):
  • JSON array: ["behavior1", "behavior2"] - parsed as an array of behaviors
  • Comma-separated string: "behavior1,behavior2" - split by comma into multiple behaviors
  • Single string: "behavior1" - treated as a single behavior string
type
enum
Test case type. Case-insensitive. Accepts:
  • SCENARIO
  • TRANSCRIPT
metadata
json
JSON object containing test case metadata
agent_ids
string | array
Agent IDs to associate with the test set. Test-set level (applies to all test cases):
  • JSON array of strings: ["agent-id-1", "agent-id-2"] - parsed as an array of agent ID strings
  • Comma-separated string: "agent-id-1,agent-id-2"
  • Values are trimmed and empty values are filtered out
  • Uses the first non-empty value found in the file (since agent_ids applies to the whole test-set)
knowledge_base_entries
string | array
Knowledge base entries to attach to test cases:
  • JSON array of objects: [{"id": "entry-id-1", "type": "web_url"}, {"id": "entry-id-2"}]
    • each object can have id (required) and type (optional)
  • JSON array of strings: ["entry-id-1", "entry-id-2"] - treated as entry IDs with default type
  • Comma-separated string: "entry-id-1:web_url,entry-id-2,entry-id-3:pdf"
    • Each object can be formatted as ‘id:type’ or just ‘id’ with a default type
  • Single string: "entry-id-1" - treated as an entry ID with a default type
  • Accepts:
    • web_url (default)
    • plain_text
    • json
    • zendesk
    • shelf
    • file
custom columns
any
Any additional column headers will automatically be treated as metadata fields

File Requirements

Your file must meet the following criteria:
  • Accepted formats: .csv or .xlsx
  • Maximum file size: 10MB
  • First row: Must contain column headers (case-insensitive)
  • Empty rows: Automatically skipped during import
  • Validation: Rows with empty input values are filtered out
Ensure your file doesn’t exceed 10MB and contains at least one row with a valid input value.

Understanding Test Cases

Test Case Input

Each test case uses one of three input types that determine how the simulated user behaves during a run:
TypeWhat it isSimulated user behavior
ScenarioHigh-level intentImprovises freely toward the goal
TranscriptA reference conversationAdapts as needed to match the flow
ScriptExact turnsFollows them precisely, word for word

1. Scenarios

Define specific tasks or behaviors for your simulated user. Use quotation marks for exact phrases you want them to say. Examples:
  • Simple task: “Call to get a refund”
  • Complex scenario: “First, ask for PTO from the 21st to the 22nd of March. After receiving a confirmation, ask to change to the 20th to 22nd. During the verification, share your email address as ‘emily [at] gmail [dot] com’. Then, proceed to correct yourself with ‘oh no - it’s actually emily [dot] marc [at] gmail [dot] com’.”
The more detailed your scenario, the more precisely our simulated user will follow it.

2. Transcript

Recreate specific conversations using OpenAI transcript format. The agent will follow the user’s part of the transcript as closely as possible. Format example:
[
  {
    "role": "assistant",
    "content": "Welcome to X Restaurant. How may I assist you today?"
  },
  { "role": "user", "content": "I would like to order some pizza." }
]

3. Audio Upload

Upload a pre-recorded audio file containing the persona’s side of the conversation (right channel) to use during a voice simulation. Instead of the persona generating responses with an LLM and TTS, the uploaded audio plays back exactly as recorded — making the test fully deterministic. Supported formats: .wav, .mp3 (max 200 MB, duration 5 seconds – 1 hour). How it works:
  1. In the test set editor, select Audio as the input type and upload your audio file containing the persona’s speech (right channel only)
  2. The file is played back during simulation in place of LLM-generated persona speech
  3. The uploaded audio is automatically transcribed so persona turns still appear in the transcript
  4. After the audio finishes playing, the simulation waits 30 seconds for the agent to finish responding, then ends the call
Audio upload test cases are ideal for regression testing — record a specific caller interaction once, then replay it across agent updates to detect regressions in handling.

Ground Truth Transcript

To measure your agent’s STT accuracy against a known-correct transcript of the uploaded audio, you can provide a ground truth transcript in two ways: Via the UI — when uploading an audio file, the modal includes a ground truth transcript field where you can either paste the transcript as plain text or upload a .txt or .json file. Via metadata — add a ground_truth_transcript key to the test case metadata directly. Either method enables the STT Word Error Rate (Audio Upload) metric, which compares your agent’s speech-to-text output against this reference text. The ground truth can be plain text, labeled text with timestamps and role labels, or a JSON object with a messages array.

4. Script

Define an ordered list of exact lines for the persona to deliver, turn by turn. The persona follows the script exactly rather than generating responses with an LLM — while still using the configured persona voice and background sounds. Example script turns:
  1. “Hi, I’d like to check my account balance.”
  2. “Yes, my account number is 12345.”
  3. “Thank you, goodbye.”
How it works:
  1. In the test set editor, select Script as the input type
  2. Add ordered turn texts in the script editor (each turn is one persona utterance)
  3. During simulation, the persona delivers each line in order instead of generating LLM responses
  4. A divergence detector monitors agent responses — if the agent diverges significantly from the expected flow, the simulation can end early with a SCRIPT_DIVERGED reason
  5. After the last scripted turn is delivered, the agent gets one final response before the simulation ends with a SCRIPT_COMPLETED reason
Script test cases give you deterministic persona speech output while still exercising the full voice pipeline (TTS, turn-taking, background noise). Use them when you need control over exactly what the persona says but still want realistic audio delivery.

5. Image Attachment

Attach a single image to a test case so the persona can share it during a WebSocket voice simulation. This is useful for flows like sending a receipt, damage photo, insurance card, or product image after the agent asks for visual proof or context. Supported formats: .png, .jpg, .jpeg (max 2 MB, one image per test case). Before using image attachments, make sure the test set is attached to a WebSocket voice agent. The image will only be sent when this test set is used with that agent type. How it works:
  1. In the test set editor, open a test case and click Add Media.
  2. Upload a PNG or JPEG image and give it a short Name such as receipt_photo or broken_screen.
  3. Optionally add a Description telling the persona when the image should be sent.
  4. Attach the test set to a WebSocket voice agent with a media send template configured.
  5. Launch the run using that attached WebSocket voice agent.
  6. During the conversation, Coval can send the image when the agent asks for relevant visual information.
Image attachments augment a normal test case input rather than replacing it. You still define the scenario, transcript, script, or audio flow as usual, and the image is available as an additional artifact the persona can send when needed.
Image attachments currently work only when the test set is attached to a WebSocket voice agent and used in a voice simulation. Other simulator types do not send attached images.
Best practices:
  • Use short, stable names like receipt_photo or drivers_license_front.
  • Use the description to explain when to send the image, not just what the file contains.
  • Keep the image tightly scoped to the task so the agent receives only the evidence it needs.

Test-Case Specific Evaluation

Expected Behavior and Metadata allow you to utilize test-case specific data to evaluate how the agent responds to a specific test case.

Test Case Expected Behavior

The expected behavior dictates how your agent should be responding to the user’s requests. Example
  • “the agent should ask the user for their phone number”
  • “the agent should repeat the phone number back to the users”
Use the Composite Evaluation metric to evaluate whether the agent followed the expected behaviors. Configure it with From Test Case as the criteria source to automatically pull behaviors from each test case. With Percentage of Criteria Met reporting, the example above would return 0.5 if the agent asks for the phone number but does not repeat it back.

Test Case Metadata

These fields can be used to store specific metadata about a test case. This is helpful when you want to create a metric that might reference a specific aspect of the test case. You can input as key/value pairs, or as a JSON. Example: Imagine an airline help desk where the test case contains this metadata
{
  "source": "LAX",
  "destination": "SFO"
}
Then, you can write, for example, a binary Destination Identification Metric with the question: Did the agent correctly identify the destination as: {{test_case.destination}}?
For comprehensive testing, create multiple types of test sets:
  • Regression Set: Contains “happy path” scenarios representing typical successful interactions
  • Adversarial Set: Contains edge cases and scenarios designed to test your agent’s limits and handling of unusual requests

Utilizing Agent Attributes

In your agents, you can set specific attributes associated with that agent. You can embed these agent attributes into your scenarios with this format: {{agent.attribute_name}} Example: Imagine one agent has the attribute location with a value “San Francisco”, and another agent has the value “London”. Embed those agent attributes in your scenarios and expected behaviors like this: Scenario: You are a user calling for travel recommendations in {{agent.location}} Expected Behavior: The agent should only give travel recommendations in {{agent.location}}

Test Cases vs. Personas

  • Persona: Defines how to behave
    • Characteristics (friendly, angry)
    • Voice configuration
    • Can be assigned multiple test cases
  • Test Case: Defines what to do
    • Specific tasks or scenarios
    • Can be assigned to any persona