Skip to main content
When you launch a run, you trigger a simulation and subsequent evaluation of that simulation. Coval supports different simulation approaches:
  • Text-based: For chat agents using text inputs and outputs
  • Voice-based: For voice agents with audio inputs and outputs

Setting Up an Evaluation

  1. Click “Launch Evaluation”
  2. Select a template or configure manually:
    • Choose a test set
    • Select an agent to test
    • Select a persona
    • Choose metrics to track
    • Set simulation parameters
    • (Optional) Add tags to label this run
(See “Templates” for more information)

Tagging Runs

You can add up to 20 tags to a run at launch time. Tags are useful for organizing and filtering runs — for example, by environment, release version, or test type. From the UI: A “Tags” card appears in the launch panel. Type a tag name and click + (or press Enter) to add it. Click the × on any tag chip to remove it. Via the API: Pass tags in the metadata.tags field of the launch request:
{
  "agent_id": "...",
  "persona_id": "...",
  "test_set_id": "...",
  "metadata": {
    "tags": ["regression", "v2.1", "nightly"]
  }
}
Constraints: max 20 tags per run, each tag max 200 characters. After launch, you can filter runs by tag using the tag= filter expression (e.g., tag="regression").

Scheduling Recurring Evaluations

  1. Enable the “Schedule Recurring” option
  2. Set frequency (hourly, daily, weekly)
  3. Configure start and end dates if applicable
  4. Set alert thresholds for specific metrics (in “Alerts”)
Benefits of Recurring Evaluations:
  • Continuous monitoring of your agent’s performance
  • Early detection of regressions or issues
  • Ability to set alerts when specific metrics underperform
  • Historical performance tracking for trend analysis

Analyzing Evaluation Results

A simulation is a simulated conversation between our agent and your voice or chat agent. You can define the environment on how to test your agent within test sets and Templates. Metrics define the success or failure criteria for your tests.

Runs

A Run is an evaluation. A Run can consist of multiple conversations (e.g., if the test set consists of multiple scenarios/transcripts). On each run, you will see the following set of actions:
  • Resimulate: Re-run if something looks off, or to confirm the performance of a specific metric
  • Rerun metrics: An LLM Judge metric doesn’t perform as expected and you need to adjust it? Go back to the run and rerun that specific metric
  • Compare: Compare a run with any other run that was performed on the same test set
  • Human Review: Provide feedback on the run results and send it to the “Manual Review” for team members to collaborate on iterations
  • Share: share an internal or public link to your run results - a great way to use simulations as part of your sales process!
Docs Runresults Pn Clicking on one call of this run will open your metric results in detail, allowing you to check your results in depth, detect where in the transcript your issues arise, and see detailed explanations for LLM Judge metrics. If OpenTelemetry traces are available for the simulation, an OTel Traces card appears in the metric grid showing span count and linking to the trace viewer. Docs Runresults2 Pn

Overview

The Overview tab consists of all individual conversations. It helps you get an overview of your agent’s performance by creating your own summary graphs and see aggregated performance over time.

Review

Use Coval’s Human-in-the-loop review capabilities to label runs for review.

Deterministic Simulation Modes

By default, the persona generates responses dynamically using an LLM. For cases where you need repeatable, deterministic persona behavior, Coval offers two additional test case input types:
  • Audio Upload: Upload a pre-recorded audio file (persona’s side of the conversation) that plays back exactly as recorded instead of generating persona speech. The audio is automatically transcribed so persona turns still appear in the transcript. After playback completes, the simulation waits a 30-second grace period for the agent to finish responding, then ends the call. You can optionally attach a ground truth transcript to each test case to enable the STT Word Error Rate (Audio Upload) metric, which measures your agent’s speech recognition accuracy against the known-correct transcript. See Test Sets — Audio Upload for setup details.
  • Scripted Turns: Define an ordered list of exact lines for the persona to deliver turn by turn. The persona still uses the configured voice and background sounds, but speaks the scripted text instead of LLM-generated responses. A built-in divergence detector monitors agent responses and can end the simulation early if the agent goes off-track. See Test Sets — Script for setup details.

Simulation Time Limits

Each simulated conversation has a maximum duration:
Duration
Default timeout10 minutes
Maximum timeout15 minutes
A simulation ends when the conversation reaches a natural conclusion, the test objective is met, or the timeout is reached — whichever comes first.
If your agent requires longer conversations, contact support@coval.dev to discuss your use case. The hard maximum per simulation is 15 minutes.

Best Practices for your Evaluations:

Testing Strategy:
  • Start with core functionality test cases
  • Expand to edge cases and failure scenarios
  • Include regression tests for fixed issues
  • Test across different user personas and scenarios
Continuous Improvement:
  • Regularly update test sets based on production data
  • Refine metrics as your understanding of agent performance evolves