- Text-based: For chat agents using text inputs and outputs
- Voice-based: For voice agents with audio inputs and outputs
Setting Up an Evaluation
- Click “Launch Evaluation”
- Select a template or configure manually:
- Choose a test set
- Select an agent to test
- Select a persona
- Choose metrics to track
- Set simulation parameters
- (Optional) Add tags to label this run
Tagging Runs
You can add up to 20 tags to a run at launch time. Tags are useful for organizing and filtering runs — for example, by environment, release version, or test type. From the UI: A “Tags” card appears in the launch panel. Type a tag name and click + (or press Enter) to add it. Click the × on any tag chip to remove it. Via the API: Pass tags in themetadata.tags field of the launch request:
tag= filter expression (e.g., tag="regression").
Scheduling Recurring Evaluations
- Enable the “Schedule Recurring” option
- Set frequency (hourly, daily, weekly)
- Configure start and end dates if applicable
- Set alert thresholds for specific metrics (in “Alerts”)
Benefits of Recurring Evaluations:
- Continuous monitoring of your agent’s performance
- Early detection of regressions or issues
- Ability to set alerts when specific metrics underperform
- Historical performance tracking for trend analysis
Analyzing Evaluation Results
A simulation is a simulated conversation between our agent and your voice or chat agent. You can define the environment on how to test your agent within test sets and Templates. Metrics define the success or failure criteria for your tests.Runs
A Run is an evaluation. A Run can consist of multiple conversations (e.g., if the test set consists of multiple scenarios/transcripts). On each run, you will see the following set of actions:- Resimulate: Re-run if something looks off, or to confirm the performance of a specific metric
- Rerun metrics: An LLM Judge metric doesn’t perform as expected and you need to adjust it? Go back to the run and rerun that specific metric
- Compare: Compare a run with any other run that was performed on the same test set
- Human Review: Provide feedback on the run results and send it to the “Manual Review” for team members to collaborate on iterations
- Share: share an internal or public link to your run results - a great way to use simulations as part of your sales process!


Overview
The Overview tab consists of all individual conversations. It helps you get an overview of your agent’s performance by creating your own summary graphs and see aggregated performance over time.Review
Use Coval’s Human-in-the-loop review capabilities to label runs for review.Deterministic Simulation Modes
By default, the persona generates responses dynamically using an LLM. For cases where you need repeatable, deterministic persona behavior, Coval offers two additional test case input types:- Audio Upload: Upload a pre-recorded audio file (persona’s side of the conversation) that plays back exactly as recorded instead of generating persona speech. The audio is automatically transcribed so persona turns still appear in the transcript. After playback completes, the simulation waits a 30-second grace period for the agent to finish responding, then ends the call. You can optionally attach a ground truth transcript to each test case to enable the STT Word Error Rate (Audio Upload) metric, which measures your agent’s speech recognition accuracy against the known-correct transcript. See Test Sets — Audio Upload for setup details.
- Scripted Turns: Define an ordered list of exact lines for the persona to deliver turn by turn. The persona still uses the configured voice and background sounds, but speaks the scripted text instead of LLM-generated responses. A built-in divergence detector monitors agent responses and can end the simulation early if the agent goes off-track. See Test Sets — Script for setup details.
Simulation Time Limits
Each simulated conversation has a maximum duration:| Duration | |
|---|---|
| Default timeout | 10 minutes |
| Maximum timeout | 15 minutes |
If your agent requires longer conversations, contact support@coval.dev to discuss your use case. The hard maximum per simulation is 15 minutes.

