Testing Across Audio Qualities - Coval Documentation

Use audio-quality testing when you want to know whether a voice agent still succeeds under real-world conditions. Run the same agent, test set, and metrics across multiple audio-quality scenarios, then compare the results in a multi-run report. This workflow is for voice simulations. Chat simulations do not exercise speech recognition, text-to-speech, audio timing, or background-noise handling. The goal is not to produce a leaderboard. The goal is to find which real-world audio conditions change outcomes, then decide whether the next fix belongs in your agent prompt, tool handling, speech recognition setup, generated voice setup, tracing, or audio-scenario coverage.

Use An AI Agent

If you use Coval Agent Skills, an AI agent can help with both the workflow and the follow-up analysis. Use the run-audio-quality-testing skill to create the audio-scenario runs and multi-run report. After the report exists, use the analyze-audio-quality-report skill to turn the report into recommended agent fixes. To have an AI agent run this workflow for you, paste this prompt into your coding agent or local LLM:

Use the Coval `run-audio-quality-testing` skill:
https://github.com/coval-ai/coval-external-skills/tree/main/skills/runs/run-audio-quality-testing

I want to test this voice agent across audio-quality scenarios:
<paste Coval agent URL or agent ID>

Use this test set:
<paste Coval test set URL or test set ID>

Run the same sampled cases, metrics, and agent configuration across the built-in audio-quality personas: Standard Customer, Impatient Customer, Confused Customer, Interruptive Speaker, Super Fast Speaker, High Background Noise Speaker, and Low Volume Speaker. Create the runs, wait for them to finish, then create a multi-run report grouped by Persona so I can compare each audio-quality scenario against Standard Customer.

After the report exists, summarize the largest regressions and include the report URL plus representative simulation links.

1. Choose A Voice Agent

Pick one voice agent to test. For the cleanest comparison, keep the agent configuration fixed across all runs. For agents that emit traces, include trace-based timing metrics such as time to first byte or provider latency. If your agent is not sending traces yet, set up OpenTelemetry traces so Coval can measure agent-side timing and tool behavior alongside the recording. You can also have your coding agent help instrument traces using the Coval tracing skills.

2. Choose The Audio-Quality Personas

Select Standard Customer plus the built-in audio-quality personas:

Coval Persona	What It Tests
Standard Customer	Baseline clean-call behavior
Impatient Customer	Short answers and lower patience
Confused Customer	Clarification handling
Interruptive Speaker	Overlap and interruption handling
Super Fast Speaker	Fast speech
High Background Noise Speaker	Background noise robustness
Low Volume Speaker	Quiet speaker audio

Use the same test set for every audio-quality persona. If you subsample a test set, keep the same sampled cases across personas so differences come from the audio condition, not case selection.

3. Select Metrics

Use metrics that separate task success from audio-path behavior:

Goal	Useful Metrics
Task outcome	Composite evaluation, task-completion LLM judges, or scenario-specific pass/fail metrics
Responsiveness	Latency, time to first audio, trace TTFB, or provider response-time metrics
Speech recognition	STT Word Error Rate for traced agents, or Transcription Error for WER from recorded conversation audio without requiring agent traces
Generated voice quality	Voice Quality and Speech Artifact Anomaly for broad generated-speech quality; Clipping Artifact Detection, Dropout Artifact Detection, Codec Artifact Detection, Loop Detection, Phoneme Stretch, Syllable Rate, Timbre Drift, or Pause Analysis for specific failure modes
Conversation flow	Turn count, audio duration, early termination, abnormally short or long calls, interruption rate, silence, and turn-level timing metrics

Do not use Percent Audio Above 300Hz as a perceived audio-quality score. It measures pitch distribution, not listener-rated quality.

4. Launch The Runs

Launch voice simulations with:

one agent
one test set
the built-in audio-quality personas listed above
the same metrics for every audio-quality scenario

Coval creates separate runs for each selected persona. In this workflow, each persona represents one audio-quality scenario. This keeps each scenario comparable while still letting you analyze the set together.

5. Compare Audio Quality Scenarios

After the runs finish:

Open the runs list.
Select the completed runs from the audio-quality persona set.
Create a multi-run report.
Set Compare by to Persona.
Use the grouped view to compare aggregate scores and latency across audio-quality scenarios.

Look for regressions that appear only under specific audio conditions. For example, a high task-success baseline with worse results for the High Background Noise Speaker suggests an audio-path robustness issue rather than a general agent-quality issue. Also scan for UNKNOWN, missing, or unscored metric results. Under heavy audio stress, a judge may be unable to evaluate the conversation because the call ended early, the transcript is too sparse, or the interaction became too anomalous. Treat that as a signal to inspect the recording, not just as missing data.

6. Spot-Check Simulations

Do not stop at pass/fail metric columns. A scenario can pass binary task metrics while the recording shows a broken or materially different experience. Treat very short calls, very long calls, latency spikes, and UNKNOWN or unscored metrics as spot-check triggers.

Open representative completed simulations from each audio-quality scenario, especially the lowest-scoring and most surprising rows from the grouped report. Listen to the recording and read the transcript to confirm how your agent handled the audio condition.

Audio Condition	What To Review
Background noise	Whether your agent confirms important fields, recovers from misheard details, and avoids repeatedly asking for information the speaker already provided.
Fast speech	Whether your agent keeps up with compressed turns, asks for clarification when needed, and still reaches the required outcome.
Low volume	Whether quiet speaker audio causes missed details, extra confirmation loops, or incorrect task outcomes.
Interruptions	Increased interruption rate over longer calls, and whether your agent recovers after overlap instead of losing state or talking past the caller.

If the listening pass affects a release decision, send representative simulations to Human Review. Use a review project to collect ground-truth labels for questions such as whether your agent captured the required information, recovered after interruptions, handled transcript errors, and completed the task. Use Collaborative mode when you want one shared answer per simulation, or Individual mode when you want independent reviewer agreement.

7. Understand The Results

Set Compare by to Persona and use the grouped view so each row represents one audio-quality persona. Compare every persona against Standard Customer, then inspect the scenarios whose task success, latency, speech recognition, generated voice quality, or call shape changed the most. In your analysis, lead with the conclusions that explain what changed:

the largest audio-quality scenario regressions compared with Standard Customer
the affected task-success, latency, speech-recognition, or audio metrics
any UNKNOWN, missing, or unscored metric results that point to anomalous conversations
representative simulation links for the most important regressions and one healthy baseline
Human Review results or reviewer agreement, if you used manual labels
the recommended next step from your report analysis, such as prompt changes, tool handling fixes, STT/TTS adjustments, trace setup, or expanded audio-scenario coverage

To have an AI agent produce this analysis from the report, use the analyze-audio-quality-report skill:

Use the Coval `analyze-audio-quality-report` skill.

I completed the Testing Across Audio Qualities workflow and created this multi-run report:
<paste Coval report URL, report export, or run IDs>

Analyze the report by audio-quality scenario against Standard Customer. If the report is grouped by Persona in Coval, interpret those personas as the audio-quality scenarios being tested. Use metric deltas, UNKNOWN or unscored metrics, call-shape changes such as turn count and audio duration, representative simulations, recordings, transcripts, traces if present, and Human Review labels if present.

Tell me:
- which audio-quality scenarios regressed most or became anomalous, and on which metrics
- what likely failed in my agent
- the recommended next fix, such as prompt changes, tool handling fixes, STT/TTS adjustments, trace setup, or expanded audio-scenario coverage
- which audio-quality scenarios, test cases, and metrics I should rerun after the fix to confirm improvement

Keep agent-side fixes separate from Coval metric or test setup changes. Do not guess from the aggregate table alone; inspect representative simulations when they are available.

Documentation Index

​Use An AI Agent

​1. Choose A Voice Agent

​2. Choose The Audio-Quality Personas

​3. Select Metrics

​4. Launch The Runs

​5. Compare Audio Quality Scenarios

​6. Spot-Check Simulations

​7. Understand The Results

Use An AI Agent

1. Choose A Voice Agent

2. Choose The Audio-Quality Personas

3. Select Metrics

4. Launch The Runs

5. Compare Audio Quality Scenarios

6. Spot-Check Simulations

7. Understand The Results