Built-in Metrics
Coval provides a comprehensive suite of built-in metrics to evaluate your AI agent’s performance across multiple dimensions. These metrics are ready to use out of the box and cover audio quality, conversation flow, response timing, and more.Audio Quality
Audio metrics evaluate the quality and characteristics of speech output from your agents. These are essential for voice-based applications and provide comprehensive analysis of audio fidelity, conversation flow, and speech characteristics.All metrics in this section require audio input to function properly. They
will not work with text-only transcripts.
Background Noise
Purpose: Measurement of audio clarity and background noise. What it measures: Ratio between speech signal strength and background noise, signal noise ratio (SNR). When to use: Audio quality assessment, identifying poor recording conditions. How it works: Compares the strength of speech signals against background noise by analyzing speech and silent (room tone) segments separately. The metric calculates the ratio between signal and noise power levels, providing both overall and segment-by-segment quality assessments. How to interpret:- SNR above 20 dB indicates excellent audio clarity.
- SNR between 10-20 dB is acceptable for most applications.
- SNR below 10 dB may significantly impact speech recognition and comprehension.
Natural Non-robotic Tone Detection
Purpose: Analysis of audio frequency characteristics highlighting overtones and harmonics which give human voices their natural timbre. What it measures: Frequency distribution, dominant frequencies, and natural voice characteristics. When to use: This metric helps detect synthetic or robotic-sounding speech that could reduce the naturalness and effectiveness of agent interactions:- When evaluating speech synthesis quality
- When testing new voice models or providers
- When optimizing voice parameters for naturalness
- When troubleshooting user complaints about robotic voices
- Audio analysis, speaker identification, quality assessment
- > 40%: Very natural / expressive - Found in natural human speech; rich in harmonics and overtones.
- 30-40%: Acceptably natural - May be synthetic but sounds close to human-like
- 20-30%: Slightly robotic - Often found in lower-quality TTS systems
- < 20%: Robotic / flat - Indicates overly monotone or artificial tone
Music Detection
Purpose: Detects music segments in audio recordings. What it measures: Count of music segments detected in the conversation, with timeline data showing when each music segment occurs. When to use:- Detecting hold music or queue music during voice calls
- Identifying unwanted background music in recordings
- Measuring music duration and frequency in customer service scenarios
- Debugging audio quality issues related to music interference
- Value equals the count of distinct music segments detected
- Higher counts indicate more frequent or longer music interruptions
- Timeline subvalues show exact timestamps for each music segment
- Useful for identifying when customers are placed on hold with music
Conversation Length
These metrics analyze the content and flow of conversations to ensure effective communication.Audio Duration
Purpose: This metric measures the duration of the audio file in seconds. What it measures: Duration of the full conversation in seconds.Turn Count
Purpose: Counts how many turns were taken in a conversation. What it measures: Each turn is a change between speakers.Words Per Message
Purpose: The average number of words per message agent message. What it measures: Average number of words per message in a conversation.Instruction Following
These metrics measure how well the agent follows predefined behaviors.Workflow Verification
Verifies if conversations follow expected workflow patterns and business logic.Resolution
Evaluate the end of the conversation.End Reason
Purpose: The reason that the conversation ended. When to use: Help identify patterns in call completion.This currently only works for simulations run on Coval. Support live calls are a work in progress.
| Value | Description |
|---|---|
COMPLETED | The conversation reached a natural conclusion with a successful resolution |
MAX_TURNS | The conversation reached the maximum allowed number of turns |
MAX_DURATION | The conversation exceeded the maximum allowed duration |
USER_HANGUP | The user ended the conversation (voice calls only) |
AGENT_HANGUP | The agent ended the conversation (voice calls only) |
IDLE_TIMEOUT | The conversation timed out due to inactivity (chat/SMS simulations) |
ERROR | An error occurred during the simulation |
UNKNOWN | The end reason could not be determined |
Responsiveness
Critial metric to identify if the agent is responding correctly.Agent Fails To Respond
Purpose: Evaluate continuity and identif moments when the agent ignores or misses a user query. What it measures: Long silence from the agent between two consecutive user turns, and whether and when the agent eventually responds after the second user turn. When to use: Identifying moments when the agent ignored or misses a user query How it works: Finds silence gaps ≥ 3 seconds between two consecutive user turns and checks whether the agent resumes speaking after this.Agent Needs Reprompting
Purpose: Identifies when agents become unresponsive but will respond after user repetition.This metric helps identify edge cases where the agent’s response mechanism may
be failing intermittently.
- Each silence gap and eventual response are collectively considered one event.
- More events = worse responsiveness.
Agent Repeats Itself
Purpose: Identifies instances where the agent says the same sentence or asks the same question multiple times. When to use: Evaluating naturalness and word choice, identifying diverse language.Timing & Latency
Ensure timely agent interactions.Interruption Rate
Purpose: The rate (interruptions per minute) that the user is interrupted by the assistant. What it measures: An interruption is defined as any time the user is speaking and the assistant starts speaking before the user has finished speaking. This does not include times that the user interrupts the assistant. When to use: Conversation flow analysis, identifying communication issues, training data for interruption handling. How to interpret:- High interruption frequency may indicate communication issues.
- Interruption patterns can help identify conversation flow problems.
- Useful for training agents to handle interruptions gracefully.
Latency
Purpose: Measurement of delays between user and agent response time in milliseconds (ms). What it measures: Time between user input and agent response, silence durations. When to use: Performance evaluation, identifying slow response times, conversation flow analysis. How it works: Analyzes the audio signal using Voice Activity Detection (VAD) to identify speaker transitions and measure the time delay between when a user finishes speaking and when the agent begins responding. The metric tracks these response times throughout the conversation to identify patterns and potential issues. How to interpret:- Target latencies under 500ms for real-time conversations.
- Target latencies under 2 seconds for complex query responses.
- Higher latencies may indicate performance issues or processing bottlenecks.
Time To First Audio
Purpose: Detect audio start latency and responsiveness. What it measures: Time delay between simulation start and the first audible sound in the audio recording. When to use: Evaluating system or agent response latency before any speech begins. How it works: Detects the first audio frame that has RMS energy above a certain threshold and returns the timestamp of this frame. How to interpret:- < 1000 ms: Fast audio start; considered responsive.
- 1–3 seconds: Acceptable delay.
-
3000 ms: Noticeable lag; may indicate issues in agent response, recording delay, or user hesitation.
- -1 ms: No audio detected; likely a technical failure or silent recording.
Speech Tempo
Purpose: Identifies the rate of phonemes (perceptually distinct unit of speech sound) and high-speed speech periods. What it measures: The rate of phonemes per second (pps) in audio output. When to use: Speech quality assessment. Usefult to identify the average tempo. How it works: Measures the number of phonemes per interval over the duration of speech segment. How to interpret:- Above 20 PPS is too fast and will be hard to follow.
- Between 15-20 PPS is fast but could be comprehensible.
- Target is 10-15 PPS is not too fast or too slow.
- Below 10 is too slow.
Pause Analysis
Purpose: Measures how frequently the agent pauses mid-speech and how long those pauses are. What it measures: Frequency of agent pauses within a turn (pauses per minute), along with total and average pause duration. When to use:- Identifying unnatural or excessive hesitations in agent speech
- Detecting processing delays that manifest as in-speech pauses
- Evaluating speech fluency across different configurations
- Lower values indicate more fluent speech.
- The detail view shows each pause with its timestamp and duration.
- Brief pauses are normal and often expressive; frequent longer pauses may indicate hesitation artifacts.
Trace Metrics
These metrics use OpenTelemetry (OTel) trace data to measure the performance of individual components in your voice agent pipeline. They provide granular visibility into LLM, TTS, and STT service latencies, token consumption, and tool usage.All metrics in this section require your agent to send OpenTelemetry traces to Coval.
See the OpenTelemetry Traces guide for setup instructions.
If traces are not configured, these metrics will report an error at execution time.
LLM Time to First Byte
Purpose: Measure LLM responsiveness by tracking how quickly the first token is returned. What it measures: Average time (in seconds) from when the LLM request is sent to when the first token is received, across all turns in the conversation. When to use: Identifying slow LLM providers, comparing model latencies, optimizing prompt length for faster responses.TTS Time to First Byte
Purpose: Measure TTS responsiveness by tracking how quickly the first audio byte is produced. What it measures: Average time (in seconds) from when text is sent to the TTS service to when the first audio byte is returned, across all turns. When to use: Evaluating TTS provider performance, identifying bottlenecks in the audio generation pipeline.STT Time to First Byte
Purpose: Measure STT responsiveness by tracking how quickly the first transcription result is returned. What it measures: Average time (in seconds) from when audio is sent to the STT service to when the first transcription result is received, across all turns. When to use: Evaluating STT provider performance, diagnosing why the agent is slow to start processing user input.LLM Token Usage
Purpose: Track the total token consumption of LLM calls during a conversation. What it measures: Sum of input tokens and output tokens consumed across all LLM calls in the conversation. When to use: Cost monitoring, identifying conversations that consume excessive tokens, comparing prompt strategies for efficiency. How to interpret:- Token counts vary by model and use case. Track this metric over time to establish baselines for your specific agent.
- Sudden spikes may indicate prompt injection, runaway tool loops, or excessively long conversations.
- Use in combination with turn count to compute average tokens per turn.
Tool Call Count
Purpose: Count the total number of tool calls made during a conversation. What it measures: Number of tool call invocations detected in OTel trace spans. When to use: Verifying that the agent is using tools as expected, identifying conversations with excessive or insufficient tool usage.STT Word Error Rate
Purpose: Measure the accuracy of your agent’s Speech-to-Text by comparing it against Coval’s own transcription of the same conversation. What it measures: Word Error Rate (WER) between your agent’s STT output and Coval’s reference transcript of the caller’s speech. The reference (ground truth) is Coval’s transcription of the persona’s speech, generated automatically from each simulation. The hypothesis (what you’re testing) is your agent’s own STT output, read from thetranscript attribute on OTel stt spans.
When to use: Evaluating your STT provider’s accuracy, comparing STT providers (e.g., Deepgram vs Whisper vs Google), diagnosing why your agent misunderstands users, or tracking STT quality over time.
This metric requires your agent to emit OTel traces with the
transcript attribute on each stt span. See Instrumenting STT Spans for setup instructions. If your STT provider exposes utterance confidence, we also recommend sending stt.confidence on the same spans for debugging and provider-quality analysis. No manual ground truth or test data is needed — the reference transcript is generated automatically.STT Word Error Rate (Audio Upload)
Purpose: Measure STT accuracy against a known-correct transcript that you provide, rather than Coval’s auto-generated reference. What it measures: Word Error Rate (WER) between your agent’s STT output and a ground truth transcript you supply in the test case metadata. The reference (ground truth) comes from theground_truth_transcript field in your test case metadata. The hypothesis (what you’re testing) is your agent’s own STT output, read from the transcript attribute on OTel stt spans — the same as the standard STT Word Error Rate metric.
When to use: When you have audio upload test cases with pre-recorded audio where you know exactly what was said. This lets you measure how accurately your agent’s speech recognition transcribes a known recording — useful for regression testing STT quality against a canonical script.
This metric requires two things:
- Your agent must emit OTel traces with the
transcriptattribute on eachsttspan. See Instrumenting STT Spans for setup. - Your test case must include a
ground_truth_transcriptkey in its metadata containing the reference transcript. See Audio Upload — Ground Truth Transcript for details.
stt.confidence to each stt span so low-confidence turns are easier to inspect alongside the WER result.| Format | Example |
|---|---|
| Plain text | "Hi, I'd like to check my account balance" |
| Labeled text with timestamps | "[15.4s - 26.8s] PERSONA: Hi, I'd like to check my account balance" |
JSON with messages array | Persona turns are extracted automatically — see snippet below |
PERSONA:, AGENT:), only persona/user lines are used — agent lines are filtered out automatically.
How to interpret: Same as STT Word Error Rate — a lower WER means better STT accuracy. Because the reference transcript is your own known-correct text (not Coval’s transcription), this metric isolates your STT provider’s accuracy without any variance from the reference side. For the WER formula and interpretation thresholds, see Transcription Error.
Transcription Accuracy
Transcription Error
Purpose: Evaluate the accuracy of agent messages through Word Error Rate (WER) which measures the percentage of errors in a transcript. What it measures: Where:- S = substitutions
- D = deletions
- I = insertions
- N = total number of agent words in the reference transcript
- WER < 0.10: Is great! This indicates excellent high quality audio.
- WER 0.10 - 0.30: Is acceptable for most converational agents and situations with background noise.
- WER > 0.30: May significantly impact understanding of the audio.
User Patterns
Audio Sentiment
Purpose: Detect vocal tone of each audio segment. What it measures: Emotional tone for each audio segment for both parties. When to use: General tone of the conversation and trend of audio sentiment across the conversation. How it works: Classifies audio sentiment per speaking segment based purely on the audio tone and not the spoken content. How to interpret: Check frequency of certain emotional tones.Transcript Sentiment Analysis
Purpose: Analyzes the transcript for rude, polite, encouraging, and professional sentiments, identifying the sentiment with the highest overall score. What it measures: Score of emotional tone for each audio segment for the agent. When to use: General tone of the agent and how it could be interpreted. How it works: Classifies audio sentiment per speaking segment based purely on the audio tone and not the spoken content. How to interpret: Higher scores in each sentiment indicate stronger sentiment detected.Best Practices for Using Built-in Metrics
Start with Core Metrics
Begin with essential metrics like response time, resolution success, and
audio quality before adding specialized ones.
Set Baselines
Establish baseline measurements before making changes to track improvement
over time.
Combine Metrics
Use multiple metrics together for comprehensive evaluation rather than
relying on single indicators.
Regular Review
Schedule regular metric reviews to identify trends and areas needing
attention.
Metric Selection Guide
Choose metrics based on your use case:Voice Assistants
- Audio Quality
- Speech Tempo
- Background Noise
- Natural Non-robotic Tone Detection
- Volume/Pitch Misalignment
- Latency
- Interruption Detection
- Trace Metrics (requires OTel traces)
- LLM Time to First Byte
- TTS Time to First Byte
- STT Time to First Byte
Customer Service Bots
- Composite Evaluation
- Resolution Time Efficiency
- End Resolution
- Audio Sentiment
Task Automation Agents
- Workflow Verification
- Composite Evaluation
- Words Per Minute
- LLM Token Usage
- Tool Call Count
General Conversational AI
- Agent Response Times
- Interruption Rate
- Agent Repeats Itself
- Transcript Sentiment Analysis
- End Reason
Remember that not all metrics are suitable for every scenario. Audio metrics
require actual audio input, while comparison metrics need reference data to
function properly.

