Skip to main content

Built-in Metrics

Coval provides a comprehensive suite of built-in metrics to evaluate your AI agent’s performance across multiple dimensions. These metrics are ready to use out of the box and cover audio quality, conversation flow, response timing, and more.

Audio Quality

Audio metrics evaluate the quality and characteristics of speech output from your agents. These are essential for voice-based applications and provide comprehensive analysis of audio fidelity, conversation flow, and speech characteristics.
All metrics in this section require audio input to function properly. They will not work with text-only transcripts.

Background Noise

Purpose: Measurement of audio clarity and background noise. What it measures: Ratio between speech signal strength and background noise, signal noise ratio (SNR). When to use: Audio quality assessment, identifying poor recording conditions. How it works: Compares the strength of speech signals against background noise by analyzing speech and silent (room tone) segments separately. The metric calculates the ratio between signal and noise power levels, providing both overall and segment-by-segment quality assessments. How to interpret:
  • SNR above 20 dB indicates excellent audio clarity.
  • SNR between 10-20 dB is acceptable for most applications.
  • SNR below 10 dB may significantly impact speech recognition and comprehension.

Natural Non-robotic Tone Detection

Purpose: Analysis of audio frequency characteristics highlighting overtones and harmonics which give human voices their natural timbre. What it measures: Frequency distribution, dominant frequencies, and natural voice characteristics. When to use: This metric helps detect synthetic or robotic-sounding speech that could reduce the naturalness and effectiveness of agent interactions:
  • When evaluating speech synthesis quality
  • When testing new voice models or providers
  • When optimizing voice parameters for naturalness
  • When troubleshooting user complaints about robotic voices
  • Audio analysis, speaker identification, quality assessment
How it works: Analyzes the frequency distribution of the audio signal and measures the percentage of pitched content above 300Hz. How to interpret:
  • > 40%: Very natural / expressive - Found in natural human speech; rich in harmonics and overtones.
  • 30-40%: Acceptably natural - May be synthetic but sounds close to human-like
  • 20-30%: Slightly robotic - Often found in lower-quality TTS systems
  • < 20%: Robotic / flat - Indicates overly monotone or artificial tone
Need a different frequency threshold?
The Audio Frequency Filter custom metric lets you set any Hz value instead of the fixed 300Hz and reports the percentage of voiced segments above or below that threshold. Create one from the metric editor and configure frequency_threshold.

Music Detection

Purpose: Detects music segments in audio recordings. What it measures: Count of music segments detected in the conversation, with timeline data showing when each music segment occurs. When to use:
  • Detecting hold music or queue music during voice calls
  • Identifying unwanted background music in recordings
  • Measuring music duration and frequency in customer service scenarios
  • Debugging audio quality issues related to music interference
How it works: Analyzes audio to identify non-speech segments and classify them as music. Returns a count of music segments and timeline entries with start/end offsets and duration for each detected music segment. How to interpret:
  • Value equals the count of distinct music segments detected
  • Higher counts indicate more frequent or longer music interruptions
  • Timeline subvalues show exact timestamps for each music segment
  • Useful for identifying when customers are placed on hold with music

Conversation Length

These metrics analyze the content and flow of conversations to ensure effective communication.

Audio Duration

Purpose: This metric measures the duration of the audio file in seconds. What it measures: Duration of the full conversation in seconds.

Turn Count

Purpose: Counts how many turns were taken in a conversation. What it measures: Each turn is a change between speakers.

Words Per Message

Purpose: The average number of words per message agent message. What it measures: Average number of words per message in a conversation.

Instruction Following

These metrics measure how well the agent follows predefined behaviors.

Workflow Verification

Verifies if conversations follow expected workflow patterns and business logic.

Resolution

Evaluate the end of the conversation.

End Reason

Purpose: The reason that the conversation ended. When to use: Help identify patterns in call completion.
This currently only works for simulations run on Coval. Support live calls are a work in progress.
Possible values:
ValueDescription
COMPLETEDThe conversation reached a natural conclusion with a successful resolution
MAX_TURNSThe conversation reached the maximum allowed number of turns
MAX_DURATIONThe conversation exceeded the maximum allowed duration
USER_HANGUPThe user ended the conversation (voice calls only)
AGENT_HANGUPThe agent ended the conversation (voice calls only)
IDLE_TIMEOUTThe conversation timed out due to inactivity (chat/SMS simulations)
ERRORAn error occurred during the simulation
UNKNOWNThe end reason could not be determined
Want to define what counts as a successful end reason?
The Successful End Reason custom metric returns YES/NO based on whether the end reason matches your configured success criteria. Select one or more end reasons as your success conditions.

Responsiveness

Critial metric to identify if the agent is responding correctly.

Agent Fails To Respond

Purpose: Evaluate continuity and identif moments when the agent ignores or misses a user query.
Any occurrence of this metric indicates a critical failure requiring immediate investigation.
What it measures: Long silence from the agent between two consecutive user turns, and whether and when the agent eventually responds after the second user turn. When to use: Identifying moments when the agent ignored or misses a user query How it works: Finds silence gaps ≥ 3 seconds between two consecutive user turns and checks whether the agent resumes speaking after this.
Need a different silence threshold?
The Agent Fails to Respond Delay custom metric lets you configure max_silence_duration_seconds instead of the fixed 5-second default.

Agent Needs Reprompting

Purpose: Identifies when agents become unresponsive but will respond after user repetition.
This metric helps identify edge cases where the agent’s response mechanism may be failing intermittently.
What it measures: Long silence from the agent between two consecutive user turns, only ig the agent responds after the second user turn. When to use: Evaluating naturalness and continuity. Identifying moments when the agent ignores or misses a user query. How it works: Finds silence gaps ≥ 3 seconds between two consecutive user turns and checks if the agent resumes speaking after this. How to interpret:
  • Each silence gap and eventual response are collectively considered one event.
  • More events = worse responsiveness.
Need a different silence gap threshold?
The Agent Reprompting Delay custom metric lets you configure min_silence_gap_seconds instead of the fixed 2-second default.

Agent Repeats Itself

Purpose: Identifies instances where the agent says the same sentence or asks the same question multiple times. When to use: Evaluating naturalness and word choice, identifying diverse language.

Timing & Latency

Ensure timely agent interactions.

Interruption Rate

Purpose: The rate (interruptions per minute) that the user is interrupted by the assistant. What it measures: An interruption is defined as any time the user is speaking and the assistant starts speaking before the user has finished speaking. This does not include times that the user interrupts the assistant. When to use: Conversation flow analysis, identifying communication issues, training data for interruption handling. How to interpret:
  • High interruption frequency may indicate communication issues.
  • Interruption patterns can help identify conversation flow problems.
  • Useful for training agents to handle interruptions gracefully.

Latency

Purpose: Measurement of delays between user and agent response time in milliseconds (ms). What it measures: Time between user input and agent response, silence durations. When to use: Performance evaluation, identifying slow response times, conversation flow analysis. How it works: Analyzes the audio signal using Voice Activity Detection (VAD) to identify speaker transitions and measure the time delay between when a user finishes speaking and when the agent begins responding. The metric tracks these response times throughout the conversation to identify patterns and potential issues. How to interpret:
  • Target latencies under 500ms for real-time conversations.
  • Target latencies under 2 seconds for complex query responses.
  • Higher latencies may indicate performance issues or processing bottlenecks.

Time To First Audio

Purpose: Detect audio start latency and responsiveness. What it measures: Time delay between simulation start and the first audible sound in the audio recording. When to use: Evaluating system or agent response latency before any speech begins. How it works: Detects the first audio frame that has RMS energy above a certain threshold and returns the timestamp of this frame. How to interpret:
  • < 1000 ms: Fast audio start; considered responsive.
  • 1–3 seconds: Acceptable delay.
  • 3000 ms: Noticeable lag; may indicate issues in agent response, recording delay, or user hesitation.
  • -1 ms: No audio detected; likely a technical failure or silent recording.

Speech Tempo

Purpose: Identifies the rate of phonemes (perceptually distinct unit of speech sound) and high-speed speech periods. What it measures: The rate of phonemes per second (pps) in audio output. When to use: Speech quality assessment. Usefult to identify the average tempo. How it works: Measures the number of phonemes per interval over the duration of speech segment. How to interpret:
  • Above 20 PPS is too fast and will be hard to follow.
  • Between 15-20 PPS is fast but could be comprehensible.
  • Target is 10-15 PPS is not too fast or too slow.
  • Below 10 is too slow.

Pause Analysis

Purpose: Measures how frequently the agent pauses mid-speech and how long those pauses are. What it measures: Frequency of agent pauses within a turn (pauses per minute), along with total and average pause duration. When to use:
  • Identifying unnatural or excessive hesitations in agent speech
  • Detecting processing delays that manifest as in-speech pauses
  • Evaluating speech fluency across different configurations
How it works: Identifies gaps between consecutive agent speaking segments within the same turn and measures their duration. Persona pauses and inter-turn gaps are excluded. How to interpret:
  • Lower values indicate more fluent speech.
  • The detail view shows each pause with its timestamp and duration.
  • Brief pauses are normal and often expressive; frequent longer pauses may indicate hesitation artifacts.

Trace Metrics

These metrics use OpenTelemetry (OTel) trace data to measure the performance of individual components in your voice agent pipeline. They provide granular visibility into LLM, TTS, and STT service latencies, token consumption, and tool usage.
All metrics in this section require your agent to send OpenTelemetry traces to Coval. See the OpenTelemetry Traces guide for setup instructions. If traces are not configured, these metrics will report an error at execution time.

LLM Time to First Byte

Purpose: Measure LLM responsiveness by tracking how quickly the first token is returned. What it measures: Average time (in seconds) from when the LLM request is sent to when the first token is received, across all turns in the conversation. When to use: Identifying slow LLM providers, comparing model latencies, optimizing prompt length for faster responses.

TTS Time to First Byte

Purpose: Measure TTS responsiveness by tracking how quickly the first audio byte is produced. What it measures: Average time (in seconds) from when text is sent to the TTS service to when the first audio byte is returned, across all turns. When to use: Evaluating TTS provider performance, identifying bottlenecks in the audio generation pipeline.

STT Time to First Byte

Purpose: Measure STT responsiveness by tracking how quickly the first transcription result is returned. What it measures: Average time (in seconds) from when audio is sent to the STT service to when the first transcription result is received, across all turns. When to use: Evaluating STT provider performance, diagnosing why the agent is slow to start processing user input.

LLM Token Usage

Purpose: Track the total token consumption of LLM calls during a conversation. What it measures: Sum of input tokens and output tokens consumed across all LLM calls in the conversation. When to use: Cost monitoring, identifying conversations that consume excessive tokens, comparing prompt strategies for efficiency. How to interpret:
  • Token counts vary by model and use case. Track this metric over time to establish baselines for your specific agent.
  • Sudden spikes may indicate prompt injection, runaway tool loops, or excessively long conversations.
  • Use in combination with turn count to compute average tokens per turn.

Tool Call Count

Purpose: Count the total number of tool calls made during a conversation. What it measures: Number of tool call invocations detected in OTel trace spans. When to use: Verifying that the agent is using tools as expected, identifying conversations with excessive or insufficient tool usage.

STT Word Error Rate

Purpose: Measure the accuracy of your agent’s Speech-to-Text by comparing it against Coval’s own transcription of the same conversation. What it measures: Word Error Rate (WER) between your agent’s STT output and Coval’s reference transcript of the caller’s speech. The reference (ground truth) is Coval’s transcription of the persona’s speech, generated automatically from each simulation. The hypothesis (what you’re testing) is your agent’s own STT output, read from the transcript attribute on OTel stt spans. When to use: Evaluating your STT provider’s accuracy, comparing STT providers (e.g., Deepgram vs Whisper vs Google), diagnosing why your agent misunderstands users, or tracking STT quality over time.
This metric requires your agent to emit OTel traces with the transcript attribute on each stt span. See Instrumenting STT Spans for setup instructions. If your STT provider exposes utterance confidence, we also recommend sending stt.confidence on the same spans for debugging and provider-quality analysis. No manual ground truth or test data is needed — the reference transcript is generated automatically.
How to interpret: A lower WER means your STT is more accurately capturing what the caller said. Compare this metric across runs to track STT quality over time, or across different STT providers to find the best fit for your use case. Note that streaming (real-time) STT typically produces higher WER than batch transcription because it processes audio incrementally. For the WER formula and interpretation thresholds, see Transcription Error.

STT Word Error Rate (Audio Upload)

Purpose: Measure STT accuracy against a known-correct transcript that you provide, rather than Coval’s auto-generated reference. What it measures: Word Error Rate (WER) between your agent’s STT output and a ground truth transcript you supply in the test case metadata. The reference (ground truth) comes from the ground_truth_transcript field in your test case metadata. The hypothesis (what you’re testing) is your agent’s own STT output, read from the transcript attribute on OTel stt spans — the same as the standard STT Word Error Rate metric. When to use: When you have audio upload test cases with pre-recorded audio where you know exactly what was said. This lets you measure how accurately your agent’s speech recognition transcribes a known recording — useful for regression testing STT quality against a canonical script.
This metric requires two things:
  1. Your agent must emit OTel traces with the transcript attribute on each stt span. See Instrumenting STT Spans for setup.
  2. Your test case must include a ground_truth_transcript key in its metadata containing the reference transcript. See Audio Upload — Ground Truth Transcript for details.
If your STT provider exposes utterance confidence, we also recommend attaching stt.confidence to each stt span so low-confidence turns are easier to inspect alongside the WER result.
Accepted ground truth formats:
FormatExample
Plain text"Hi, I'd like to check my account balance"
Labeled text with timestamps"[15.4s - 26.8s] PERSONA: Hi, I'd like to check my account balance"
JSON with messages arrayPersona turns are extracted automatically — see snippet below
{
  "messages": [
    { "role": "user", "content": "Hi, I'd like to check my account balance" },
    { "role": "assistant", "content": "Sure, I can help with that." }
  ]
}
When the ground truth contains role labels (e.g. PERSONA:, AGENT:), only persona/user lines are used — agent lines are filtered out automatically. How to interpret: Same as STT Word Error Rate — a lower WER means better STT accuracy. Because the reference transcript is your own known-correct text (not Coval’s transcription), this metric isolates your STT provider’s accuracy without any variance from the reference side. For the WER formula and interpretation thresholds, see Transcription Error.
Custom Trace Metrics — In addition to these built-in trace metrics, you can create your own custom trace metrics to measure any OTel span attribute emitted by your agent using the Create Metric button within the Metrics section of the Coval UI.

Transcription Accuracy

Transcription Error

Purpose: Evaluate the accuracy of agent messages through Word Error Rate (WER) which measures the percentage of errors in a transcript. What it measures: WER=S+D+INWER = \frac{S + D + I}{N} Where:
  • S = substitutions
  • D = deletions
  • I = insertions
  • N = total number of agent words in the reference transcript
How to interpret:
  • WER < 0.10: Is great! This indicates excellent high quality audio.
  • WER 0.10 - 0.30: Is acceptable for most converational agents and situations with background noise.
  • WER > 0.30: May significantly impact understanding of the audio.
See Coval Benchmarks for real-world WER performance data across different transcription providers and audio configurations.

User Patterns

Audio Sentiment

Purpose: Detect vocal tone of each audio segment. What it measures: Emotional tone for each audio segment for both parties. When to use: General tone of the conversation and trend of audio sentiment across the conversation. How it works: Classifies audio sentiment per speaking segment based purely on the audio tone and not the spoken content. How to interpret: Check frequency of certain emotional tones.
Want to set a pass/fail threshold on sentiment?
The Preferred Audio Sentiment custom metric lets you select which sentiments count as success, choose which speaker to evaluate (agent or persona), and set a minimum percentage of segments that must match.

Transcript Sentiment Analysis

Purpose: Analyzes the transcript for rude, polite, encouraging, and professional sentiments, identifying the sentiment with the highest overall score. What it measures: Score of emotional tone for each audio segment for the agent. When to use: General tone of the agent and how it could be interpreted. How it works: Classifies audio sentiment per speaking segment based purely on the audio tone and not the spoken content. How to interpret: Higher scores in each sentiment indicate stronger sentiment detected.

Best Practices for Using Built-in Metrics

Start with Core Metrics

Begin with essential metrics like response time, resolution success, and audio quality before adding specialized ones.

Set Baselines

Establish baseline measurements before making changes to track improvement over time.

Combine Metrics

Use multiple metrics together for comprehensive evaluation rather than relying on single indicators.

Regular Review

Schedule regular metric reviews to identify trends and areas needing attention.

Metric Selection Guide

Choose metrics based on your use case:

Voice Assistants

  • Audio Quality
    • Speech Tempo
    • Background Noise
    • Natural Non-robotic Tone Detection
    • Volume/Pitch Misalignment
  • Latency
  • Interruption Detection
  • Trace Metrics (requires OTel traces)
    • LLM Time to First Byte
    • TTS Time to First Byte
    • STT Time to First Byte

Customer Service Bots

  • Composite Evaluation
  • Resolution Time Efficiency
  • End Resolution
  • Audio Sentiment

Task Automation Agents

  • Workflow Verification
  • Composite Evaluation
  • Words Per Minute
  • LLM Token Usage
  • Tool Call Count

General Conversational AI

  • Agent Response Times
  • Interruption Rate
  • Agent Repeats Itself
  • Transcript Sentiment Analysis
  • End Reason
Remember that not all metrics are suitable for every scenario. Audio metrics require actual audio input, while comparison metrics need reference data to function properly.