Skip to main content

Built-in Metrics

Coval provides a comprehensive suite of built-in metrics to evaluate your AI agent’s performance across multiple dimensions. These metrics are ready to use out of the box and cover audio quality, conversation flow, response timing, and more.

Audio Quality

Audio metrics evaluate the quality and characteristics of speech output from your agents. These are essential for voice-based applications and provide comprehensive analysis of audio fidelity, conversation flow, and speech characteristics.
All metrics in this section require audio input to function properly. They will not work with text-only transcripts.

Background Noise

Purpose: Measurement of audio clarity and background noise. What it measures: Ratio between speech signal strength and background noise, signal noise ratio (SNR). When to use: Audio quality assessment, identifying poor recording conditions. How it works: Compares the strength of speech signals against background noise by analyzing speech and silent (room tone) segments separately. The metric calculates the ratio between signal and noise power levels, providing both overall and segment-by-segment quality assessments. How to interpret:
  • SNR above 20 dB indicates excellent audio clarity.
  • SNR between 10-20 dB is acceptable for most applications.
  • SNR below 10 dB may significantly impact speech recognition and comprehension.

Natural Non-robotic Tone Detection

Purpose: Analysis of audio frequency characteristics highlighting overtones and harmonics which give human voices their natural timbre. What it measures: Frequency distribution, dominant frequencies, and natural voice characteristics. When to use: This metric helps detect synthetic or robotic-sounding speech that could reduce the naturalness and effectiveness of agent interactions:
  • When evaluating speech synthesis quality
  • When testing new voice models or providers
  • When optimizing voice parameters for naturalness
  • When troubleshooting user complaints about robotic voices
  • Audio analysis, speaker identification, quality assessment
How it works: Analyzes the frequency distribution of the audio signal and measures the percentage of pitched content above 300Hz. How to interpret:
  • > 40%: Very natural / expressive - Found in natural human speech; rich in harmonics and overtones.
  • 30-40%: Acceptably natural - May be synthetic but sounds close to human-like
  • 20-30%: Slightly robotic - Often found in lower-quality TTS systems
  • < 20%: Robotic / flat - Indicates overly monotone or artificial tone
Need a different frequency threshold?
The Audio Frequency Filter custom metric lets you set any Hz value instead of the fixed 300Hz and reports the percentage of voiced segments above or below that threshold. Create one from the metric editor and configure frequency_threshold.

Music Detection

Purpose: Detects music segments in audio recordings. What it measures: Count of music segments detected in the conversation, with timeline data showing when each music segment occurs. When to use:
  • Detecting hold music or queue music during voice calls
  • Identifying unwanted background music in recordings
  • Measuring music duration and frequency in customer service scenarios
  • Debugging audio quality issues related to music interference
How it works: Analyzes audio to identify non-speech segments and classify them as music. Returns a count of music segments and timeline entries with start/end offsets and duration for each detected music segment. How to interpret:
  • Value equals the count of distinct music segments detected
  • Higher counts indicate more frequent or longer music interruptions
  • Timeline subvalues show exact timestamps for each music segment
  • Useful for identifying when customers are placed on hold with music

Audio Artifact Detection

These metrics detect synthesis and recording artifacts in agent audio output. Where the Audio Quality metrics above measure signal-level properties (noise, frequency, duration), artifact detection looks for discrete failure modes — clipping, signal dropouts, TTS loops, unnatural phoneme stretching, rhythm irregularities, voice identity drift, and anomalous pauses — that degrade perceived naturalness or indicate a broken TTS pipeline.
All metrics in this section require audio input. They are designed for voice agents using text-to-speech synthesis and will surface TTS-specific artifacts that general audio quality metrics do not detect.

Speech Artifact Score

Purpose: Composite quality score aggregating all artifact analyzers into a single 0–1 signal. What it measures: A weighted average of per-analyzer severity scores, inverted so that 1.0 = fully clean and 0.0 = severe artifact presence. Displayed with a quality tier badge in the UI.
ScoreTier
≥ 0.8Good
≥ 0.6Fair
≥ 0.4Poor
< 0.4Severe
When to use: Use this as your primary artifact health signal when you want a single number to track across runs or set an alert threshold on. Drill into the per-analyzer breakdown to identify which failure mode is driving the score. How it works: Runs all 8 analyzers and collects a 0–1 severity from each via sigmoid mapping calibrated for telephony audio. Severities are combined via weighted average; if an analyzer fails or is skipped (e.g., insufficient audio length), its weight is redistributed across the remaining analyzers so the score stays meaningful. How to interpret:
  • The per-analyzer breakdown is split into three sections:
    • Quality — Voice Quality shows a quality score (1.0 = natural, 0.0 = degraded)
    • Severity — Clipping, Dropout, Timbre Drift show severity (0.0 = clean, 1.0 = severe)
    • Diagnostic Measurements — Syllable Rate, Loop Detection, Phoneme Stretch, Pause Detection show severity scores
  • Waveform regions are color-coded by source: purple for signal artifacts (clipping, dropout), blue for speech anomalies (loop, phoneme stretch, syllable rate, timbre drift), teal for voice quality issues.
  • Opening a dropdown in the anomaly list filters the waveform to show only that analyzer’s regions. Opening multiple shows the union. Closing all restores the full view.

Signal Artifacts


Clipping

Purpose: Detect digital clipping caused by audio samples exceeding the signal ceiling. What it measures: The fraction of total samples inside clipping runs (0–1). Samples with absolute amplitude above 0.95 that persist for at least 5 consecutive samples are counted. When to use: When agents sound distorted or “crackly,” or when you suspect TTS output levels are misconfigured. Common in pipelines where gain staging is not controlled. How it works: Scans the raw (non-normalized) audio for consecutive samples above the amplitude threshold (0.95). A minimum run length of 5 samples filters out isolated spikes. Per-event severity scales linearly with how far the peak exceeds the threshold. How to interpret:
  • Any clipping fraction above ~0.01 (1%) typically produces audible distortion.
  • High clipping severity combined with low dropout severity points to a gain staging issue upstream of your TTS provider.

Dropout

Purpose: Detect brief signal interruptions where audio unexpectedly drops to near-silence. What it measures: The density of dropout events per minute of audio. A dropout is a 50–200 ms window where the signal falls ≥25 dB below the local baseline with a steep onset edge (≥20 dB within 2 ms). When to use: When callers report choppy audio or cut-out moments, or when you see unexplained gaps in waveform visualizations. Also useful for detecting network-induced packet loss on telephony integrations. How it works: Computes a rolling local energy baseline and flags windows where RMS drops sharply. The steep-edge requirement distinguishes dropouts from natural pauses. Events are filtered to the 50–200 ms duration range to avoid conflating dropouts with silence or pauses. How to interpret:
  • Isolated events (< 0.5/min) are often benign; rates above 1–2/min produce noticeably choppy audio.
  • Short, steep dropouts that cluster in time suggest a network or buffering issue; distributed dropouts suggest a TTS rendering problem.

Codec

Purpose: Detect codec compression artifacts introduced by audio encoding and decoding. What it measures: Mean MAD z-score across detected codec artifact events. Higher values indicate more anomalous spectral features (MFCC cosine distance, spectral flatness, spectral flux) relative to the call’s own median. Per-event timeline severities are normalized to 0–1 by capping at a z-score of 10. When to use: When audio has a “muddy,” over-compressed, or digitally degraded quality that isn’t explained by clipping or dropout. Common when audio passes through low-bitrate codecs (e.g., G.711 on telephony) or multiple encode/decode cycles. How it works: Extracts codec-sensitive spectral features and applies MAD-based outlier detection across segments. Segments with feature values that deviate significantly from the call median are flagged. The raw z-score is normalized to 0–1 by capping at 10, preventing extreme outliers from distorting comparisons.
Codec is excluded from the Speech Artifact Score composite because its raw z-scores can be very large, which would dominate the weighted average even after normalization. It is still surfaced in the per-analyzer breakdown and waveform regions.
How to interpret:
  • High codec severity with low clipping and dropout suggests the artifact is encoding-induced, not a gain or signal issue.
  • Telephony integrations running G.711 or similar narrow-band codecs will have a naturally elevated baseline.

Speech Anomalies


Loop Detection

Purpose: Detect repeated audio segments caused by TTS synthesis looping on a phrase or fragment. What it measures: Count of distinct loop patterns detected. Each loop must last at least 1.0 second, repeat at least 3 times, and have pitch correlation above 0.95 (confirming TTS origin rather than natural repetition). When to use: When agents occasionally repeat themselves verbatim in a way that does not match the conversation script, or when TTS audio sounds like it stuttered and replayed a segment. How it works: Computes MFCC (Mel-frequency cepstral coefficient) fingerprints across the audio and detects recurrence patches where fingerprints match closely. Pitch correlation above 0.95 between candidate segments is required as a confirmation gate, distinguishing true TTS loops from natural lexical repetition. How to interpret:
  • Any loop event above the minimum duration threshold (1.0 s) is worth investigating — true loops are almost never intentional.
  • Low MFCC similarity at or near the threshold indicates borderline matches.

Phoneme Stretch

Purpose: Detect unnaturally sustained phonemes where a single sound is held far longer than normal speech cadence. What it measures: Total duration of all phoneme stretch events in seconds. When to use: When TTS audio sounds like it is “freezing” on a syllable or vowel, which can occur when synthesis models encounter unusual input (long numbers, rare proper nouns, edge cases in SSML). How it works: Applies a quad-gate: a region must simultaneously satisfy voiced-segment detection, low pitch jitter (below 2% — TTS vocoders produce unnaturally stable pitch), stable fundamental frequency (F0 std < 15 Hz), and MFCC stasis (minimal spectral movement) to be flagged. All four conditions must hold to filter out natural expressive lengthening and prosodic emphasis. How to interpret:
  • Events under 0.5 s may be expressive prosody; events above 1.5 s are almost certainly synthesis artifacts.

Syllable Rate

Purpose: Detect rhythm irregularities indicating unnaturally mechanical or erratic speech pacing. What it measures: Syllable rate in syllables per second, along with rhythm diagnostics (nPVI, inter-syllable CV, and a composite irregularity score). The headline value is the raw syllable rate; the aggregate uses the rhythm irregularity score (0–1) for severity calculation. When to use: When agent speech sounds robotic or rushed, or when you want to validate that a new TTS model produces natural prosodic rhythm before deploying it. How it works: Measures variability between successive syllable durations (nPVI) and the coefficient of variation of inter-syllable gaps alongside absolute syllable rate. Scores are penalized when the rate falls outside the expected natural range (2.5–6.0 syl/s) or when rhythmic variability is abnormally low (monotone) or abnormally high (erratic). How to interpret:
  • A low irregularity score near 0 does not mean the speech sounds natural — it could indicate perfectly uniform (robotic) cadence. Examine both score and rate together.
  • High irregularity in combination with a rate above 6.0 syl/s suggests rushed synthesis; below 2.5 syl/s suggests sluggish or over-paused synthesis.

Timbre Drift

Purpose: Detect mid-call changes in voice identity — shifts in speaker timbre or pitch that make the agent sound like a different person. What it measures: Maximum cosine distance of speaker embeddings from an anchor reference (0–1). Higher values indicate greater drift from the original voice identity. F0 drift is tracked as an additional signal. When to use: When callers report that the agent’s voice changed during a call, or when you suspect your TTS provider is switching voice models or applying inconsistent conditioning mid-session. How it works: Extracts ECAPA-TDNN speaker embeddings at regular intervals. Two references are maintained: a session anchor (first 15 seconds) and a rolling 30-second window centroid. Drift is flagged when either the anchor distance or rolling centroid distance exceeds the threshold, or when F0 deviates more than 30% from the anchor mean. Both embedding drift and pitch shift contribute to severity, with a multiplier when both signals fire simultaneously. How to interpret:
  • Anchor drift catches slow degradation (voice getting “tired”); rolling centroid drift catches sudden voice swaps.
  • Small drift values (< 0.2) are within normal TTS session variation; values above 0.3–0.4 are perceivable to most listeners.

Voice Quality

Purpose: Measure naturalness and vocal health of the agent’s synthesized speech across four acoustic dimensions. What it measures: A composite naturalness score 0–1 (higher = more natural) combining four sub-metrics:
Sub-metricWeightWhat it captures
CPPS (Cepstral Peak Prominence Smoothed)40%Voice clarity and breathiness — how well the harmonic structure stands above noise
Jitter20%Cycle-to-cycle pitch period stability — irregularity sounds like roughness or hoarseness
Shimmer20%Cycle-to-cycle amplitude stability — irregularity sounds like tremor or unsteadiness
F0 variability20%Pitch expressiveness — monotone speech with low F0 range scores poorly
When to use: When onboarding a new TTS provider or voice model, when monitoring for voice degradation over time, or when callers report the agent sounding robotic, breathy, or unsteady. How it works: Requires both raw and LUFS-normalized audio. CPPS is computed on the LUFS-normalized variant, which stabilizes the cepstral envelope for accurate breathiness measurement. Jitter and shimmer are computed on the raw variant using a 3-second sliding window with 1-second hops, so short-lived instabilities are captured without being diluted by longer clean segments. F0 variability is derived from pitch statistics across the call — both standard deviation and range must exceed thresholds for a healthy score. How to interpret:
ScoreTier
≥ 0.8Excellent
≥ 0.6Good
≥ 0.4Fair
< 0.4Poor
  • CPPS carries the most weight (40%) and is the most sensitive indicator of breathiness and vocal fry. A CPPS below ~6 dB strongly suggests synthesis quality issues.
  • Jitter and shimmer thresholds are drawn from clinical speech pathology research on healthy vs. disordered voices. Each sub-metric scores 1.0 in the normal range, degrades linearly through the warning zone, and hits 0.0 at the instability ceiling:
    Normal (score 1.0)Warning zoneInstability (score 0.0)
    Jitter< 1.04%1.04–2.0%≥ 2.0%
    Shimmer< 3.81%3.81–6.0%≥ 6.0%
    Well-tuned TTS systems should stay well under the normal floors. Values approaching the instability ceiling typically sound noticeably rough or unsteady to listeners.
  • F0 variability scores poorly when F0 standard deviation is below 15 Hz or pitch range is below 30 Hz — both must exceed their thresholds for a full score.
  • High jitter or shimmer without CPPS degradation often reflects expressive prosody rather than a synthesis defect. Look at all four sub-scores in the breakdown before drawing conclusions.

Pause Detection

Purpose: Detect anomalous pauses that are outliers relative to the agent’s own baseline pause distribution. What it measures: Count of anomalous pause events, identified using MAD (Median Absolute Deviation) z-score thresholding against the distribution of all pauses in the call. The headline value is the count of outlier pauses. When to use: When callers report the agent “hanging” or going silent unexpectedly, or when you want to separate natural hesitation patterns from processing-delay artifacts. How it works: Identifies all silent gaps in agent speech and computes a MAD-based z-score for each, making the detector robust to non-Gaussian pause distributions. Pauses with a z-score above the threshold (3.0) are flagged as anomalous. This approach adapts to each call’s natural rhythm rather than applying a fixed silence threshold. How to interpret:
  • The detector flags pauses that are statistical outliers within the call — a single very long pause in an otherwise fluent call will score higher than the same pause duration in a call with consistently slow pacing.
  • MAD z-score 3.0 roughly corresponds to pauses more than 3x the typical duration for that agent in that call.

Conversation Length

These metrics analyze the content and flow of conversations to ensure effective communication.

Audio Duration

Purpose: This metric measures the duration of the audio file in seconds. What it measures: Duration of the full conversation in seconds.

Turn Count

Purpose: Counts how many turns were taken in a conversation. What it measures: Each turn is a change between speakers.

Words Per Message

Purpose: The average number of words per message agent message. What it measures: Average number of words per message in a conversation.

Speaking Time Percentage

Purpose: Measures the percentage of total audio duration occupied by a selected speaker or silence. What it measures: The fraction of the call spent by the configured role, expressed as a percentage (0-100%). Configurable via the role parameter:
  • Agent — percentage of the call where the agent is speaking.
  • Persona — percentage where the customer/persona is speaking.
  • Silence — percentage of dead air (no speech or music).
  • Music — percentage of hold music or background music.
When to use: Analyzing call composition — e.g., checking if the agent dominates the conversation, identifying excessive hold time, or measuring dead air. How to interpret: The four roles are complementary: agent + persona + silence + music = 100%. Create multiple instances with different roles to get a full breakdown. Audio regions are highlighted on the waveform timeline for each detected segment.

Instruction Following

These metrics measure how well the agent follows predefined behaviors.

Workflow Verification

Verifies if conversations follow expected workflow patterns and business logic.

Resolution

Evaluate the end of the conversation.

End Reason

Purpose: The reason that the conversation ended. When to use: Help identify patterns in call completion.
This currently only works for simulations run on Coval. Support live calls are a work in progress.
Possible values:
ValueDescription
COMPLETEDThe conversation reached a natural conclusion with a successful resolution
MAX_TURNSThe conversation reached the maximum allowed number of turns
MAX_DURATIONThe conversation exceeded the maximum allowed duration
USER_HANGUPThe user ended the conversation (voice calls only)
AGENT_HANGUPThe agent ended the conversation (voice calls only)
IDLE_TIMEOUTThe conversation timed out due to inactivity (chat/SMS simulations)
ERRORAn error occurred during the simulation
UNKNOWNThe end reason could not be determined
Want to define what counts as a successful end reason?
The Successful End Reason custom metric returns YES/NO based on whether the end reason matches your configured success criteria. Select one or more end reasons as your success conditions.

Responsiveness

Critial metric to identify if the agent is responding correctly.

Agent Fails To Respond

Purpose: Evaluate continuity and identif moments when the agent ignores or misses a user query.
Any occurrence of this metric indicates a critical failure requiring immediate investigation.
What it measures: Long silence from the agent between two consecutive user turns, and whether and when the agent eventually responds after the second user turn. When to use: Identifying moments when the agent ignored or misses a user query How it works: Finds silence gaps ≥ 3 seconds between two consecutive user turns and checks whether the agent resumes speaking after this.
Need a different silence threshold?
The Agent Fails to Respond Delay custom metric lets you configure max_silence_duration_seconds instead of the fixed 5-second default.

Agent Needs Reprompting

Purpose: Identifies when agents become unresponsive but will respond after user repetition.
This metric helps identify edge cases where the agent’s response mechanism may be failing intermittently.
What it measures: Long silence from the agent between two consecutive user turns, only ig the agent responds after the second user turn. When to use: Evaluating naturalness and continuity. Identifying moments when the agent ignores or misses a user query. How it works: Finds silence gaps ≥ 3 seconds between two consecutive user turns and checks if the agent resumes speaking after this. How to interpret:
  • Each silence gap and eventual response are collectively considered one event.
  • More events = worse responsiveness.
Need a different silence gap threshold?
The Agent Reprompting Delay custom metric lets you configure min_silence_gap_seconds instead of the fixed 2-second default.

Agent Repeats Itself

Purpose: Identifies instances where the agent says the same sentence or asks the same question multiple times. When to use: Evaluating naturalness and word choice, identifying diverse language.

Timing & Latency

Ensure timely agent interactions.

Interruption Rate

Purpose: The rate (interruptions per minute) that the user is interrupted by the assistant. What it measures: An interruption is defined as any time the user is speaking and the assistant starts speaking before the user has finished speaking. This does not include times that the user interrupts the assistant. When to use: Conversation flow analysis, identifying communication issues, training data for interruption handling. How to interpret:
  • High interruption frequency may indicate communication issues.
  • Interruption patterns can help identify conversation flow problems.
  • Useful for training agents to handle interruptions gracefully.

Latency

Purpose: Measurement of delays between user and agent response time in milliseconds (ms). What it measures: Time between user input and agent response, silence durations. When to use: Performance evaluation, identifying slow response times, conversation flow analysis. How it works: Analyzes the audio signal using Voice Activity Detection (VAD) to identify speaker transitions and measure the time delay between when a user finishes speaking and when the agent begins responding. The metric tracks these response times throughout the conversation to identify patterns and potential issues. How to interpret:
  • Target latencies under 500ms for real-time conversations.
  • Target latencies under 2 seconds for complex query responses.
  • Higher latencies may indicate performance issues or processing bottlenecks.

Time To First Audio

Purpose: Detect audio start latency and responsiveness. What it measures: Time delay between simulation start and the first audible sound in the audio recording. When to use: Evaluating system or agent response latency before any speech begins. How it works: Detects the first audio frame that has RMS energy above a certain threshold and returns the timestamp of this frame. How to interpret:
  • < 1000 ms: Fast audio start; considered responsive.
  • 1–3 seconds: Acceptable delay.
  • 3000 ms: Noticeable lag; may indicate issues in agent response, recording delay, or user hesitation.
  • -1 ms: No audio detected; likely a technical failure or silent recording.

Speech Tempo

Purpose: Identifies the rate of phonemes (perceptually distinct unit of speech sound) and high-speed speech periods. What it measures: The rate of phonemes per second (pps) in audio output. When to use: Speech quality assessment. Usefult to identify the average tempo. How it works: Measures the number of phonemes per interval over the duration of speech segment. How to interpret:
  • Above 20 PPS is too fast and will be hard to follow.
  • Between 15-20 PPS is fast but could be comprehensible.
  • Target is 10-15 PPS is not too fast or too slow.
  • Below 10 is too slow.

Pause Analysis

Purpose: Measures how frequently the agent pauses mid-speech and how long those pauses are. What it measures: Frequency of agent pauses within a turn (pauses per minute), along with total and average pause duration. When to use:
  • Identifying unnatural or excessive hesitations in agent speech
  • Detecting processing delays that manifest as in-speech pauses
  • Evaluating speech fluency across different configurations
How it works: Identifies gaps between consecutive agent speaking segments within the same turn and measures their duration. Persona pauses and inter-turn gaps are excluded. How to interpret:
  • Lower values indicate more fluent speech.
  • The detail view shows each pause with its timestamp and duration.
  • Brief pauses are normal and often expressive; frequent longer pauses may indicate hesitation artifacts.

Trace Metrics

These metrics use OpenTelemetry (OTel) trace data to measure the performance of individual components in your voice agent pipeline. They provide granular visibility into LLM, TTS, and STT service latencies, token consumption, and tool usage.
All metrics in this section require your agent to send OpenTelemetry traces to Coval. See the OpenTelemetry Traces guide for setup instructions. If traces are not configured, these metrics will report an error at execution time.

LLM Time to First Byte

Purpose: Measure LLM responsiveness by tracking how quickly the first token is returned. What it measures: Average time (in seconds) from when the LLM request is sent to when the first token is received, across all turns in the conversation. When to use: Identifying slow LLM providers, comparing model latencies, optimizing prompt length for faster responses.

TTS Time to First Byte

Purpose: Measure TTS responsiveness by tracking how quickly the first audio byte is produced. What it measures: Average time (in seconds) from when text is sent to the TTS service to when the first audio byte is returned, across all turns. When to use: Evaluating TTS provider performance, identifying bottlenecks in the audio generation pipeline.

STT Time to First Byte

Purpose: Measure STT responsiveness by tracking how quickly the first transcription result is returned. What it measures: Average time (in seconds) from when audio is sent to the STT service to when the first transcription result is received, across all turns. When to use: Evaluating STT provider performance, diagnosing why the agent is slow to start processing user input.

LLM Token Usage

Purpose: Track the total token consumption of LLM calls during a conversation. What it measures: Sum of input tokens and output tokens consumed across all LLM calls in the conversation. When to use: Cost monitoring, identifying conversations that consume excessive tokens, comparing prompt strategies for efficiency. How to interpret:
  • Token counts vary by model and use case. Track this metric over time to establish baselines for your specific agent.
  • Sudden spikes may indicate prompt injection, runaway tool loops, or excessively long conversations.
  • Use in combination with turn count to compute average tokens per turn.

Tool Call Count

Purpose: Count the total number of tool calls made during a conversation. What it measures: Number of tool call invocations detected in OTel trace spans. When to use: Verifying that the agent is using tools as expected, identifying conversations with excessive or insufficient tool usage.

STT Word Error Rate

Purpose: Measure the accuracy of your agent’s Speech-to-Text by comparing it against Coval’s own transcription of the same conversation. What it measures: Word Error Rate (WER) between your agent’s STT output and Coval’s reference transcript of the caller’s speech. The reference (ground truth) is Coval’s transcription of the persona’s speech, generated automatically from each simulation. The hypothesis (what you’re testing) is your agent’s own STT output, read from the transcript attribute on OTel stt spans. When to use: Evaluating your STT provider’s accuracy, comparing STT providers (e.g., Deepgram vs Whisper vs Google), diagnosing why your agent misunderstands users, or tracking STT quality over time.
This metric requires your agent to emit OTel traces with the transcript attribute on each stt span. See Instrumenting STT Spans for setup instructions. If your STT provider exposes utterance confidence, we also recommend sending stt.confidence on the same spans for debugging and provider-quality analysis. No manual ground truth or test data is needed — the reference transcript is generated automatically.
How to interpret: A lower WER means your STT is more accurately capturing what the caller said. Compare this metric across runs to track STT quality over time, or across different STT providers to find the best fit for your use case. Note that streaming (real-time) STT typically produces higher WER than batch transcription because it processes audio incrementally. For the WER formula and interpretation thresholds, see Transcription Error.

STT Word Error Rate (Audio Upload)

Purpose: Measure STT accuracy against a known-correct transcript that you provide, rather than Coval’s auto-generated reference. What it measures: Word Error Rate (WER) between your agent’s STT output and a ground truth transcript you supply in the test case metadata. The reference (ground truth) comes from the ground_truth_transcript field in your test case metadata. The hypothesis (what you’re testing) is your agent’s own STT output, read from the transcript attribute on OTel stt spans — the same as the standard STT Word Error Rate metric. When to use: When you have audio upload test cases with pre-recorded audio where you know exactly what was said. This lets you measure how accurately your agent’s speech recognition transcribes a known recording — useful for regression testing STT quality against a canonical script.
This metric requires two things:
  1. Your agent must emit OTel traces with the transcript attribute on each stt span. See Instrumenting STT Spans for setup.
  2. Your test case must include a ground_truth_transcript key in its metadata containing the reference transcript. See Audio Upload — Ground Truth Transcript for details.
If your STT provider exposes utterance confidence, we also recommend attaching stt.confidence to each stt span so low-confidence turns are easier to inspect alongside the WER result.
Accepted ground truth formats:
FormatExample
Plain text"Hi, I'd like to check my account balance"
Labeled text with timestamps"[15.4s - 26.8s] PERSONA: Hi, I'd like to check my account balance"
JSON with messages arrayPersona turns are extracted automatically — see snippet below
{
  "messages": [
    { "role": "user", "content": "Hi, I'd like to check my account balance" },
    { "role": "assistant", "content": "Sure, I can help with that." }
  ]
}
When the ground truth contains role labels (e.g. PERSONA:, AGENT:), only persona/user lines are used — agent lines are filtered out automatically. How to interpret: Same as STT Word Error Rate — a lower WER means better STT accuracy. Because the reference transcript is your own known-correct text (not Coval’s transcription), this metric isolates your STT provider’s accuracy without any variance from the reference side. For the WER formula and interpretation thresholds, see Transcription Error.
Custom Trace Metrics — In addition to these built-in trace metrics, you can create your own custom trace metrics to measure any OTel span attribute emitted by your agent using the Create Metric button within the Metrics section of the Coval UI.

Transcription Accuracy

Transcription Error

Purpose: Evaluate transcription accuracy through Word Error Rate (WER) — the percentage of words in the existing transcript that differ from a reference transcription Coval generates from the call audio. What it measures: WER=S+D+INWER = \frac{S + D + I}{N} Where:
  • S = substitutions
  • D = deletions
  • I = insertions
  • N = total number of words in the reference transcript
When to use:
  • Detecting transcription quality regressions across runs or audio configurations
  • Comparing speech-to-text providers for the agent or persona side
  • Reporting WER for the agent, the caller, or both sides of a conversation
  • Measuring transcription accuracy on uploaded conversations
How it works: Coval generates an independent reference transcription of the call audio and compares it to the existing transcript word by word. Each missing, inserted, or substituted word becomes an error and is surfaced as a word-level highlight in the transcript view (substitutions in yellow, deletions in red, insertions in blue). The metric automatically detects which speaker channel is which, so it works whether the audio was recorded by Coval or uploaded as a conversation. Configuration (via metric metadata):
ParameterDefaultDescription
role"agent"Whose turns to score. One of "agent", "persona", or "both" to measure agent and caller together.
min_reference_confidence0.8Drop missing- or wrong-word errors where the reference recognizer’s confidence is below this value (0.0–1.0). Filters out errors that are likely just reference uncertainty. Leave blank to disable.
min_substitute_similarity0.8Drop substitution errors where the original and replacement words are at least this similar (0.0–1.0). Filters out spelling variants like “color”/“colour”. Leave blank to disable.
The headline WER reflects errors after filtering, so the displayed number always matches the highlighted errors in the transcript view. How to interpret:
  • WER < 0.10: Excellent — clean audio with high transcription accuracy.
  • WER 0.10 – 0.30: Acceptable for most conversational agents and situations with background noise.
  • WER > 0.30: May significantly impact understanding of the audio.
See Coval Benchmarks for real-world WER performance data across different transcription providers and audio configurations.

User Patterns

Audio Sentiment

Purpose: Detect vocal tone of each audio segment. What it measures: Emotional tone for each audio segment for both parties. When to use: General tone of the conversation and trend of audio sentiment across the conversation. How it works: Classifies audio sentiment per speaking segment based purely on the audio tone and not the spoken content. How to interpret: Check frequency of certain emotional tones.
Want to set a pass/fail threshold on sentiment?
The Preferred Audio Sentiment custom metric lets you select which sentiments count as success, choose which speaker to evaluate (agent or persona), and set a minimum percentage of segments that must match.

Transcript Sentiment Analysis

Purpose: Analyzes the transcript for rude, polite, encouraging, and professional sentiments, identifying the sentiment with the highest overall score. What it measures: Score of emotional tone for each audio segment for the agent. When to use: General tone of the agent and how it could be interpreted. How it works: Classifies audio sentiment per speaking segment based purely on the audio tone and not the spoken content. How to interpret: Higher scores in each sentiment indicate stronger sentiment detected.

Best Practices for Using Built-in Metrics

Start with Core Metrics

Begin with essential metrics like response time, resolution success, and audio quality before adding specialized ones.

Set Baselines

Establish baseline measurements before making changes to track improvement over time.

Combine Metrics

Use multiple metrics together for comprehensive evaluation rather than relying on single indicators.

Regular Review

Schedule regular metric reviews to identify trends and areas needing attention.

Metric Selection Guide

Choose metrics based on your use case:

Voice Assistants

  • Audio Quality
    • Speech Tempo
    • Background Noise
    • Natural Non-robotic Tone Detection
    • Volume/Pitch Misalignment
  • Latency
  • Interruption Detection
  • Trace Metrics (requires OTel traces)
    • LLM Time to First Byte
    • TTS Time to First Byte
    • STT Time to First Byte

Customer Service Bots

  • Composite Evaluation
  • Resolution Time Efficiency
  • End Resolution
  • Audio Sentiment

Task Automation Agents

  • Workflow Verification
  • Composite Evaluation
  • Words Per Minute
  • LLM Token Usage
  • Tool Call Count

General Conversational AI

  • Agent Response Times
  • Interruption Rate
  • Agent Repeats Itself
  • Transcript Sentiment Analysis
  • End Reason
Remember that not all metrics are suitable for every scenario. Audio metrics require actual audio input, while comparison metrics need reference data to function properly.