Custom Metrics - Coval Documentation

This guide provides instruction for creating high-performing custom prompting metrics in Coval’s evaluation platform. Each metric type benefits from various prompting strategies to achieve reliable, deterministic results. For help writting prompts for the custom metrics Coval offers an optimize metric button to improve clarity and confidence.

Core Principles for All Metrics

1. Specificity Over Generality

Define exact evaluation criteria rather than subjective assessments
Use concrete, measurable behaviors instead of abstract concepts
Provide clear boundary conditions for edge cases

2. Role Consistency

Always refer to the AI agent as “the assistant”
Use “the user” or “the customer” for human participants
Maintain consistent terminology throughout your prompts

3. Deterministic Design

Structure prompts to minimize LLM variance across evaluations
Provide explicit decision trees when possible
Define what constitutes partial vs. complete success

Binary LLM Judge Metrics

Purpose: Yes/No evaluations with high accuracy and consistency

Prompt Structure Template

[CONTEXT SETTING]
Given the transcript, [SPECIFIC QUESTION]?

Return YES if:
• [Explicit criterion 1]
• [Explicit criterion 2]
• [Explicit criterion 3]

Return NO if:
• [Explicit disqualifying condition 1]
• [Explicit disqualifying condition 2]
• [Edge case handling]

[CLARIFICATIONS FOR EDGE CASES]

Important: When using OR conditions, make it explicitly clear that the metric should return YES/NO if any of the conditions are met. Use “ANY of the following” language to remove ambiguity.

“Return x if ANY of the following apply:”
“[Condition] OR [Condition] OR [Condition] … “

Example 1: Issue Resolution Detection

Given the transcript, did the assistant successfully resolve the user's primary issue or concern?

Return YES if ANY of the following apply:
• The user explicitly confirms their issue is resolved (e.g., "That worked," "Perfect, thank you")
• OR the assistant provides a complete solution and the user accepts it without further objection
• OR the user indicates satisfaction with the outcome before ending the conversation
• OR the assistant completes a requested action and the user acknowledges success
• OR the user's question was fully answered and they don't ask follow-up questions about the same issue
• OR the assistant provides complete, actionable guidance and the user indicates understanding
• OR no primary issue or concern was raised by the user (e.g., casual greetings, general inquiries)

Return NO if ANY of the following apply:
• The user states their issue remains unresolved
• OR the conversation ends without addressing the user's main concern
• OR the user expresses frustration or dissatisfaction with the proposed solution
• OR the assistant escalates or transfers the issue without providing any resolution attempt
• OR the user has to repeat their problem multiple times without progress
• OR the assistant admits they cannot help or solve the user's problem
• OR the user asks the same question again after receiving an answer

Example 2: Compliance Verification

Given the transcript, did the assistant properly collect all required verification information before processing the request?

Return YES if:
• The assistant gathered account number, full name, and security question answer
• All three verification elements were confirmed before proceeding
• The assistant explicitly stated verification was complete

Return NO if:
• Any of the three required elements (account number, name, security answer) were skipped
• The assistant proceeded with the request before completing verification
• Verification was attempted but failed, yet the assistant continued anyway

If the user refuses to provide verification, return NO regardless of the reason.

Tips and tricks

Be Objective

Recommended: Objective: “Did the assistant acknowledge the user’s concern within their first two responses?”Avoid: Too subjective: “Did the assistant provide good customer service?”

Single Focus

Recommended: Singular observation:

Create separate metrics for seperate obervations such as resolution and professionalism.

Avoid: Multiple criteria:

“Did the assistant resolve the issue and maintain professionalism?”

Clear Logic

Recommended: Use of clear logical operators

Use AND/OR operators, ANY/ALL.

Avoid: Evaluation logic that contradicts the stated rules

Metrics return incorrect results when the evaluation system checks for things that shouldn’t trigger failures.
Such as requiring disclosure when no transfer occurred, or flagging live conversations as voicemails.

Before (Poor Metric Example):

Based on the transcript, did the customer service agent ask about
the customer's preferred contact method, current service plan, or billing preferences?

Return YES if:
All three preference items were specifically inquired about.

Return NO if:
One or more items were not asked.

Why this fails: The metric has an “OR” condition in the question but requires “AND” logic in the evaluation, creating confusion about whether one or all conditions must be met.

After (Improved Metric Example):

Based on the transcript, did the customer service agent ask about the customer's preferred contact method, current service plan, or billing preferences?

Return YES if:
• The agent asked about preferred contact method, current service plan, AND billing preferences
• This can be in a single question (e.g., "What's your preferred contact method, current plan, and billing preference?") OR separate questions for each item

Return NO ONLY if:
• The agent failed to ask about one or more of these three specific items: contact method, service plan, or billing preferences
• Note: Focus on what the AGENT asked, not on what the customer mentioned in their response

Examples of acceptable questions:
• "How would you like us to contact you, what's your current plan, and how do you prefer to handle billing?"
• Three separate questions covering each preference
• Any variation that covers all three customer preference areas

Key improvements: Clear AND/OR operators, explicit examples, and evaluation logic that matches the stated conditions.

Categorical LLM Judge Metrics

Purpose: Classification into predefined, mutually exclusive custom categories.

Prompt Structure Template

Classify [SPECIFIC ASPECT] based on the conversation content.

Decision Logic:
• If [condition], classify as [CATEGORY_NAME]
• If [condition], classify as [CATEGORY_NAME]
• If [condition], classify as [CATEGORY_NAME]

Return only the exact category name.

Note: Configure the category options and their definitions in the Coval UI category menu. The categories and their descriptions are set through the platform interface, not in the prompt text.

Example 1: Call Intent Classification

Classify the primary reason for this conversation based on the user's needs and requests.

Decision Logic:
• If user mentions technical problems, errors, or "not working", classify as TECHNICAL_SUPPORT
• If user mentions money, charges, bills, or payments, classify as BILLING_INQUIRY
• If user wants to change account details or settings, classify as ACCOUNT_MANAGEMENT
• If user asks general questions without specific issues, classify as GENERAL_INFORMATION
• If user expresses dissatisfaction and requests escalation, classify as COMPLAINT_ESCALATION

Return only the exact category name.

Example 2: Conversation Outcome Classification

Classify the final outcome of this conversation based on how it concluded.

Decision Logic:
• If user explicitly confirms resolution or satisfaction, classify as RESOLVED_SUCCESSFULLY
• If solution provided but requires user action outside conversation, classify as PARTIALLY_RESOLVED
• If conversation transferred to human agent, classify as ESCALATED_TO_HUMAN
• If user ends conversation frustrated or without resolution, classify as UNRESOLVED_ABANDONED
• If user asked questions and received answers without specific problems, classify as INFORMATION_PROVIDED

Return only the exact category name.

Numerical LLM Judge Metrics

Purpose: Score-based evaluations with consistent integer scaling.

Prompt Structure Template

Rate [SPECIFIC ASPECT] based on the following criteria:

Evaluation Criteria:
• [Criterion 1 with behavioral indicators]
• [Criterion 2 with behavioral indicators]
• [Criterion 3 with behavioral indicators]

Scoring Guidelines:
Low scores: [Behavioral indicators for poor performance]
High scores: [Behavioral indicators for excellent performance]

Return only the numerical score.

Note: Configure the Min and Max score values as shown, not in the prompt text. The scoring scale (e.g., 1-5, 1-10) is set through the platform interface.

Example 1: Empathy Assessment

Rate the assistant's empathy level based on the following criteria:

Evaluation Criteria:
• Acknowledgment of user emotions and concerns
• Use of appropriate empathetic language and tone indicators
• Validation of user feelings before moving to solutions
• Adaptation of communication style to user's emotional state

Scoring Guidelines:
Low scores: No empathy shown, dismissive responses, purely transactional
High scores: Clear empathetic responses, validates feelings, shows genuine concern

Return only the numerical score.

Example 2: Technical Accuracy Scoring

Rate the technical accuracy of the assistant's information based on the following criteria:

Evaluation Criteria:
• Factual correctness of all technical statements
• Completeness of technical explanations
• Appropriate level of technical detail for the context
• Identification and correction of any technical misconceptions

Scoring Guidelines:
Low scores: Major technical errors that could cause problems
High scores: Expert-level accuracy with comprehensive, precise details

Return only the numerical score.

Multimodal LLM Judge Metrics

Purpose: Include audio-specific evaluations that text analysis cannot capture. Multimodal LLM Judge metrics analyze the audio along with the transcript text. This allows you to evaluate qualities like vocal tone, speech clarity, pacing, and emotional expression that are impossible to assess from text alone.

The format of the Multimodal LLM judge metrics are the same as the LLM judge metrics. Coval will handle the audio processing automatically, Your prompt should focus on what you want to evaluate, not how to process the audio.

What Audio Metrics Can Detect

Audio LLM Judge metrics excel at evaluating:

Category	Examples
Speech Quality	Clarity, articulation, pronunciation, stuttering
Vocal Characteristics	Tone, pitch, volume consistency, speaking pace
Emotional Expression	Enthusiasm, frustration, sarcasm, empathy in voice
Professional Demeanor	Courtesy, patience, confidence, nervousness
Speaker Identification	Distinguishing between speakers, detecting interruptions

Prompt Structure Template

[SPECIFIC AUDIO QUESTION]

Audio Analysis Criteria:
• [Acoustic feature 1]
• [Vocal characteristic 2]
• [Speech pattern 3]

Return YES if:
• [Audio-specific condition 1]
• [Audio-specific condition 2]

Return NO if:
• [Audio-specific disqualifier 1]
• [Audio-specific disqualifier 2]

Note: [Clarification about evaluation scope]

Writing Effective Audio Prompts: Be specific about which speaker to evaluate (assistant, user, or both) and what acoustic qualities matter. Vague prompts like “Did it sound good?” produce inconsistent results.

Transcript Scope for Audio Metrics

Audio LLM Judge metrics support Transcript Scope, allowing you to evaluate only specific portions of the audio. When you apply filters (such as agent-only or last N turns), the system automatically extracts and evaluates only the corresponding audio segments. This is particularly useful for:

Evaluating agent speech quality without user audio
Focusing on closing statements or greetings
Reducing token costs on long recordings

Best Practices for Audio Metrics

Focus on Audio-Only Qualities

Only use Audio LLM Judge for evaluations that require hearing the audio. If something can be determined from the transcript alone (like whether specific words were said), use a standard LLM Judge metric instead - it’s faster and more cost-effective.Use Audio metrics for: Tone of voice, speaking pace, pronunciation clarity, emotional expression, volume issuesUse Text metrics for: Word choice, script compliance, information accuracy

Specify the Speaker Role

Always clarify whose audio you’re evaluating:

“Did the assistant speak clearly…”
“Did the user sound frustrated…”
“Was there crosstalk between both speakers…”

This prevents ambiguity when multiple voices are present.

Define Concrete Audio Criteria

Replace subjective terms with specific, observable audio qualities:

Avoid	Use Instead
”Good tone"	"Calm, even-paced tone without audible frustration"
"Clear speech"	"Words pronounced distinctly without mumbling or slurring"
"Professional"	"Business-appropriate volume and pace, no sighing or dismissive inflections”

Include Reasoning Guidance

For complex evaluations, ask the model to consider specific aspects before making a determination. This improves accuracy:

Before making your determination, consider:
What is the overall vocal tone throughout the call?
Are there any moments where the tone shifts notably?
How would a customer likely perceive this tone?

Example 1: Speech Clarity Assessment

Did the assistant speak clearly and at an appropriate pace throughout the conversation?

Audio Analysis Criteria:
• Pronunciation clarity and articulation
• Speaking pace (not too fast or slow for comprehension)
• Volume consistency and audibility
• Absence of mumbling, slurring, or rushed speech

Return YES if:
• All words are clearly pronounced and easily understood
• Speaking pace allows for comfortable comprehension
• Volume remains consistent and audible throughout
• No instances of unclear or garbled speech

Return NO if:
• Words are frequently mumbled, slurred, or unclear
• Speaking pace is too fast or slow for easy comprehension
• Volume fluctuations make parts difficult to hear
• Any portions of speech are unintelligible due to clarity issues

Note: Focus only on the assistant's speech clarity, not content quality.

Example 2: Professional Tone Detection

Did the assistant maintain a professional vocal tone throughout the conversation?

Audio Analysis Criteria:
• Tone consistency and appropriateness for business context
• Absence of inappropriate emotional expressions (anger, frustration, sarcasm)
• Professional demeanor in vocal inflection and manner
• Respectful and courteous vocal presentation

Return YES if:
• Vocal tone remains professional and business-appropriate throughout
• No instances of unprofessional vocal expressions or attitudes
• Tone conveys respect and courtesy consistently
• Emotional responses, if any, are appropriate to the context

Return NO if:
• Vocal tone becomes unprofessional, dismissive, or inappropriate
• Clear instances of anger, frustration, or sarcasm in voice
• Tone suggests disrespect or lack of courtesy
• Emotional vocal responses inappropriate for professional context

Note: Evaluate vocal tone and manner, not the words spoken.

Example 3: Empathy Detection

Did the assistant demonstrate vocal empathy when the user expressed frustration or concern?

Audio Analysis Criteria:
• Softening of tone when user expresses negative emotions
• Appropriate pacing adjustments (slowing down to show care)
• Warm, understanding vocal quality rather than robotic or dismissive
• Verbal acknowledgments delivered with genuine-sounding concern

Return YES if:
• Assistant's tone audibly softens or warms in response to user distress
• Pacing adjusts appropriately to show the assistant is listening
• Voice conveys genuine concern rather than scripted responses
• No rushing through empathetic statements

Return NO if:
• Assistant maintains the same tone regardless of user's emotional state
• Empathetic words are delivered in a flat, robotic, or rushed manner
• Assistant sounds impatient or dismissive when user is upset
• No vocal adaptation to the user's emotional needs

Note: Evaluate the vocal delivery of empathy, not just whether empathetic words were used.

Example 4: Speaker Diarization Quality

Can the two speakers (assistant and user) be clearly distinguished throughout the recording?

Audio Analysis Criteria:
• Distinct vocal characteristics between speakers
• Clear turn-taking without excessive overlap
• Ability to attribute each utterance to the correct speaker
• Audio quality sufficient for speaker identification

Return YES if:
• Each speaker has distinguishable vocal qualities
• Turn-taking is clear with minimal confusing overlaps
• All significant utterances can be attributed to a specific speaker
• No extended portions where speaker identity is unclear

Return NO if:
• Speakers sound too similar to reliably distinguish
• Frequent overlapping speech makes attribution difficult
• Significant portions have unclear speaker identity
• Audio quality issues (echo, distortion) prevent speaker identification

Note: This metric evaluates audio clarity for speaker identification, not conversation quality.

Common Pitfalls to Avoid

Don’t mix audio and text evaluations in a single Audio metric. If you need to check both “Did they sound professional?” AND “Did they say the required disclaimer?”, create two separate metrics - an Audio metric for tone and a Text metric for the disclaimer.

Pitfall	Problem	Solution
Evaluating transcript content	Audio metrics can’t reliably assess word choice	Use standard LLM Judge for text content
Vague audio criteria	”Good voice” is subjective and inconsistent	Define specific qualities: pace, clarity, tone
Missing speaker specification	Unclear whose voice to evaluate	Always specify: assistant, user, or both
Combining unrelated qualities	”Clear AND professional AND empathetic” is too broad	Create separate metrics for each quality

Transcript Scope

Purpose: Focus metric evaluation on specific portions of a conversation rather than the entire transcript. Transcript Scope allows you to filter which messages the LLM evaluates, reducing noise and improving accuracy for targeted assessments. This feature is available for all LLM Judge metrics (Binary, Numerical, Categorical) and Audio LLM Judge metrics.

When to Use Transcript Scope

Use Case	Filter Configuration
Evaluate only agent responses	Role filter: `agent`
Check the closing of a conversation	Range filter: Last 3 turns
Assess user sentiment only	Role filter: `user`
Focus on recent context	Range filter: Last N messages

Configuration Options

Transcript Scope Toggle:

Full (default) - Evaluate the entire transcript
Custom - Apply filters to focus on specific messages

Available Filters:

Role Filter

Limit evaluation to messages from specific speakers:

Agent - Only evaluate assistant/agent messages
User - Only evaluate user/customer messages
Both - Evaluate messages from selected roles

This is useful when you want to assess agent behavior without user input affecting the evaluation, or vice versa.

Range Filter

Limit evaluation to a specific portion of the conversation:

Last N turns - Evaluate only the final N message exchanges
First N turns - Evaluate only the opening N message exchanges

This is useful for evaluating specific phases of a conversation, such as greetings, closings, or resolution attempts.

Transcript Scope for Audio Metrics

When using Transcript Scope with Audio LLM Judge metrics, the system automatically:

Filters the transcript to the selected messages
Uses message timestamps to extract the corresponding audio segments
Merges adjacent audio segments (within 0.5 seconds) to avoid artifacts
Sends only the filtered audio to the LLM for evaluation

This enables focused audio evaluations while reducing processing time and token costs. Example: To evaluate only the agent’s speech quality in the last 3 turns:

Enable Custom transcript scope
Add a Role filter for agent
Add a Range filter for Last 3 turns

The metric will only analyze the agent’s audio from the final 3 exchanges, ignoring user speech and earlier portions of the call.

Benefits

More accurate evaluations - Remove noise from irrelevant messages
Lower costs - Process less content per evaluation
Faster execution - Smaller context means quicker LLM responses
Targeted insights - Focus on the exact conversation segments that matter

Combine multiple filters for precise control. For example, use both a Role filter (agent only) and a Range filter (last 5 turns) to evaluate just the agent’s closing performance.

Composite Evaluation

Purpose: Evaluates a transcript against custom criteria and returns an aggregated score. It assesses each criterion and reports how many passed. When to use: Use Composite Evaluation when you need to check whether a conversation meets several requirements at once.

Use cases

Did the agent greet the customer, verify their identity, and offer a resolution?
Did the response cover all required talking points?
Did the conversation follow each step of a compliance checklist?

Implementation

Criterion Source - Choose where your criteria come from:

From Test Case - Pulls criteria automatically from each test case’s Expected Behaviors field. This is useful when different test cases have different criteria.
Static Criteria - Define a fixed list of criteria directly on the metric. Every transcript is evaluated against the same set.

Custom Evaluation Prompt (optional) - Provide additional instructions to guide how each criterion is evaluated. This lets you tailor the evaluation context without editing the criteria. Additional Options:

Knowledge Base - Enable to give the evaluator access to your knowledge base for more informed assessments.
LLM Model - Select which model performs the evaluation.
Transcript Scope - Limit evaluation to specific portions of the transcript. See Transcript Scope for configuration details.

Results: Each run produces:

An overall score count and percentage of how many passed criteria.
A breakdown showing which criteria passed or failed with reasoning.
A summary explaining the overall evaluation.

Understanding Result Types

Each criterion is evaluated independently and returns one of three results:

Result	Meaning
MET	Clear evidence in the transcript that the criterion was satisfied
NOT_MET	Evidence that contradicts or fails to satisfy the criterion
UNKNOWN	Insufficient information to determine

Getting UNKNOWN usually means your criterion is too vague or your evaluation prompt lacks context. The evaluator cannot find sufficient evidence in the transcript to make a determination.

Writing Effective Custom Evaluation Prompts

The Custom Evaluation Prompt field controls how the evaluator interprets each criterion. A well-written prompt provides context that helps the evaluator understand your domain and make accurate determinations. Default Behavior: Without a custom prompt, the evaluator uses semantic matching to determine if each criterion was met. This works well for straightforward criteria but may return UNKNOWN for domain-specific expectations. When to Use a Custom Prompt:

Your criteria reference domain-specific terminology
You need the evaluator to understand your agent’s role or capabilities
You want to define what counts as “meeting” a criterion in your context

Custom Prompt Examples

Healthcare Scheduling Agent:

You are evaluating a healthcare scheduling assistant. The agent helps patients
book, reschedule, and cancel appointments. It has access to provider availability
and patient records.

When evaluating criteria:
- "Confirms appointment" means the agent stated the date, time, and provider name
- "Verifies patient identity" means the agent asked for date of birth or member ID
- "Offers alternatives" means the agent suggested at least one other available time slot

Banking Support Agent:

You are evaluating a banking support assistant. The agent handles account inquiries,
transaction disputes, and card services.

When evaluating criteria:
- Account verification requires confirming at least 2 identity factors
- "Explains fees" means stating the specific dollar amount and when it applies
- Security disclosures must mention fraud protection and reporting procedures

Prompt Structure Guidelines

State the agent’s role - What does the agent do? What information does it have access to?
Define ambiguous terms - What does “confirms” or “explains” mean in your context?
Set evaluation standards - What level of detail counts as meeting a criterion?

Poor prompt:

Evaluate if the agent did a good job.

Effective prompt:

You are evaluating a restaurant reservation assistant. The agent books tables,
manages waitlists, and answers questions about menu and hours.

A criterion is MET when the agent provides the specific information requested.
Partial or vague responses should be marked NOT_MET. If the conversation does
not address the topic at all, mark as UNKNOWN.

Writing Effective Criteria

The most common cause of inaccurate results is vague criteria. The evaluator uses semantic understanding, so equivalent meanings count as matches. However, it cannot infer intent from ambiguous statements. The Specificity Formula Good criteria follow this pattern: [Actor] + [Specific Action] + [Specific Information/Outcome]

Vague vs Specific Examples

Scenario	Vague (Likely UNKNOWN)	Specific (Reliable)
Appointment booking	”Agent schedules the appointment"	"Agent confirms the appointment date, time, and provider name”
Account inquiry	”Agent explains the fees"	"Agent states the monthly fee amount and when it is charged”
Password reset	”Agent helps with password"	"Agent sends a password reset link to the registered email address”
Escalation	”Agent offers to escalate"	"Agent offers to transfer to a specialist when unable to resolve the issue”

Why Vague Criteria Fail

Consider the criterion: “Agent explains the account options”This fails because:

“Account options” could mean account types, features, fees, or upgrades
The evaluator cannot determine which aspect you intended
Even if the agent discussed accounts, there’s no way to verify the specific expectation was met

Rewritten: “Agent explains the difference between checking and savings accounts, including minimum balance requirements”Now the evaluator can look for specific information about account types and balance requirements.

Balancing Specificity for Shared Test Sets

When sharing criteria between voice and chat test sets:

Focus on WHAT should happen, not HOW
- Avoid: “Agent says ‘I understand your concern’”
- Use: “Agent acknowledges the customer’s concern before proceeding”
Use outcome-based criteria
- Avoid: “Agent reads the cancellation policy”
- Use: “Agent confirms the customer understands the cancellation deadline”
Avoid modality-specific language
- Avoid: “Agent clicks the submit button”
- Use: “Agent completes the reservation request”

Using Agent Evaluation Context

Adding your agent’s system prompt or context significantly improves evaluation accuracy. The evaluator performs better when it understands what your agent is supposed to do. Navigate to Agent Settings > Evaluation Context and add:

What the agent does and what information it has access to
Key policies or procedures it should follow
How it should handle common scenarios

Example Agent Context:

This is a healthcare scheduling assistant that helps patients with:
- Booking new appointments with available providers
- Rescheduling existing appointments (requires 24-hour notice)
- Canceling appointments
- Answering questions about office locations and hours

The agent should always:
- Verify patient identity before making changes
- Confirm appointment details before finalizing
- Offer alternative times when the requested slot is unavailable

Troubleshooting UNKNOWN Results

If you’re getting UNKNOWN results:

Improve your custom prompt - Add domain context and define what “meeting” a criterion means in your use case
Check criterion specificity - Is the criterion concrete enough to verify against the transcript?
Add agent evaluation context - Does the evaluator understand what the agent is supposed to do?
Review the transcript - Is the expected information actually present in the conversation?
Split compound criteria - Break “Agent explains X and confirms Y” into two separate criteria

Tool Call Metrics

Purpose: Evaluate whether AI agent tool calls (functions) were executed correctly

Prompt Structure Template

Given the conversation transcript, [SPECIFIC TOOL CALL EVALUATION QUESTION]?

Return YES if:
• [Tool call execution criterion 1]
• [Tool call execution criterion 2]
• [Tool call execution criterion 3]

Return NO if:
• [Tool call failure condition 1]
• [Tool call failure condition 2]
• [Edge case for incorrect usage]

[CLARIFICATIONS FOR TOOL CALL CONTEXT]

Example 1: Function Call Accuracy

Given the conversation transcript, did the assistant correctly execute the search function with the appropriate parameters?

Return YES if:
• The search function was called when the user requested information lookup
• All required parameters (query, filters) were properly populated
• The function call syntax and format were correct
• The assistant used the search results appropriately in their response

Return NO if:
• The search function was called unnecessarily or at wrong times
• Required parameters were missing or incorrectly formatted
• The function call failed due to syntax errors
• The assistant ignored or misused the function results

Note: Focus on the technical execution of the tool call, not the quality of the response content.

Example 2: API Integration Validation

Given the conversation transcript, did the assistant properly use the customer lookup API when handling account inquiries?

Return YES if:
• The API was called only when customer account information was needed
• Customer identifier (email, phone, or account number) was correctly passed as parameter
• The assistant handled API response data appropriately
• Proper error handling was demonstrated if API call failed

Return NO if:
• The API was called without sufficient customer identification
• Wrong parameters were passed to the lookup function
• The assistant proceeded without waiting for API response
• API errors were not handled gracefully

Note: Evaluate the technical integration, not the customer service quality.

API State Matcher

Purpose: Evaluate the assistant by validating real-world system outcomes via an external API.

Implementation

Add the URL of the API endpoint
Select GET for simple lookups or POST if the API requires a body.
Expected Body can be a full JSON object, a primitive value (string, number, boolean), or a template variable. Template Patterns: {{expected_output.balance}}, { "status": "success" }, completed
Match path (optional): A dot-notation path used to extract a specific field from the API response.
Timeout (optional): Maximum wait time for the API response before marking the metric as failed.
Headers (optional): Custom HTTP headers sent with the request.

Use Cases

Verify the agent produced the correct structured output.
Validate mocked API responses in simulations.
Check tool-call results in real services.

How it works

An HTTP request is sent to the specified API endpoint.
The response body is inspected (optionally at a specific JSON path).
The extracted value is compared against your Expected Body.
Returns 1 if the response body matches the expected value, otherwise returns 0.

Match Expected Simulaton Wrapper

Purpose: Evaluates an assitant by comparing data captured during the simulation against an expected value.

Instead of calling an external API like API State Matcher, this metric inspects simulation wrapper observations (pre- or post-simulation) and verifies that the recorded response matches expectations.

Implementation

Select observations (pre- or post-simulation)
Expected Body can be a full JSON object, a primitive value (string, number, boolean), or a template variable. Template Patterns: {{expected_output.balance}}, { "status": "success" }, completed

Use cases

Verify the agent produced the correct structured output.
Validate mocked API responses in simulations.
Test tool-call results without affecting real services.

How it works

Using the wrapper observations (for example, API pre-simulation or post-simulation payloads).
This metric reads the selected observation.
Extracts a value using a match path and compares the result to the expected body.
Returns 1 if the response body matches the expected value, otherwise returns 0.

Metadata Field Metric

Purpose: Reports the value of a run’s metadata field, retrieved from the custom metadata using a specified key. The value may be a number, text, or boolean.

Implementation

Select the metadata field type: string, float, or boolean.
Input the metadata field key.

Warning: Works only if you send metadata as part of your transcripts to evaluate with Coval, this metric will take the specific metadata field’s value and output that result as a metric result.

Use Cases

Track custom business metrics (e.g. customer satisfaction scores, call type).
Monitor agent performance indicators passed through metadata.
Extract conversation context data for analysis.
Aggregate custom KPIs from your application.
Track boolean flags (e.g. escalation occurred, customer authenticated, issue resolved).

How It Works

The metric returns the exact value stored in the specified metadata field.
Automatically aggregates values across multiple conversations.
Direct field value extraction with no LLM processing required.
Supports numeric, text, and boolean metadata values.
Boolean values are output as float (0.0 for false, 1.0 for true) for proper metric aggregation.

Transcript Regex Match Metrics

Purpose: Pattern detection for exact phrase matching, compliance validation, and format verification.

Implementation

Configure the Regex Pattern field (required) and optional fields below. No text prompt is required for this metric type.

Configuration Fields

Field	Required	Default	Description
Regex Pattern	Yes	—	Regular expression pattern to match against the transcript
Role	No	All messages	Filter by speaker role: `AGENT`, `PERSONA`, `TOOL`, `SYSTEM`, or `MUSIC`
Match Mode	No	`presence`	`presence` returns 1 if pattern is found; `absence` returns 1 if pattern is NOT found
Position	No	`any`	`any` checks all messages, `first` checks only the first message, `last` checks only the last message (of the filtered role)
Case Insensitive	No	`false`	When enabled, pattern matching ignores case

Pattern Design Guidelines

Use word boundaries (\b) for exact word matching.
Enable Case Insensitive matching instead of using inline (?i) flags for clarity.
Use Position filtering instead of complex anchoring when you only care about the first or last message.
Use Absence mode for compliance rules (“agent must not say X”) instead of trying to negate patterns in regex.

Test patterns thoroughly before deployment.

Use Case Examples

Example 1: Greeting Detection

Goal: Detect if the agent uses a proper greeting phrase. Regex Pattern: \b(hello|hi|good morning|good afternoon|good evening)\b Role: AGENT Case Insensitive: Enabled Returns: 1 if greeting found, 0 if no greeting detected.

Example 2: Required Disclosure in First Message

Goal: Verify the agent states a required disclosure at the start of the conversation. Regex Pattern: this call may be recorded Role: AGENT Position: first Case Insensitive: Enabled Returns: 1 if disclosure is in the first agent message, 0 if missing.

Example 3: Prohibited Language (Compliance)

Goal: Ensure the agent never makes unauthorized promises. Regex Pattern: \b(guarantee|promise|100%|definitely)\b Role: AGENT Match Mode: absence Case Insensitive: Enabled Returns: 1 if the agent did NOT use prohibited language (pass), 0 if prohibited language was found (fail).

Example 4: Closing Statement in Last Message

Goal: Verify the agent ends the conversation with a proper closing. Regex Pattern: (goodbye|have a (great|nice|lovely) day|thank you for calling) Role: AGENT Position: last Case Insensitive: Enabled Returns: 1 if closing statement found in last agent message, 0 if missing.

Example 5: Phone Number Format Validation

Goal: Detect when the user provides a phone number in standard US format. Regex Pattern: \b\d{3}[-.]?\d{3}[-.]?\d{4}\b Role: PERSONA Returns: 1 if valid format detected, 0 if invalid or missing.

How It Works

The metric filters transcript messages by Role (if specified). If no role is set, all messages are checked.
The Position filter is applied: first keeps only the first matching message, last keeps only the last.
The Regex Pattern is matched against the filtered messages, with Case Insensitive applied if enabled.
The Match Mode determines the result:
- presence: returns 1 if the pattern was found, 0 if not.
- absence: returns 1 if the pattern was NOT found, 0 if it was.
Direct pattern matching — no LLM required, fast and deterministic.

Words Per Message (Threshold)

Purpose: Validates that all agent messages meet a configurable word count requirement. What it measures: Whether every agent message satisfies a word count condition — for example, “all messages must have fewer than 50 words” or “all messages must have at least 5 words.” When to use:

Enforcing response length guidelines (e.g., keeping answers concise)
Detecting unexpectedly short or empty responses
Validating that the agent doesn’t produce overly verbose replies

How it works: Counts words in each agent message and checks whether all messages satisfy the configured operator and threshold. Returns YES only if every message passes; NO if any message fails. How to interpret:

YES = all agent messages meet the word count condition.
NO = at least one message violated the condition. The detail view identifies which messages failed and their word counts.

Customized Audio Metrics

Custom Pause Analysis

Purpose: Measures how frequently the agent pauses mid-speech and how long those pauses are. What it measures: Frequency of agent pauses within a turn (pauses per minute), along with total and average pause duration. When to use:

Identifying unnatural or excessive hesitations in agent speech
Detecting processing delays that manifest as in-speech pauses
Evaluating speech fluency across different configurations

How it works: Identifies gaps between consecutive agent speaking segments within the same turn and measures their duration. Persona pauses and inter-turn gaps are excluded. How to interpret:

Lower values indicate more fluent speech.
The detail view shows each pause with its timestamp and duration.
Brief pauses are normal and often expressive; frequent longer pauses may indicate hesitation artifacts.

Volume Variance

Purpose: Measures how consistently the agent maintains volume throughout the conversation. What it measures: Standard deviation of audio volume (in dB) across agent speech — lower values indicate more consistent volume. When to use:

Identifying erratic loudness changes in agent speech
Ensuring consistent audio quality across a call
Comparing voice model configurations for volume stability

How it works: Divides agent speech into fixed-length intervals and measures the volume of each. Intervals are flagged as too loud or too soft based on absolute dBFS thresholds. The primary score is the standard deviation across all intervals. To adjust sensitivity there are different thresholds available:

Preset	Loud threshold	Soft threshold
`strict`	above -3 dBFS	below -30 dBFS
`normal` (default)	above -6 dBFS	below -35 dBFS
`lenient`	above -9 dBFS	below -40 dBFS

You can also override thresholds individually with loud_threshold_db, soft_threshold_db, or interval_seconds. How to interpret:

Lower standard deviation = more consistent volume.
The detail view shows only the problematic intervals (too loud or too soft) with their timestamps and dB values.

Abrupt Pitch Changes

Purpose: Detects sudden, jittery transitions in pitch that can make speech sound unnatural. What it measures: Distinct segments where pitch changes abruptly between frames, reported as events per minute. When to use:

Detecting unnatural speech characteristics in synthesized voices
Identifying voice models with unstable or jittery pitch
Comparing voice configurations for smoothness

How it works: Compares pitch values frame-by-frame, flags frames where the change exceeds a threshold, and groups consecutive flagged frames into segments. Configuration (via metric metadata):

Parameter	Default	Description
`significant_changes_threshold_hz`	`200.0`	Minimum pitch change in Hz to consider a transition abrupt

How to interpret:

Lower values indicate smoother, more natural pitch transitions.
Higher values suggest jittery or unstable pitch.

Volume/Pitch Misalignment

Purpose: Detects moments where pitch and volume move in opposite directions, which can indicate unnatural prosody in synthesized speech. What it measures: Frames where the pitch is rising while volume is falling (or vice versa), scored by severity relative to the clip’s own baseline. When to use: Identifying unnatural-sounding speech output — for example, a voice that gets louder while its pitch drops unexpectedly, or vice versa. Useful for:

Evaluating text-to-speech engine quality
Detecting prosody issues that may sound “off” to listeners
Comparing voice model configurations

How it works: Analyzes frame-by-frame pitch and volume changes across the audio. Frames where the two signals diverge in opposite directions are flagged. Each event receives a severity score based on how unusual the divergence is relative to the rest of the clip (using z-scored magnitudes), making the metric robust across different speakers and recording conditions. Configuration (via metric metadata):

Parameter	Default	Description
`min_volume_change_for_pitch_misalignment`	`7`	Minimum intensity change (dB) required to flag a misalignment event

How to interpret: Severity scores are relative to the clip, not absolute. A higher score means both pitch and volume were moving unusually for this speaker in this recording.

Low severity (~0 – 1): Both signals are near their mean change magnitude — nothing unusual relative to the speaker’s baseline.
Medium severity (~1 – 2): One or both signals are about 1 standard deviation above their clip mean.
High severity (~2–6+): Both signals are 2+ standard deviations above their clip mean — a genuinely unusual frame.

Because severity is z-score based, values are comparable across different speakers and recording conditions.

Non-Expressive Pauses

Purpose: Identifies pauses in speech that lack preparatory pitch movement, which can make the agent sound flat or monotone. What it measures: Pauses above a minimum duration where pitch shows little variation in the frames immediately before the pause, reported as events per minute. When to use:

Evaluating whether a voice sounds expressive and natural
Detecting monotone delivery in synthesized speech
Comparing voice configurations for expressiveness

How it works: Detects pauses above a minimum duration threshold, then examines the pitch trajectory in the frames immediately preceding each pause. Pauses with minimal pitch variation beforehand are flagged as non-expressive. Configuration (via metric metadata):

Parameter	Default	Description
`min_pause_duration_seconds`	`0.6`	Minimum silence duration (s) to qualify as a pause
`pre_pause_window`	`5`	Number of 10ms frames to inspect before each pause for pitch movement

How to interpret:

Lower values indicate more expressive delivery — pitch varies naturally before pauses.
Higher values suggest a flat or robotic cadence where pauses arrive without natural pitch cues.

Vocal Fry

Purpose: Detects vocal fry — a low, creaky speech quality, typically occurring at the end of phrases. What it measures: Total time spent in vocal fry (in seconds), with additional detail on percentage of affected speech and longest continuous fry segment. When to use:

Evaluating whether a voice has creaky or rough-sounding artifacts
Monitoring vocal quality across different voice configurations
Identifying voices where fry affects listener experience

How it works: Identifies frames with simultaneously low pitch, high acoustic roughness, and irregular vocal cord vibration. Consecutive flagged frames are grouped into fry segments. Configuration (via metric metadata):

Parameter	Default	Description
`sample_rate_seconds`	`0.01`	Analysis frame rate in seconds
`pitch_floor`	`60`	Minimum pitch frequency (Hz) for detection
`pitch_ceiling`	`400`	Maximum pitch frequency (Hz) for detection
`low_pitch_threshold_multiplier`	`0.6`	Fraction of speaker’s median pitch below which a frame is considered low-pitched
`jitter_threshold_multiplier`	`2.0`	Multiple of baseline jitter above which a frame is flagged
`harmonics_to_noise_ratio_threshold_offset_db`	`-10.0`	dB offset below baseline HNR that marks a frame as noisy
`harmonics_to_noise_ratio_minimum_pitch`	`60`	Minimum pitch for HNR calculation (Hz)
`harmonics_to_noise_ratio_silence_threshold`	`0.1`	Amplitude threshold below which frames are treated as silent
`harmonics_to_noise_ratio_periods_per_window`	`1.0`	Analysis window size in pitch periods for HNR
`baseline_calculation_multiplier`	`0.8`	Fraction of median pitch used to define the “clear voice” baseline for HNR and jitter
`min_fry_segment_seconds`	`0.05`	Minimum duration (s) for a fry segment to be counted

How to interpret:

Total time in vocal fry (seconds). Lower is better.
Occasional brief fry is common in natural speech; sustained or frequent fry may reduce perceived quality.

Spectrogram Pitch Analysis

Purpose: Evaluates whether audio contains natural upper-frequency content, which is a key indicator of voice naturalness. Synthetic or bandwidth-limited audio often lacks energy in higher frequency ranges. What it measures: The fraction of upper-frequency spectrogram bins that have energy above a noise floor, averaged across analysis windows. Returns 1.0 (pass) or 0.0 (fail) based on whether the average fill ratio meets the naturalness threshold. When to use:

Detecting bandwidth-limited or muffled synthesized speech
Comparing voice model configurations for spectral richness
Identifying voices that lack harmonic upper-frequency energy

How it works: Splits the audio into fixed-length windows and computes a frequency spectrum for each. The fraction of bins in the upper frequency region that exceed the noise floor is measured per window. If the average fill ratio across all windows meets the naturalness threshold, the metric passes. Configuration (via metric metadata):

Parameter	Default	Description
`naturalness_threshold`	`0.10`	Minimum average fill ratio (0.0–1.0) to pass
`upper_region_percentage`	`0.25`	Fraction of the frequency range treated as the upper region
`noise_floor_db`	`-15.0`	dB level above which a bin counts as filled
`segment_length_seconds`	`2.0`	Duration of each analysis window

How to interpret:

1.0 = pass — average upper-frequency fill ratio meets the naturalness threshold.
0.0 = fail — audio lacks sufficient upper-frequency energy.
The detail view shows the fill ratio per window across the recording timeline.

Using Trace Context in LLM Judge Metrics

Purpose: Give an LLM Judge or Composite Evaluation metric visibility into what your agent actually did — not just what it said — by including OpenTelemetry span data alongside the transcript. When Include Traces is enabled on a custom transcript scope, the judge automatically receives a TRACE CONTEXT: block appended to its prompt. This block summarizes the OTel spans from the conversation: span names, timing windows, and key attributes like tool call names and function arguments. For Composite Evaluation metrics (Expected Behaviors), the same TRACE CONTEXT: block is appended to every per-criterion prompt — each criterion is evaluated against both the transcript and the trace spans inside the configured scope.

Walkthrough

When to Enable Include Traces

Trace context is most valuable when the behavior you want to evaluate isn’t visible in the transcript alone:

Use Case	Why Traces Help
Verify the agent used the right tools in the right order	Tool call spans show what functions were invoked and with what arguments
Catch hallucinations — agent claimed to do something it didn’t	Trace spans show whether the action actually occurred
Evaluate retrieval quality	Retrieval spans show what data was fetched before the agent responded
Assess error handling	Error spans reveal failures the agent may have silently recovered from

How to Enable

Open or create a supported metric — LLM Judge (Binary, Numerical, Categorical, or Audio) or Composite Evaluation.
Set Transcript Scope to Custom.
In the custom scope configuration panel, toggle Include Traces on.

The trace context is appended automatically — no changes to your judge prompt are required, though you can reference it explicitly for better results.

Requirements

Your agent must emit OpenTelemetry traces to Coval. See the OpenTelemetry Traces guide for setup.
The simulation must have produced trace data. If no trace data is available, the toggle has no effect and the prompt is sent without a trace context block.

Writing Prompts That Leverage Trace Context

When writing prompts for metrics with trace context enabled, reference the trace data explicitly. The judge sees a TRACE CONTEXT: block appended after the transcript — you can instruct it to reason about both sources.

Example: Verify Tool Usage

Given the transcript and trace context, did the assistant call the `lookup_account` function before providing account balance information?

Return YES if:
• The TRACE CONTEXT shows a tool call to `lookup_account` (or equivalent) occurring before the agent stated the balance
• The transcript confirms the agent provided balance details

Return NO if:
• The agent mentioned account balance information but no `lookup_account` tool call appears in the TRACE CONTEXT
• The tool call appears AFTER the agent has already stated the balance (out of order)
• The TRACE CONTEXT shows a failed or missing tool call for this operation

Note: If no TRACE CONTEXT is provided, evaluate based on transcript alone.

Example: Catch Hallucination

Given the transcript and trace context, did the assistant accurately report what actions it took?

Return YES if:
• All actions the assistant claims to have performed appear in the TRACE CONTEXT as actual tool or function calls

Return NO if:
• The assistant stated it performed an action (e.g., "I've updated your address") but no corresponding tool call appears in the TRACE CONTEXT
• The TRACE CONTEXT shows an error or missing call for an action the assistant claimed was successful

Note: Minor phrasing differences between the transcript and trace data are acceptable — evaluate intent.

Add “Note: If no TRACE CONTEXT is provided, evaluate based on transcript alone” to your prompt. This makes the metric degrade gracefully on simulations where traces weren’t captured.

Utilizing Attributes

You can embed dynamic values from agents, test cases, and simulations into your metric prompts using template variables. This allows you to create context-aware metrics that adapt to specific agent configurations or test case requirements. For comprehensive documentation on using attributes, including nested paths, array indexing, dynamic keys, and complete examples, see Attributes.

Advanced Prompting Techniques

1. Chain of Thought for Complex Evaluations

Before making your final determination, consider:
1. What was the user's primary goal?
2. What actions did the assistant take?
3. What was the final outcome?
4. Did the outcome match the user's goal?

Based on this analysis, did the assistant successfully resolve the user's issue?

2. Few-Shot Examples for Edge Cases

Examples of what constitutes resolution:
• User: "That fixed it, thanks!" → YES
• User: "I'll try that and call back if needed" → YES
• User: "This is too complicated, forget it" → NO
• User hangs up without confirmation → NO

Given the transcript, did the assistant successfully resolve the user's issue?

3. Hierarchical Decision Making

First, determine if the assistant attempted to address the user's concern:
- If no attempt was made → Return NO
- If attempt was made → Continue to step 2

Second, evaluate if the attempt was successful:
- If user confirmed satisfaction → Return YES
- If user remained unsatisfied → Return NO
- If outcome unclear → Return NO (err on conservative side)

Using Agent Attributes and Test Case Attributes

You can make your metric prompts more dynamic and context-aware by referencing agent attributes and test case attributes. This allows you to create metrics that evaluate agent performance against specific agent configurations or test case requirements.

Agent Attributes

Agent attributes are custom properties you define for each agent configuration. How to use agent attributes in metric prompts: Insert {{agent.attribute_name}} anywhere in your metric prompt. The system will automatically replace this placeholder with the actual attribute value from the agent being evaluated. Example 1: Business Hours Verification

Given the transcript, did the assistant provide the correct opening hours?

The correct opening hours are: {{agent.opening_hours}}

Return YES if:
• The assistant stated the opening hours as {{agent.opening_hours}}
• The assistant provided opening hours that match exactly (e.g., "9 AM to 5 PM" matches "9:00am-5:00pm")

Return NO if:
• The assistant provided different opening hours than {{agent.opening_hours}}
• The assistant claimed not to know the opening hours
• The assistant provided incorrect or conflicting information

Test Case Attributes

For a test case with attributes like:

{
  "source": "LAX",
  "destination": "SFO",
  "ticket_class": "business"
}

You could create a metric prompt:

Given the transcript, did the assistant correctly process the flight booking request?

The booking details are:
- Source: {{test_case.source}}
- Destination: {{test_case.destination}}
- Ticket Class: {{test_case.ticket_class}}

Return YES if:
• The assistant confirmed all three details correctly (source, destination, and ticket class)
• The assistant used the exact values: {{test_case.source}}, {{test_case.destination}}, and {{test_case.ticket_class}}

Return NO if:
• Any of the three details were incorrect or missing
• The assistant confused source and destination
• The assistant used a different ticket class than {{test_case.ticket_class}}

Combining Agent and Test Case Attributes

You can use both agent attributes and test case attributes in the same metric prompt to create comprehensive evaluations:

Knowledge Base Metrics

Coval allows you to connect a knowledge base (KB) to your agent and create LLM Judge metrics that use your knowledge base as context. This enables you to track accuracy on specific articles, knowledge bases, or different flows mentioned in your documentation.

Knowledge bases are configured on the agent, not the metric. See Knowledge Base for how to add entries and supported source types.

Use cases for KB metrics:

Verify agents answer questions using approved knowledge base content.
Track accuracy across different documentation sources.
Ensure compliance with specific information in FAQs, policies, or procedures.
Monitor whether agents provide consistent responses based on authoritative sources.

Pro Tip: KB metrics are particularly valuable for customer service agents, healthcare bots, or any application where accuracy against documented information is critical.

Setting Up Your Knowledge Base

Step 1: Navigate to Agent Configuration

Go to your Agent setup page
Select the agent you want to connect to a knowledge base
Scroll down to the Knowledge Base section

Step 2: Add Knowledge Base Entries Coval supports multiple knowledge base formats

Click “Add Knowledge Base Entry”
Select your file type
Upload your file (Coval will automatically parse it)
Add a descriptive name (e.g., “Hotel FAQ”, “Product Documentation”)
Optionally add tags for organization
Click “Upload”

All uploaded entries will appear in your knowledge base list, associated with the selected agent.

Creating Knowledge Base Metrics

Step 1: Create a New Metric

Navigate to the Metrics section
Click “Create New Metric”
Select Binary LLM Judge as the metric type
Name your metric (e.g., “FAQ Knowledge Base Accuracy”)

Step 2: Write Your LLM Judge Prompt

Structure your prompt to evaluate whether the agent used knowledge base information correctly: Example Prompt Structure:

Given the transcript, did the assistant answer the user's initial question accurately using information from the Hotel FAQ knowledge base?

Return YES if:
- The assistant provided specific FAQ details that are factually correct (exact addresses, dollar amounts, precise policies, named amenities)
- Core facts match the FAQ even if paraphrased (e.g., "4:00pm check-in" can be stated as "check-in at 4 PM")
- The response directly addresses the user's initial question with accurate FAQ information

Return NO if:
- The assistant provided information that contradicts the FAQ (e.g., claiming there is a pool when FAQ states there is no pool)
- The assistant gave generic responses without specific FAQ details
- The assistant fabricated information not contained in the FAQ
- The assistant claims lack of information when the FAQ contains the answer
- The initial question remains unanswered despite FAQ coverage
- The assistant provided factually incorrect information, even if detailed and specific

Return Unknown if:
- The user's question is not covered in the FAQ

**Critical: Prioritize factual accuracy over response detail. A detailed but incorrect answer must return NO.**

Best Practice: Be specific about what constitutes accurate vs. inaccurate responses based on your knowledge base. Include edge cases where the KB might not have complete information.

Step 3: Enable Knowledge Base Context Critical step: At the bottom of the metric configuration:

Locate the Knowledge Base toggle (initially disabled)
Enable the Knowledge Base option
The system will automatically include your knowledge base as context when evaluating

Critital: If you don’t enable the Knowledge Base toggle, the metric will evaluate without KB context and may produce inaccurate results.

Step 4: Save Your Metric

Review your prompt and settings
Click “Create Metric”
Your KB metric is now ready to use in simulations and conversations

Using Knowledge Base Metrics in Evaluations

In Simulations

Create or select a test set with scenarios that should use KB information
Launch a simulation (or use a template)
Select your KB accuracy metric in the metrics list
Run the simulation

In Conversations

Set your KB metric as a Default Metric to run on all incoming transcripts
Create Metric Rules to apply KB metrics conditionally
Monitor results in real-time to catch KB accuracy issues in production

Best Practices for Knowledge Base Metrics

Writing Effective Prompts

Do:

Be explicit about what information should come from the KB.
Define clear conditions for YES and NO responses.
Account for situations where the KB doesn’t have complete information.
Consider partial accuracy vs. complete inaccuracy.

Don’t:

Make assumptions about what the LLM knows without KB context.
Create overly complex evaluation criteria.

Knowledge Base Organization

Recommended structure:

Use clear, descriptive names for each KB entry.
Add tags to categorize different types of information.
Keep individual KB files focused on specific topics.
Update KB entries regularly to reflect current information.

Metric Validation and Testing

1. Metric Improvement Process

Use Coval’s “Improve Metric” feature with test transcripts.
Iterate on prompts to reduce variance.
Test edge cases and ambiguous scenarios.
Aim for >90% consistency across similar evaluations.

2. Common Issues and Solutions

Issue	Solution
Inconsistent scoring	Add more specific criteria and examples
Edge case failures	Include explicit handling for boundary conditions
LLM hallucination	Use more structured prompts with clear constraints
Low correlation	Ensure metric measures what you intend to measure

3. Performance Optimization

Keep prompts under 2,000 characters when possible.
Use regex metrics for simple pattern detection.
Combine related evaluations into single metrics when logical.
Test with diverse conversation types and lengths.

Best Practices Summary For Creating Metric Prompts

Use specific, measurable criteria.
Provide clear positive and negative examples.
Test extensively with real conversation data.
Maintain consistent terminology and structure.
Include edge case handling.

This systematic approach to metric creation will ensure reliable, actionable insights from your Coval evaluations.

Introduction

Configuration

Observability

Step-by-Step Guides

Use Cases

Connect & Collaborate

Documentation Index

​Core Principles for All Metrics

​1. Specificity Over Generality

​2. Role Consistency

​3. Deterministic Design

​Binary LLM Judge Metrics

​Prompt Structure Template

​Example 1: Issue Resolution Detection

​Example 2: Compliance Verification

​Tips and tricks

​Before (Poor Metric Example):

​After (Improved Metric Example):

​Categorical LLM Judge Metrics

​Prompt Structure Template

​Example 1: Call Intent Classification

​Example 2: Conversation Outcome Classification

​Numerical LLM Judge Metrics

​Prompt Structure Template

​Example 1: Empathy Assessment

​Example 2: Technical Accuracy Scoring

​Multimodal LLM Judge Metrics

​What Audio Metrics Can Detect

​Prompt Structure Template

​Transcript Scope for Audio Metrics

​Best Practices for Audio Metrics

​Example 1: Speech Clarity Assessment

​Example 2: Professional Tone Detection

​Example 3: Empathy Detection

​Example 4: Speaker Diarization Quality

​Common Pitfalls to Avoid

​Transcript Scope

​When to Use Transcript Scope

​Configuration Options

​Transcript Scope for Audio Metrics

​Benefits

​Composite Evaluation

​Use cases

​Implementation

​Understanding Result Types

​Writing Effective Custom Evaluation Prompts

​Writing Effective Criteria

​Using Agent Evaluation Context

​Troubleshooting UNKNOWN Results

​Tool Call Metrics

​Prompt Structure Template

​Example 1: Function Call Accuracy

​Example 2: API Integration Validation

​API State Matcher

​Implementation

​Use Cases

​How it works

​Match Expected Simulaton Wrapper

​Implementation

​Use cases

​How it works

​Metadata Field Metric

​Implementation

​Use Cases

​How It Works

​Transcript Regex Match Metrics

​Implementation

​Configuration Fields

​Pattern Design Guidelines

​Use Case Examples

​Example 1: Greeting Detection

​Example 2: Required Disclosure in First Message

​Example 3: Prohibited Language (Compliance)

​Example 4: Closing Statement in Last Message

​Example 5: Phone Number Format Validation

​How It Works

​Words Per Message (Threshold)

​Customized Audio Metrics

​Custom Pause Analysis

​Volume Variance

Core Principles for All Metrics

1. Specificity Over Generality

2. Role Consistency

3. Deterministic Design

Binary LLM Judge Metrics

Prompt Structure Template

Example 1: Issue Resolution Detection

Example 2: Compliance Verification

Tips and tricks

Before (Poor Metric Example):

After (Improved Metric Example):

Categorical LLM Judge Metrics

Prompt Structure Template

Example 1: Call Intent Classification

Example 2: Conversation Outcome Classification

Numerical LLM Judge Metrics

Prompt Structure Template

Example 1: Empathy Assessment

Example 2: Technical Accuracy Scoring

Multimodal LLM Judge Metrics

What Audio Metrics Can Detect

Prompt Structure Template

Transcript Scope for Audio Metrics

Best Practices for Audio Metrics

Example 1: Speech Clarity Assessment

Example 2: Professional Tone Detection

Example 3: Empathy Detection

Example 4: Speaker Diarization Quality

Common Pitfalls to Avoid

Transcript Scope

When to Use Transcript Scope

Configuration Options

Transcript Scope for Audio Metrics

Benefits

Composite Evaluation

Use cases

Implementation

Understanding Result Types

Writing Effective Custom Evaluation Prompts

Writing Effective Criteria

Using Agent Evaluation Context

Troubleshooting UNKNOWN Results

Tool Call Metrics

Prompt Structure Template

Example 1: Function Call Accuracy

Example 2: API Integration Validation

API State Matcher

Implementation

Use Cases

How it works

Match Expected Simulaton Wrapper

Implementation

Use cases

How it works

Metadata Field Metric

Implementation

Use Cases

How It Works

Transcript Regex Match Metrics

Implementation

Configuration Fields

Pattern Design Guidelines

Use Case Examples

Example 1: Greeting Detection

Example 2: Required Disclosure in First Message

Example 3: Prohibited Language (Compliance)

Example 4: Closing Statement in Last Message

Example 5: Phone Number Format Validation

How It Works

Words Per Message (Threshold)

Customized Audio Metrics

Custom Pause Analysis

Volume Variance

Abrupt Pitch Changes

Volume/Pitch Misalignment

Non-Expressive Pauses

Vocal Fry

Spectrogram Pitch Analysis

Using Trace Context in LLM Judge Metrics

Walkthrough