Skip to main content
This guide provides instruction for creating high-performing custom prompting metrics in Coval’s evaluation platform. Each metric type benefits from various prompting strategies to achieve reliable, deterministic results. For help writting prompts for the custom metrics Coval offers an optimize metric button to improve clarity and confidence.

Core Principles for All Metrics

1. Specificity Over Generality

  • Define exact evaluation criteria rather than subjective assessments
  • Use concrete, measurable behaviors instead of abstract concepts
  • Provide clear boundary conditions for edge cases

2. Role Consistency

  • Always refer to the AI agent as “the assistant”
  • Use “the user” or “the customer” for human participants
  • Maintain consistent terminology throughout your prompts

3. Deterministic Design

  • Structure prompts to minimize LLM variance across evaluations
  • Provide explicit decision trees when possible
  • Define what constitutes partial vs. complete success

Binary LLM Judge Metrics

Purpose: Yes/No evaluations with high accuracy and consistency

Prompt Structure Template

[CONTEXT SETTING]
Given the transcript, [SPECIFIC QUESTION]?

Return YES if:
• [Explicit criterion 1]
• [Explicit criterion 2]
• [Explicit criterion 3]

Return NO if:
• [Explicit disqualifying condition 1]
• [Explicit disqualifying condition 2]
• [Edge case handling]

[CLARIFICATIONS FOR EDGE CASES]
Important: When using OR conditions, make it explicitly clear that the metric should return YES/NO if any of the conditions are met. Use “ANY of the following” language to remove ambiguity.
  • “Return x if ANY of the following apply:”
  • “[Condition] OR [Condition] OR [Condition] … “

Example 1: Issue Resolution Detection

Given the transcript, did the assistant successfully resolve the user's primary issue or concern?

Return YES if ANY of the following apply:
• The user explicitly confirms their issue is resolved (e.g., "That worked," "Perfect, thank you")
• OR the assistant provides a complete solution and the user accepts it without further objection
• OR the user indicates satisfaction with the outcome before ending the conversation
• OR the assistant completes a requested action and the user acknowledges success
• OR the user's question was fully answered and they don't ask follow-up questions about the same issue
• OR the assistant provides complete, actionable guidance and the user indicates understanding
• OR no primary issue or concern was raised by the user (e.g., casual greetings, general inquiries)

Return NO if ANY of the following apply:
• The user states their issue remains unresolved
• OR the conversation ends without addressing the user's main concern
• OR the user expresses frustration or dissatisfaction with the proposed solution
• OR the assistant escalates or transfers the issue without providing any resolution attempt
• OR the user has to repeat their problem multiple times without progress
• OR the assistant admits they cannot help or solve the user's problem
• OR the user asks the same question again after receiving an answer

Example 2: Compliance Verification

Given the transcript, did the assistant properly collect all required verification information before processing the request?

Return YES if:
• The assistant gathered account number, full name, and security question answer
• All three verification elements were confirmed before proceeding
• The assistant explicitly stated verification was complete

Return NO if:
• Any of the three required elements (account number, name, security answer) were skipped
• The assistant proceeded with the request before completing verification
• Verification was attempted but failed, yet the assistant continued anyway

If the user refuses to provide verification, return NO regardless of the reason.

Tips and tricks

Recommended: Objective: “Did the assistant acknowledge the user’s concern within their first two responses?”Avoid: Too subjective: “Did the assistant provide good customer service?”
Recommended: Singular observation:
  • Create separate metrics for seperate obervations such as resolution and professionalism.
Avoid: Multiple criteria:
  • “Did the assistant resolve the issue and maintain professionalism?”
Recommended: Use of clear logical operators
  • Use AND/OR operators, ANY/ALL.
Avoid: Evaluation logic that contradicts the stated rules
  • Metrics return incorrect results when the evaluation system checks for things that shouldn’t trigger failures.
  • Such as requiring disclosure when no transfer occurred, or flagging live conversations as voicemails.

Before (Poor Metric Example):

Based on the transcript, did the customer service agent ask about
the customer's preferred contact method, current service plan, or billing preferences?

Return YES if:
All three preference items were specifically inquired about.

Return NO if:
One or more items were not asked.
Why this fails: The metric has an “OR” condition in the question but requires “AND” logic in the evaluation, creating confusion about whether one or all conditions must be met.

After (Improved Metric Example):

Based on the transcript, did the customer service agent ask about the customer's preferred contact method, current service plan, or billing preferences?

Return YES if:
• The agent asked about preferred contact method, current service plan, AND billing preferences
• This can be in a single question (e.g., "What's your preferred contact method, current plan, and billing preference?") OR separate questions for each item

Return NO ONLY if:
• The agent failed to ask about one or more of these three specific items: contact method, service plan, or billing preferences
• Note: Focus on what the AGENT asked, not on what the customer mentioned in their response

Examples of acceptable questions:
• "How would you like us to contact you, what's your current plan, and how do you prefer to handle billing?"
• Three separate questions covering each preference
• Any variation that covers all three customer preference areas
Key improvements: Clear AND/OR operators, explicit examples, and evaluation logic that matches the stated conditions.

Categorical LLM Judge Metrics

Purpose: Classification into predefined, mutually exclusive custom categories.

Prompt Structure Template

Classify [SPECIFIC ASPECT] based on the conversation content.

Decision Logic:
• If [condition], classify as [CATEGORY_NAME]
• If [condition], classify as [CATEGORY_NAME]
• If [condition], classify as [CATEGORY_NAME]

Return only the exact category name.
Note: Configure the category options and their definitions in the Coval UI category menu. The categories and their descriptions are set through the platform interface, not in the prompt text.

Example 1: Call Intent Classification

Classify the primary reason for this conversation based on the user's needs and requests.

Decision Logic:
• If user mentions technical problems, errors, or "not working", classify as TECHNICAL_SUPPORT
• If user mentions money, charges, bills, or payments, classify as BILLING_INQUIRY
• If user wants to change account details or settings, classify as ACCOUNT_MANAGEMENT
• If user asks general questions without specific issues, classify as GENERAL_INFORMATION
• If user expresses dissatisfaction and requests escalation, classify as COMPLAINT_ESCALATION

Return only the exact category name.

Example 2: Conversation Outcome Classification

Classify the final outcome of this conversation based on how it concluded.

Decision Logic:
• If user explicitly confirms resolution or satisfaction, classify as RESOLVED_SUCCESSFULLY
• If solution provided but requires user action outside conversation, classify as PARTIALLY_RESOLVED
• If conversation transferred to human agent, classify as ESCALATED_TO_HUMAN
• If user ends conversation frustrated or without resolution, classify as UNRESOLVED_ABANDONED
• If user asked questions and received answers without specific problems, classify as INFORMATION_PROVIDED

Return only the exact category name.

Numerical LLM Judge Metrics

Purpose: Score-based evaluations with consistent integer scaling.

Prompt Structure Template

Rate [SPECIFIC ASPECT] based on the following criteria:

Evaluation Criteria:
• [Criterion 1 with behavioral indicators]
• [Criterion 2 with behavioral indicators]
• [Criterion 3 with behavioral indicators]

Scoring Guidelines:
Low scores: [Behavioral indicators for poor performance]
High scores: [Behavioral indicators for excellent performance]

Return only the numerical score.
Note: Configure the Min and Max score values as shown, not in the prompt text. The scoring scale (e.g., 1-5, 1-10) is set through the platform interface.

Example 1: Empathy Assessment

Rate the assistant's empathy level based on the following criteria:

Evaluation Criteria:
• Acknowledgment of user emotions and concerns
• Use of appropriate empathetic language and tone indicators
• Validation of user feelings before moving to solutions
• Adaptation of communication style to user's emotional state

Scoring Guidelines:
Low scores: No empathy shown, dismissive responses, purely transactional
High scores: Clear empathetic responses, validates feelings, shows genuine concern

Return only the numerical score.

Example 2: Technical Accuracy Scoring

Rate the technical accuracy of the assistant's information based on the following criteria:

Evaluation Criteria:
• Factual correctness of all technical statements
• Completeness of technical explanations
• Appropriate level of technical detail for the context
• Identification and correction of any technical misconceptions

Scoring Guidelines:
Low scores: Major technical errors that could cause problems
High scores: Expert-level accuracy with comprehensive, precise details

Return only the numerical score.

Multimodal LLM Judge Metrics

Purpose: Include audio-specific evaluations that text analysis cannot capture. Multimodal LLM Judge metrics analyze the audio along with the transcript text. This allows you to evaluate qualities like vocal tone, speech clarity, pacing, and emotional expression that are impossible to assess from text alone.
The format of the Multimodal LLM judge metrics are the same as the LLM judge metrics. Coval will handle the audio processing automatically, Your prompt should focus on what you want to evaluate, not how to process the audio.

What Audio Metrics Can Detect

Audio LLM Judge metrics excel at evaluating:
CategoryExamples
Speech QualityClarity, articulation, pronunciation, stuttering
Vocal CharacteristicsTone, pitch, volume consistency, speaking pace
Emotional ExpressionEnthusiasm, frustration, sarcasm, empathy in voice
Professional DemeanorCourtesy, patience, confidence, nervousness
Speaker IdentificationDistinguishing between speakers, detecting interruptions

Prompt Structure Template

[SPECIFIC AUDIO QUESTION]

Audio Analysis Criteria:
• [Acoustic feature 1]
• [Vocal characteristic 2]
• [Speech pattern 3]

Return YES if:
• [Audio-specific condition 1]
• [Audio-specific condition 2]

Return NO if:
• [Audio-specific disqualifier 1]
• [Audio-specific disqualifier 2]

Note: [Clarification about evaluation scope]
Writing Effective Audio Prompts: Be specific about which speaker to evaluate (assistant, user, or both) and what acoustic qualities matter. Vague prompts like “Did it sound good?” produce inconsistent results.

Transcript Scope for Audio Metrics

Audio LLM Judge metrics support Transcript Scope, allowing you to evaluate only specific portions of the audio. When you apply filters (such as agent-only or last N turns), the system automatically extracts and evaluates only the corresponding audio segments. This is particularly useful for:
  • Evaluating agent speech quality without user audio
  • Focusing on closing statements or greetings
  • Reducing token costs on long recordings

Best Practices for Audio Metrics

Only use Audio LLM Judge for evaluations that require hearing the audio. If something can be determined from the transcript alone (like whether specific words were said), use a standard LLM Judge metric instead - it’s faster and more cost-effective.Use Audio metrics for: Tone of voice, speaking pace, pronunciation clarity, emotional expression, volume issuesUse Text metrics for: Word choice, script compliance, information accuracy
Always clarify whose audio you’re evaluating:
  • “Did the assistant speak clearly…”
  • “Did the user sound frustrated…”
  • “Was there crosstalk between both speakers…”
This prevents ambiguity when multiple voices are present.
Replace subjective terms with specific, observable audio qualities:
AvoidUse Instead
”Good tone""Calm, even-paced tone without audible frustration"
"Clear speech""Words pronounced distinctly without mumbling or slurring"
"Professional""Business-appropriate volume and pace, no sighing or dismissive inflections”
For complex evaluations, ask the model to consider specific aspects before making a determination. This improves accuracy:
Before making your determination, consider:
1. What is the overall vocal tone throughout the call?
2. Are there any moments where the tone shifts notably?
3. How would a customer likely perceive this tone?

Example 1: Speech Clarity Assessment

Did the assistant speak clearly and at an appropriate pace throughout the conversation?

Audio Analysis Criteria:
• Pronunciation clarity and articulation
• Speaking pace (not too fast or slow for comprehension)
• Volume consistency and audibility
• Absence of mumbling, slurring, or rushed speech

Return YES if:
• All words are clearly pronounced and easily understood
• Speaking pace allows for comfortable comprehension
• Volume remains consistent and audible throughout
• No instances of unclear or garbled speech

Return NO if:
• Words are frequently mumbled, slurred, or unclear
• Speaking pace is too fast or slow for easy comprehension
• Volume fluctuations make parts difficult to hear
• Any portions of speech are unintelligible due to clarity issues

Note: Focus only on the assistant's speech clarity, not content quality.

Example 2: Professional Tone Detection

Did the assistant maintain a professional vocal tone throughout the conversation?

Audio Analysis Criteria:
• Tone consistency and appropriateness for business context
• Absence of inappropriate emotional expressions (anger, frustration, sarcasm)
• Professional demeanor in vocal inflection and manner
• Respectful and courteous vocal presentation

Return YES if:
• Vocal tone remains professional and business-appropriate throughout
• No instances of unprofessional vocal expressions or attitudes
• Tone conveys respect and courtesy consistently
• Emotional responses, if any, are appropriate to the context

Return NO if:
• Vocal tone becomes unprofessional, dismissive, or inappropriate
• Clear instances of anger, frustration, or sarcasm in voice
• Tone suggests disrespect or lack of courtesy
• Emotional vocal responses inappropriate for professional context

Note: Evaluate vocal tone and manner, not the words spoken.

Example 3: Empathy Detection

Did the assistant demonstrate vocal empathy when the user expressed frustration or concern?

Audio Analysis Criteria:
• Softening of tone when user expresses negative emotions
• Appropriate pacing adjustments (slowing down to show care)
• Warm, understanding vocal quality rather than robotic or dismissive
• Verbal acknowledgments delivered with genuine-sounding concern

Return YES if:
• Assistant's tone audibly softens or warms in response to user distress
• Pacing adjusts appropriately to show the assistant is listening
• Voice conveys genuine concern rather than scripted responses
• No rushing through empathetic statements

Return NO if:
• Assistant maintains the same tone regardless of user's emotional state
• Empathetic words are delivered in a flat, robotic, or rushed manner
• Assistant sounds impatient or dismissive when user is upset
• No vocal adaptation to the user's emotional needs

Note: Evaluate the vocal delivery of empathy, not just whether empathetic words were used.

Example 4: Speaker Diarization Quality

Can the two speakers (assistant and user) be clearly distinguished throughout the recording?

Audio Analysis Criteria:
• Distinct vocal characteristics between speakers
• Clear turn-taking without excessive overlap
• Ability to attribute each utterance to the correct speaker
• Audio quality sufficient for speaker identification

Return YES if:
• Each speaker has distinguishable vocal qualities
• Turn-taking is clear with minimal confusing overlaps
• All significant utterances can be attributed to a specific speaker
• No extended portions where speaker identity is unclear

Return NO if:
• Speakers sound too similar to reliably distinguish
• Frequent overlapping speech makes attribution difficult
• Significant portions have unclear speaker identity
• Audio quality issues (echo, distortion) prevent speaker identification

Note: This metric evaluates audio clarity for speaker identification, not conversation quality.

Common Pitfalls to Avoid

Don’t mix audio and text evaluations in a single Audio metric. If you need to check both “Did they sound professional?” AND “Did they say the required disclaimer?”, create two separate metrics - an Audio metric for tone and a Text metric for the disclaimer.
PitfallProblemSolution
Evaluating transcript contentAudio metrics can’t reliably assess word choiceUse standard LLM Judge for text content
Vague audio criteria”Good voice” is subjective and inconsistentDefine specific qualities: pace, clarity, tone
Missing speaker specificationUnclear whose voice to evaluateAlways specify: assistant, user, or both
Combining unrelated qualities”Clear AND professional AND empathetic” is too broadCreate separate metrics for each quality

Transcript Scope

Purpose: Focus metric evaluation on specific portions of a conversation rather than the entire transcript. Transcript Scope allows you to filter which messages the LLM evaluates, reducing noise and improving accuracy for targeted assessments. This feature is available for all LLM Judge metrics (Binary, Numerical, Categorical) and Audio LLM Judge metrics.

When to Use Transcript Scope

Use CaseFilter Configuration
Evaluate only agent responsesRole filter: agent
Check the closing of a conversationRange filter: Last 3 turns
Assess user sentiment onlyRole filter: user
Focus on recent contextRange filter: Last N messages

Configuration Options

Transcript Scope Toggle:
  • Full (default) - Evaluate the entire transcript
  • Custom - Apply filters to focus on specific messages
Transcript Scope UI
Available Filters:
Limit evaluation to messages from specific speakers:
  • Agent - Only evaluate assistant/agent messages
  • User - Only evaluate user/customer messages
  • Both - Evaluate messages from selected roles
This is useful when you want to assess agent behavior without user input affecting the evaluation, or vice versa.
Limit evaluation to a specific portion of the conversation:
  • Last N turns - Evaluate only the final N message exchanges
  • First N turns - Evaluate only the opening N message exchanges
This is useful for evaluating specific phases of a conversation, such as greetings, closings, or resolution attempts.

Transcript Scope for Audio Metrics

When using Transcript Scope with Audio LLM Judge metrics, the system automatically:
  1. Filters the transcript to the selected messages
  2. Uses message timestamps to extract the corresponding audio segments
  3. Merges adjacent audio segments (within 0.5 seconds) to avoid artifacts
  4. Sends only the filtered audio to the LLM for evaluation
This enables focused audio evaluations while reducing processing time and token costs. Example: To evaluate only the agent’s speech quality in the last 3 turns:
  • Enable Custom transcript scope
  • Add a Role filter for agent
  • Add a Range filter for Last 3 turns
The metric will only analyze the agent’s audio from the final 3 exchanges, ignoring user speech and earlier portions of the call.

Benefits

  • More accurate evaluations - Remove noise from irrelevant messages
  • Lower costs - Process less content per evaluation
  • Faster execution - Smaller context means quicker LLM responses
  • Targeted insights - Focus on the exact conversation segments that matter
Combine multiple filters for precise control. For example, use both a Role filter (agent only) and a Range filter (last 5 turns) to evaluate just the agent’s closing performance.

Composite Evaluation

Purpose: Evaluates a transcript against custom criteria and returns an aggregated score. It assesses each criterion and reports how many passed. When to use: Use Composite Evaluation when you need to check whether a conversation meets several requirements at once.

Use cases

  • Did the agent greet the customer, verify their identity, and offer a resolution?
  • Did the response cover all required talking points?
  • Did the conversation follow each step of a compliance checklist?

Implementation

Criterion Source - Choose where your criteria come from:
  • From Test Case - Pulls criteria automatically from each test case’s Expected Behaviors field. This is useful when different test cases have different criteria.
  • Static Criteria - Define a fixed list of criteria directly on the metric. Every transcript is evaluated against the same set.
Custom Evaluation Prompt (optional) - Provide additional instructions to guide how each criterion is evaluated. This lets you tailor the evaluation context without editing the criteria. Additional Options:
  • Knowledge Base - Enable to give the evaluator access to your knowledge base for more informed assessments.
  • LLM Model - Select which model performs the evaluation.
  • Transcript Scope - Limit evaluation to specific portions of the transcript. See Transcript Scope for configuration details.
Results: Each run produces:
  • An overall score count and percentage of how many passed criteria.
  • A breakdown showing which criteria passed or failed with reasoning.
  • A summary explaining the overall evaluation.

Understanding Result Types

Each criterion is evaluated independently and returns one of three results:
ResultMeaning
METClear evidence in the transcript that the criterion was satisfied
NOT_METEvidence that contradicts or fails to satisfy the criterion
UNKNOWNInsufficient information to determine
Getting UNKNOWN usually means your criterion is too vague or your evaluation prompt lacks context. The evaluator cannot find sufficient evidence in the transcript to make a determination.

Writing Effective Custom Evaluation Prompts

The Custom Evaluation Prompt field controls how the evaluator interprets each criterion. A well-written prompt provides context that helps the evaluator understand your domain and make accurate determinations. Default Behavior: Without a custom prompt, the evaluator uses semantic matching to determine if each criterion was met. This works well for straightforward criteria but may return UNKNOWN for domain-specific expectations. When to Use a Custom Prompt:
  • Your criteria reference domain-specific terminology
  • You need the evaluator to understand your agent’s role or capabilities
  • You want to define what counts as “meeting” a criterion in your context
Healthcare Scheduling Agent:
You are evaluating a healthcare scheduling assistant. The agent helps patients
book, reschedule, and cancel appointments. It has access to provider availability
and patient records.

When evaluating criteria:
- "Confirms appointment" means the agent stated the date, time, and provider name
- "Verifies patient identity" means the agent asked for date of birth or member ID
- "Offers alternatives" means the agent suggested at least one other available time slot
Banking Support Agent:
You are evaluating a banking support assistant. The agent handles account inquiries,
transaction disputes, and card services.

When evaluating criteria:
- Account verification requires confirming at least 2 identity factors
- "Explains fees" means stating the specific dollar amount and when it applies
- Security disclosures must mention fraud protection and reporting procedures
  1. State the agent’s role - What does the agent do? What information does it have access to?
  2. Define ambiguous terms - What does “confirms” or “explains” mean in your context?
  3. Set evaluation standards - What level of detail counts as meeting a criterion?
Poor prompt:
Evaluate if the agent did a good job.
Effective prompt:
You are evaluating a restaurant reservation assistant. The agent books tables,
manages waitlists, and answers questions about menu and hours.

A criterion is MET when the agent provides the specific information requested.
Partial or vague responses should be marked NOT_MET. If the conversation does
not address the topic at all, mark as UNKNOWN.

Writing Effective Criteria

The most common cause of inaccurate results is vague criteria. The evaluator uses semantic understanding, so equivalent meanings count as matches. However, it cannot infer intent from ambiguous statements. The Specificity Formula Good criteria follow this pattern: [Actor] + [Specific Action] + [Specific Information/Outcome]
ScenarioVague (Likely UNKNOWN)Specific (Reliable)
Appointment booking”Agent schedules the appointment""Agent confirms the appointment date, time, and provider name”
Account inquiry”Agent explains the fees""Agent states the monthly fee amount and when it is charged”
Password reset”Agent helps with password""Agent sends a password reset link to the registered email address”
Escalation”Agent offers to escalate""Agent offers to transfer to a specialist when unable to resolve the issue”
Consider the criterion: “Agent explains the account options”This fails because:
  • “Account options” could mean account types, features, fees, or upgrades
  • The evaluator cannot determine which aspect you intended
  • Even if the agent discussed accounts, there’s no way to verify the specific expectation was met
Rewritten: “Agent explains the difference between checking and savings accounts, including minimum balance requirements”Now the evaluator can look for specific information about account types and balance requirements.
When sharing criteria between voice and chat test sets:
  1. Focus on WHAT should happen, not HOW
    • Avoid: “Agent says ‘I understand your concern’”
    • Use: “Agent acknowledges the customer’s concern before proceeding”
  2. Use outcome-based criteria
    • Avoid: “Agent reads the cancellation policy”
    • Use: “Agent confirms the customer understands the cancellation deadline”
  3. Avoid modality-specific language
    • Avoid: “Agent clicks the submit button”
    • Use: “Agent completes the reservation request”

Using Agent Evaluation Context

Adding your agent’s system prompt or context significantly improves evaluation accuracy. The evaluator performs better when it understands what your agent is supposed to do. Navigate to Agent Settings > Evaluation Context and add:
  • What the agent does and what information it has access to
  • Key policies or procedures it should follow
  • How it should handle common scenarios
Example Agent Context:
This is a healthcare scheduling assistant that helps patients with:
- Booking new appointments with available providers
- Rescheduling existing appointments (requires 24-hour notice)
- Canceling appointments
- Answering questions about office locations and hours

The agent should always:
- Verify patient identity before making changes
- Confirm appointment details before finalizing
- Offer alternative times when the requested slot is unavailable

Troubleshooting UNKNOWN Results

If you’re getting UNKNOWN results:
  1. Improve your custom prompt - Add domain context and define what “meeting” a criterion means in your use case
  2. Check criterion specificity - Is the criterion concrete enough to verify against the transcript?
  3. Add agent evaluation context - Does the evaluator understand what the agent is supposed to do?
  4. Review the transcript - Is the expected information actually present in the conversation?
  5. Split compound criteria - Break “Agent explains X and confirms Y” into two separate criteria

Tool Call Metrics

Purpose: Evaluate whether AI agent tool calls (functions) were executed correctly

Prompt Structure Template

Given the conversation transcript, [SPECIFIC TOOL CALL EVALUATION QUESTION]?

Return YES if:
• [Tool call execution criterion 1]
• [Tool call execution criterion 2]
• [Tool call execution criterion 3]

Return NO if:
• [Tool call failure condition 1]
• [Tool call failure condition 2]
• [Edge case for incorrect usage]

[CLARIFICATIONS FOR TOOL CALL CONTEXT]

Example 1: Function Call Accuracy

Given the conversation transcript, did the assistant correctly execute the search function with the appropriate parameters?

Return YES if:
• The search function was called when the user requested information lookup
• All required parameters (query, filters) were properly populated
• The function call syntax and format were correct
• The assistant used the search results appropriately in their response

Return NO if:
• The search function was called unnecessarily or at wrong times
• Required parameters were missing or incorrectly formatted
• The function call failed due to syntax errors
• The assistant ignored or misused the function results

Note: Focus on the technical execution of the tool call, not the quality of the response content.

Example 2: API Integration Validation

Given the conversation transcript, did the assistant properly use the customer lookup API when handling account inquiries?

Return YES if:
• The API was called only when customer account information was needed
• Customer identifier (email, phone, or account number) was correctly passed as parameter
• The assistant handled API response data appropriately
• Proper error handling was demonstrated if API call failed

Return NO if:
• The API was called without sufficient customer identification
• Wrong parameters were passed to the lookup function
• The assistant proceeded without waiting for API response
• API errors were not handled gracefully

Note: Evaluate the technical integration, not the customer service quality.

API State Matcher

Purpose: Evaluate the assistant by validating real-world system outcomes via an external API.

Implementation

  • Add the URL of the API endpoint
  • Select GET for simple lookups or POST if the API requires a body.
  • Expected Body can be a full JSON object, a primitive value (string, number, boolean), or a template variable. Template Patterns: {{expected_output.balance}}, { "status": "success" }, completed
  • Match path (optional): A dot-notation path used to extract a specific field from the API response.
  • Timeout (optional): Maximum wait time for the API response before marking the metric as failed.
  • Headers (optional): Custom HTTP headers sent with the request.

Use Cases

  • Verify the agent produced the correct structured output.
  • Validate mocked API responses in simulations.
  • Check tool-call results in real services.

How it works

  • An HTTP request is sent to the specified API endpoint.
  • The response body is inspected (optionally at a specific JSON path).
  • The extracted value is compared against your Expected Body.
  • Returns 1 if the response body matches the expected value, otherwise returns 0.

Match Expected Simulaton Wrapper

Purpose: Evaluates an assitant by comparing data captured during the simulation against an expected value.
Instead of calling an external API like API State Matcher, this metric inspects simulation wrapper observations (pre- or post-simulation) and verifies that the recorded response matches expectations.

Implementation

  • Select observations (pre- or post-simulation)
  • Expected Body can be a full JSON object, a primitive value (string, number, boolean), or a template variable. Template Patterns: {{expected_output.balance}}, { "status": "success" }, completed

Use cases

  • Verify the agent produced the correct structured output.
  • Validate mocked API responses in simulations.
  • Test tool-call results without affecting real services.

How it works

  • Using the wrapper observations (for example, API pre-simulation or post-simulation payloads).
  • This metric reads the selected observation.
  • Extracts a value using a match path and compares the result to the expected body.
  • Returns 1 if the response body matches the expected value, otherwise returns 0.

Metadata Field Metric

Purpose: Reports the value of a run’s metadata field, retrieved from the custom metadata using a specified key. The value may be a number, text, or boolean.

Implementation

  1. Select the metadata field type: string, float, or boolean.
  2. Input the metadata field key.
Warning: Works only if you send metadata as part of your transcripts to evaluate with Coval, this metric will take the specific metadata field’s value and output that result as a metric result.

Use Cases

  • Track custom business metrics (e.g. customer satisfaction scores, call type).
  • Monitor agent performance indicators passed through metadata.
  • Extract conversation context data for analysis.
  • Aggregate custom KPIs from your application.
  • Track boolean flags (e.g. escalation occurred, customer authenticated, issue resolved).

How It Works

  • The metric returns the exact value stored in the specified metadata field.
  • Automatically aggregates values across multiple conversations.
  • Direct field value extraction with no LLM processing required.
  • Supports numeric, text, and boolean metadata values.
  • Boolean values are output as float (0.0 for false, 1.0 for true) for proper metric aggregation.

Transcript Regex Match Metrics

Purpose: Pattern detection for exact phrase matching, compliance validation, and format verification.

Implementation

Configure the Regex Pattern field (required) and optional fields below. No text prompt is required for this metric type.

Configuration Fields

FieldRequiredDefaultDescription
Regex PatternYesRegular expression pattern to match against the transcript
RoleNoAll messagesFilter by speaker role: AGENT, PERSONA, TOOL, SYSTEM, or MUSIC
Match ModeNopresencepresence returns 1 if pattern is found; absence returns 1 if pattern is NOT found
PositionNoanyany checks all messages, first checks only the first message, last checks only the last message (of the filtered role)
Case InsensitiveNofalseWhen enabled, pattern matching ignores case

Pattern Design Guidelines

  • Use word boundaries (\b) for exact word matching.
  • Enable Case Insensitive matching instead of using inline (?i) flags for clarity.
  • Use Position filtering instead of complex anchoring when you only care about the first or last message.
  • Use Absence mode for compliance rules (“agent must not say X”) instead of trying to negate patterns in regex.
Test patterns thoroughly before deployment.

Use Case Examples

Example 1: Greeting Detection

Goal: Detect if the agent uses a proper greeting phrase. Regex Pattern: \b(hello|hi|good morning|good afternoon|good evening)\b Role: AGENT Case Insensitive: Enabled Returns: 1 if greeting found, 0 if no greeting detected.

Example 2: Required Disclosure in First Message

Goal: Verify the agent states a required disclosure at the start of the conversation. Regex Pattern: this call may be recorded Role: AGENT Position: first Case Insensitive: Enabled Returns: 1 if disclosure is in the first agent message, 0 if missing.

Example 3: Prohibited Language (Compliance)

Goal: Ensure the agent never makes unauthorized promises. Regex Pattern: \b(guarantee|promise|100%|definitely)\b Role: AGENT Match Mode: absence Case Insensitive: Enabled Returns: 1 if the agent did NOT use prohibited language (pass), 0 if prohibited language was found (fail).

Example 4: Closing Statement in Last Message

Goal: Verify the agent ends the conversation with a proper closing. Regex Pattern: (goodbye|have a (great|nice|lovely) day|thank you for calling) Role: AGENT Position: last Case Insensitive: Enabled Returns: 1 if closing statement found in last agent message, 0 if missing.

Example 5: Phone Number Format Validation

Goal: Detect when the user provides a phone number in standard US format. Regex Pattern: \b\d{3}[-.]?\d{3}[-.]?\d{4}\b Role: PERSONA Returns: 1 if valid format detected, 0 if invalid or missing.

How It Works

  1. The metric filters transcript messages by Role (if specified). If no role is set, all messages are checked.
  2. The Position filter is applied: first keeps only the first matching message, last keeps only the last.
  3. The Regex Pattern is matched against the filtered messages, with Case Insensitive applied if enabled.
  4. The Match Mode determines the result:
    • presence: returns 1 if the pattern was found, 0 if not.
    • absence: returns 1 if the pattern was NOT found, 0 if it was.
  5. Direct pattern matching — no LLM required, fast and deterministic.

Words Per Message (Threshold)

Purpose: Validates that all agent messages meet a configurable word count requirement. What it measures: Whether every agent message satisfies a word count condition — for example, “all messages must have fewer than 50 words” or “all messages must have at least 5 words.” When to use:
  • Enforcing response length guidelines (e.g., keeping answers concise)
  • Detecting unexpectedly short or empty responses
  • Validating that the agent doesn’t produce overly verbose replies
How it works: Counts words in each agent message and checks whether all messages satisfy the configured operator and threshold. Returns YES only if every message passes; NO if any message fails. How to interpret:
  • YES = all agent messages meet the word count condition.
  • NO = at least one message violated the condition. The detail view identifies which messages failed and their word counts.

Customized Audio Metrics

Custom Pause Analysis

Purpose: Measures how frequently the agent pauses mid-speech and how long those pauses are. What it measures: Frequency of agent pauses within a turn (pauses per minute), along with total and average pause duration. When to use:
  • Identifying unnatural or excessive hesitations in agent speech
  • Detecting processing delays that manifest as in-speech pauses
  • Evaluating speech fluency across different configurations
How it works: Identifies gaps between consecutive agent speaking segments within the same turn and measures their duration. Persona pauses and inter-turn gaps are excluded. How to interpret:
  • Lower values indicate more fluent speech.
  • The detail view shows each pause with its timestamp and duration.
  • Brief pauses are normal and often expressive; frequent longer pauses may indicate hesitation artifacts.

Volume Variance

Purpose: Measures how consistently the agent maintains volume throughout the conversation. What it measures: Standard deviation of audio volume (in dB) across agent speech — lower values indicate more consistent volume. When to use:
  • Identifying erratic loudness changes in agent speech
  • Ensuring consistent audio quality across a call
  • Comparing voice model configurations for volume stability
How it works: Divides agent speech into fixed-length intervals and measures the volume of each. Intervals are flagged as too loud or too soft based on absolute dBFS thresholds. The primary score is the standard deviation across all intervals. To adjust sensitivity there are different thresholds available:
PresetLoud thresholdSoft threshold
strictabove -3 dBFSbelow -30 dBFS
normal (default)above -6 dBFSbelow -35 dBFS
lenientabove -9 dBFSbelow -40 dBFS
You can also override thresholds individually with loud_threshold_db, soft_threshold_db, or interval_seconds. How to interpret:
  • Lower standard deviation = more consistent volume.
  • The detail view shows only the problematic intervals (too loud or too soft) with their timestamps and dB values.

Abrupt Pitch Changes

Purpose: Detects sudden, jittery transitions in pitch that can make speech sound unnatural. What it measures: Distinct segments where pitch changes abruptly between frames, reported as events per minute. When to use:
  • Detecting unnatural speech characteristics in synthesized voices
  • Identifying voice models with unstable or jittery pitch
  • Comparing voice configurations for smoothness
How it works: Compares pitch values frame-by-frame, flags frames where the change exceeds a threshold, and groups consecutive flagged frames into segments. Configuration (via metric metadata):
ParameterDefaultDescription
significant_changes_threshold_hz200.0Minimum pitch change in Hz to consider a transition abrupt
How to interpret:
  • Lower values indicate smoother, more natural pitch transitions.
  • Higher values suggest jittery or unstable pitch.

Volume/Pitch Misalignment

Purpose: Detects moments where pitch and volume move in opposite directions, which can indicate unnatural prosody in synthesized speech. What it measures: Frames where the pitch is rising while volume is falling (or vice versa), scored by severity relative to the clip’s own baseline. When to use: Identifying unnatural-sounding speech output — for example, a voice that gets louder while its pitch drops unexpectedly, or vice versa. Useful for:
  • Evaluating text-to-speech engine quality
  • Detecting prosody issues that may sound “off” to listeners
  • Comparing voice model configurations
How it works: Analyzes frame-by-frame pitch and volume changes across the audio. Frames where the two signals diverge in opposite directions are flagged. Each event receives a severity score based on how unusual the divergence is relative to the rest of the clip (using z-scored magnitudes), making the metric robust across different speakers and recording conditions. Configuration (via metric metadata):
ParameterDefaultDescription
min_volume_change_for_pitch_misalignment7Minimum intensity change (dB) required to flag a misalignment event
How to interpret: Severity scores are relative to the clip, not absolute. A higher score means both pitch and volume were moving unusually for this speaker in this recording.
  • Low severity (~0 – 1): Both signals are near their mean change magnitude — nothing unusual relative to the speaker’s baseline.
  • Medium severity (~1 – 2): One or both signals are about 1 standard deviation above their clip mean.
  • High severity (~2–6+): Both signals are 2+ standard deviations above their clip mean — a genuinely unusual frame.
Because severity is z-score based, values are comparable across different speakers and recording conditions.

Non-Expressive Pauses

Purpose: Identifies pauses in speech that lack preparatory pitch movement, which can make the agent sound flat or monotone. What it measures: Pauses above a minimum duration where pitch shows little variation in the frames immediately before the pause, reported as events per minute. When to use:
  • Evaluating whether a voice sounds expressive and natural
  • Detecting monotone delivery in synthesized speech
  • Comparing voice configurations for expressiveness
How it works: Detects pauses above a minimum duration threshold, then examines the pitch trajectory in the frames immediately preceding each pause. Pauses with minimal pitch variation beforehand are flagged as non-expressive. Configuration (via metric metadata):
ParameterDefaultDescription
min_pause_duration_seconds0.6Minimum silence duration (s) to qualify as a pause
pre_pause_window5Number of 10ms frames to inspect before each pause for pitch movement
How to interpret:
  • Lower values indicate more expressive delivery — pitch varies naturally before pauses.
  • Higher values suggest a flat or robotic cadence where pauses arrive without natural pitch cues.

Vocal Fry

Purpose: Detects vocal fry — a low, creaky speech quality, typically occurring at the end of phrases. What it measures: Total time spent in vocal fry (in seconds), with additional detail on percentage of affected speech and longest continuous fry segment. When to use:
  • Evaluating whether a voice has creaky or rough-sounding artifacts
  • Monitoring vocal quality across different voice configurations
  • Identifying voices where fry affects listener experience
How it works: Identifies frames with simultaneously low pitch, high acoustic roughness, and irregular vocal cord vibration. Consecutive flagged frames are grouped into fry segments. Configuration (via metric metadata):
ParameterDefaultDescription
sample_rate_seconds0.01Analysis frame rate in seconds
pitch_floor60Minimum pitch frequency (Hz) for detection
pitch_ceiling400Maximum pitch frequency (Hz) for detection
low_pitch_threshold_multiplier0.6Fraction of speaker’s median pitch below which a frame is considered low-pitched
jitter_threshold_multiplier2.0Multiple of baseline jitter above which a frame is flagged
harmonics_to_noise_ratio_threshold_offset_db-10.0dB offset below baseline HNR that marks a frame as noisy
harmonics_to_noise_ratio_minimum_pitch60Minimum pitch for HNR calculation (Hz)
harmonics_to_noise_ratio_silence_threshold0.1Amplitude threshold below which frames are treated as silent
harmonics_to_noise_ratio_periods_per_window1.0Analysis window size in pitch periods for HNR
baseline_calculation_multiplier0.8Fraction of median pitch used to define the “clear voice” baseline for HNR and jitter
min_fry_segment_seconds0.05Minimum duration (s) for a fry segment to be counted
How to interpret:
  • Total time in vocal fry (seconds). Lower is better.
  • Occasional brief fry is common in natural speech; sustained or frequent fry may reduce perceived quality.

Spectrogram Pitch Analysis

Purpose: Evaluates whether audio contains natural upper-frequency content, which is a key indicator of voice naturalness. Synthetic or bandwidth-limited audio often lacks energy in higher frequency ranges. What it measures: The fraction of upper-frequency spectrogram bins that have energy above a noise floor, averaged across analysis windows. Returns 1.0 (pass) or 0.0 (fail) based on whether the average fill ratio meets the naturalness threshold. When to use:
  • Detecting bandwidth-limited or muffled synthesized speech
  • Comparing voice model configurations for spectral richness
  • Identifying voices that lack harmonic upper-frequency energy
How it works: Splits the audio into fixed-length windows and computes a frequency spectrum for each. The fraction of bins in the upper frequency region that exceed the noise floor is measured per window. If the average fill ratio across all windows meets the naturalness threshold, the metric passes. Configuration (via metric metadata):
ParameterDefaultDescription
naturalness_threshold0.10Minimum average fill ratio (0.0–1.0) to pass
upper_region_percentage0.25Fraction of the frequency range treated as the upper region
noise_floor_db-15.0dB level above which a bin counts as filled
segment_length_seconds2.0Duration of each analysis window
How to interpret:
  • 1.0 = pass — average upper-frequency fill ratio meets the naturalness threshold.
  • 0.0 = fail — audio lacks sufficient upper-frequency energy.
  • The detail view shows the fill ratio per window across the recording timeline.

Using Trace Context in LLM Judge Metrics

Purpose: Give an LLM Judge metric visibility into what your agent actually did — not just what it said — by including OpenTelemetry span data alongside the transcript. When Include Traces is enabled on a custom transcript scope, the judge automatically receives a TRACE CONTEXT: block appended to its prompt. This block summarizes the OTel spans from the conversation: span names, timing windows, and key attributes like tool call names and function arguments.

Walkthrough

When to Enable Include Traces

Trace context is most valuable when the behavior you want to evaluate isn’t visible in the transcript alone:
Use CaseWhy Traces Help
Verify the agent used the right tools in the right orderTool call spans show what functions were invoked and with what arguments
Catch hallucinations — agent claimed to do something it didn’tTrace spans show whether the action actually occurred
Evaluate retrieval qualityRetrieval spans show what data was fetched before the agent responded
Assess error handlingError spans reveal failures the agent may have silently recovered from

How to Enable

  1. Open or create an LLM Judge metric (Binary, Numerical, Categorical, or Audio).
  2. Set Transcript Scope to Custom.
  3. In the custom scope configuration panel, toggle Include Traces on.
The trace context is appended automatically — no changes to your judge prompt are required, though you can reference it explicitly for better results.

Requirements

  • Your agent must emit OpenTelemetry traces to Coval. See the OpenTelemetry Traces guide for setup.
  • The simulation must have produced trace data. If no trace data is available, the toggle has no effect and the prompt is sent without a trace context block.

Writing Prompts That Leverage Trace Context

When writing prompts for metrics with trace context enabled, reference the trace data explicitly. The judge sees a TRACE CONTEXT: block appended after the transcript — you can instruct it to reason about both sources.

Example: Verify Tool Usage

Given the transcript and trace context, did the assistant call the `lookup_account` function before providing account balance information?

Return YES if:
• The TRACE CONTEXT shows a tool call to `lookup_account` (or equivalent) occurring before the agent stated the balance
• The transcript confirms the agent provided balance details

Return NO if:
• The agent mentioned account balance information but no `lookup_account` tool call appears in the TRACE CONTEXT
• The tool call appears AFTER the agent has already stated the balance (out of order)
• The TRACE CONTEXT shows a failed or missing tool call for this operation

Note: If no TRACE CONTEXT is provided, evaluate based on transcript alone.

Example: Catch Hallucination

Given the transcript and trace context, did the assistant accurately report what actions it took?

Return YES if:
• All actions the assistant claims to have performed appear in the TRACE CONTEXT as actual tool or function calls

Return NO if:
• The assistant stated it performed an action (e.g., "I've updated your address") but no corresponding tool call appears in the TRACE CONTEXT
• The TRACE CONTEXT shows an error or missing call for an action the assistant claimed was successful

Note: Minor phrasing differences between the transcript and trace data are acceptable — evaluate intent.
Add “Note: If no TRACE CONTEXT is provided, evaluate based on transcript alone” to your prompt. This makes the metric degrade gracefully on simulations where traces weren’t captured.

Utilizing Attributes

You can embed dynamic values from agents, test cases, and simulations into your metric prompts using template variables. This allows you to create context-aware metrics that adapt to specific agent configurations or test case requirements. For comprehensive documentation on using attributes, including nested paths, array indexing, dynamic keys, and complete examples, see Attributes.

Advanced Prompting Techniques

1. Chain of Thought for Complex Evaluations

Before making your final determination, consider:
1. What was the user's primary goal?
2. What actions did the assistant take?
3. What was the final outcome?
4. Did the outcome match the user's goal?

Based on this analysis, did the assistant successfully resolve the user's issue?

2. Few-Shot Examples for Edge Cases

Examples of what constitutes resolution:
• User: "That fixed it, thanks!" → YES
• User: "I'll try that and call back if needed" → YES
• User: "This is too complicated, forget it" → NO
• User hangs up without confirmation → NO

Given the transcript, did the assistant successfully resolve the user's issue?

3. Hierarchical Decision Making

First, determine if the assistant attempted to address the user's concern:
- If no attempt was made → Return NO
- If attempt was made → Continue to step 2

Second, evaluate if the attempt was successful:
- If user confirmed satisfaction → Return YES
- If user remained unsatisfied → Return NO
- If outcome unclear → Return NO (err on conservative side)

Using Agent Attributes and Test Case Attributes

You can make your metric prompts more dynamic and context-aware by referencing agent attributes and test case attributes. This allows you to create metrics that evaluate agent performance against specific agent configurations or test case requirements.

Agent Attributes

Agent attributes are custom properties you define for each agent configuration. How to use agent attributes in metric prompts: Insert {{agent.attribute_name}} anywhere in your metric prompt. The system will automatically replace this placeholder with the actual attribute value from the agent being evaluated. Example 1: Business Hours Verification
Given the transcript, did the assistant provide the correct opening hours?

The correct opening hours are: {{agent.opening_hours}}

Return YES if:
• The assistant stated the opening hours as {{agent.opening_hours}}
• The assistant provided opening hours that match exactly (e.g., "9 AM to 5 PM" matches "9:00am-5:00pm")

Return NO if:
• The assistant provided different opening hours than {{agent.opening_hours}}
• The assistant claimed not to know the opening hours
• The assistant provided incorrect or conflicting information

Test Case Attributes

For a test case with attributes like:
{
  "source": "LAX",
  "destination": "SFO",
  "ticket_class": "business"
}
You could create a metric prompt:
Given the transcript, did the assistant correctly process the flight booking request?

The booking details are:
- Source: {{test_case.source}}
- Destination: {{test_case.destination}}
- Ticket Class: {{test_case.ticket_class}}

Return YES if:
• The assistant confirmed all three details correctly (source, destination, and ticket class)
• The assistant used the exact values: {{test_case.source}}, {{test_case.destination}}, and {{test_case.ticket_class}}

Return NO if:
• Any of the three details were incorrect or missing
• The assistant confused source and destination
• The assistant used a different ticket class than {{test_case.ticket_class}}

Combining Agent and Test Case Attributes

You can use both agent attributes and test case attributes in the same metric prompt to create comprehensive evaluations:

Knowledge Base Metrics

Coval allows you to connect a knowledge base (KB) to your agent and create LLM Judge metrics that use your knowledge base as context. This enables you to track accuracy on specific articles, knowledge bases, or different flows mentioned in your documentation. Use cases for KB metrics:
  • Verify agents answer questions using approved knowledge base content.
  • Track accuracy across different documentation sources.
  • Ensure compliance with specific information in FAQs, policies, or procedures.
  • Monitor whether agents provide consistent responses based on authoritative sources.
Pro Tip: KB metrics are particularly valuable for customer service agents, healthcare bots, or any application where accuracy against documented information is critical.

Setting Up Your Knowledge Base

Step 1: Navigate to Agent Configuration
  1. Go to your Agent setup page
  2. Select the agent you want to connect to a knowledge base
  3. Scroll down to the Knowledge Base section
Step 2: Add Knowledge Base Entries Coval supports multiple knowledge base formats
  1. Click “Add Knowledge Base Entry”
  2. Select your file type
  3. Upload your file (Coval will automatically parse it)
  4. Add a descriptive name (e.g., “Hotel FAQ”, “Product Documentation”)
  5. Optionally add tags for organization
  6. Click “Upload”
image.png All uploaded entries will appear in your knowledge base list, associated with the selected agent.

Creating Knowledge Base Metrics

Step 1: Create a New Metric
  1. Navigate to the Metrics section
  2. Click “Create New Metric”
  3. Select Binary LLM Judge as the metric type
  4. Name your metric (e.g., “FAQ Knowledge Base Accuracy”)

Step 2: Write Your LLM Judge Prompt

Structure your prompt to evaluate whether the agent used knowledge base information correctly: Example Prompt Structure:
Given the transcript, did the assistant answer the user's initial question accurately using information from the Hotel FAQ knowledge base?

Return YES if:
- The assistant provided specific FAQ details that are factually correct (exact addresses, dollar amounts, precise policies, named amenities)
- Core facts match the FAQ even if paraphrased (e.g., "4:00pm check-in" can be stated as "check-in at 4 PM")
- The response directly addresses the user's initial question with accurate FAQ information

Return NO if:
- The assistant provided information that contradicts the FAQ (e.g., claiming there is a pool when FAQ states there is no pool)
- The assistant gave generic responses without specific FAQ details
- The assistant fabricated information not contained in the FAQ
- The assistant claims lack of information when the FAQ contains the answer
- The initial question remains unanswered despite FAQ coverage
- The assistant provided factually incorrect information, even if detailed and specific

Return Unknown if:
- The user's question is not covered in the FAQ

**Critical: Prioritize factual accuracy over response detail. A detailed but incorrect answer must return NO.**
Best Practice: Be specific about what constitutes accurate vs. inaccurate responses based on your knowledge base. Include edge cases where the KB might not have complete information.
Step 3: Enable Knowledge Base Context Critical step: At the bottom of the metric configuration:
  1. Locate the Knowledge Base toggle (initially disabled)
  2. Enable the Knowledge Base option
  3. The system will automatically include your knowledge base as context when evaluating
Critital: If you don’t enable the Knowledge Base toggle, the metric will evaluate without KB context and may produce inaccurate results.

Step 4: Save Your Metric

  1. Review your prompt and settings
  2. Click “Create Metric”
  3. Your KB metric is now ready to use in simulations and monitoring

Using Knowledge Base Metrics in Evaluations

In Simulations
  1. Create or select a test set with scenarios that should use KB information
  2. Launch a simulation (or use a template)
  3. Select your KB accuracy metric in the metrics list
  4. Run the simulation
In Monitoring
  1. Set your KB metric as a Default Metric to run on all incoming transcripts
  2. Create Metric Rules to apply KB metrics conditionally
  3. Monitor results in real-time to catch KB accuracy issues in production

Best Practices for Knowledge Base Metrics

Writing Effective Prompts

Do:
  • Be explicit about what information should come from the KB.
  • Define clear conditions for YES and NO responses.
  • Account for situations where the KB doesn’t have complete information.
  • Consider partial accuracy vs. complete inaccuracy.
Don’t:
  • Make assumptions about what the LLM knows without KB context.
  • Create overly complex evaluation criteria.

Knowledge Base Organization

Recommended structure:
  • Use clear, descriptive names for each KB entry.
  • Add tags to categorize different types of information.
  • Keep individual KB files focused on specific topics.
  • Update KB entries regularly to reflect current information.

Metric Validation and Testing

1. Metric Improvement Process

  • Use Coval’s “Improve Metric” feature with test transcripts.
  • Iterate on prompts to reduce variance.
  • Test edge cases and ambiguous scenarios.
  • Aim for >90% consistency across similar evaluations.

2. Common Issues and Solutions

IssueSolution
Inconsistent scoringAdd more specific criteria and examples
Edge case failuresInclude explicit handling for boundary conditions
LLM hallucinationUse more structured prompts with clear constraints
Low correlationEnsure metric measures what you intend to measure

3. Performance Optimization

  • Keep prompts under 2,000 characters when possible.
  • Use regex metrics for simple pattern detection.
  • Combine related evaluations into single metrics when logical.
  • Test with diverse conversation types and lengths.

Best Practices Summary For Creating Metric Prompts

  • Use specific, measurable criteria.
  • Provide clear positive and negative examples.
  • Test extensively with real conversation data.
  • Maintain consistent terminology and structure.
  • Include edge case handling.
This systematic approach to metric creation will ensure reliable, actionable insights from your Coval evaluations.