Documentation Index
Fetch the complete documentation index at: https://docs.coval.dev/llms.txt
Use this file to discover all available pages before exploring further.
What is a metric?
Metrics give you quantitative insights into your agent’s performance, allowing you to see red flags early and understand overall trends. Each metric assesses your agent in a different way. Audio metrics use recordings, either simulated or live, to detect interruptions, measure phonemes per second, assess latency, and more. LLM Judge metrics provide answers to specific questions you have about your transcripts, allowing you to dial in on your unique specifications. LLM Judge metrics can optionally include Trace Context — when enabled, the judge automatically receives a summary of the agent’s OpenTelemetry spans alongside the transcript, enabling evaluation of tool usage, execution order, and behavior that isn’t visible in the transcript alone. Other offerings include Sentiment Analysis, Regex Matching, and many more. While Coval provides built-in metrics (latency, accuracy, tool-call effectiveness, instruction compliance), you can create custom metrics tailored to your specific needs. All out-of-the-box metrics are marked as “Built-in” in your Metrics list. These metrics can be applied to Simulated Conversations as well as Live-Monitoring Conversations.Recommended Metrics
These are the metrics we usually recommend starting with, if you want to use built-in metrics as a starting point:- Conversational LLM Judge (Binary) Metrics:
- Composite Evaluation
- Agent Repeats Itself
- End Reason
- Audio-Metrics:
- Latency
- Interruption Rate
- Speech Tempo
- Natural Non-Robotic Tone Detection
- Volume/Pitch Misalignment
- Other:
- Workflow Verification:
- You can generate a workflow in the Agent creation flow, this metric will re-trace the workflow in the transcript and detect off-path behavior.
- Workflow Verification:
Advanced Metrics:
For when you try to evaluate specific parts of the conversation:- Binary Tool Call metrics:
- Check if your tool calls (functions) have been performed correctly
- Audio LLM Judge:
- Ask an LLM Judge question and, instead of evaluating the transcript, we’ll evaluate the audio. (e.g. “Did the assistant stutter?”)
- Categorical metrics:
- Define a set of categories/topics to filter topics of your conversations (good for exploratory call analysis)
- Transcript Regex Match:
- A metric that performs regex pattern matching on conversation transcripts. Returns 1 for a match and 0 for no match. You can filter by speaker role, check only the first or last message, require that a pattern is absent (for compliance rules like “agent must not say X”), and enable case-insensitive matching. Ideal for exact phrase detection, compliance checks, and format validation without needing LLM calls.
- Numerical LLM Judge:
- A metric that uses an LLM judge to evaluate a prompt and output a numerical score.
- Tool Call Latency:
- Used to measure the latency of tool calls.
- Metadata Field Metric (Conversations-only):
- If you send metadata as part of your transcripts to evaluate with Coval, this metric will take the specific metadata field’s value and output that result as a metric result. Supports string, float, and boolean field types.
- Custom Trace Metric:
- Extract a specific numerical value from your agent’s OpenTelemetry spans and aggregate it (average, median, p90, max, min) across all matching spans in a simulation. Use this to track custom latency signals, confidence scores, tool call durations, or any other numerical attribute your agent emits. See the Custom Trace Metrics guide for details.
- Custom: If you have your own metrics that you want to upload to the Coval platform to run next to our built-in metrics, let us know.
Guide to Creating Binary LLM Judge Metrics
When creating metrics that use an LLM to evaluate performance:- Be precise in your descriptions
- Always refer to the agent as “the assistant” for clarity
- Provide clear guidance on evaluation criteria
Example: “Avoid Unresponsiveness” Metric:Given the transcript, did the assistant maintain responsiveness by acknowledging all user inputs and avoiding behaviors that make the user question whether the assistant is still present?Return YES if:• The assistant responds promptly and appropriately to all user inputs• There are no long silences, skipped questions, or ignored user messages• The user does not need to ask “Are you still there?” or similar prompts• If the assistant is uncertain or processing, it states that clearly (e.g., “Let me check that for you”)Return NO if:• The assistant fails to respond to a user input• The user asks “Are you still there?” or expresses concern about being ignored• The assistant gets stuck or goes silent without explanation
Improve your Metrics
To refine a metric, open it from the metrics list and click “Improve Metric.” Select a test set (must be a transcript—tip: copy/paste a simulated transcript into a new set). You can then iterate on the metric’s formulation and see how often it returns YES vs. NO. This helps reduce noise and non-determinism in LLM-judge metrics.Custom Metrics
Need custom metrics tailored to your needs? Contact us, and we’ll create them
for you.

