# Coval > Coval is the reliability infrastructure for conversational AI agents. Evaluate voice and text AI agents by running simulated conversations and measuring performance with metrics. Supports inbound voice, outbound voice, chat, SMS, WebSocket, Pipecat, and LiveKit agent types. ## Getting Started - [Welcome](https://docs.coval.dev/getting-started/welcome): What Coval does and who it's built for - [Quick Start](https://docs.coval.dev/getting-started/quick-start): Set up your first evaluation in 5 minutes - [GitHub Actions](https://docs.coval.dev/getting-started/github-actions-tutorial): CI/CD integration for automated evaluations on every PR ## Concepts - [Agents](https://docs.coval.dev/concepts/agents/overview): Connect voice and chat AI agents to Coval for evaluation - [Mutations](https://docs.coval.dev/concepts/agents/mutations): Test agent configuration variants side-by-side (A/B testing) - [Attributes](https://docs.coval.dev/concepts/attributes/overview): Tag and filter resources with custom attributes - [Personas](https://docs.coval.dev/concepts/personas/overview): Configure simulated callers with voice, accent, behavior, and background noise - [Test Sets](https://docs.coval.dev/concepts/test-sets/overview): Define scenarios, transcripts, scripts, and expected behaviors for evaluation - [Metrics](https://docs.coval.dev/concepts/metrics/overview): Measure agent performance with LLM judges, audio analysis, regex, and tool call checks - [Built-in Metrics](https://docs.coval.dev/concepts/metrics/built-in-metrics): Pre-built metrics for latency, interruptions, sentiment, call resolution, and audio quality - [Metric Prompting](https://docs.coval.dev/concepts/metrics/prompting): Writing effective LLM judge prompts and expected behaviors - [Metric Chaining](https://docs.coval.dev/concepts/metrics/MetricChaining): Chain metrics for complex multi-step evaluation logic - [Human Review](https://docs.coval.dev/concepts/metrics/human-review/human-review): Human-in-the-loop review for metric calibration - [Templates](https://docs.coval.dev/concepts/templates/overview): Reusable evaluation configs bundling agent, test set, persona, and metrics - [Simulations](https://docs.coval.dev/concepts/simulations/overview): Launch and analyze simulated conversations between your agent and personas - [Multi-Run Analysis](https://docs.coval.dev/concepts/simulations/multi-run-analysis): Compare results across multiple evaluation runs - [OpenTelemetry Traces](https://docs.coval.dev/concepts/simulations/traces/opentelemetry): Correlate evaluation results with production traces ## Agent Connections - [Inbound Voice](https://docs.coval.dev/concepts/agents/connections/inbound-voice): Test agents that receive incoming phone calls - [Outbound Voice](https://docs.coval.dev/concepts/agents/connections/outbound-voice): Test agents that make outbound calls - [Chat (OpenAI Endpoint)](https://docs.coval.dev/concepts/agents/connections/openai-endpoint): Connect OpenAI-compatible chat APIs - [Chat WebSocket](https://docs.coval.dev/concepts/agents/connections/chat-websocket): Text chat over persistent WebSocket connections - [Pipecat](https://docs.coval.dev/concepts/agents/connections/pipecat): Integrate with Pipecat Cloud agents - [LiveKit](https://docs.coval.dev/concepts/agents/connections/livekit): Real-time communication platform integration - [WebSocket](https://docs.coval.dev/concepts/agents/connections/websocket): Generic WebSocket agent connection ## Observability - [Dashboards](https://docs.coval.dev/concepts/dashboard/overview): Visualize performance trends across evaluation runs - [Monitoring](https://docs.coval.dev/concepts/monitoring/overview): Evaluate production conversations with live monitoring - [Improving Metrics with Human Review](https://docs.coval.dev/guides/improving-metrics-with-human-review): Calibrate metrics using human feedback loops ## Guides - [Inbound Voice Simulations](https://docs.coval.dev/guides/simulations/inbound-voice): Step-by-step guide for inbound voice evaluations - [Chat Simulations](https://docs.coval.dev/guides/simulations/chat): Step-by-step guide for chat agent evaluations - [SMS Simulations](https://docs.coval.dev/guides/simulations/sms): Step-by-step guide for SMS agent evaluations - [Outbound Voice](https://docs.coval.dev/guides/outbound-voice): Guide for outbound voice agent testing - [API Keys](https://docs.coval.dev/guides/api-keys): Managing API keys for programmatic access - [Scheduled Runs](https://docs.coval.dev/guides/scheduled-runs): Set up recurring automated evaluations - [Webhooks](https://docs.coval.dev/guides/webhooks): Receive real-time notifications for run events - [Observability](https://docs.coval.dev/guides/observability): OpenTelemetry traces and production monitoring setup - [Human Review API](https://docs.coval.dev/guides/human-review-api): Human-in-the-loop review workflows via API ## CLI - [Overview](https://docs.coval.dev/cli/overview): Command-line interface for evaluation, scripting, and CI/CD - [Installation](https://docs.coval.dev/cli/installation): Install via Homebrew, Cargo, or binary download - [Agents](https://docs.coval.dev/cli/agents): Create, list, update, delete agents - [Runs](https://docs.coval.dev/cli/runs): Launch evaluations, watch progress, view results - [Simulations](https://docs.coval.dev/cli/simulations): Inspect individual simulation results and download audio - [Test Sets](https://docs.coval.dev/cli/test-sets): Manage test set collections - [Test Cases](https://docs.coval.dev/cli/test-cases): Define evaluation inputs, expected outputs, and bulk import - [Personas](https://docs.coval.dev/cli/personas): Configure simulated callers with voice and behavior - [Metrics](https://docs.coval.dev/cli/metrics): Define scoring criteria (LLM judge, audio, regex, tool call) - [Mutations](https://docs.coval.dev/cli/mutations): Manage agent configuration variants - [Run Templates](https://docs.coval.dev/cli/run-templates): Save reusable evaluation configurations - [Scheduled Runs](https://docs.coval.dev/cli/scheduled-runs): Schedule recurring evaluations with cron expressions - [Dashboards](https://docs.coval.dev/cli/dashboards): Create dashboards and widgets from the CLI - [API Keys](https://docs.coval.dev/cli/api-keys): Manage API keys - [Human Review](https://docs.coval.dev/cli/human-review): Human review projects and annotations ## MCP Server - [Overview](https://docs.coval.dev/mcp/overview): Model Context Protocol server for LLM tool access - [Installation](https://docs.coval.dev/mcp/installation): Set up MCP server for Claude, Cursor, and other clients - [Tools](https://docs.coval.dev/mcp/tools): Available MCP tools reference - [Beginner's Guide](https://docs.coval.dev/mcp/beginners-guide): Getting started with MCP and Coval ## API Reference - [Introduction](https://docs.coval.dev/api-reference/v1/introduction): Authentication, base URL, pagination, filtering, and error codes - [OpenAPI Specs](https://api.coval.dev/v1/openapi): Machine-readable API specifications (15 resource specs) ## Optional - [Use Cases](https://docs.coval.dev/use-cases/overview): Example evaluation scenarios by industry - [Leveraging Test Users](https://docs.coval.dev/use-cases/leveraging-test-users): Using test user data for better evaluations - [Hackathons](https://docs.coval.dev/collaborate/hackathons/overview): Community events and collaboration --- # Full Documentation --- ## Welcome to Coval Source: https://docs.coval.dev/getting-started/welcome Coval is the reliability infrastructure for conversational AI agents. ![](/images/overview-coval.jpg) ## Why Coval? Welcome to Coval - your enterprise-grade conversational agent testing and observability platform. We help **developers**, **QA teams**, and **enterprise operations teams** confidently test, evaluate, and monitor voice and chat agents—before and after they go live. Whether you’re building a new agent, debugging a regression, or comparing vendors, Coval gives you a full-stack platform to simulate real conversations, track production quality, and ship agents you trust. **Built for**: **Developers & QA** – Automate testing across hundreds of scenarios with realistic voice inputs, IVRs, and edge cases **Enterprise Agent Ops** – Monitor multiple vendors, standardize metrics, and scale performance evaluation across use cases **Teams shipping fast** – Integrate Coval into your CI/CD or agent workflows to catch issues early, simulate regressions, and ship with confidence ## **How Coval Provides Value** Coval accelerates AI agent development with automated testing for chat, voice, and other objective-oriented systems. Instead of calling your agent manually, simulate your end customers in realistic settings with background noise, accents, and test a variety of scenarios in automated tests. Coval provides the tools to automatically generate and execute a wide range of test scenarios, ensuring no aspect of your agent's behavior is overlooked. Run GitHub Actions on any change and simulate outcomes, or run automated evaluation on a schedule. We also integrate with partners across the voice stack to embed Coval into your workflows. Keep your customers assured on delivering high-quality results at scale by running evals on calls in production . Smart alerting notifies you on on critical issues and anomalies for early debugging and scaling your agent fast. > ## Start building Follow the "Key Concepts" & "Guides" section to set up your evals & monitoring. Alternatively, use our API to easily manage and integrate your Coval agents. - [API Reference](/api-reference/v1/introduction): Manage evaluation, monitoring, metrics and simulators with the Coval API. ## Keep in touch with us Join the Coval community to ask questions, discuss best practices, and share tips. - [Slack Community](https://forms.gle/frTn8eCHkcTfw67p8) - [LinkedIn](https://www.linkedin.com/company/covaldev/posts/?feedView=all) - [X (Twitter)](https://x.com/covaldev) --- ## Quick Start Source: https://docs.coval.dev/getting-started/quick-start Get started with Coval in 5 minutes - simulate and evaluate your AI agent Ready to start evaluating your AI agent? This quick start guide will get you running your first simulation in minutes. ## Prerequisites - Your AI agent (voice or chat) must be accessible via phone number or API endpoint - A Coval account ## 1. Connect Your Agent Connect your agent to Coval's simulation platform: 1. Go to **Agents** in your dashboard 2. Click **"Add New Agent"** 3. Provide your agent's connection details: - **Phone number** (for voice agents) - **API endpoint URL** (for chat agents) 4. Configure additional settings like language preferences > **Tip:** Use **Inbound Voice** connection type for agents that receive calls (like customer support lines) ### **Here's a quickstart guide on connecting your Agent:** [Video: Quickstart guide on connecting your Agent](https://www.loom.com/embed/70bb69704bfc4157975e43824ef4248d?sid=95b0b768-bd68-47bf-8286-8b46a349d9f5) ## 2. (Optional) Create a Persona Define how your simulated users will behave: 1. Navigate to **Personas** 2. Click **"Create New Persona"** 3. Configure basic settings: - **User Type**: Customer, Patient, Employee, or Custom - **Voice & Language**: Select from available options - **Behavior**: Set interruption sensitivity and response patterns > **Info:** Coval offers a set of built-in Personas with different voice & background noise settings. ## 3. Create Your Test Set Tell Coval what your simulated users should do: 1. Go to **Test Sets** 2. Click **"Create New Test Set"** 3. Add test cases using **Scenarios**: - Simple: `"Call to get a refund"` - Complex: `"Ask for PTO from March 21-22, then change to March 20-22, provide email as emily@gmail.com"` > **Note:** Be specific in your scenarios - the more detail, the more accurately our simulated users will follow instructions. For more detail check our [Test Set Guide](https://docs.coval.dev/concepts/test-sets/overview). ## 4. Choose Your Metrics Select how to evaluate your agent's performance: **Recommended starter metrics:** - **Call Resolution Success** (LLM Judge) - **Latency** (Audio metric) - **Interruptions** (Audio metric) Navigate to **Metrics** to create custom metrics or use built-in options. ## 5. Create a Template Bundle everything together for easy reuse: 1. Go to **Templates** → **"Create New Template"** 2. Configure: - **Test Set**: Your scenarios - **Agent**: Your connected agent - **Persona**: Your simulated user behavior - **Metrics**: Your evaluation criteria - **Iterations**: Number of conversations to run (start with 1-3) - **Concurrency**: Parallel simulations (start with 1-2) ## 6. Launch Your First Evaluation 1. Click **"Launch Evaluation"** 2. Select **"Use Template"** and choose your template 3. Click **"Launch"** 4. Monitor progress in real-time ## 7. Analyze Results Once complete, review your evaluation: - **Overview**: Aggregated performance metrics - **Individual Conversations**: Detailed analysis of each simulation - **Transcript Review**: See exactly what was said and where issues occurred - **Metric Explanations**: Understand why each metric passed or failed ## Next Steps - [Advanced Metrics](/concepts/metrics/overview): Create custom LLM judge metrics for your specific use cases - [CI/CD Integration](/getting-started/github-actions-tutorial): Automate evaluations with GitHub Actions - [Continuous Monitoring](/concepts/monitoring/overview): Set up alerts and recurring evaluations ## Troubleshooting **Agent not connecting?** - Verify your phone number or endpoint URL - Check if your agent accepts inbound calls/requests - Ensure proper authentication if using API endpoints **Simulations not running as expected?** - Make test case scenarios more specific - Adjust persona settings for desired behavior - Check agent and persona compatibility **Need help?** Contact our support team or check our detailed guides for more advanced configuration options. --- ## Agents Source: https://docs.coval.dev/concepts/agents/overview Connect your voice & chat agents to Coval to launch simulated conversations Each agent configuration acts as a reusable connection profile that can be referenced across multiple simulations, evaluations, and monitoring sessions without requiring reconfiguration. ## How to Configure Agents ### Adding an Agent [Image: Connect Agent Demo] 1. Navigate to the Agents section in your dashboard 2. Click "Add New Agent" 3. Configure the connection parameters: - **Endpoint URL**: API endpoint for your agent service - **Phone Number**: For voice-based agents requiring telephony access - **Authentication**: API keys or authentication tokens as required 4. Set operational parameters: - **Language Preferences**: Primary and fallback language configurations - **Agent Behavior Prompts**: System prompts or behavioral guidelines - **Simulator Types**: Compatible simulation environments ### Attributes In your agents, you can set specific attributes associated with that agent. For example, if you have multiple agents representing different restaurant reservation services, you could define the attributes such as "opening_hours" and "menu_items". You can embed these agent attributes into [test case scenarios](/concepts/test-sets/overview#utilizing-agent-attributes) or [metric prompts](/concepts/metrics/prompting#using-agent-attributes-and-test-case-attributes) by inserting `{{agent.attribute_name}}`. In the example above, you could create a metric that asks: ```markdown Did the agent give the correct opening hours? Opening hours are `{{agent.opening_hours}}` ``` or, if you could use it in a test case: ```markdown Order two items from this list: {{agent.menu_items}} ``` ## Connect Your Agent - [Inbound Voice](/concepts/agents/connections/inbound-voice): Receive incoming phone calls for customer service scenarios - [Outbound Voice](/concepts/agents/connections/outbound-voice): Make calls to users for sales and scheduling - [OpenAI Endpoint](/concepts/agents/connections/openai-endpoint): Connect OpenAI-compatible chat APIs - [Chat WebSocket](/concepts/agents/connections/chat-websocket): Text chat over persistent WebSocket connections - [Pipecat Cloud](/concepts/agents/connections/pipecat): Integrate with Pipecat Cloud agents - [LiveKit](/concepts/agents/connections/livekit): Advanced real-time communication platform --- ## Agent Mutations Source: https://docs.coval.dev/concepts/agents/mutations Test agent configuration variants side-by-side in a single evaluation Agent mutations let you create configuration variants of an agent and run them alongside the base agent in a single evaluation. Instead of manually creating separate agents for each configuration you want to test, mutations override specific fields and produce side-by-side results so you can compare performance directly. ## How Mutations Work Every agent has a base configuration: model, system prompt, temperature, voice, API key, and other settings depending on the agent type. A mutation overrides one or more of these fields while leaving the rest unchanged. When you launch an evaluation with mutations selected, Coval runs the base agent **plus** each selected mutation as separate simulations. The results appear in the same run, grouped by variant, so you can compare metrics across configurations. > **Info:** The base agent always runs. Mutations are additional variants that run alongside it. ### Deep Merge Mutation overrides are deep-merged with the base agent configuration. For nested objects (like custom headers), this means individual keys are overridden while sibling keys are preserved. ```json // Base agent metadata { "custom_sip_headers": { "X-Session-Id": "abc", "X-Routing": "prod" } } // Mutation override { "custom_sip_headers": { "X-Session-Id": "override-123" } } // Merged result at runtime { "custom_sip_headers": { "X-Session-Id": "override-123", "X-Routing": "prod" } } ``` For scalar values (strings, numbers), the mutation value replaces the base value entirely. ## Supported Agent Types Mutations are available for agent types that have overridable configuration fields: | Agent Type | Overridable Fields | |---|---| | **WebSocket** | Endpoint, initialization JSON, authorization header, custom headers | | **Inbound Voice (SIP)** | Custom SIP headers | ## Creating Mutations 1. Navigate to your agent's detail page 2. Scroll to the **Mutations** section 3. Click **Create** ### Mutation Form Each mutation requires: - **Name** -- A descriptive label (e.g., "High Temperature", "French Language", "GPT-4 Turbo") - **Description** -- Optional context about what this variant tests Then select which fields to override. Only fields defined for your agent type are available. For each selected field, enter the override value using the appropriate editor: - **Text input** for simple string or number fields - **Dropdown** for fields with predefined options (e.g., model selection, voice) - **JSON editor** for structured data fields (e.g., custom headers) > **Tip:** Only override the fields you want to test. Fewer overrides make results easier to interpret. ## Using Mutations in Evaluations When launching an evaluation or creating a template: 1. Select a single agent 2. The **Agent Mutations** selector appears if the agent has active mutations 3. Check the mutations you want to include The base agent is always included. Each selected mutation creates additional simulations in the same run. ### Viewing Results After the evaluation completes: - The **Mutation** column in the results table shows which variant produced each result - Use the **filter dropdown** to isolate results by mutation - On the **Dashboard**, use the "Agent Mutation" dimension to split charts by variant This lets you directly compare metrics like latency, accuracy, or custom scores across your base agent and each mutation. ## Common Use Cases **Model comparison** -- Create mutations for different models (GPT-4o vs GPT-4 Turbo vs Claude) to benchmark quality and latency. **Prompt tuning** -- Test system prompt variations to find the wording that produces the best metric scores. **Temperature sweeps** -- Run the same agent at different temperature values to find the optimal balance of creativity and consistency. **Voice testing** -- Compare different TTS voices or providers to find the best fit for your use case. **Header overrides** -- For SIP-based voice agents, test different routing headers or session metadata per variant. ## Managing Mutations ### Editing Click the edit icon on any mutation card to modify its name, description, or overrides. ### Deleting Click the delete icon to archive a mutation. If the mutation is referenced by active templates, you will see a warning listing those templates. You can cancel or force-delete, which causes those templates to skip the deleted mutation on future runs. > **Warning:** Archived mutations cannot be restored. Create a new mutation if you need the same configuration again. ## API and CLI Mutations can also be managed programmatically: - [REST API](/api-reference/v1/mutations/mutations/list-mutations): Create, list, update, and delete mutations via the v1 API - [CLI](/cli/mutations): Manage mutations from the command line --- ## Attributes Source: https://docs.coval.dev/concepts/attributes/overview Use dynamic attributes from agents, test cases, and simulations in your metric prompts and test scenarios. You can embed dynamic values from agents, test cases, and simulations into your metric prompts and test scenarios using template variables. The system automatically replaces these placeholders with actual values during evaluation. ## Supported Sources The template system supports three main sources: - **`{{agent.*}}`** - References agent attributes - **`{{test_case.*}}`** - References test case attributes - **`{{test_case.expected_behaviors}}`** - References test case criteria (used by Composite Evaluation) ## Basic Usage The simplest form is accessing a top-level attribute: ``` {{agent.attribute_name}} {{test_case.attribute_name}} {{test_case.expected_behaviors}} //For criteria in your test case (used by Composite Evaluation) ``` **Example:** Imagine one agent has the attribute **location** with value "San Francisco", and another agent has value "London". ``` Scenario: You are a user calling for travel recommendations in {{agent.location}} Criterion: The agent should only give travel recommendations in {{agent.location}} ``` ## Nested Paths You can access nested attributes using dot notation: ``` {{agent.users.sam.flight_number}} → agent.attributes["users"]["sam"]["flight_number"] ``` **Example:** If your agent has a nested structure: ```json { "users": { "sam": { "flight_number": "UA123", "email": "sam@example.com" } } } ``` You can access it in your prompt: ``` Did the agent identify the flight as: {{agent.users.sam.flight_number}}. ``` ## Array Indexing Access specific elements in arrays using bracket notation: ``` {{test_case.test[0]}} → test_case.attributes["test"][0] ``` **Example:** If your test case has an array: ```json { "flight_options": ["United Airlines", "Delta", "American Airlines"] } ``` You can reference specific flights: ``` The first flight option is {{test_case.flight_options[0]}}. The assistant should mention {{test_case.flight_options[1]}} as an alternative. ``` ## Array Access Without Indexing Access entire arrays without indexing - they'll be returned as strings: ``` {{test_case.test}} → test_case.expected_output_json["test"] (entire array as string) {{agent.items}} → agent.attributes["items"] (entire array as string) ``` **Example:** ``` The available options are: {{test_case.flight_options}} ``` ## Dynamic Keys via Multi-Pass Resolution The system supports dynamic key resolution through multiple passes, allowing you to use one template variable to determine another: ``` {{agent.users.{{test_case.username}}.email}} ``` **How it works:** 1. First pass: Resolves `{{test_case.username}}` → "user1" 2. Second pass: Resolves `{{agent.users.user1.email}}` → "test@example.com" **Example:** If your test case specifies a username: ```json { "username": "user1" } ``` And your agent has user-specific data: ```json { "users": { "user1": { "email": "user1@example.com", "tier": "premium" } } } ``` You can use dynamic resolution: ``` The user {{test_case.username}} has email {{agent.users.{{test_case.username}}.email}} and is a {{agent.users.{{test_case.username}}.tier}} member. ``` ## Complete Example Here's a comprehensive example combining multiple features: **Agent attributes:** ```json { "location": "San Francisco", "users": { "sam": { "tier": "premium", "perks": ["early_checkin", "room_upgrade"] } } } ``` **Test case:** ```json { "username": "sam", "requested_perks": ["early_checkin"] } ``` **Metric prompt:** ``` Given the transcript, did the assistant properly handle the reservation request? Hotel Location: {{agent.location}} Customer: {{test_case.username}} Customer Tier: {{agent.users.{{test_case.username}}.tier}} Available Perks: {{agent.users.{{test_case.username}}.perks}} Requested Perks: {{test_case.requested_perks}} Return YES if: • The assistant confirmed the reservation is for {{agent.location}} • The assistant recognized {{test_case.username}} as a {{agent.users.{{test_case.username}}.tier}} member • The assistant mentioned available perks: {{agent.users.{{test_case.username}}.perks}} • The assistant processed the requested perk: {{test_case.requested_perks[0]}} Return NO if: • The assistant provided incorrect location information • The assistant failed to recognize the customer's tier status • The assistant couldn't access the requested perk information ``` ## Use Cases ### In Metric Prompts Attributes are commonly used in metric prompts to create context-aware evaluations. See [Metric Prompting](/concepts/metrics/prompting) for examples of using attributes in evaluation metrics. ### In Test Scenarios You can embed agent attributes directly into test case scenarios and expected behaviors. This allows the same test set to work with different agents that have different attribute values. See [Test Sets](/concepts/test-sets/overview) for more information. ### In Criteria Use attributes in criteria definitions to create dynamic validation that adapts to the specific agent or test case being evaluated. These criteria are used by the **Composite Evaluation** metric. --- ## Inbound Voice Source: https://docs.coval.dev/concepts/agents/connections/inbound-voice Configure agents that receive incoming phone calls from users ## Overview Inbound Voice connections simulate users calling your agent through traditional telephony. ## Configuration Requirements ### Phone Number - **Field**: `phone_number` - **Type**: String (required) - **Format**: E.164 international format with country code, or a SIP address - **Examples**: `+12345678901`, `sip:agent@example.com` - **Validation**: Must be a valid phone number or SIP URI ### Wideband Audio When your agent is configured with a **SIP address**, a **Wideband Audio (16kHz)** toggle appears in the agent configuration page. Enabling this toggle switches the audio encoding from the default narrowband codec (G.711 mu-law at 8kHz) to wideband L16 PCM at 16kHz, providing higher quality audio. #### When to use wideband audio Wideband audio is beneficial when your agent uses a SIP address (e.g., `sip:agent@example.com`) and supports a wideband codec. In this configuration the audio is 16kHz end-to-end, giving a meaningful quality improvement. This toggle only appears for SIP connections because PSTN phone numbers are limited to narrowband audio (8kHz G.711) on the carrier leg. Even if wideband encoding were used on the Coval side, the PSTN leg would still be narrowband, so there is no actual quality benefit. #### How to enable it 1. In the agent configuration page, select a **SIP address** as your connection type 2. A **Wideband Audio (16kHz)** toggle will appear 3. Enable the toggle to use wideband audio > **Note:** When the wideband toggle is off (the default), standard narrowband PCMU encoding is used. Existing agents are unaffected unless you explicitly enable wideband audio. Note that some PSTN carriers may transcode audio, which can reduce quality regardless of the codec setting. ### Setup Instructions 1. Select "Inbound Voice" as the connection type 2. Enter your agent's phone number in E.164 format or SIP address 3. (Optional) If using a SIP address, enable the **Wideband Audio (16kHz)** toggle for higher quality audio 4. Save and test the configuration ## Technical Details **Call Flow** 1. Coval initiates call to configured phone number 2. Your agent answers and begins conversation 3. Audio processed in real-time 4. Session continues until completion or timeout ## Troubleshooting **Common Issues:** - **Invalid Phone Number Format**: Ensure E.164 format with country code - **Call Connection Failures**: Verify number is active and accessible - **Poor Audio Quality**: Check telephony provider settings. For SIP agents, consider enabling the **Wideband Audio (16kHz)** toggle - **Timeout Issues**: Adjust agent response timing configuration --- ## Inbound Voice Simulations Source: https://docs.coval.dev/guides/simulations/inbound-voice Simulate inbound calls to test your voice agent ## Overview To simulate inbound calls, add your agent's phone number or SIP address in an evaluation template. **Requirements**: Add your agent's phone number or SIP address in a Coval Template. ## How It Works When you launch an evaluation with an inbound voice agent: 1. Coval's simulated user calls your agent's phone number 2. The conversation follows the test set scenarios you've defined 3. The simulated user behaves according to the persona you've configured 4. Metrics are automatically evaluated after the call completes ## Identifying Simulation Calls Coval includes a custom SIP header in the outgoing call: ``` X-Coval-Simulation-Id: ``` This header lets you correlate incoming calls with their corresponding simulation runs in Coval. > **Note:** This header is carried inside the SIP signaling layer. Whether your agent can read it depends on how the call is routed — see below. ### When the Header Is Available **SIP trunking** — If your agent receives calls via a SIP trunk (e.g., Twilio SIP Trunking, Telnyx SIP Trunking, or your own SBC), the custom header is preserved end-to-end and your telephony provider will surface it in the inbound call event. **Regular phone numbers (PSTN)** — If your agent receives calls on a standard phone number, the call routes through the public telephone network. PSTN carriers strip non-standard SIP headers, so `X-Coval-Simulation-Id` will **not** be available to your application. You can still identify simulation calls by the calling phone number or by matching timestamps in the Coval dashboard. ### Retrieving the Header (SIP Trunking) How you access custom SIP headers depends on your telephony provider. **Twilio** Twilio exposes custom SIP headers as parameters prefixed with `SipHeader_` on incoming call webhooks. The simulation ID will be available as `SipHeader_X-Coval-Simulation-Id` in the request parameters sent to your webhook. See [Twilio's SIP documentation](https://www.twilio.com/docs/voice/api/sip-interface) for more details. **Telnyx** Telnyx includes custom SIP headers in the `sip_headers` array on the `call.initiated` webhook event. Look for the header with the name `X-Coval-Simulation-Id` in that array. See [Telnyx's SIP documentation](https://developers.telnyx.com/docs/voice/sip-trunking) for more details. ## Firewall & IP Allowlist When Coval places a simulation call to your agent, the call arrives from specific IP addresses. If your telephony infrastructure has firewall rules, access control lists (ACLs), or security groups that restrict which IPs can send SIP traffic, you must allowlist the IPs below — otherwise simulation calls will be silently dropped before they reach your agent. > **Note:** This only applies if you receive calls on your own SIP infrastructure (SBC, IP-PBX, or SIP trunk endpoint) and restrict inbound traffic by IP address. If your agent receives calls on a regular phone number through a cloud provider like Twilio or Telnyx, calls arrive through your provider's normal flow and no allowlisting is needed. Two types of traffic need to be allowed: ### Signaling IPs (SIP) SIP signaling is how calls are initiated (the SIP INVITE that starts the call, plus responses). Allow these IPs on ports **UDP/TCP 5060** and **TLS 5061**. | Region | Primary | Secondary | |--------|---------|-----------| | US | 192.76.120.10 | 64.16.250.10 | | Europe | 185.246.41.140 | 185.246.41.141 | | Australia | 103.115.244.145 | 103.115.244.146 | | Canada | 192.76.120.31 | 64.16.250.13 | | Asia (Beta) | 103.115.244.158 | 103.115.244.159 | ### Media IP Subnets (RTP) RTP carries the actual audio once the call is connected. Allow these subnets on **UDP ports 16384–32768**. ``` 36.255.198.128/25 50.114.136.128/25 50.114.144.0/21 64.16.226.0/24 64.16.227.0/24 64.16.228.0/24 64.16.229.0/24 64.16.230.0/24 64.16.248.0/24 64.16.249.0/24 103.115.244.128/25 185.246.41.128/25 ``` > **Tip:** If simulation calls are failing with no audio or immediate hangups, a missing IP allowlist entry is a common cause. Verify that both the signaling IPs and media subnets are allowed in your firewall. ## Setup Steps 1. Navigate to your Template configuration 2. Add your agent's phone number or SIP address in the agent connection settings 3. Configure your test sets with scenarios you want to test 4. Select the personas for your simulated callers 5. Choose the metrics you want to evaluate 6. Launch your evaluation ## Best Practices - Use realistic phone numbers that your agent can receive calls on - Test with different personas to cover various customer types - Include both happy path and edge case scenarios in your test sets - Monitor latency and interruption metrics for voice quality --- ## Outbound Voice Source: https://docs.coval.dev/concepts/agents/connections/outbound-voice Configure agents that initiate phone calls to users ## Overview Outbound Voice connections simulate your agent making calls to users. ## Configuration Requirements ### Trigger Call Endpoint - **Field**: `trigger_call_endpoint` - **Type**: String (required) - **Format**: Valid HTTPS URL - **Purpose**: The webhook endpoint where outbound calls are initiated - **Example**: `https://your-endpoint.com/webhook` ### Trigger Call Headers - **Field**: `trigger_call_headers` - **Type**: String (optional) - **Format**: Valid JSON string - **Purpose**: HTTP headers to include in trigger requests - **Example**: `{"Content-Type": "application/json", "Authorization": "Bearer token123"}` ### Phone Number Key - **Field**: `phone_number_key` - **Type**: String (optional) - **Default**: `"phone_number"` - **Purpose**: Key used to identify the phone number in the trigger payload - **Example**: `"phone_number"`, `"recipient_phone"`, `"target_number"` ### Trigger Call Payload - **Field**: `trigger_call_payload` - **Type**: String (optional) - **Format**: Valid JSON string - **Purpose**: Additional data to send with the trigger request - **Example**: `{"sequence_code": "ABC123", "campaign_id": "summer2024"}` ## Setup Instructions 1. Configure webhook endpoint URL that can receive HTTP POST requests 2. Set authentication headers as JSON string 3. Define phone number key and payload structure 4. Test webhook response and call initiation ## Technical Details **Call Initiation Flow** 1. Coval sends HTTP POST request to trigger endpoint 2. Your system receives webhook and initiates call 3. Agent connects to specified phone number 4. Conversation proceeds with real-time monitoring **Webhook Payload Structure** ```json { "phone_number": "+12345678901", "sequence_code": "ABC123", "campaign_id": "summer2024" } ``` ## Troubleshooting **Common Issues:** - **Webhook Not Triggering**: Verify endpoint URL and accessibility - **Invalid JSON Format**: Validate headers and payload syntax - **Authentication Failures**: Check authorization headers and tokens - **Call Connection Issues**: Verify phone number format and availability --- ## Outbound Voice Simulations Source: https://docs.coval.dev/guides/outbound-voice Coval's Outbound Voice Simulation feature enables you to test your voice agents by having Coval simulate realistic customer interactions. Instead of your agent calling a test phone number, Coval triggers your system to initiate an outbound call to our simulated customer, creating a more realistic testing environment that mirrors your production call flows. ### **How It Works** 1. **You configure** an endpoint that Coval can call to trigger outbound calls 2. **Coval starts** a simulation and sends a request to your trigger endpoint 3. **Your system** receives the request and initiates an outbound call to our simulated customer 4. **The simulation runs** with realistic customer responses based on your test scenarios 5. **Coval provides** detailed transcripts, recordings, and analysis of the interaction --- ## **Getting Started** ### **Prerequisites** Before setting up outbound voice simulations, ensure you have: - An endpoint that can receive HTTP POST requests - The ability to initiate outbound calls from your phone system - API authentication mechanism (recommended) - Phone system capable of dialing phone numbers ### **Quick Setup** 1. **Prepare Your Trigger Endpoint** - Create an endpoint that accepts POST requests - Implement authentication (API key, bearer token, etc.) - Add logic to extract phone number and initiate calls 2. **Configure in Coval** - Navigate to your simulation settings - Select "Outbound Voice" as the simulation type - Enter your trigger endpoint URL - Configure headers and payload format - Test the configuration 3. **Run Your First Simulation** - Create a test scenario - Start the simulation - Monitor the call initiation and interaction ## **Configuration Details** ### **Trigger Call Endpoint** **Purpose**: The URL where Coval will send requests to trigger outbound calls from your system. This is a required field. **Requirements**: - Must be a valid HTTP/HTTPS URL - Should be publicly accessible or whitelisted for Coval's IP ranges - Must respond within 30 seconds - Should return 2xx status codes for successful requests **Example Configuration**: ``` https://api.yourcompany.com/triggers/voice-simulation ``` ### **Trigger Call Headers** Configure HTTP headers that Coval will include with every trigger request. This typically includes authentication and content type headers. This is a required field. **Format**: Valid JSON object ```json { "Content-Type": "application/json", "Authorization": "Bearer your-api-key-here", "X-Source": "coval-simulation" } ``` **Common Headers**: - `Authorization`: API keys, bearer tokens, or basic auth - `Content-Type`: Usually `application/json` - `X-API-Key`: Alternative authentication method - Custom headers for routing or identification ### **Trigger Call Payload** The base JSON payload that Coval will send to your endpoint. Coval automatically adds the phone number field to this payload. This is a required field. **Format**: Valid JSON object ```json { "campaign_id": "test-campaign-001", "priority": "high", } ``` **Note**: The `phone_number` field will be automatically added by Coval. ### **Phone Number Key** Customize the field name used for the phone number in the payload. This allows integration with systems that expect different field names. **Default**: phone_number **Common Alternatives**: - `phoneNumber` - `phone` - `number` - `phoneNumberToCall` - `destination` **Example**: If your system expects `destination`, configure this field as `destination`, and the payload will include: ```json { "campaign_id": "test-campaign-001", "destination": "+1-555-123-4567" } ``` --- ## **Advanced Features** ### **Multi-Language Support** Coval supports simulations in multiple languages with native voice models: - **English** (Default): Standard US English voice and responses - **Spanish**: Latin American Spanish with appropriate cultural context - **French**: Standard French with proper pronunciation and idioms - **German**: Standard German with accurate grammar and expressions **Configuration**: Language is typically configured at the organization level or specified in simulation parameters. ### **Custom Voice Models** Configure specific voice characteristics for your simulations: - **Voice Provider**: Choose from multiple voice synthesis providers - **Voice Model**: Select specific voice models (multilingual, turbo, etc.) - **Voice ID**: Use specific voice identities for consistent testing ### **Simulation Behavior** Control how the simulated customer behaves during calls: - **Response Style**: Natural, conversational interactions with appropriate emotional responses - **Conversation Flow**: Realistic pauses, interruptions, and speaking patterns - **Scenario Adherence**: Follows predefined customer scenarios and objectives - **Language Consistency**: Maintains language and cultural context throughout the call --- ## OpenAI Endpoint Source: https://docs.coval.dev/concepts/agents/connections/openai-endpoint Connect to OpenAI-compatible chat completions APIs for text-based interactions ## Overview OpenAI Endpoint connections integrate with any OpenAI-compatible chat completions API, enabling text-based conversational testing. ## Configuration Requirements ### Chat Endpoint - **Field**: `chat_endpoint` - **Type**: String (required) - **Purpose**: URL endpoint for OpenAI-compatible chat completions API - **Format**: Valid HTTPS URL - **Example**: `https://api.openai.com/v1/chat/completions` ### Authentication Token - **Field**: `auth_token` - **Type**: String (required) - **Purpose**: API key or authentication token for the endpoint - **Format**: String (maximum 4KB length) - **Security**: Sensitive field - stored encrypted - **Example**: `sk-proj-abc123def456...` ### Model - **Field**: `model` - **Type**: String (optional) - **Default**: `"gpt-4o"` - **Purpose**: Specify which OpenAI model to use for completions **Available Models:** - **GPT-5 Series**: `gpt-5`, `gpt-5-mini`, `gpt-5-nano`, `gpt-5-chat` - **GPT-4.1 Family**: `gpt-4.1`, `gpt-4.1-mini`, `gpt-4.1-nano` - **Reasoning Models**: `o1`, `o1-mini`, `o1-pro`, `o3-mini`, `o3-pro`, `o3`, `o4-mini` - **Current Generation**: `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo` - **Legacy Models**: `gpt-4`, `gpt-3.5-turbo` ### Max Tokens - **Field**: `max_tokens` - **Type**: Number (optional) - **Purpose**: Maximum number of tokens to generate in responses - **Range**: Positive integer ≤ 100,000 - **Example**: `1000` ### System Prompt - **Field**: `system_prompt` - **Type**: String (optional) - **Purpose**: System-level instructions defining agent behavior and constraints - **Use Cases**: Role definition, behavior guidelines, response formatting - **Example**: `"You are a helpful customer service assistant. Always be polite and provide accurate information."` ### Temperature - **Field**: `temperature` - **Type**: Number (optional) - **Default**: `1.0` - **Purpose**: Controls randomness in AI responses - **Range**: 0.0 (deterministic) to 2.0 (very random) - **Example**: `0.7` ## Setup Instructions 1. Enter the complete URL for your OpenAI-compatible API 2. Input valid API key from your provider 3. Select model and configure token limits/temperature 4. Define system prompt for agent behavior ## Troubleshooting **Common Issues:** - **Authentication Failures**: Verify API key validity and permissions - **Invalid Endpoint**: Confirm URL format and API compatibility - **Model Not Available**: Check model name and provider support - **Rate Limit Errors**: Monitor API usage and implement backoff strategies --- ## Chat Simulations Source: https://docs.coval.dev/guides/simulations/chat Simulate text-based conversations with your chat agent ## Overview If you have chat agents or want to simulate your voice agent conversations without calls, you can use Coval the same way it's used for voice simulations by generating text conversations. Create test sets, define metrics, and configure templates to evaluate your chat agents automatically. **Requirements**: Provide a custom text endpoint that Coval can connect to ## Quick Start **Minimum Required Configuration:** 1. **Chat Endpoint** - The URL where your agent receives messages 2. **Authorization Header** - Authentication credentials for your API That's it! All other fields are optional and depend on your specific API requirements. --- ## Core Configuration ### Chat Endpoint (Required) The primary URL where Coval sends conversation messages during simulation. **Format:** Full HTTPS URL ``` https://api.yourdomain.com/chat ``` **Requirements:** - Must use HTTPS (HTTP will be auto-upgraded) - Cannot use local/private IP addresses - Must return JSON responses **Example:** ``` https://api.yourdomain.com/v1/chat ``` --- ### Authorization Header (Required) Authentication credentials sent with every API request. **Common Formats:** #### Bearer Token ``` Bearer your-secret-token-here ``` #### API Key ``` API-Key your-api-key-here ``` #### Custom Authorization ``` Custom-Auth your-custom-format ``` **UI Tip:** Use the dropdown to select your auth type, then paste your token/key. The system automatically formats the header correctly. --- ## Standard Protocol The standard integration for chat uses an HTTP, JSON-based API endpoint that you provide. When running a simulation, Coval's simulator, acting as a user, will connect to the endpoint to get responses from your agent. We use OpenAI's chat completions format, although we also support receiving responses in the OpenAI responses format. Query strings are not allowed in URLs. --- ## Optional Configuration ### Initialization Endpoint Called once before the conversation starts to set up session state. **When to use:** - Your API requires session initialization - You need to obtain a session ID or auth token - You want to pre-configure conversation context **Format:** Full HTTPS URL ``` https://api.yourdomain.com/init ``` **Example Response:** ```json { "sessionId": "abc-123-def", "userId": "user-456", "conversationId": "conv-789" } ``` **How it works:** 1. Coval calls the initialization endpoint with your [initialization payload](#initialization-payload) 2. The response is captured and made available to subsequent chat requests via template variables like `{{sessionId}}` or `{{init_response.conversationId}}` in your [input template](#input-template) and [custom headers](#custom-headers) > **Note:** The initialization endpoint is called **before** any chat messages are sent. The [input template](#input-template) is only used for chat requests — it does not affect the initialization call. --- ### Initialization Payload JSON payload sent to the initialization endpoint. **Format:** Valid JSON ```json { "user_id": "test-user", "config": { "language": "en", "temperature": 0.7 } } ``` **Template Variables Available:** | Variable | Description | Example Value | |----------|-------------|---------------| | `{{simulation_output_id}}` | Unique ID for this simulation | `"sim-abc-123"` | | `{{persona.field}}` | Data from persona metadata | `{{persona.user_id}}` | **Example with Variables:** ```json { "session_id": "{{simulation_output_id}}", "user_context": { "user_id": "{{persona.user_id}}", "preferences": {} } } ``` **Persona Integration:** To use `{{persona.field}}` variables, add `initialization_parameters` to your Persona's metadata: ```json { "initialization_parameters": { "user_id": "customer-123", "account_type": "premium" } } ``` Then reference in payload: ```json { "user_id": "{{persona.user_id}}", "account_type": "{{persona.account_type}}" } ``` --- ### Custom Data Additional JSON data included in every chat request (for APIs using standard payload format). **Format:** Valid JSON ```json { "metadata": { "source": "coval-evaluation", "version": "1.0" }, "context": { "department": "sales" } } ``` **How it's sent:** ```json { "messages": [...], "customData": { "metadata": {...}, "context": {...} } } ``` **Note:** Only used when NOT using `input_template`. If you use `input_template`, reference custom data with `{{custom_data.field}}` instead. --- ### Custom Headers Additional HTTP headers sent with every chat request, with support for dynamic values from the initialization response. **Format:** Valid JSON object with string keys and values ```json { "X-Session-ID": "{{sessionId}}", "X-User-ID": "{{init_response.user.id}}", "X-Custom-Header": "static-value" } ``` **Template Variables Available:** | Variable | Description | Example | |----------|-------------|---------| | `{{sessionId}}` | Session ID from init response | Extracted from `init_response.sessionId` | | `{{simulation_output_id}}` | Unique simulation ID | Generated by Coval | | `{{init_response.path}}` | Any nested field from init response | `{{init_response.auth.token}}` | **Common Use Cases:** **Use Case 1: Session ID in Header** ```json { "X-Session-ID": "{{sessionId}}" } ``` **Use Case 2: Nested Auth Token** ```json { "X-Auth-Token": "{{init_response.auth.token}}" } ``` **Use Case 3: Mixed Static and Dynamic** ```json { "X-Session-ID": "{{sessionId}}", "X-API-Version": "v2", "X-Simulation-ID": "{{simulation_output_id}}" } ``` **Field Size Limit:** 16kB maximum --- ## Chat Messages Your chat endpoint should be an HTTPS URL that will respond to POST requests with a JSON body. If an Authorization token was provided, it will be included in the headers. **Initial Request from Coval:** ```json { "sessionId": "XXXX", "customData": {}, "messages": [ { "role": "user", "content": "Initial reach out text" } ] } ``` **Expected Response Format:** Standard format: ```json { "messages": [ { "role": "assistant", "content": "Response to the initial text" } ] } ``` Or in the newer Responses format: ```json { "messages": [ { "role": "assistant", "content": [ { "type": "text_output", "text": "Thanks for contacting us" } ] } ] } ``` --- ## Advanced Configuration ### Response Format Determines the format for tool call responses sent to your API. **Options:** #### Chat Completions (Default) Standard OpenAI format for tool responses: ```json { "role": "tool", "content": "result", "tool_call_id": "call_123" } ``` #### Responses API Alternative format for function call outputs: ```json { "type": "function_call_output", "call_id": "call_123", "output": "result" } ``` Configure the response format by adding `response_format` to your model configuration: ```json { "response_format": "responses_api", "chat_endpoint": "https://api.your-company.com/chat", "authorization_header": "Bearer your-api-key" } ``` Coval will respond with the entire chat history in the format specified: **Chat Completions (Default):** ```json { "sessionId": "XXXX", "customData": {}, "messages": [ { "role": "user", "content": "Initial reach out text" }, { "role": "assistant", "content": "Thanks for contacting us" }, { "role": "user", "content": "When will my order arrive?" } ] } ``` **Responses API:** ```json { "sessionId": "XXXX", "customData": {}, "messages": [ { "role": "user", "content": "Initial reach out text" }, { "role": "assistant", "content": [ { "type": "text_output", "text": "Thanks for contacting us" } ] }, { "role": "user", "content": "When will my order arrive?" } ] } ``` **When to use:** Only change this if your API explicitly requires the Responses API format for tool calls. Most APIs use the default Chat Completions format. --- ### Tool Calls You can include tool/function calls in the Responses format: ```json { "messages": [ { "type": "function_call", "name": "get_order_date", "arguments": "{\"shipment_id\": \"xx555\"}" }, { "role": "assistant", "content": [ { "type": "text_output", "text": "Your order should arrive next Tuesday" } ] } ] } ``` --- ### Payload Wrapper Wraps the entire payload in a specified field name. **When to use:** Your API requires all payloads nested under a specific key (e.g., `data`, `request`, `body`). **Example:** **Without wrapper:** ```json { "messages": [...], "customData": {...} } ``` **With wrapper set to `"data"`:** ```json { "data": { "messages": [...], "customData": {...} } } ``` **Common Values:** - `data` - `request` - `body` - `payload` --- ### Input Template Completely customize the JSON payload sent to your **chat endpoint** on each conversation turn. > **Note:** The input template is **not** used for the [initialization endpoint](#initialization-endpoint) — that call always uses the [initialization payload](#initialization-payload). The simulation flow is: > > 1. Coval calls your initialization endpoint with the initialization payload > 2. The init response is captured > 3. For each chat turn, Coval uses the input template to build the request to your chat endpoint — and you can reference fields from the init response (e.g. `{{init_response.conversation_id}}`) **When to use:** - Your API expects a non-standard payload format - You need to include specific fields from initialization response - You want fine-grained control over the request structure **Format:** JSON with template variable placeholders **Available Template Variables:** | Variable | Type | Description | |----------|------|-------------| | `{{messages}}` | Array | Full conversation history | | `{{latest_message}}` | String | Most recent user message content | | `{{sessionId}}` | String | Session ID (from init response or simulation ID) | | `{{simulation_output_id}}` | String | Unique simulation identifier | | `{{custom_data}}` | Object | The custom data object | | `{{custom_data.field}}` | Any | Specific field from custom data | | `{{any.nested.path}}` | Any | Extract any field from init response using dot notation | **Example Templates:** **Example 1: Simple Custom Format** ```json { "user_input": "{{latest_message}}", "session_id": "{{sessionId}}", "context": {{custom_data}} } ``` **Example 2: Nested Init Response Fields** ```json { "messages": {{messages}}, "user_id": "{{init_response.user.id}}", "conversation_id": "{{init_response.conversation.id}}", "api_key": "{{init_response.auth.api_key}}" } ``` **Example 3: String Substitution** ```json { "input": "User said: {{latest_message}}", "session": "{{sessionId}}", "metadata": { "source": "coval", "user": "{{custom_data.user_id}}" } } ``` **Note:** When using `input_template`, the `custom_data` field is ignored. Reference custom data using `{{custom_data}}` or `{{custom_data.field}}` in your template instead. > **Warning:** **Quoting rules for template variables:** > > - **Object/Array variables** (`{{messages}}`, `{{custom_data}}`) substitute to valid JSON literals — do **not** wrap them in quotes. > - **String variables** (`{{sessionId}}`, `{{latest_message}}`, `{{init_response.*}}`) substitute to plain text — you **must** wrap them in quotes. > > For example, `"conversation_id": {{init_response.conversation.id}}` produces invalid JSON because the substituted value is not quoted. Use `"conversation_id": "{{init_response.conversation.id}}"` instead. --- ### Response Message Path Tells Coval where to find the assistant's message in your API response using dot notation. **When to use:** Your API returns a non-standard response format. **Default Behavior (when not set):** Expects response in this format: ```json { "messages": [ { "role": "assistant", "content": "The response text" } ] } ``` **Custom Path Examples:** **Example 1: Direct Field** ``` Response Message Path: output_message ``` Extracts from: ```json { "output_message": "The assistant response", "metadata": {...} } ``` **Example 2: Nested Object** ``` Response Message Path: data.response.text ``` Extracts from: ```json { "data": { "response": { "text": "The assistant response" } } } ``` **Example 3: Array Index** ``` Response Message Path: choices.0.message.content ``` Extracts from: ```json { "choices": [ { "message": { "content": "The assistant response" } } ] } ``` **Path Notation Rules:** - Use `.` to navigate nested objects: `data.response.text` - Use numeric indices for arrays: `choices.0.message` - Combine for complex paths: `data.results.0.output.text` --- ### Strip Message Timestamps Removes `timestamp` fields from messages before sending to your API. **When to use:** Your API rejects requests containing timestamp fields. **Default:** Disabled (timestamps included) **Example:** **With timestamps (default):** ```json { "messages": [ { "role": "user", "content": "Hello", "timestamp": "2025-01-15T10:30:00Z" } ] } ``` **With stripping enabled:** ```json { "messages": [ { "role": "user", "content": "Hello" } ] } ``` **Common Error Pattern:** ```json { "message": ["messages.0.property timestamp should not exist"], "statusCode": 400 } ``` If you see this error, enable "Strip Message Timestamps". --- ## Ending the Chat You can end the conversation by setting "status" to "ended" in your response: ```json { "status": "ended", "messages": [...] } ``` --- ## Common Configuration Patterns ### Pattern 1: OpenAI-Compatible API ``` Chat Endpoint: https://api.yourdomain.com/chat Authorization: Bearer sk-your-key-here (All other fields: leave empty/default) ``` ### Pattern 2: API with Session Initialization ``` Chat Endpoint: https://api.yourdomain.com/chat Initialization Endpoint: https://api.yourdomain.com/init Authorization: API-Key your-key-here Initialization Payload: { "user_id": "{{persona.user_id}}", "session_id": "{{simulation_output_id}}" } ``` ### Pattern 3: Custom API Format with Template ``` Chat Endpoint: https://api.yourdomain.com/v1/message Authorization: Bearer your-token Input Template: { "user_input": "{{latest_message}}", "session_id": "{{sessionId}}", "conversation_history": {{messages}} } Response Message Path: data.response.text ``` ### Pattern 4: API with Payload Wrapper ``` Chat Endpoint: https://api.yourdomain.com/chat Authorization: Bearer your-token Payload Wrapper: data ``` ### Pattern 5: Complex Custom Format ``` Chat Endpoint: https://api.yourdomain.com/chat Initialization Endpoint: https://api.yourdomain.com/sessions/create Authorization: Bearer static-token Custom Headers: { "X-Session-ID": "{{sessionId}}", "X-User-Context": "{{init_response.user.id}}" } Input Template: { "messages": {{messages}}, "user_id": "{{init_response.user.id}}", "api_version": "v2" } Response Message Path: response.text Strip Message Timestamps: true ``` --- ## Troubleshooting ### Error: "Failed to run simulation due to an unexpected error" **Problem:** This generic error often indicates an issue with your agent configuration, most commonly an invalid input template. **Common causes:** - Unquoted string variables in your input template (e.g. `{{init_response.id}}` instead of `"{{init_response.id}}"`) - Malformed JSON in your input template, initialization payload, or custom data - Invalid field references in template variables **Solution:** Double-check your input template for valid JSON syntax. Make sure all string template variables are wrapped in quotes. See the [quoting rules](#input-template) in the Input Template section. ### Error: "Invalid JSON response from endpoint" **Problem:** Your API returned non-JSON response **Solution:** Ensure your endpoint returns `Content-Type: application/json` ### Error: "Could not extract message from path 'X' in response" **Problem:** Response message path doesn't match your API response structure **Solution:** Verify the path using dot notation matches your actual response structure ### Error: "messages.0.property timestamp should not exist" **Problem:** Your API rejects timestamp fields **Solution:** Enable "Strip Message Timestamps" ### Tool calls not showing in transcript **Problem:** Tool call extraction not configured **Solution:** Verify your API returns OpenAI-compatible format or contact support for custom tool call extraction configuration ### Session ID not working in headers **Problem:** Template variable not being substituted **Solution:** Verify initialization endpoint returns `sessionId` field, or check custom headers configuration --- ## Best Practices 1. **Start Simple:** Begin with just Chat Endpoint and Authorization, add complexity as needed 2. **Test Incrementally:** Add one advanced feature at a time and test 3. **Use Template Variables:** Leverage `{{sessionId}}` and init response fields to maintain session state 4. **Validate JSON:** Always validate JSON fields before saving 5. **Check API Logs:** Use your API server logs to debug payload/response format issues 6. **Document Custom Formats:** Keep notes on your API's expected format for future reference --- ## Chat WebSocket Source: https://docs.coval.dev/concepts/agents/connections/chat-websocket Connect to text chat agents over a persistent WebSocket connection ## Overview Chat WebSocket agents communicate via text messages over a persistent WebSocket connection. Unlike the standard [Chat (HTTP)](/guides/simulations/chat) integration which uses request-response, Chat WebSocket maintains a single connection for the entire conversation — ideal for agents built on platforms like Genesys, NICE, or custom WebSocket-based chat systems. **When to use Chat WebSocket instead of Chat (HTTP):** - Your agent communicates over WebSocket rather than HTTP POST - Your agent sends multiple messages in response to a single user message - Your platform requires a persistent connection for the conversation lifecycle ## Connection Modes Chat WebSocket supports two connection modes: ### Direct Mode (Default) Connect directly to a WebSocket endpoint. ``` wss://your-agent.example.com/ws/chat ``` ### HTTP-First Mode Call an HTTP endpoint first to create a session, then connect to the WebSocket URL returned in the response. Common with platforms that require session provisioning before establishing a WebSocket connection. **Flow:** 1. Coval sends an HTTP request to your setup endpoint 2. Your API returns a response containing the WebSocket URL 3. Coval connects to that WebSocket URL ## Configuration ### Direct Mode Fields | Field | Required | Description | |-------|----------|-------------| | WebSocket Endpoint | Yes | The `wss://` URL to connect to | | Initialization JSON | No | JSON payload sent immediately after connection | | Authorization Header | No | Auth value sent during the WebSocket handshake | | Custom Headers | No | Additional headers for the WebSocket handshake | ### HTTP-First Mode Fields | Field | Required | Description | |-------|----------|-------------| | HTTP Endpoint URL | Yes | The `https://` URL to call for session setup | | HTTP Method | No | Request method (default: POST) | | Request Body | No | JSON body for the HTTP request | | HTTP Headers | No | Headers for the HTTP request | | WebSocket URL Response Path | Yes | Dot-notation path to the WebSocket URL in the response | | Authorization Header | No | Auth value for the WebSocket connection (separate from HTTP headers) | | Initialization JSON | No | JSON payload sent after WebSocket connection | | Custom Headers | No | Additional headers for the WebSocket connection | ## Message Format ### Sending Messages (Coval to Agent) Messages are sent as JSON using a configurable template. The default template: ```json {"type": "message", "text": "{{message}}"} ``` The `{{message}}` placeholder is replaced with the actual message text. Customize the template to match your agent's expected format: ```json {"event": "chat", "body": "{{message}}"} ``` ### Receiving Messages (Agent to Coval) Coval extracts text from incoming WebSocket messages using configurable JSON paths: | Setting | Default | Description | |---------|---------|-------------| | Message type path | `type` | Path to the message type field | | Text message type values | `message` | Type value(s) that indicate a text message | | Message text path | `text` | Path to the actual message content | **Example:** For an agent that sends: ```json {"event": "reply", "data": {"content": "Hello!"}} ``` Configure: - Message type path: `event` - Text message type values: `reply` - Message text path: `data.content` ## Message Coalescing Many chat agents send multiple messages in quick succession (e.g., a greeting followed by a question). Coval batches these into a single response using a configurable quiet period. - **Default:** 2.0 seconds - **Set to 0:** Deliver each message immediately (no batching) - **Increase:** For agents that send messages with longer pauses between them ## Handshake Some WebSocket agents send a "ready" message before accepting conversation messages. Configure the handshake to wait for this signal: | Setting | Default | Description | |---------|---------|-------------| | Ready message type | *(empty — no wait)* | The message type value that signals readiness | | Handshake timeout | 30 seconds | How long to wait before timing out | **Example:** If your agent sends `{"type": "session_ready"}` when it's ready: - Set Ready message type to `session_ready` ## Direction Filtering If your agent echoes back your outbound messages (common with Genesys), configure direction filtering to skip those echoes: | Setting | Description | |---------|-------------| | Direction path | JSON path to the direction field (e.g., `direction`) | | Outbound direction value | The value indicating an agent-to-user message (e.g., `outbound`) | When configured, only messages matching the outbound direction value are processed. Messages without a direction field or with a different value are skipped. ## Setup Instructions 1. **Create the agent** — Navigate to [Agents](https://app.coval.dev/agents/create), select **Chat** as the agent type, then toggle to **WebSocket** protocol 2. **Choose connection mode** — Select Direct or HTTP-First depending on your platform 3. **Configure the endpoint** — Enter your `wss://` URL (Direct) or HTTP setup endpoint (HTTP-First) 4. **Set message format** — If your agent doesn't use the default `{"type": "message", "text": "..."}` format, customize the send template and receive paths under Advanced Configuration 5. **Test** — Create a test set with a single test case and launch a simulation to verify connectivity ## Common Patterns ### Pattern 1: Simple Direct Connection ``` Connection Mode: Direct Endpoint: wss://chat.example.com/ws ``` ### Pattern 2: Authenticated Direct Connection ``` Connection Mode: Direct Endpoint: wss://chat.example.com/ws Authorization Header: Bearer your-token-here Initialization JSON: {"action": "start_session", "channel": "web"} ``` ### Pattern 3: HTTP-First with Session Provisioning ``` Connection Mode: HTTP-First HTTP URL: https://api.example.com/v1/sessions HTTP Method: POST Request Body: {"channel": "web", "language": "en"} HTTP Headers: {"Authorization": "Bearer your-token"} WebSocket URL Response Path: data.websocket_url ``` ### Pattern 4: Custom Message Format ``` Connection Mode: Direct Endpoint: wss://chat.example.com/ws Send Template: {"event": "user_message", "payload": {"text": "{{message}}"}} Message Type Path: event Text Message Type Values: agent_message Message Text Path: payload.text ``` ## Troubleshooting ### Connection Failures **"Timeout connecting to WebSocket"** - Verify the `wss://` URL is correct and publicly accessible - Check that your server accepts WebSocket upgrade requests - Ensure firewall rules allow inbound WebSocket connections **"Failed to connect to WebSocket"** - Confirm the endpoint is running and healthy - Check authorization header format matches what your server expects - For HTTP-First: verify the HTTP setup endpoint returns a valid WebSocket URL ### No Messages Received - Check that your message type path and text message type values match what your agent actually sends - Verify the message text path points to the correct field - If using direction filtering, confirm the outbound direction value is correct - Try increasing the coalesce timeout if messages arrive after the batch window closes ### Handshake Timeout - Confirm your agent sends the expected ready message type - Check that the ready message is sent before the timeout (default 30s) - Verify the message type path resolves correctly on the ready message ### Messages Getting Dropped - If your agent echoes your messages back, configure direction filtering - Ensure `text_message_type_values` includes all message types your agent uses for text responses - Check agent logs for messages with unexpected type values ## Technical Requirements | Requirement | Details | |-------------|---------| | Protocol | `wss://` (TLS-encrypted WebSocket) | | Message format | JSON with configurable paths | | Accessibility | Must be publicly accessible from Coval servers | | Concurrency limit | 8 simultaneous simulations | --- ## SMS Simulations Source: https://docs.coval.dev/guides/simulations/sms Simulate SMS conversations with your agent ## Overview The SMS Simulator enables you to test and evaluate SMS-based AI agents by conducting automated text message conversations. It simulates a real customer interacting with your SMS agent, sending messages and receiving responses just as an actual user would. [Image: SMS Conversation] --- ## How It Works 1. **Test Case Delivery**: Each test case from your test set is sent as an SMS message to your configured phone number 2. **Agent Response**: Your SMS agent receives the message and responds 3. **Conversation Flow**: The simulator continues the conversation naturally, responding to your agent's messages as a realistic customer would 4. **Completion**: The conversation ends when the test scenario objective is achieved, the conversation reaches a natural conclusion, or the maximum simulation time (15 minutes) is reached 5. **Evaluation**: The complete message exchange is captured and evaluated against your configured metrics --- ## Setup ### 1. Create an SMS Agent 1. Navigate to **Agents** in your dashboard 2. Click **Create Agent** 3. Under the **Text** section, select **SMS** [Image: SMS Agent Selector] [Image: SMS Agent Selector] 4. Enter a **Display Name** for your agent (e.g., "Customer Support SMS Bot") 5. Enter your agent's **Phone Number** in E.164 format: - Format: `+[country code][number]` - Example: `+14155551234` (US number) - Example: `+442071234567` (UK number) [Image: SMS Agent Configuration] [Image: SMS Agent Configuration] ### 2. Configure Your Agent (Optional) You can add additional configuration: - **System Prompt**: Context that will not affect the simulation but allows better context for generating test sets, workflows, and metrics for this agent - **Attributes**: Custom metadata tags for organizing your agents ### 3. Run an Evaluation 1. Go to **Evaluations** and click **New Evaluation** 2. Select your SMS agent 3. Choose a **Test Set** containing the conversations you want to simulate 4. Select **Metrics** to evaluate (e.g., response quality, task completion) 5. Configure run settings: - **Iterations**: Number of times to run each test case (default: 10) - **Concurrency**: Parallel simulations (default: 5) 6. Click **Launch** --- ## Key Features - **SMS-Optimized**: Responses are kept short and concise to reflect real SMS communication patterns - **Full Transcripts**: Every message is captured with timestamps for detailed analysis --- ## Metrics Compatibility SMS simulations generate text transcripts that are compatible with all standard text-based metrics, including: - Response accuracy - Conversation completion rate - Goal achievement - Custom LLM-judged metrics --- ## Best Practices 1. **Phone Number Format**: Always use E.164 format for phone numbers (e.g., `+1` followed by 10 digits for US numbers) 2. **Use Test Numbers**: Use a test/staging number rather than production during initial testing 3. **Start with Low Concurrency**: Begin with low concurrency to avoid rate limiting on your SMS provider 4. **Design Realistic Scenarios**: Include varied test cases that cover different user intents and edge cases 5. **Keep Messages Concise**: Your agent's responses should be brief—SMS has character limitations and customers expect short messages 6. **Response Time**: Your agent should respond within a reasonable timeframe for accurate simulation flow --- ## Limitations - Maximum simulation duration: 15 minutes per conversation - Agent must be accessible via standard SMS messaging --- ## WebSocket Source: https://docs.coval.dev/concepts/agents/connections/websocket Connect to WebSocket-based agents for real-time bidirectional communication ## Overview WebSocket connections enable real-time, bidirectional communication with your AI agents. This connection type is ideal for chat-based agents, real-time assistants, or any application that requires persistent, low-latency message exchange. When you connect a WebSocket agent, Coval establishes a secure connection to your endpoint and handles the full conversation flow—sending messages, receiving responses, and evaluating performance. ## Configuration Requirements ### WebSocket Endpoint - **Field**: `endpoint` - **Type**: String (required) - **Purpose**: The WebSocket URL that Coval connects to for simulations - **Format**: Must start with `wss://` (secure WebSocket) - **Example**: `wss://your-api.com/ws/chat` > **Warning:** Only secure WebSocket connections (`wss://`) are supported. Plain `ws://` endpoints will be rejected for security reasons. ### Initialization JSON - **Field**: `initialization_json` - **Type**: JSON object (optional) - **Purpose**: Initial payload sent to your WebSocket endpoint when the connection is established - **Format**: Valid JSON object - **Use Cases**: Session initialization, authentication handshakes, context setup **Example:** ```json { "action": "start_session", "session_type": "simulation", "metadata": { "source": "coval", "test_mode": true } } ``` > **Info:** The initialization JSON is sent immediately after the WebSocket connection is established. Use this to configure your agent's behavior or authenticate the session. ### Authorization Header - **Field**: `authorization_header` - **Type**: String (optional) - **Purpose**: Authentication value sent in the `Authorization` header during the WebSocket handshake - **Format**: Standard authorization header value - **Security**: Stored encrypted and handled securely **Common formats:** - Bearer token: `Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...` - API key: `X-API-Key your-api-key-here` - Basic auth: `Basic base64-encoded-credentials` ### Custom Headers - **Field**: `custom_headers` - **Type**: JSON object (optional) - **Purpose**: Additional HTTP headers sent during the WebSocket handshake - **Format**: Valid JSON object with string key-value pairs - **Use Cases**: Custom authentication, routing headers, client identification **Example:** ```json { "X-Client-ID": "coval-simulation", "X-API-Version": "2024-01", "X-Environment": "production" } ``` ## Setup Instructions 1. **Prepare your WebSocket endpoint** - Ensure your endpoint accepts secure WebSocket connections (`wss://`) - Configure your server to handle the authorization header if authentication is required - Set up message handling for the conversation flow 2. **Create the agent in Coval** - Navigate to [app.coval.dev/agents/create](https://app.coval.dev/coval/agents/create) - Select **WebSocket** as the connection type - Enter your WebSocket endpoint URL (must start with `wss://`) - Add authorization header if your endpoint requires authentication - Configure initialization JSON if you need to send a setup payload - Add any custom headers required by your endpoint 3. **Test the connection** - Create a simple test set with a few scenarios - Launch a simulation to verify end-to-end connectivity - Check that messages are being sent and received correctly ## How Simulations Work When you launch a simulation with a WebSocket agent, Coval: 1. **Establishes the connection** — Opens a secure WebSocket connection to your endpoint with the configured headers 2. **Sends initialization payload** — If configured, sends the initialization JSON immediately after connection 3. **Starts the conversation** — Sends the first message based on your test case configuration 4. **Exchanges messages** — Receives agent responses and sends follow-up messages according to the persona behavior 5. **Records the transcript** — Captures the full conversation for evaluation 6. **Runs metrics** — Evaluates the conversation against your configured metrics 7. **Closes the connection** — Cleanly terminates the WebSocket connection ## Message Format Coval sends and expects messages in JSON format with a `content` field containing the message text: **Outgoing message (Coval → Your Agent):** ```json { "content": "Hello, I need help with my order" } ``` **Expected response (Your Agent → Coval):** ```json { "content": "Hi! I'd be happy to help you with your order. Could you please provide your order number?" } ``` > **Note:** Your WebSocket endpoint should respond with JSON messages containing a `content` field. Coval extracts this field to build the conversation transcript. ## Technical Requirements ### Endpoint Requirements | Requirement | Details | |-------------|---------| | Protocol | `wss://` (TLS-encrypted WebSocket) | | Accessibility | Must be publicly accessible from Coval servers | | Response format | JSON with `content` field | | Connection timeout | Connections may be held open for the duration of the simulation | ### Running Locally If your WebSocket server is running locally, you'll need to expose it publicly for Coval to connect. Use a tunneling service like [ngrok](https://ngrok.com): ```bash ngrok http 8080 # Use the generated wss:// URL as your endpoint # Note: ngrok provides wss:// automatically for HTTPS tunnels ``` > **Warning:** Remember to update your agent configuration when your tunnel URL changes. Consider using ngrok's reserved domains for persistent URLs. ## Troubleshooting ### Connection Failures **"Invalid WebSocket URL" error** - Verify your endpoint starts with `wss://` (not `ws://`, `http://`, or `https://`) - Check that the URL is properly formatted with no trailing spaces **"Connection refused" error** - Ensure your WebSocket server is running and accessible - Check firewall rules allow inbound connections on the WebSocket port - Verify the endpoint URL is correct **"Authentication failed" error** - Confirm your authorization header value is correct - Check that the header format matches what your server expects - Verify the API key or token hasn't expired ### Message Handling Issues **"No response received" error** - Ensure your agent sends responses in JSON format with a `content` field - Check that your agent is processing and responding to incoming messages - Verify there are no errors in your agent's logs **"Invalid JSON in response" error** - Confirm your agent returns valid JSON - Check for proper encoding of special characters - Ensure the `content` field is a string ### Timeout Issues **Simulation timeouts** - Verify your agent responds within a reasonable time (< 30 seconds per message) - Check for any blocking operations in your agent's message handler - Monitor your agent's resource usage during simulations ## Best Practices 1. **Use persistent connections** — WebSocket agents should maintain the connection throughout the conversation 2. **Handle reconnection gracefully** — If your agent supports it, configure automatic reconnection 3. **Log initialization payloads** — Track the initialization JSON in your server logs for debugging 4. **Implement health checks** — Add a ping/pong mechanism to detect connection issues early 5. **Secure your endpoint** — Always use `wss://` and implement proper authentication --- ## Pipecat Cloud Source: https://docs.coval.dev/concepts/agents/connections/pipecat Connect your Pipecat Cloud agent to Coval to run Voice AI simulations ## Overview The Pipecat Cloud connection lets you run Coval simulations against agents hosted on [Pipecat Cloud](https://docs.pipecat.daily.co). Once connected, Coval can automatically call your Pipecat agent, simulate realistic conversations, and evaluate its performance — no manual testing required. ## Configuration Requirements ### Agent Name - **Field**: `agent_name` - **Type**: String (required) - **Purpose**: The name of the Pipecat Cloud agent that Coval will call during simulations - **Format**: Must exactly match the agent name in your Pipecat Cloud dashboard - **Example**: `"customer-service-agent"`, `"sales-assistant"` ### Pipecat API Key - **Field**: `pipecat_api_key` - **Type**: String (required) - **Purpose**: Authentication key that allows Coval to connect to your Pipecat Cloud agent - **Format**: Valid API key string from your Pipecat account - **Security**: Stored encrypted and handled securely - **Example**: `"pk_live_abc123def456..."` ### Custom Data - **Field**: `custom_data` - **Type**: String (optional) - **Purpose**: Additional context passed to your Pipecat agent at the start of each simulation - **Format**: Valid JSON string - **Use Cases**: Agent-specific parameters, session context, custom settings - **Example**: `{"department": "support", "language": "en", "priority": "high"}` ## Setup Instructions 1. **Deploy your agent to Pipecat Cloud** - Follow the [Pipecat Cloud Quickstart](https://docs.pipecat.daily.co/quickstart) to deploy your agent - Confirm your agent is listed in the Pipecat Cloud dashboard 2. **Connect it to Coval** - Go to [app.coval.dev/coval/agents/create](https://app.coval.dev/coval/agents/create) - Enter the exact agent name as it appears in Pipecat Cloud - Input your Pipecat API key - Add any required custom data in JSON format 3. **Run a simulation** - Create a test set with scenarios for your agent - Launch a simulation to verify the connection works end-to-end ## How Simulations Work When you launch a simulation, Coval: 1. Authenticates with Pipecat Cloud using your API key 2. Starts a session with your specified agent (along with any custom data) 3. Simulates a realistic voice conversation using your test set and persona configuration 4. Records the full interaction and runs your configured evaluation metrics ## Troubleshooting **Common Issues:** - **Authentication Failures**: Verify your Pipecat API key is valid and has the correct permissions - **Invalid Custom Data**: Ensure the JSON format is valid --- ## LiveKit Source: https://docs.coval.dev/concepts/agents/connections/livekit Connect to LiveKit real-time communication platform for advanced audio/video agents ## Overview LiveKit connection enables integration with agents built on LiveKit's real-time communication platform for audio, video, and data streaming. This connection type supports both LiveKit Cloud and self-hosted LiveKit deployments. ## Configuration Requirements ### Generate Token Endpoint - **Field**: `generate_token_endpoint` - **Type**: String (required) - **Purpose**: Endpoint for generating LiveKit access tokens - **Format**: Valid HTTPS URL - **Example**: `https://your-api.com/livekit/token` Coval sends a POST request to this endpoint with: ```json { "room_name": "uuid-generated-by-coval", "participant_name": "simulated_user" } ``` Your endpoint should return: ```json { "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...", "serverUrl": "wss://your-livekit-server.com", "room_name": "uuid-generated-by-coval" } ``` ### LiveKit URL - **Field**: `livekit_url` - **Type**: String (required) - **Purpose**: LiveKit server WebSocket URL - **Format**: Valid WebSocket URL (wss://) - **Example**: `wss://your-livekit-server.com` > **Note:** If your token endpoint returns `serverUrl` or `server_url`, that value will override this configuration. ### Generate Token Headers - **Field**: `generate_token_headers` - **Type**: String (optional) - **Purpose**: HTTP headers for token generation requests - **Format**: Valid JSON string - **Example**: `{"Authorization": "Bearer your-api-key", "Content-Type": "application/json"}` ### Sandbox ID - **Field**: `sandbox_id` - **Type**: String (optional) - **Purpose**: LiveKit Cloud sandbox identifier for automatic agent dispatch - **When to use**: Only required when using LiveKit Cloud's managed sandbox feature - **When to skip**: Leave empty if self-hosting LiveKit or using your own agent dispatch system > **Info:** **Sandbox ID is optional for most users.** It's a LiveKit Cloud-specific feature for automatic agent dispatch. If you're running your own LiveKit server or managing agent dispatch yourself, you don't need this field. ### LiveKit Agent Name - **Field**: `livekit_agent_name` - **Type**: String (optional) - **Purpose**: Name identifier for the LiveKit agent - **Format**: String identifier - **Example**: `"voice-assistant"`, `"video-agent"` ### LiveKit Agent Metadata - **Field**: `livekit_agent_metadata` - **Type**: String (optional) - **Purpose**: Additional metadata for the LiveKit agent - **Format**: String or JSON string - **Example**: `"{'role': 'assistant', 'capabilities': ['voice', 'video']}"` ### Custom Payload Fields - **Field**: `token_request_payload` - **Type**: String (optional) - **Purpose**: Additional fields to include in the token request - **Format**: Valid JSON string - **Example**: `{"agent_variant": "sales", "language": "en"}` These fields are merged with `room_name` and `participant_name` when calling your token endpoint. ## Setup Instructions 1. Set up a token generation endpoint that accepts POST requests 2. Configure your endpoint to return tokens with the correct room permissions 3. Enter your LiveKit server WebSocket URL 4. (Optional) Add authentication headers if your endpoint requires them 5. Test the connection by launching a simulation ## Technical Details ### Token Generation Flow 1. Coval generates a unique room name (UUID) for each simulation 2. Coval sends a POST request to your token endpoint with the room name 3. Your endpoint generates a LiveKit JWT token with room access permissions 4. Your endpoint should also dispatch your agent to join the same room 5. Coval joins the room using the returned token 6. Coval waits for your agent to join (`on_first_participant_joined` event) 7. Conversation simulation begins ### Accepted Response Field Names Coval accepts multiple field name variations for flexibility: | Data | Accepted Field Names | |------|---------------------| | Token | `token`, `participantToken`, `accessToken`, `participant_token`, `access_token` | | Server URL | `serverUrl`, `server_url` | | Room Name | `roomName`, `room_name` | ## Troubleshooting ### Common Issues **Token Generation Failures** - Check endpoint accessibility and authentication - Verify your endpoint returns valid JSON with a `token` field - Ensure HTTPS is properly configured **WebSocket Connection Errors** - Verify LiveKit server URL starts with `wss://` - Check that `serverUrl` is included in your token response - Confirm your LiveKit server is running and accessible **Agent Not Joining Room** - Ensure your agent dispatch system receives the `room_name` from token requests - Verify your agent is running and connected to LiveKit - Check that the token grants access to the correct room **Simulation Timeouts** - Coval waits for your agent to join before starting - If your agent doesn't join, the simulation will timeout - Check your agent logs for connection errors **"No token found in response" Error** - Verify your response includes a recognized token field name - Check that the token value is a non-empty string - Ensure response Content-Type is `application/json` ### Running Components Locally Coval's servers need to reach your **token endpoint** and **LiveKit server** to run simulations. Here's what needs to be publicly accessible: | Component | Must be public? | Why | |-----------|-----------------|-----| | Token endpoint | Yes | Coval calls it to get access tokens | | LiveKit server | Yes | Coval connects via WebSocket | | Your LiveKit agent | No | Only needs outbound connection to LiveKit | **If running your token server locally:** Use a tunneling service like [ngrok](https://ngrok.com) to expose it: ```bash ngrok http 8888 # Use the generated https:// URL as your token endpoint ``` > **Note:** Your agent can run on your local machine without any tunneling—it just connects outbound to the LiveKit server like any other client. --- ## Personas Source: https://docs.coval.dev/concepts/personas/overview Generate realistic simulated personas that reflect your end-users. Personas define the characteristics of the simulated user interacting with your agent. Configure voice, accent, behavior, and more to match your real user base. ## Creating a Persona 1. Navigate to the Personas section 2. Click "Create New Persona" 3. Configure the persona settings ![Persona Configuration](/images/personas/persona-config-dec-25.png) ## Configuration Options ### Avatar Customize the persona's visual representation: - Select from various hair, eye, and lip styles - Regenerate avatar seed for a new base face ### Persona Label Display name for the persona (required). ### Persona Characteristics Define the persona's demographics, personality, and communication style (required). Use the expand button (Shift+E) for a full-screen editor. ### Voice Configuration | Setting | Description | |---------|-------------| | **Voice** | Select from available voices with gender options. Preview available for each voice. | | **Language & Accent** | Choose language and regional accent. Supported languages include English, Spanish, French, French (Canada), German, Italian, Japanese, Korean, Portuguese, and Russian. | | **Background Noise** | Add ambient noise to simulate real-world calling environments. Volume is adjustable with a slider. | **Available Voices** Voices fall into two categories based on realism and concurrency. Higher-realism voices sound more natural and expressive but have a concurrency limit of approximately 12 simultaneous connections. For high-volume simulation runs, use higher-concurrency voices to avoid bottlenecks. **Higher Concurrency (20 voices)** | Voice | Gender | |-------|--------| | Aria | Female | | Ashwin | Male | | Autumn | Female | | Brynn | Female | | Callum | Male | | Caspian | Male | | Corwin | Male | | Darrow | Male | | Delphine | Female | | Dorian | Male | | Elara | Female | | Kieran | Male | | Lysander | Male | | Marina | Female | | Naveen | Male | | Orion | Male | | Rowan | Male | | Skye | Female | | Soren | Male | | Vera | Female | **Higher Realism (7 voices) — Limited Concurrency** | Voice | Accent | |-------|--------| | Alejandro | Latin America | | Angela | American | | Erika | American | | Harry | American | | Mark | American | | Monika | American | | Raju | Indian | > **Warning:** **Concurrency limit:** Higher-realism voices support a maximum of approximately 12 simultaneous connections. If you are running a large volume of simulations, these voices can become a bottleneck. Use higher-concurrency voices for high-volume runs. **Available Background Sounds** | Sound | Description | |-------|-------------| | **Off** | No background noise (default) | | **Office** | Office ambience | | **Lounge** | People in a lounge | | **Crowd Talking** | Crowd conversation noise | | **Airport Boarding** | Airport boarding announcements | | **Bus Interior** | Inside a bus | | **Kids Playing** | Playground sounds | | **Doorbell** | Doorbell ringing | | **Train Arrival** | Train station arrival sounds | | **Portable AC** | Air conditioner hum | | **Skatepark** | Skatepark ambience | | **Small Dog Bark** | Small dog barking | | **Cafe** | Cafe ambience | | **Ferry Announcement** | Ferry and PA announcements | | **Heavy Rain** | Heavy rainfall | | **Moderate Wind** | Wind sounds | | **Newborn Baby Crying** | Baby crying | | **Office with Alarm** | Office with alarm going off | | **Street with Sirens** | Street traffic with sirens | | **Construction Work** | Construction site noise | ### Conversation Initiator | Option | Behavior | |--------|----------| | **Persona waits to speak** | Waits for the agent to speak first. | | **Persona speaks first** | Persona initiates the conversation. | ### Interruption Rate Controls how often the persona proactively interrupts the agent during a conversation. This simulates impatient or talkative callers who don't wait for the agent to finish speaking. | Option | Behavior | |--------|----------| | **None** | The persona never proactively interrupts the agent (default). | | **Low** | The persona occasionally interrupts (roughly every 90 seconds). | | **Medium** | The persona interrupts at moderate frequency (roughly every 45 seconds). | | **High** | The persona frequently interrupts (roughly every 30 seconds). | > **Info:** **Note on natural turn-taking:** Even with Interruption Rate set to None, you may observe occasional overlapping speech between the persona and agent. This is expected behavior caused by natural voice conversation turn-taking, where the speech-to-text engine detects a pause in the agent's speech and the persona begins responding before the agent has fully finished. This is distinct from proactive interruptions and reflects realistic phone conversation dynamics. > > To minimize this, add instructions in your persona prompt like: "Always wait for the agent to completely finish speaking before responding." > > See [Interruption Behavior](#interruption-behavior) below for more details. ### Multi-Language STT Enable multilingual speech recognition so the persona can accurately hear and respond to agents that speak multiple languages in the same conversation (e.g. "For English press one, Para español presione dos"). Found under **Advanced** in the persona configuration modal. | Setting | Description | |---------|-------------| | **Off** (default) | Speech recognition is set to the persona's configured language for best single-language accuracy. | | **On** | Speech recognition accepts all supported languages simultaneously. Supports English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. | > **Tip:** If your agent starts with a multi-language greeting or IVR menu, create a persona with Multi-Language STT enabled. You can clone an existing persona and toggle it on — this lets you test the same agent with both single-language and multi-language speech recognition. ### Silent Mode When enabled, the persona remains completely silent throughout the conversation. The persona will not respond to anything the agent says. This is useful for testing how your agent handles unresponsive callers, dead air, or scenarios where the caller has put their phone down. When silent mode is enabled, all other behavioral settings (background sound, interruption rate, conversation initiator) are automatically disabled. ### Caller Phone Number Configure phone number routing for voice simulations. See phone number mappings below. > **Info:** **Caller Phone Number** for Voice Simulations: > > Coval uses different phone numbers depending on the simulation type. Assign a specific phone number index to a persona if your workflow depends on phone number routing. > > ## Inbound Voice Simulations > > For **inbound** simulations (Coval calls your agent), assign up to 29 phone numbers to a persona. > > **View Available Inbound Phone Number Mappings** > > | Index | Phone Number | > |-------|--------------| > | 1 | +16504471573 | > | 2 | +16506400392 | > | 3 | +16506329775 | > | 4 | +16505360811 | > | 5 | +16505360576 | > | 6 | +15418450089 | > | 7 | +15412194880 | > | 8 | +14157181081 | > | 9 | +14157180538 | > | 10 | +14157180269 | > | 11 | +14153765034 | > | 12 | +14069058267 | > | 13 | +14066920094 | > | 14 | +14064159042 | > | 15 | +14063022479 | > | 16 | +14063022353 | > | 17 | +17182801764 | > | 18 | +17182858503 | > | 19 | +17182859858 | > | 20 | +17183051836 | > | 21 | +17187195385 | > | 22 | +17187195407 | > | 23 | +15342172296 | > | 24 | +15342172366 | > | 25 | +15342172371 | > | 26 | +15342172387 | > | 27 | +19855295712 | > | 28 | +19858539008 | > | 29 | +19858539188 | > > > > ## Outbound Voice Simulations > > For **outbound** simulations (your agent calls Coval's simulated user), select a phone number for the persona to receive calls on. > > **View Available Outbound Phone Number Mappings** > > | Index | Phone Number | > |-------|--------------| > | 1 | +14158734019 | > | 2 | +17199853850 | > | 3 | +17199853656 | > | 4 | +17196219208 | > | 5 | +17194630332 | > | 6 | +17194630202 | > | 7 | +17194630116 | > | 8 | +17194510465 | > | 9 | +16309315617 | > | 10 | +16309190593 | > | 11 | +16306014871 | > | 12 | +16305857118 | > | 13 | +16305526080 | > | 14 | +16305222063 | > | 15 | +16304468895 | > | 16 | +12624037199 | > | 17 | +12623988133 | > | 18 | +12622149045 | > > ## Advanced Configuration ### Emotional Voice Simulation Emotional tone in voice simulations is controlled through the **Persona Characteristics** prompt. These instructions guide the persona's dialogue generation, shaping word choice, sentence structure, and phrasing to convey emotion. The text-to-speech engine then speaks that text. > **Info:** **How emotion works in voice simulations:** The persona prompt controls what *text* the persona generates, not the voice itself. For example, instructing the persona to "be impatient" results in shorter sentences, more direct language, and frustrated phrasing. The TTS engine does not have direct emotion controls — it speaks whatever text the persona produces. Emotional impact comes from the words and sentence structure, not from changes in vocal tone or volume. #### Best Practices for Emotional Personas **Be specific and descriptive.** Instead of generic labels, describe the emotional behavior in terms of word choice and conversational patterns: ``` // Less effective You are an angry customer. // More effective You are extremely frustrated and losing patience. You use short, clipped sentences. When you have to repeat information you've already provided, you use phrases like "I already told you this" and "This is unacceptable." Your language becomes sharper and more aggressive as the conversation goes on if the agent cannot resolve your issue quickly. ``` **Use punctuation to signal emotion.** The text-to-speech engine interprets punctuation as speech cues: - Exclamation marks (`!`) convey urgency or emphasis - Commas create natural pauses and hesitation - Dashes (`-`) create brief breaks - Short sentences convey impatience or stress - Question marks with exclamation marks (`?!`) convey disbelief > **Warning:** Avoid using ellipses (`...`) for pauses. Some TTS engines read them aloud as "dot dot dot." Use commas or dashes instead to create natural pauses. **Include emotional progression.** Real callers escalate or de-escalate: ``` You start the call calmly but become increasingly frustrated if the agent asks you to repeat information or puts you on hold. If the agent resolves your issue, your tone should soften. If the agent is dismissive, you become more insistent and demand to speak to a supervisor. ``` #### Voice Selection for Emotional Scenarios Higher-realism voices generally produce better emotional expressiveness. If emotional nuance is important for your test scenarios, consider selecting a higher-realism voice for your persona. > **Warning:** Higher-realism voices have a concurrency limit of approximately 12 simultaneous connections. For high-volume simulation runs where emotional expressiveness is less critical, use higher-concurrency voices to avoid bottlenecks. #### Example Emotional Personas **Stressed Customer** ``` You are a customer under significant time pressure. You are calling during your lunch break and need this resolved quickly. Keep your responses very short and direct. You mention the time frequently and say things like "Can we speed this up?" and "I really don't have much time." If the agent asks unnecessary questions, respond with impatience: "Is that really necessary right now?" ``` **Impatient Elderly Caller** ``` You are an older adult who is not comfortable with technology. You are calling because you cannot figure out the website. You are somewhat impatient and repeat yourself when you feel you're not being understood. You become flustered when given too many steps at once and say things like "That's too many things at once" or "Can you just do it for me?" You occasionally go on brief tangents about how things used to be simpler. ``` **Upset but Polite** ``` You are disappointed with the service you received but remain polite throughout the call. You express frustration through pointed questions rather than harsh language. Use phrases like "I'm quite disappointed" and "I was really expecting better." You give the agent a fair chance to resolve the issue but make it clear that your patience has limits. ``` ### Filler Words and TTS Behavior When configuring personas to use filler words like "um", "uh", or "hmm", the way these words are written in the persona's speech directly affects how the text-to-speech engine pronounces them. Text-to-speech engines process text literally. Unusual spellings or excessive repeated letters can cause the engine to spell out letters individually, read punctuation marks aloud, or mispronounce unfamiliar character sequences. #### TTS-Friendly Filler Words Use these standard spellings, which are recognized by text-to-speech engines: | Use This | Avoid This | Why | |----------|-----------|-----| | `um` | `ummm`, `ummmm` | Extra letters may be spelled out | | `uh` | `uhhh`, `uhhhhh` | Extra letters may be spelled out | | `hmm` | `hmmmmm`, `hmmmmmm` | Extra letters may be spelled out | | `oh` | `ohhh`, `ohhhh` | Extra letters may be spelled out | | `ah` | `ahhh`, `ahhhh` | Extra letters may be spelled out | | `well,` | `well...` | Ellipses may be read as "dot dot dot" | | `so,` | `so...` | Ellipses may be read as "dot dot dot" | | `you know,` | `you know...` | Ellipses may be read as "dot dot dot" | #### Recommended Persona Prompt for Filler Words Include explicit TTS-friendly instructions in your persona prompt: ``` You use natural filler words in conversation. When hesitating, use only these words: "um", "uh", "hmm", "oh", "well". Write them as single short words. Use commas for pauses instead of ellipses or repeated letters. Example: "Um, I think the order number is, uh, let me check, it's 12345." ``` ### Conversation Triggers You may want the persona to remain silent until the agent says a specific word or phrase, such as waiting for a greeting before starting to speak. The persona's behavior is driven by the instructions in the persona prompt. You can instruct the persona to wait for specific phrases, but because the underlying language model is probabilistic, adherence is not 100% deterministic. #### Maximizing Trigger Reliability 1. **Set the Conversation Initiator** to "Persona waits to speak" so the agent always speaks first. 2. **Use strong, repeated language** in the persona prompt: ``` CRITICAL INSTRUCTION: You MUST remain completely silent until the agent says "How can I help you today?" Do not speak. Do not respond to any other greeting or introduction. Wait specifically for the phrase "How can I help you today?" before saying anything. Any other phrase like "How may I assist you?" or "What can I do for you?" should NOT trigger your response. Continue waiting silently. ``` 3. **Keep the trigger phrase simple and distinctive.** Shorter, more common phrases are easier for the persona to reliably detect. 4. **Include fallback behavior** for cases where the exact phrase doesn't appear: ``` If the agent does not say "How can I help you today?" within the first 30 seconds, you may begin speaking with your objective. This prevents the conversation from stalling entirely. ``` > **Warning:** Conversation triggers provide high but not perfect consistency. For mission-critical trigger behavior, consider running multiple simulations to account for natural variation. ### Interruption Behavior Voice simulations involve two distinct types of interruption behavior: #### Proactive Interruptions The **Interruption Rate** setting (None, Low, Medium, High) controls whether the persona deliberately interrupts the agent on a timer. When set to None, the persona never proactively talks over the agent. | Setting | Behavior | |---------|----------| | **None** | No proactive interruptions | | **Low** | Interrupts approximately every 90 seconds | | **Medium** | Interrupts approximately every 45 seconds | | **High** | Interrupts approximately every 30 seconds | #### Natural Turn-Taking Overlap Even with Interruption Rate set to None, you may observe the persona starting to speak while the agent is still talking. This is caused by **natural voice turn-taking**, not proactive interruptions. In real phone conversations, speakers rely on pauses, intonation changes, and context to determine when the other person has finished speaking. The simulation's speech-to-text engine detects pauses in the agent's speech and may interpret a brief pause as end-of-turn, causing the persona to begin responding before the agent has fully finished. This behavior is realistic and expected in voice simulation testing, as it mirrors how real callers sometimes talk over agents. #### Reducing Turn-Taking Overlap If you need the persona to be more patient and avoid any overlap: 1. **Add explicit waiting instructions** to the persona prompt: ``` Always wait for the agent to completely finish their thought before responding. Take a brief pause after the agent stops speaking before you begin your response. If you hear the agent start speaking again, stop immediately and let them finish. ``` 2. **Use longer, more deliberate speech patterns** in the persona characteristics to naturally slow the response: ``` You are a thoughtful, patient caller who considers the agent's words carefully before responding. You take a moment to think before speaking. ``` ## Personas vs. Test Sets Personas and test sets serve distinct purposes and work together in simulations. ### Personas: Define HOW to Behave Personas establish behavioral traits applied across multiple test sets: - "You are polite and friendly, respond in short sentences." - "You speak slowly with natural pauses like 'uhh' and 'umm'." - "You are impatient and frequently interrupt." ### Test Sets: Define WHAT to Do Test sets contain specific instructions for the conversation: - "Call to get a refund for order #12345" - "Ask for PTO from March 21st to 22nd" - "Inquire about account balance" ### Why Keep Them Separate? **Reusability**: Apply one persona to multiple test sets, or test one scenario with multiple personas. **Comparison Testing**: Run the same test set across different personas to evaluate agent handling of various user types. **Easier Maintenance**: Update behavioral traits in one place without affecting test scenarios. ## Best Practices **Recommended:** ``` Persona: "You are a friendly customer who speaks in short sentences." Test Set: "Call to cancel your subscription." ``` **Avoid mixing behavioral traits with task instructions:** ``` Test Set: "You are a rude customer who wants to cancel subscription and argues about fees." ``` ## Custom Persona Prompts Include in your custom persona prompt: - **DTMF/IVR handling**: Navigation instructions for phone menus - **Speech style**: Filler words, response patterns - **Information flow**: When to provide or withhold information - **Call ending triggers**: Conditions for hanging up ### Voice Persona Example ``` You are a customer calling support. - WAIT for all options before proceeding - Use dtmf tool to select menu options - Remain silent during IVR navigation - Natural responses with occasional pauses - Only respond when directly asked Hang up if transferred to a human agent. ``` ### Chat Persona Example ``` - Wait for the automated greeting before typing - Respond naturally to prompts - Natural chat language with occasional typos - Concise responses unless asked for details ``` ## Template Strategy 1. Create persona variations for different user types 2. Create focused test sets for specific workflows 3. Combine in templates for comprehensive testing 4. Analyze results across user personalities > **Tip:** Start with 2-3 core personas (Polite Customer, Impatient Customer, Technical Customer) and build test sets around common workflows. --- ## Test Sets Source: https://docs.coval.dev/concepts/test-sets/overview Tell our simulated users what to do, say, and how to behave. A **Test Set** is a structured collection of **test cases** designed to evaluate specific functionalities, workflows, or scenarios in your project. Each test set can contain multiple test cases, and simulations/evaluations will analyze the aggregate results of all test cases within the set. # How to Generate a Test Set ## Quick Start 1. **Enter your test scenario** in the input box. 2. **(Optional) Add extra context**: - Attach files (such as text, JSON, or markdown) - Choose an agent to evaluate - Pick a relevant category from those suggested 3. **(Optional) Add metadata**: - Define metadata fields to extract from each test case. - Example: key: "ticket_number", description: "X-###" will generate entries like "X-001" per test case - Example: key: "destination", description: "enter a possible airport code the user is flying to" will generate entries like "SFO" 4. **Submit** using the arrow button or by pressing Enter. 5. **Review and modify** your test set in the test set editor. ## Alternative Options - **Upload from file**: Use "Upload from file" to import CSV/Excel test cases - **Manual mode**: Use "Use manual creation mode" to create a blank test set and add cases yourself > **Tip:** **Tips for better test cases:** > - Be specific in your description for better test cases > - Attaching agent prompts or documentation helps generate more relevant tests > - You can edit, add, or remove test cases after generation ## Uploading from CSV/Excel Import test cases in bulk by uploading a properly formatted CSV or Excel file. ### Column Structure The test case input or prompt. This column is case-insensitive and must be present in your file. Expected behaviors for the test case. Parsing rules (applied during test-set ingest/validation): - **JSON array**: `["behavior1", "behavior2"]` - parsed as an array of behaviors - **Comma-separated string**: `"behavior1,behavior2"` - split by comma into multiple behaviors - **Single string**: `"behavior1"` - treated as a single behavior string Test case type. Case-insensitive. Accepts: - `SCENARIO` - `TRANSCRIPT` JSON object containing test case metadata Agent IDs to associate with the test set. Test-set level (applies to all test cases): - **JSON array of strings**: `["agent-id-1", "agent-id-2"]` - parsed as an array of agent ID strings - **Comma-separated string**: `"agent-id-1,agent-id-2"` - Values are trimmed and empty values are filtered out - Uses the first non-empty value found in the file (since `agent_ids` applies to the whole test-set) Knowledge base entries to attach to test cases: - **JSON array of objects**: `[{"id": "entry-id-1", "type": "web_url"}, {"id": "entry-id-2"}]` - each object can have `id` (required) and `type` (optional) - **JSON array of strings**: `["entry-id-1", "entry-id-2"]` - treated as entry IDs with default type - **Comma-separated string**: `"entry-id-1:web_url,entry-id-2,entry-id-3:pdf"` - Each object can be formatted as 'id:type' or just 'id' with a default type - **Single string**: `"entry-id-1"` - treated as an entry ID with a default type - Accepts: - `web_url` (default) - `plain_text` - `json` - `zendesk` - `shelf` - `file` Any additional column headers will automatically be treated as metadata fields ### File Requirements Your file must meet the following criteria: - Accepted formats: `.csv` or `.xlsx` - Maximum file size: 10MB - First row: Must contain column headers (case-insensitive) - Empty rows: Automatically skipped during import - Validation: Rows with empty input values are filtered out > **Warning:** Ensure your file doesn't exceed 10MB and contains at least one row with a valid `input` value. # Understanding Test Cases ## Test Case Input ### 1. Scenarios Define specific tasks or behaviors for your simulated user. Use quotation marks for exact phrases you want them to say. Examples: - Simple task: "Call to get a refund" - Complex scenario: "First, ask for PTO from the 21st to the 22nd of March. After receiving a confirmation, ask to change to the 20th to 22nd. During the verification, share your email address as 'emily [at] gmail [dot] com'. Then, proceed to correct yourself with 'oh no - it's actually emily [dot] marc [at] gmail [dot] com'." The more detailed your scenario, the more precisely our simulated user will follow it. ### 2. Transcript Recreate specific conversations using OpenAI transcript format. The agent will follow the user's part of the transcript as closely as possible. Format example: ```json [ { "role": "assistant", "content": "Welcome to X Restaurant. How may I assist you today?" }, { "role": "user", "content": "I would like to order some pizza." } ] ``` ### 3. Audio Upload Upload a pre-recorded audio file containing the persona's side of the conversation (right channel) to use during a voice simulation. Instead of the persona generating responses with an LLM and TTS, the uploaded audio plays back exactly as recorded — making the test fully deterministic. Supported formats: `.wav`, `.mp3` (max 200 MB, duration 5 seconds – 1 hour). **How it works:** 1. In the test set editor, select **Audio** as the input type and upload your audio file containing the persona's speech (right channel only) 2. The file is played back during simulation in place of LLM-generated persona speech 3. The uploaded audio is automatically transcribed so persona turns still appear in the transcript 4. After the audio finishes playing, the simulation waits 30 seconds for the agent to finish responding, then ends the call > **Tip:** Audio upload test cases are ideal for regression testing — record a specific caller interaction once, then replay it across agent updates to detect regressions in handling. #### Ground Truth Transcript To measure your agent's STT accuracy against a known-correct transcript of the uploaded audio, you can provide a ground truth transcript in two ways: **Via the UI** — when uploading an audio file, the modal includes a ground truth transcript field where you can either paste the transcript as plain text or upload a `.txt` or `.json` file. **Via metadata** — add a `ground_truth_transcript` key to the test case metadata directly. Either method enables the [STT Word Error Rate (Audio Upload)](/concepts/metrics/built-in-metrics#stt-word-error-rate-audio-upload) metric, which compares your agent's speech-to-text output against this reference text. The ground truth can be plain text, labeled text with timestamps and role labels, or a JSON object with a `messages` array. ### 4. Script Define an ordered list of exact lines for the persona to deliver, turn by turn. The persona follows the script exactly rather than generating responses with an LLM — while still using the configured persona voice and background sounds. Example script turns: 1. "Hi, I'd like to check my account balance." 2. "Yes, my account number is 12345." 3. "Thank you, goodbye." **How it works:** 1. In the test set editor, select **Script** as the input type 2. Add ordered turn texts in the script editor (each turn is one persona utterance) 3. During simulation, the persona delivers each line in order instead of generating LLM responses 4. A divergence detector monitors agent responses — if the agent diverges significantly from the expected flow, the simulation can end early with a `SCRIPT_DIVERGED` reason 5. After the last scripted turn is delivered, the agent gets one final response before the simulation ends with a `SCRIPT_COMPLETED` reason > **Tip:** Script test cases give you deterministic persona speech output while still exercising the full voice pipeline (TTS, turn-taking, background noise). Use them when you need control over exactly what the persona says but still want realistic audio delivery. ## Test-Case Specific Evaluation Expected Behavior and Metadata allow you to utilize test-case specific data to evaluate how the agent responds to a specific test case. ## Test Case Expected Behavior The expected behavior dictates how your agent should be responding to the user's requests. Example - "the agent should ask the user for their phone number" - "the agent should repeat the phone number back to the users" Use the **Composite Evaluation** metric to evaluate whether the agent followed the expected behaviors. Configure it with **From Test Case** as the criteria source to automatically pull behaviors from each test case. With **Percentage of Criteria Met** reporting, the example above would return 0.5 if the agent asks for the phone number but does not repeat it back. ## Test Case Metadata These fields can be used to store specific metadata about a test case. This is helpful when you want to create a metric that might reference a specific aspect of the test case. You can input as key/value pairs, or as a JSON. Example: Imagine an airline help desk where the test case contains this metadata ```json { "source": "LAX", "destination": "SFO" } ``` Then, you can write, for example, a binary Destination Identification Metric with the question: Did the agent correctly identify the destination as: `{{test_case.destination}}`? ## Recommended Test Set Types > **Info:** For comprehensive testing, create multiple types of test sets: > > - **Regression Set**: Contains "happy path" scenarios representing typical successful interactions > - **Adversarial Set**: Contains edge cases and scenarios designed to test your agent's limits and handling of unusual requests ## Utilizing Agent Attributes In your agents, you can set specific attributes associated with that agent. You can embed these agent attributes into your scenarios with this format: `{{agent.attribute_name}}` Example: Imagine one agent has the attribute **location** with a value "San Francisco", and another agent has the value "London". Embed those agent attributes in your scenarios and expected behaviors like this: Scenario: You are a user calling for travel recommendations in `{{agent.location}}` Expected Behavior: The agent should only give travel recommendations in `{{agent.location}}` ## Test Cases vs. Personas - **Persona**: Defines **how** to behave - Characteristics (friendly, angry) - Voice configuration - Can be assigned multiple test cases - **Test Case**: Defines **what** to do - Specific tasks or scenarios - Can be assigned to any persona --- ## Metrics Source: https://docs.coval.dev/concepts/metrics/overview Understand and analyze your AI agents' performance with Coval's comprehensive metrics ## What is a metric? Metrics give you quantitative insights into your agent's performance, allowing you to see red flags early and understand overall trends. Each metric assesses your agent in a different way. **Audio** metrics use recordings, either simulated or live, to detect interruptions, measure phonemes per second, assess latency, and more. **LLM Judge** metrics provide answers to specific questions you have about your transcripts, allowing you to dial in on your unique specifications. LLM Judge metrics can optionally include **Trace Context** — when enabled, the judge automatically receives a summary of the agent's OpenTelemetry spans alongside the transcript, enabling evaluation of tool usage, execution order, and behavior that isn't visible in the transcript alone. Other offerings include **Sentiment Analysis**, **Regex Matching**, and many more. While Coval provides built-in metrics (latency, accuracy, tool-call effectiveness, instruction compliance), you can create custom metrics tailored to your specific needs. All out-of-the-box metrics are marked as “Built-in” in your Metrics list. These metrics can be applied to Simulated Conversations as well as Live-Monitoring Conversations. ## Recommended Metrics These are the metrics we usually recommend starting with, if you want to use built-in metrics as a starting point: - **Conversational LLM Judge (Binary) Metrics:** - Composite Evaluation - Agent Repeats Itself - End Reason - **Audio-Metrics:** - Latency - Interruption Rate - Speech Tempo - Natural Non-Robotic Tone Detection - Volume/Pitch Misalignment - **Other:** - Workflow Verification: - You can generate a workflow in the Agent creation flow, this metric will re-trace the workflow in the transcript and detect off-path behavior. ## Advanced Metrics: For when you try to evaluate specific parts of the conversation: - **Binary Tool Call metrics**: - Check if your tool calls (functions) have been performed correctly - **Audio LLM Judge:** - Ask an LLM Judge question and, instead of evaluating the transcript, we'll evaluate the audio. (e.g. "Did the assistant stutter?") - **Categorical metrics**: - Define a set of categories/topics to filter topics of your conversations (good for exploratory call analysis) - **Transcript Regex Match:** - A metric that performs regex pattern matching on conversation transcripts. Returns 1 for a match and 0 for no match. You can filter by speaker role, check only the first or last message, require that a pattern is absent (for compliance rules like “agent must not say X”), and enable case-insensitive matching. Ideal for exact phrase detection, compliance checks, and format validation without needing LLM calls. - **Numerical LLM Judge:** - A metric that uses an LLM judge to evaluate a prompt and output a numerical score. - **Tool Call Latency:** - Used to measure the latency of tool calls. - **Metadata Field Metric (Monitoring-only):** - If you send metadata as part of your transcripts to evaluate with Coval, this metric will take the specific metadata field's value and output that result as a metric result. Supports string, float, and boolean field types. - **Custom Trace Metric:** - Extract a specific numerical value from your agent's OpenTelemetry spans and aggregate it (average, median, p90, max, min) across all matching spans in a simulation. Use this to track custom latency signals, confidence scores, tool call durations, or any other numerical attribute your agent emits. See the [Custom Trace Metrics guide](/concepts/metrics/custom-trace-metrics) for details. - **Custom**: If you have your own metrics that you want to upload to the Coval platform to run next to our built-in metrics, let us know. \_Note: This is just an excerpt of Coval’s built-in Metrics. More metrics can be found in the Metrics overview list on the platform. \_ ## Guide to Creating Binary LLM Judge Metrics When creating metrics that use an LLM to evaluate performance: - Be precise in your descriptions - Always refer to the agent as "the assistant" for clarity - Provide clear guidance on evaluation criteria > **Info:** **Example: "Avoid Unresponsiveness" Metric:** > > _Given the transcript, did the assistant maintain responsiveness by acknowledging all user inputs and avoiding behaviors that make the user question whether the assistant is still present?_ > > _Return YES if:_ > > _• The assistant responds promptly and appropriately to all user inputs_ > > _• There are no long silences, skipped questions, or ignored user messages_ > > _• The user does not need to ask "Are you still there?" or similar prompts_ > > _• If the assistant is uncertain or processing, it states that clearly (e.g., "Let me check that for you")_ > > _Return NO if:_ > > _• The assistant fails to respond to a user input_ > > _• The user asks "Are you still there?" or expresses concern about being ignored_ > > _• The assistant gets stuck or goes silent without explanation_ ### Improve your Metrics To refine a metric, open it from the metrics list and click “Improve Metric.” Select a test set (must be a transcript—tip: copy/paste a simulated transcript into a new set). You can then iterate on the metric’s formulation and see how often it returns YES vs. NO. This helps reduce noise and non-determinism in LLM-judge metrics. ### Custom Metrics > **Info:** Need custom metrics tailored to your needs? Contact us, and we’ll create them > for you. --- ## Built-in Metrics Source: https://docs.coval.dev/concepts/metrics/built-in-metrics Comprehensive guide to Coval's pre-built metrics for evaluating agent performance # Built-in Metrics Coval provides a comprehensive suite of built-in metrics to evaluate your AI agent's performance across multiple dimensions. These metrics are ready to use out of the box and cover audio quality, conversation flow, response timing, and more. ## Audio Quality Audio metrics evaluate the quality and characteristics of speech output from your agents. These are essential for voice-based applications and provide comprehensive analysis of audio fidelity, conversation flow, and speech characteristics. > **Note:** All metrics in this section require audio input to function properly. They > will not work with text-only transcripts. ### Background Noise **Purpose**: Measurement of audio clarity and background noise. **What it measures**: Ratio between speech signal strength and background noise, signal noise ratio (SNR). **When to use**: Audio quality assessment, identifying poor recording conditions. **How it works**: Compares the strength of speech signals against background noise by analyzing speech and silent (room tone) segments separately. The metric calculates the ratio between signal and noise power levels, providing both overall and segment-by-segment quality assessments. **How to interpret**: - SNR above 20 dB indicates excellent audio clarity. - SNR between 10-20 dB is acceptable for most applications. - SNR below 10 dB may significantly impact speech recognition and comprehension. ### Natural Non-robotic Tone Detection **Purpose**: Analysis of audio frequency characteristics highlighting overtones and harmonics which give human voices their natural timbre. **What it measures**: Frequency distribution, dominant frequencies, and natural voice characteristics. **When to use**: This metric helps detect synthetic or robotic-sounding speech that could reduce the naturalness and effectiveness of agent interactions: - When evaluating speech synthesis quality - When testing new voice models or providers - When optimizing voice parameters for naturalness - When troubleshooting user complaints about robotic voices - Audio analysis, speaker identification, quality assessment **How it works**: Analyzes the frequency distribution of the audio signal and measures the percentage of pitched content above 300Hz. **How to interpret**: - \> 40%: Very natural / expressive - Found in natural human speech; rich in harmonics and overtones. - 30-40%: Acceptably natural - May be synthetic but sounds close to human-like - 20-30%: Slightly robotic - Often found in lower-quality TTS systems - < 20%: Robotic / flat - Indicates overly monotone or artificial tone > **Tip:** Need a different frequency threshold? > The **Audio Frequency Filter** custom metric lets you set any Hz value instead of the fixed 300Hz and reports the percentage of voiced segments above or below that threshold. Create one from the metric editor and configure `frequency_threshold`. ### Music Detection **Purpose**: Detects music segments in audio recordings. **What it measures**: Count of music segments detected in the conversation, with timeline data showing when each music segment occurs. **When to use**: - Detecting hold music or queue music during voice calls - Identifying unwanted background music in recordings - Measuring music duration and frequency in customer service scenarios - Debugging audio quality issues related to music interference **How it works**: Analyzes audio to identify non-speech segments and classify them as music. Returns a count of music segments and timeline entries with start/end offsets and duration for each detected music segment. **How to interpret**: - Value equals the count of distinct music segments detected - Higher counts indicate more frequent or longer music interruptions - Timeline subvalues show exact timestamps for each music segment - Useful for identifying when customers are placed on hold with music ## Conversation Length These metrics analyze the content and flow of conversations to ensure effective communication. ### Audio Duration **Purpose**: This metric measures the duration of the audio file in seconds. **What it measures**: Duration of the full conversation in seconds. ### Turn Count **Purpose**: Counts how many turns were taken in a conversation. **What it measures**: Each turn is a change between speakers. ### Words Per Message **Purpose**: The average number of words per message agent message. **What it measures**: Average number of words per message in a conversation. ## Instruction Following These metrics measure how well the agent follows predefined behaviors. ### Workflow Verification Verifies if conversations follow expected workflow patterns and business logic. ## Resolution Evaluate the end of the conversation. ### End Reason **Purpose**: The reason that the conversation ended. **When to use**: Help identify patterns in call completion. > **Note:** This currently only works for simulations run on Coval. Support live calls are a work in progress. **Possible values:** | Value | Description | |-------|-------------| | `COMPLETED` | The conversation reached a natural conclusion with a successful resolution | | `MAX_TURNS` | The conversation reached the maximum allowed number of turns | | `MAX_DURATION` | The conversation exceeded the maximum allowed duration | | `USER_HANGUP` | The user ended the conversation (voice calls only) | | `AGENT_HANGUP` | The agent ended the conversation (voice calls only) | | `IDLE_TIMEOUT` | The conversation timed out due to inactivity (chat/SMS simulations) | | `ERROR` | An error occurred during the simulation | | `UNKNOWN` | The end reason could not be determined | > **Tip:** Want to define what counts as a successful end reason? > The **Successful End Reason** custom metric returns YES/NO based on whether the end reason matches your configured success criteria. Select one or more end reasons as your success conditions. ## Responsiveness Critial metric to identify if the agent is responding correctly. ### Agent Fails To Respond **Purpose**: Evaluate continuity and identif moments when the agent ignores or misses a user query. > **Warning:** Any occurrence of this metric indicates a critical failure requiring immediate > investigation. **What it measures**: Long silence from the agent between two consecutive user turns, and whether and when the agent eventually responds after the second user turn. **When to use**: Identifying moments when the agent ignored or misses a user query **How it works**: Finds silence gaps ≥ 3 seconds between two consecutive user turns and checks whether the agent resumes speaking after this. > **Tip:** Need a different silence threshold? > The **Agent Fails to Respond Delay** custom metric lets you configure `max_silence_duration_seconds` instead of the fixed 5-second default. ### Agent Needs Reprompting **Purpose**: Identifies when agents become unresponsive but will respond after user repetition. > **Note:** This metric helps identify edge cases where the agent's response mechanism may > be failing intermittently. **What it measures**: Long silence from the agent between two consecutive user turns, only ig the agent responds after the second user turn. **When to use**: Evaluating naturalness and continuity. Identifying moments when the agent ignores or misses a user query. **How it works**: Finds silence gaps ≥ 3 seconds between two consecutive user turns and checks if the agent resumes speaking after this. **How to interpret**: - Each silence gap and eventual response are collectively considered one event. - More events = worse responsiveness. > **Tip:** Need a different silence gap threshold? > The **Agent Reprompting Delay** custom metric lets you configure `min_silence_gap_seconds` instead of the fixed 2-second default. ### Agent Repeats Itself **Purpose**: Identifies instances where the agent says the same sentence or asks the same question multiple times. **When to use**: Evaluating naturalness and word choice, identifying diverse language. ## Timing & Latency Ensure timely agent interactions. ### Interruption Rate **Purpose**: The rate (interruptions per minute) that the user is interrupted by the assistant. **What it measures**: An interruption is defined as any time the user is speaking and the assistant starts speaking before the user has finished speaking. This does not include times that the user interrupts the assistant. **When to use**: Conversation flow analysis, identifying communication issues, training data for interruption handling. **How to interpret**: - High interruption frequency may indicate communication issues. - Interruption patterns can help identify conversation flow problems. - Useful for training agents to handle interruptions gracefully. ### Latency **Purpose**: Measurement of delays between user and agent response time in milliseconds (ms). **What it measures**: Time between user input and agent response, silence durations. **When to use**: Performance evaluation, identifying slow response times, conversation flow analysis. **How it works**: Analyzes the audio signal using Voice Activity Detection (VAD) to identify speaker transitions and measure the time delay between when a user finishes speaking and when the agent begins responding. The metric tracks these response times throughout the conversation to identify patterns and potential issues. **How to interpret**: - Target latencies under 500ms for real-time conversations. - Target latencies under 2 seconds for complex query responses. - Higher latencies may indicate performance issues or processing bottlenecks. ### Time To First Audio **Purpose**: Detect audio start latency and responsiveness. **What it measures**: Time delay between simulation start and the first audible sound in the audio recording. **When to use**: Evaluating system or agent response latency before any speech begins. **How it works**: Detects the first audio frame that has RMS energy above a certain threshold and returns the timestamp of this frame. **How to interpret**: - \< 1000 ms: Fast audio start; considered responsive. - 1–3 seconds: Acceptable delay. - >3000 ms: Noticeable lag; may indicate issues in agent response, recording delay, or user hesitation. - -1 ms: No audio detected; likely a technical failure or silent recording. ### Speech Tempo **Purpose**: Identifies the rate of phonemes (perceptually distinct unit of speech sound) and high-speed speech periods. **What it measures**: The rate of phonemes per second (pps) in audio output. **When to use**: Speech quality assessment. Usefult to identify the average tempo. **How it works**: Measures the number of phonemes per interval over the duration of speech segment. **How to interpret**: - Above 20 PPS is too fast and will be hard to follow. - Between 15-20 PPS is fast but could be comprehensible. - Target is 10-15 PPS is not too fast or too slow. - Below 10 is too slow. ### Pause Analysis **Purpose**: Measures how frequently the agent pauses mid-speech and how long those pauses are. **What it measures**: Frequency of agent pauses within a turn (pauses per minute), along with total and average pause duration. **When to use**: - Identifying unnatural or excessive hesitations in agent speech - Detecting processing delays that manifest as in-speech pauses - Evaluating speech fluency across different configurations **How it works**: Identifies gaps between consecutive agent speaking segments within the same turn and measures their duration. Persona pauses and inter-turn gaps are excluded. **How to interpret**: - Lower values indicate more fluent speech. - The detail view shows each pause with its timestamp and duration. - Brief pauses are normal and often expressive; frequent longer pauses may indicate hesitation artifacts. ## Trace Metrics These metrics use OpenTelemetry (OTel) trace data to measure the performance of individual components in your voice agent pipeline. They provide granular visibility into LLM, TTS, and STT service latencies, token consumption, and tool usage. > **Note:** All metrics in this section require your agent to send OpenTelemetry traces to Coval. > See the [OpenTelemetry Traces guide](/concepts/simulations/traces/opentelemetry) for setup instructions. > If traces are not configured, these metrics will report an error at execution time. ### LLM Time to First Byte **Purpose**: Measure LLM responsiveness by tracking how quickly the first token is returned. **What it measures**: Average time (in seconds) from when the LLM request is sent to when the first token is received, across all turns in the conversation. **When to use**: Identifying slow LLM providers, comparing model latencies, optimizing prompt length for faster responses. ### TTS Time to First Byte **Purpose**: Measure TTS responsiveness by tracking how quickly the first audio byte is produced. **What it measures**: Average time (in seconds) from when text is sent to the TTS service to when the first audio byte is returned, across all turns. **When to use**: Evaluating TTS provider performance, identifying bottlenecks in the audio generation pipeline. ### STT Time to First Byte **Purpose**: Measure STT responsiveness by tracking how quickly the first transcription result is returned. **What it measures**: Average time (in seconds) from when audio is sent to the STT service to when the first transcription result is received, across all turns. **When to use**: Evaluating STT provider performance, diagnosing why the agent is slow to start processing user input. ### LLM Token Usage **Purpose**: Track the total token consumption of LLM calls during a conversation. **What it measures**: Sum of input tokens and output tokens consumed across all LLM calls in the conversation. **When to use**: Cost monitoring, identifying conversations that consume excessive tokens, comparing prompt strategies for efficiency. **How to interpret**: - Token counts vary by model and use case. Track this metric over time to establish baselines for your specific agent. - Sudden spikes may indicate prompt injection, runaway tool loops, or excessively long conversations. - Use in combination with turn count to compute average tokens per turn. ### Tool Call Count **Purpose**: Count the total number of tool calls made during a conversation. **What it measures**: Number of tool call invocations detected in OTel trace spans. **When to use**: Verifying that the agent is using tools as expected, identifying conversations with excessive or insufficient tool usage. ### STT Word Error Rate **Purpose**: Measure the accuracy of your agent's Speech-to-Text by comparing it against Coval's own transcription of the same conversation. **What it measures**: Word Error Rate (WER) between your agent's STT output and Coval's reference transcript of the caller's speech. The **reference** (ground truth) is Coval's transcription of the persona's speech, generated automatically from each simulation. The **hypothesis** (what you're testing) is your agent's own STT output, read from the `transcript` attribute on OTel `stt` spans. **When to use**: Evaluating your STT provider's accuracy, comparing STT providers (e.g., Deepgram vs Whisper vs Google), diagnosing why your agent misunderstands users, or tracking STT quality over time. > **Note:** This metric requires your agent to emit OTel traces with the `transcript` attribute on each `stt` span. See [Instrumenting STT Spans](/concepts/simulations/traces/opentelemetry#instrumenting-stt-spans) for setup instructions. If your STT provider exposes utterance confidence, we also recommend sending `stt.confidence` on the same spans for debugging and provider-quality analysis. No manual ground truth or test data is needed — the reference transcript is generated automatically. **How to interpret**: A lower WER means your STT is more accurately capturing what the caller said. Compare this metric across runs to track STT quality over time, or across different STT providers to find the best fit for your use case. Note that streaming (real-time) STT typically produces higher WER than batch transcription because it processes audio incrementally. For the WER formula and interpretation thresholds, see [Transcription Error](#transcription-error). ### STT Word Error Rate (Audio Upload) **Purpose**: Measure STT accuracy against a known-correct transcript that you provide, rather than Coval's auto-generated reference. **What it measures**: Word Error Rate (WER) between your agent's STT output and a ground truth transcript you supply in the test case metadata. The **reference** (ground truth) comes from the `ground_truth_transcript` field in your test case metadata. The **hypothesis** (what you're testing) is your agent's own STT output, read from the `transcript` attribute on OTel `stt` spans — the same as the standard [STT Word Error Rate](#stt-word-error-rate) metric. **When to use**: When you have [audio upload](/concepts/test-sets/overview#3-audio-upload) test cases with pre-recorded audio where you know exactly what was said. This lets you measure how accurately your agent's speech recognition transcribes a known recording — useful for regression testing STT quality against a canonical script. > **Note:** This metric requires two things: > 1. Your agent must emit OTel traces with the `transcript` attribute on each `stt` span. See [Instrumenting STT Spans](/concepts/simulations/traces/opentelemetry#instrumenting-stt-spans) for setup. > 2. Your test case must include a `ground_truth_transcript` key in its metadata containing the reference transcript. See [Audio Upload — Ground Truth Transcript](/concepts/test-sets/overview#ground-truth-transcript) for details. > > If your STT provider exposes utterance confidence, we also recommend attaching `stt.confidence` to each `stt` span so low-confidence turns are easier to inspect alongside the WER result. **Accepted ground truth formats:** | Format | Example | |--------|---------| | Plain text | `"Hi, I'd like to check my account balance"` | | Labeled text with timestamps | `"[15.4s - 26.8s] PERSONA: Hi, I'd like to check my account balance"` | | JSON with `messages` array | Persona turns are extracted automatically — see snippet below | ```json { "messages": [ { "role": "user", "content": "Hi, I'd like to check my account balance" }, { "role": "assistant", "content": "Sure, I can help with that." } ] } ``` When the ground truth contains role labels (e.g. `PERSONA:`, `AGENT:`), only persona/user lines are used — agent lines are filtered out automatically. **How to interpret**: Same as [STT Word Error Rate](#stt-word-error-rate) — a lower WER means better STT accuracy. Because the reference transcript is your own known-correct text (not Coval's transcription), this metric isolates your STT provider's accuracy without any variance from the reference side. For the WER formula and interpretation thresholds, see [Transcription Error](#transcription-error). > **Tip:** **Custom Trace Metrics** — In addition to these built-in trace metrics, you can create your own custom trace metrics to measure any OTel span attribute emitted by your agent using the **Create Metric** button within the Metrics section of the Coval UI. ## Transcription Accuracy ### Transcription Error **Purpose**: Evaluate the accuracy of agent messages through Word Error Rate (WER) which measures the percentage of errors in a transcript. **What it measures**: $$ WER = \frac{S + D + I}{N} $$ Where: - **S** = substitutions - **D** = deletions - **I** = insertions - **N** = total number of agent words in the reference transcript **How to interpret**: - **WER < 0.10**: Is great! This indicates excellent high quality audio. - **WER 0.10 - 0.30**: Is acceptable for most converational agents and situations with background noise. - **WER > 0.30**: May significantly impact understanding of the audio. > **Tip:** See [Coval Benchmarks](https://benchmarks.coval.ai) for real-world WER performance data across different transcription providers and audio configurations. ## User Patterns ### Audio Sentiment **Purpose**: Detect vocal tone of each audio segment. **What it measures**: Emotional tone for each audio segment for both parties. **When to use**: General tone of the conversation and trend of audio sentiment across the conversation. **How it works**: Classifies audio sentiment per speaking segment based purely on the audio tone and not the spoken content. **How to interpret**: Check frequency of certain emotional tones. > **Tip:** Want to set a pass/fail threshold on sentiment? > The **Preferred Audio Sentiment** custom metric lets you select which sentiments count as success, choose which speaker to evaluate (agent or persona), and set a minimum percentage of segments that must match. ### Transcript Sentiment Analysis **Purpose**: Analyzes the transcript for rude, polite, encouraging, and professional sentiments, identifying the sentiment with the highest overall score. **What it measures**: Score of emotional tone for each audio segment for the agent. **When to use**: General tone of the agent and how it could be interpreted. **How it works**: Classifies audio sentiment per speaking segment based purely on the audio tone and not the spoken content. **How to interpret**: Higher scores in each sentiment indicate stronger sentiment detected. ## Best Practices for Using Built-in Metrics Begin with essential metrics like response time, resolution success, and audio quality before adding specialized ones. Establish baseline measurements before making changes to track improvement over time. Use multiple metrics together for comprehensive evaluation rather than relying on single indicators. Schedule regular metric reviews to identify trends and areas needing attention. ## Metric Selection Guide Choose metrics based on your use case: ### Voice Assistants - Audio Quality - Speech Tempo - Background Noise - Natural Non-robotic Tone Detection - Volume/Pitch Misalignment - Latency - Interruption Detection - Trace Metrics (requires OTel traces) - LLM Time to First Byte - TTS Time to First Byte - STT Time to First Byte ### Customer Service Bots - Composite Evaluation - Resolution Time Efficiency - End Resolution - Audio Sentiment ### Task Automation Agents - Workflow Verification - Composite Evaluation - Words Per Minute - LLM Token Usage - Tool Call Count ### General Conversational AI - Agent Response Times - Interruption Rate - Agent Repeats Itself - Transcript Sentiment Analysis - End Reason > **Note:** Remember that not all metrics are suitable for every scenario. Audio metrics > require actual audio input, while comparison metrics need reference data to > function properly. --- ## Custom Metrics Source: https://docs.coval.dev/concepts/metrics/prompting This guide provides instruction for creating high-performing custom prompting metrics in Coval's evaluation platform. Each metric type benefits from various prompting strategies to achieve reliable, deterministic results. For help writting prompts for the custom metrics Coval offers an `optimize metric` button to improve clarity and confidence. ## Core Principles for All Metrics ### 1. **Specificity Over Generality** - Define exact evaluation criteria rather than subjective assessments - Use concrete, measurable behaviors instead of abstract concepts - Provide clear boundary conditions for edge cases ### 2. **Role Consistency** - Always refer to the AI agent as "the assistant" - Use "the user" or "the customer" for human participants - Maintain consistent terminology throughout your prompts ### 3. **Deterministic Design** - Structure prompts to minimize LLM variance across evaluations - Provide explicit decision trees when possible - Define what constitutes partial vs. complete success --- ## Binary LLM Judge Metrics **Purpose**: Yes/No evaluations with high accuracy and consistency ### Prompt Structure Template ``` [CONTEXT SETTING] Given the transcript, [SPECIFIC QUESTION]? Return YES if: • [Explicit criterion 1] • [Explicit criterion 2] • [Explicit criterion 3] Return NO if: • [Explicit disqualifying condition 1] • [Explicit disqualifying condition 2] • [Edge case handling] [CLARIFICATIONS FOR EDGE CASES] ``` > **Tip:** **Important**: When using OR conditions, make it explicitly clear that the metric should return `YES`/`NO` if **any** of the conditions are met. Use "ANY of the following" language to remove ambiguity. > * "Return x if ANY of the following apply:" > * "[Condition] OR [Condition] OR [Condition] ... " ### Example 1: Issue Resolution Detection ``` Given the transcript, did the assistant successfully resolve the user's primary issue or concern? Return YES if ANY of the following apply: • The user explicitly confirms their issue is resolved (e.g., "That worked," "Perfect, thank you") • OR the assistant provides a complete solution and the user accepts it without further objection • OR the user indicates satisfaction with the outcome before ending the conversation • OR the assistant completes a requested action and the user acknowledges success • OR the user's question was fully answered and they don't ask follow-up questions about the same issue • OR the assistant provides complete, actionable guidance and the user indicates understanding • OR no primary issue or concern was raised by the user (e.g., casual greetings, general inquiries) Return NO if ANY of the following apply: • The user states their issue remains unresolved • OR the conversation ends without addressing the user's main concern • OR the user expresses frustration or dissatisfaction with the proposed solution • OR the assistant escalates or transfers the issue without providing any resolution attempt • OR the user has to repeat their problem multiple times without progress • OR the assistant admits they cannot help or solve the user's problem • OR the user asks the same question again after receiving an answer ``` ### Example 2: Compliance Verification ``` Given the transcript, did the assistant properly collect all required verification information before processing the request? Return YES if: • The assistant gathered account number, full name, and security question answer • All three verification elements were confirmed before proceeding • The assistant explicitly stated verification was complete Return NO if: • Any of the three required elements (account number, name, security answer) were skipped • The assistant proceeded with the request before completing verification • Verification was attempted but failed, yet the assistant continued anyway If the user refuses to provide verification, return NO regardless of the reason. ``` ### Tips and tricks **Be Objective** **Recommended:** **Objective**: "Did the assistant acknowledge the user's concern within their first two responses?" **Avoid:** **Too subjective**: "Did the assistant provide good customer service?" **Single Focus** **Recommended:** **Singular observation**: - Create separate metrics for seperate obervations such as resolution and professionalism. **Avoid:** **Multiple criteria**: - "Did the assistant resolve the issue and maintain professionalism?" **Clear Logic** **Recommended:** **Use of clear logical operators** - Use AND/OR operators, ANY/ALL. **Avoid:** **Evaluation logic that contradicts the stated rules** - Metrics return incorrect results when the evaluation system checks for things that shouldn't trigger failures. - Such as requiring disclosure when no transfer occurred, or flagging live conversations as voicemails. #### Before (Poor Metric Example): ``` Based on the transcript, did the customer service agent ask about the customer's preferred contact method, current service plan, or billing preferences? Return YES if: All three preference items were specifically inquired about. Return NO if: One or more items were not asked. ``` **Why this fails**: The metric has an "OR" condition in the question but requires "AND" logic in the evaluation, creating confusion about whether one or all conditions must be met. #### After (Improved Metric Example): ``` Based on the transcript, did the customer service agent ask about the customer's preferred contact method, current service plan, or billing preferences? Return YES if: • The agent asked about preferred contact method, current service plan, AND billing preferences • This can be in a single question (e.g., "What's your preferred contact method, current plan, and billing preference?") OR separate questions for each item Return NO ONLY if: • The agent failed to ask about one or more of these three specific items: contact method, service plan, or billing preferences • Note: Focus on what the AGENT asked, not on what the customer mentioned in their response Examples of acceptable questions: • "How would you like us to contact you, what's your current plan, and how do you prefer to handle billing?" • Three separate questions covering each preference • Any variation that covers all three customer preference areas ``` > **Tip:** **Key improvements**: Clear AND/OR operators, explicit examples, and evaluation logic that matches the stated conditions. --- ## Categorical LLM Judge Metrics **Purpose**: Classification into predefined, mutually exclusive custom categories. ### Prompt Structure Template ``` Classify [SPECIFIC ASPECT] based on the conversation content. Decision Logic: • If [condition], classify as [CATEGORY_NAME] • If [condition], classify as [CATEGORY_NAME] • If [condition], classify as [CATEGORY_NAME] Return only the exact category name. ``` > **Note:** Note: Configure the category options and their definitions in the Coval UI category menu. The categories and their descriptions are set through the platform interface, not in the prompt text. ### Example 1: Call Intent Classification ``` Classify the primary reason for this conversation based on the user's needs and requests. Decision Logic: • If user mentions technical problems, errors, or "not working", classify as TECHNICAL_SUPPORT • If user mentions money, charges, bills, or payments, classify as BILLING_INQUIRY • If user wants to change account details or settings, classify as ACCOUNT_MANAGEMENT • If user asks general questions without specific issues, classify as GENERAL_INFORMATION • If user expresses dissatisfaction and requests escalation, classify as COMPLAINT_ESCALATION Return only the exact category name. ``` ### Example 2: Conversation Outcome Classification ``` Classify the final outcome of this conversation based on how it concluded. Decision Logic: • If user explicitly confirms resolution or satisfaction, classify as RESOLVED_SUCCESSFULLY • If solution provided but requires user action outside conversation, classify as PARTIALLY_RESOLVED • If conversation transferred to human agent, classify as ESCALATED_TO_HUMAN • If user ends conversation frustrated or without resolution, classify as UNRESOLVED_ABANDONED • If user asked questions and received answers without specific problems, classify as INFORMATION_PROVIDED Return only the exact category name. ``` --- ## Numerical LLM Judge Metrics **Purpose**: Score-based evaluations with consistent integer scaling. ### Prompt Structure Template ``` Rate [SPECIFIC ASPECT] based on the following criteria: Evaluation Criteria: • [Criterion 1 with behavioral indicators] • [Criterion 2 with behavioral indicators] • [Criterion 3 with behavioral indicators] Scoring Guidelines: Low scores: [Behavioral indicators for poor performance] High scores: [Behavioral indicators for excellent performance] Return only the numerical score. ``` **Note**: Configure the Min and Max score values as shown, not in the prompt text. The scoring scale (e.g., 1-5, 1-10) is set through the platform interface. ### Example 1: Empathy Assessment ``` Rate the assistant's empathy level based on the following criteria: Evaluation Criteria: • Acknowledgment of user emotions and concerns • Use of appropriate empathetic language and tone indicators • Validation of user feelings before moving to solutions • Adaptation of communication style to user's emotional state Scoring Guidelines: Low scores: No empathy shown, dismissive responses, purely transactional High scores: Clear empathetic responses, validates feelings, shows genuine concern Return only the numerical score. ``` ### Example 2: Technical Accuracy Scoring ``` Rate the technical accuracy of the assistant's information based on the following criteria: Evaluation Criteria: • Factual correctness of all technical statements • Completeness of technical explanations • Appropriate level of technical detail for the context • Identification and correction of any technical misconceptions Scoring Guidelines: Low scores: Major technical errors that could cause problems High scores: Expert-level accuracy with comprehensive, precise details Return only the numerical score. ``` --- ## Multimodal LLM Judge Metrics **Purpose**: Include audio-specific evaluations that text analysis cannot capture. Multimodal LLM Judge metrics analyze the audio along with the transcript text. This allows you to evaluate qualities like vocal tone, speech clarity, pacing, and emotional expression that are impossible to assess from text alone. > **Note:** The format of the Multimodal LLM judge metrics are the same as the LLM judge metrics. > Coval will handle the audio processing automatically, Your prompt should focus on **what you want to evaluate**, not how to process the audio. ### What Audio Metrics Can Detect Audio LLM Judge metrics excel at evaluating: | Category | Examples | | -------------------------- | -------------------------------------------------------- | | **Speech Quality** | Clarity, articulation, pronunciation, stuttering | | **Vocal Characteristics** | Tone, pitch, volume consistency, speaking pace | | **Emotional Expression** | Enthusiasm, frustration, sarcasm, empathy in voice | | **Professional Demeanor** | Courtesy, patience, confidence, nervousness | | **Speaker Identification** | Distinguishing between speakers, detecting interruptions | ### Prompt Structure Template ``` [SPECIFIC AUDIO QUESTION] Audio Analysis Criteria: • [Acoustic feature 1] • [Vocal characteristic 2] • [Speech pattern 3] Return YES if: • [Audio-specific condition 1] • [Audio-specific condition 2] Return NO if: • [Audio-specific disqualifier 1] • [Audio-specific disqualifier 2] Note: [Clarification about evaluation scope] ``` > **Tip:** **Writing Effective Audio Prompts**: Be specific about which speaker to > evaluate (assistant, user, or both) and what acoustic qualities matter. Vague > prompts like "Did it sound good?" produce inconsistent results. ### Transcript Scope for Audio Metrics Audio LLM Judge metrics support [Transcript Scope](#transcript-scope), allowing you to evaluate only specific portions of the audio. When you apply filters (such as agent-only or last N turns), the system automatically extracts and evaluates only the corresponding audio segments. This is particularly useful for: - Evaluating agent speech quality without user audio - Focusing on closing statements or greetings - Reducing token costs on long recordings ### Best Practices for Audio Metrics **Focus on Audio-Only Qualities** Only use Audio LLM Judge for evaluations that **require hearing the audio**. If something can be determined from the transcript alone (like whether specific words were said), use a standard LLM Judge metric instead - it's faster and more cost-effective. **Use Audio metrics for:** Tone of voice, speaking pace, pronunciation clarity, emotional expression, volume issues **Use Text metrics for:** Word choice, script compliance, information accuracy **Specify the Speaker Role** Always clarify whose audio you're evaluating: - "Did **the assistant** speak clearly..." - "Did **the user** sound frustrated..." - "Was there crosstalk between **both speakers**..." This prevents ambiguity when multiple voices are present. **Define Concrete Audio Criteria** Replace subjective terms with specific, observable audio qualities: | Avoid | Use Instead | |-------|-------------| | "Good tone" | "Calm, even-paced tone without audible frustration" | | "Clear speech" | "Words pronounced distinctly without mumbling or slurring" | | "Professional" | "Business-appropriate volume and pace, no sighing or dismissive inflections" | **Include Reasoning Guidance** For complex evaluations, ask the model to consider specific aspects before making a determination. This improves accuracy: ``` Before making your determination, consider: 1. What is the overall vocal tone throughout the call? 2. Are there any moments where the tone shifts notably? 3. How would a customer likely perceive this tone? ``` ### Example 1: Speech Clarity Assessment ``` Did the assistant speak clearly and at an appropriate pace throughout the conversation? Audio Analysis Criteria: • Pronunciation clarity and articulation • Speaking pace (not too fast or slow for comprehension) • Volume consistency and audibility • Absence of mumbling, slurring, or rushed speech Return YES if: • All words are clearly pronounced and easily understood • Speaking pace allows for comfortable comprehension • Volume remains consistent and audible throughout • No instances of unclear or garbled speech Return NO if: • Words are frequently mumbled, slurred, or unclear • Speaking pace is too fast or slow for easy comprehension • Volume fluctuations make parts difficult to hear • Any portions of speech are unintelligible due to clarity issues Note: Focus only on the assistant's speech clarity, not content quality. ``` ### Example 2: Professional Tone Detection ``` Did the assistant maintain a professional vocal tone throughout the conversation? Audio Analysis Criteria: • Tone consistency and appropriateness for business context • Absence of inappropriate emotional expressions (anger, frustration, sarcasm) • Professional demeanor in vocal inflection and manner • Respectful and courteous vocal presentation Return YES if: • Vocal tone remains professional and business-appropriate throughout • No instances of unprofessional vocal expressions or attitudes • Tone conveys respect and courtesy consistently • Emotional responses, if any, are appropriate to the context Return NO if: • Vocal tone becomes unprofessional, dismissive, or inappropriate • Clear instances of anger, frustration, or sarcasm in voice • Tone suggests disrespect or lack of courtesy • Emotional vocal responses inappropriate for professional context Note: Evaluate vocal tone and manner, not the words spoken. ``` ### Example 3: Empathy Detection ``` Did the assistant demonstrate vocal empathy when the user expressed frustration or concern? Audio Analysis Criteria: • Softening of tone when user expresses negative emotions • Appropriate pacing adjustments (slowing down to show care) • Warm, understanding vocal quality rather than robotic or dismissive • Verbal acknowledgments delivered with genuine-sounding concern Return YES if: • Assistant's tone audibly softens or warms in response to user distress • Pacing adjusts appropriately to show the assistant is listening • Voice conveys genuine concern rather than scripted responses • No rushing through empathetic statements Return NO if: • Assistant maintains the same tone regardless of user's emotional state • Empathetic words are delivered in a flat, robotic, or rushed manner • Assistant sounds impatient or dismissive when user is upset • No vocal adaptation to the user's emotional needs Note: Evaluate the vocal delivery of empathy, not just whether empathetic words were used. ``` ### Example 4: Speaker Diarization Quality ``` Can the two speakers (assistant and user) be clearly distinguished throughout the recording? Audio Analysis Criteria: • Distinct vocal characteristics between speakers • Clear turn-taking without excessive overlap • Ability to attribute each utterance to the correct speaker • Audio quality sufficient for speaker identification Return YES if: • Each speaker has distinguishable vocal qualities • Turn-taking is clear with minimal confusing overlaps • All significant utterances can be attributed to a specific speaker • No extended portions where speaker identity is unclear Return NO if: • Speakers sound too similar to reliably distinguish • Frequent overlapping speech makes attribution difficult • Significant portions have unclear speaker identity • Audio quality issues (echo, distortion) prevent speaker identification Note: This metric evaluates audio clarity for speaker identification, not conversation quality. ``` ### Common Pitfalls to Avoid > **Warning:** **Don't mix audio and text evaluations** in a single Audio metric. If you need > to check both "Did they sound professional?" AND "Did they say the required > disclaimer?", create two separate metrics - an Audio metric for tone and a > Text metric for the disclaimer. | Pitfall | Problem | Solution | | ----------------------------- | ---------------------------------------------------- | ---------------------------------------------- | | Evaluating transcript content | Audio metrics can't reliably assess word choice | Use standard LLM Judge for text content | | Vague audio criteria | "Good voice" is subjective and inconsistent | Define specific qualities: pace, clarity, tone | | Missing speaker specification | Unclear whose voice to evaluate | Always specify: assistant, user, or both | | Combining unrelated qualities | "Clear AND professional AND empathetic" is too broad | Create separate metrics for each quality | --- ## Transcript Scope **Purpose**: Focus metric evaluation on specific portions of a conversation rather than the entire transcript. Transcript Scope allows you to filter which messages the LLM evaluates, reducing noise and improving accuracy for targeted assessments. This feature is available for all LLM Judge metrics (Binary, Numerical, Categorical) and Audio LLM Judge metrics. ### When to Use Transcript Scope | Use Case | Filter Configuration | |----------|---------------------| | Evaluate only agent responses | Role filter: `agent` | | Check the closing of a conversation | Range filter: Last 3 turns | | Assess user sentiment only | Role filter: `user` | | Focus on recent context | Range filter: Last N messages | ### Configuration Options **Transcript Scope Toggle**: - **Full** (default) - Evaluate the entire transcript - **Custom** - Apply filters to focus on specific messages [Image: Transcript Scope UI] [Image: Transcript Scope UI] **Available Filters**: **Role Filter** Limit evaluation to messages from specific speakers: - **Agent** - Only evaluate assistant/agent messages - **User** - Only evaluate user/customer messages - **Both** - Evaluate messages from selected roles This is useful when you want to assess agent behavior without user input affecting the evaluation, or vice versa. **Range Filter** Limit evaluation to a specific portion of the conversation: - **Last N turns** - Evaluate only the final N message exchanges - **First N turns** - Evaluate only the opening N message exchanges This is useful for evaluating specific phases of a conversation, such as greetings, closings, or resolution attempts. ### Transcript Scope for Audio Metrics When using Transcript Scope with Audio LLM Judge metrics, the system automatically: 1. Filters the transcript to the selected messages 2. Uses message timestamps to extract the corresponding audio segments 3. Merges adjacent audio segments (within 0.5 seconds) to avoid artifacts 4. Sends only the filtered audio to the LLM for evaluation This enables focused audio evaluations while reducing processing time and token costs. **Example**: To evaluate only the agent's speech quality in the last 3 turns: - Enable **Custom** transcript scope - Add a **Role filter** for `agent` - Add a **Range filter** for `Last 3 turns` The metric will only analyze the agent's audio from the final 3 exchanges, ignoring user speech and earlier portions of the call. ### Benefits - **More accurate evaluations** - Remove noise from irrelevant messages - **Lower costs** - Process less content per evaluation - **Faster execution** - Smaller context means quicker LLM responses - **Targeted insights** - Focus on the exact conversation segments that matter > **Tip:** Combine multiple filters for precise control. For example, use both a Role filter (agent only) and a Range filter (last 5 turns) to evaluate just the agent's closing performance. --- ## Composite Evaluation **Purpose**: Evaluates a transcript against custom criteria and returns an aggregated score. It assesses each criterion and reports how many passed. **When to use**: Use Composite Evaluation when you need to check whether a conversation meets several requirements at once. ### Use cases - Did the agent greet the customer, verify their identity, and offer a resolution? - Did the response cover all required talking points? - Did the conversation follow each step of a compliance checklist? ### **Implementation** **Criterion Source** - Choose where your criteria come from: - **From Test Case** - Pulls criteria automatically from each test case's Expected Behaviors field. This is useful when different test cases have different criteria. - **Static Criteria** - Define a fixed list of criteria directly on the metric. Every transcript is evaluated against the same set. **Custom Evaluation Prompt** (optional) - Provide additional instructions to guide how each criterion is evaluated. This lets you tailor the evaluation context without editing the criteria. **Additional Options**: - **Knowledge Base** - Enable to give the evaluator access to your knowledge base for more informed assessments. - **LLM Model** - Select which model performs the evaluation. - **Transcript Scope** - Limit evaluation to specific portions of the transcript. See [Transcript Scope](#transcript-scope) for configuration details. **Results**: Each run produces: - An overall score count and percentage of how many passed criteria. - A breakdown showing which criteria passed or failed with reasoning. - A summary explaining the overall evaluation. ### Understanding Result Types Each criterion is evaluated independently and returns one of three results: | Result | Meaning | |--------|---------| | **MET** | Clear evidence in the transcript that the criterion was satisfied | | **NOT_MET** | Evidence that contradicts or fails to satisfy the criterion | | **UNKNOWN** | Insufficient information to determine | > **Warning:** Getting **UNKNOWN** usually means your criterion is too vague or your evaluation prompt lacks context. The evaluator cannot find sufficient evidence in the transcript to make a determination. ### Writing Effective Custom Evaluation Prompts The **Custom Evaluation Prompt** field controls how the evaluator interprets each criterion. A well-written prompt provides context that helps the evaluator understand your domain and make accurate determinations. **Default Behavior**: Without a custom prompt, the evaluator uses semantic matching to determine if each criterion was met. This works well for straightforward criteria but may return UNKNOWN for domain-specific expectations. **When to Use a Custom Prompt**: - Your criteria reference domain-specific terminology - You need the evaluator to understand your agent's role or capabilities - You want to define what counts as "meeting" a criterion in your context **Custom Prompt Examples** **Healthcare Scheduling Agent:** ``` You are evaluating a healthcare scheduling assistant. The agent helps patients book, reschedule, and cancel appointments. It has access to provider availability and patient records. When evaluating criteria: - "Confirms appointment" means the agent stated the date, time, and provider name - "Verifies patient identity" means the agent asked for date of birth or member ID - "Offers alternatives" means the agent suggested at least one other available time slot ``` **Banking Support Agent:** ``` You are evaluating a banking support assistant. The agent handles account inquiries, transaction disputes, and card services. When evaluating criteria: - Account verification requires confirming at least 2 identity factors - "Explains fees" means stating the specific dollar amount and when it applies - Security disclosures must mention fraud protection and reporting procedures ``` **Prompt Structure Guidelines** 1. **State the agent's role** - What does the agent do? What information does it have access to? 2. **Define ambiguous terms** - What does "confirms" or "explains" mean in your context? 3. **Set evaluation standards** - What level of detail counts as meeting a criterion? **Poor prompt:** ``` Evaluate if the agent did a good job. ``` **Effective prompt:** ``` You are evaluating a restaurant reservation assistant. The agent books tables, manages waitlists, and answers questions about menu and hours. A criterion is MET when the agent provides the specific information requested. Partial or vague responses should be marked NOT_MET. If the conversation does not address the topic at all, mark as UNKNOWN. ``` ### Writing Effective Criteria The most common cause of inaccurate results is vague criteria. The evaluator uses semantic understanding, so equivalent meanings count as matches. However, it cannot infer intent from ambiguous statements. **The Specificity Formula** Good criteria follow this pattern: **[Actor] + [Specific Action] + [Specific Information/Outcome]** **Vague vs Specific Examples** | Scenario | Vague (Likely UNKNOWN) | Specific (Reliable) | |----------|------------------------|---------------------| | Appointment booking | "Agent schedules the appointment" | "Agent confirms the appointment date, time, and provider name" | | Account inquiry | "Agent explains the fees" | "Agent states the monthly fee amount and when it is charged" | | Password reset | "Agent helps with password" | "Agent sends a password reset link to the registered email address" | | Escalation | "Agent offers to escalate" | "Agent offers to transfer to a specialist when unable to resolve the issue" | **Why Vague Criteria Fail** Consider the criterion: "Agent explains the account options" This fails because: - "Account options" could mean account types, features, fees, or upgrades - The evaluator cannot determine which aspect you intended - Even if the agent discussed accounts, there's no way to verify the specific expectation was met **Rewritten**: "Agent explains the difference between checking and savings accounts, including minimum balance requirements" Now the evaluator can look for specific information about account types and balance requirements. **Balancing Specificity for Shared Test Sets** When sharing criteria between voice and chat test sets: 1. **Focus on WHAT should happen, not HOW** - Avoid: "Agent says 'I understand your concern'" - Use: "Agent acknowledges the customer's concern before proceeding" 2. **Use outcome-based criteria** - Avoid: "Agent reads the cancellation policy" - Use: "Agent confirms the customer understands the cancellation deadline" 3. **Avoid modality-specific language** - Avoid: "Agent clicks the submit button" - Use: "Agent completes the reservation request" ### Using Agent Evaluation Context Adding your agent's system prompt or context significantly improves evaluation accuracy. The evaluator performs better when it understands what your agent is supposed to do. Navigate to **Agent Settings > Evaluation Context** and add: - What the agent does and what information it has access to - Key policies or procedures it should follow - How it should handle common scenarios **Example Agent Context:** ``` This is a healthcare scheduling assistant that helps patients with: - Booking new appointments with available providers - Rescheduling existing appointments (requires 24-hour notice) - Canceling appointments - Answering questions about office locations and hours The agent should always: - Verify patient identity before making changes - Confirm appointment details before finalizing - Offer alternative times when the requested slot is unavailable ``` ### Troubleshooting UNKNOWN Results If you're getting UNKNOWN results: 1. **Improve your custom prompt** - Add domain context and define what "meeting" a criterion means in your use case 2. **Check criterion specificity** - Is the criterion concrete enough to verify against the transcript? 3. **Add agent evaluation context** - Does the evaluator understand what the agent is supposed to do? 4. **Review the transcript** - Is the expected information actually present in the conversation? 5. **Split compound criteria** - Break "Agent explains X and confirms Y" into two separate criteria --- ## Tool Call Metrics **Purpose**: Evaluate whether AI agent tool calls (functions) were executed correctly ### Prompt Structure Template ``` Given the conversation transcript, [SPECIFIC TOOL CALL EVALUATION QUESTION]? Return YES if: • [Tool call execution criterion 1] • [Tool call execution criterion 2] • [Tool call execution criterion 3] Return NO if: • [Tool call failure condition 1] • [Tool call failure condition 2] • [Edge case for incorrect usage] [CLARIFICATIONS FOR TOOL CALL CONTEXT] ``` ### Example 1: Function Call Accuracy ``` Given the conversation transcript, did the assistant correctly execute the search function with the appropriate parameters? Return YES if: • The search function was called when the user requested information lookup • All required parameters (query, filters) were properly populated • The function call syntax and format were correct • The assistant used the search results appropriately in their response Return NO if: • The search function was called unnecessarily or at wrong times • Required parameters were missing or incorrectly formatted • The function call failed due to syntax errors • The assistant ignored or misused the function results Note: Focus on the technical execution of the tool call, not the quality of the response content. ``` ### Example 2: API Integration Validation ``` Given the conversation transcript, did the assistant properly use the customer lookup API when handling account inquiries? Return YES if: • The API was called only when customer account information was needed • Customer identifier (email, phone, or account number) was correctly passed as parameter • The assistant handled API response data appropriately • Proper error handling was demonstrated if API call failed Return NO if: • The API was called without sufficient customer identification • Wrong parameters were passed to the lookup function • The assistant proceeded without waiting for API response • API errors were not handled gracefully Note: Evaluate the technical integration, not the customer service quality. ``` --- ## API State Matcher **Purpose**: Evaluate the assistant by validating real-world system outcomes via an external API. ### Implementation - Add the URL of the API endpoint - Select `GET` for simple lookups or `POST` if the API requires a body. - Expected Body can be a full JSON object, a primitive value (string, number, boolean), or a template variable. **Template Patterns**: `{{expected_output.balance}}`, `{ "status": "success" }`, `completed` - Match path (optional): A dot-notation path used to extract a specific field from the API response. - Timeout (optional): Maximum wait time for the API response before marking the metric as failed. - Headers (optional): Custom HTTP headers sent with the request. ### Use Cases - Verify the agent produced the correct structured output. - Validate mocked API responses in simulations. - Check tool-call results in real services. ### How it works - An HTTP request is sent to the specified API endpoint. - The response body is inspected (optionally at a specific JSON path). - The extracted value is compared against your Expected Body. - Returns 1 if the response body matches the expected value, otherwise returns 0. --- ## Match Expected Simulaton Wrapper **Purpose**: Evaluates an assitant by comparing data captured during the simulation against an expected value. > **Info:** Instead of calling an external API like [API State Matcher](concepts/metrics/prompting#api-state-matcher), this metric inspects simulation wrapper observations (pre- or post-simulation) > and verifies that the recorded response matches expectations. ### Implementation - Select observations (pre- or post-simulation) - Expected Body can be a full JSON object, a primitive value (string, number, boolean), or a template variable. **Template Patterns**: `{{expected_output.balance}}`, `{ "status": "success" }`, `completed` ### Use cases - Verify the agent produced the correct structured output. - Validate mocked API responses in simulations. - Test tool-call results without affecting real services. ### How it works - Using the wrapper observations (for example, API pre-simulation or post-simulation payloads). - This metric reads the selected observation. - Extracts a value using a match path and compares the result to the expected body. - Returns 1 if the response body matches the expected value, otherwise returns 0. --- ## Metadata Field Metric **Purpose**: Reports the value of a run's metadata field, retrieved from the custom metadata using a specified key. The value may be a number, text, or boolean. ### Implementation 1. Select the metadata field type: **string**, **float**, or **boolean**. 2. Input the metadata field key. > **Warning:** Warning: Works only if you send metadata as part of your transcripts to evaluate with Coval, > this metric will take the specific metadata field's value and output that result as a metric result. ### Use Cases - Track custom business metrics (e.g. customer satisfaction scores, call type). - Monitor agent performance indicators passed through metadata. - Extract conversation context data for analysis. - Aggregate custom KPIs from your application. - Track boolean flags (e.g. escalation occurred, customer authenticated, issue resolved). ### How It Works - The metric returns the exact value stored in the specified metadata field. - Automatically aggregates values across multiple conversations. - Direct field value extraction with no LLM processing required. - Supports numeric, text, and boolean metadata values. - Boolean values are output as float (0.0 for false, 1.0 for true) for proper metric aggregation. --- ## Transcript Regex Match Metrics **Purpose**: Pattern detection for exact phrase matching, compliance validation, and format verification. ### Implementation Configure the **Regex Pattern** field (required) and optional fields below. No text prompt is required for this metric type. ### Configuration Fields | Field | Required | Default | Description | |-------|----------|---------|-------------| | **Regex Pattern** | Yes | — | Regular expression pattern to match against the transcript | | **Role** | No | All messages | Filter by speaker role: `AGENT`, `PERSONA`, `TOOL`, `SYSTEM`, or `MUSIC` | | **Match Mode** | No | `presence` | `presence` returns 1 if pattern is found; `absence` returns 1 if pattern is NOT found | | **Position** | No | `any` | `any` checks all messages, `first` checks only the first message, `last` checks only the last message (of the filtered role) | | **Case Insensitive** | No | `false` | When enabled, pattern matching ignores case | ### Pattern Design Guidelines - Use word boundaries **(`\b`)** for exact word matching. - Enable **Case Insensitive** matching instead of using inline `(?i)` flags for clarity. - Use **Position** filtering instead of complex anchoring when you only care about the first or last message. - Use **Absence** mode for compliance rules ("agent must not say X") instead of trying to negate patterns in regex. Test patterns thoroughly before deployment. ### Use Case Examples #### Example 1: Greeting Detection **Goal**: Detect if the agent uses a proper greeting phrase. **Regex Pattern**: `\b(hello|hi|good morning|good afternoon|good evening)\b` **Role**: `AGENT` **Case Insensitive**: Enabled **Returns**: 1 if greeting found, 0 if no greeting detected. #### Example 2: Required Disclosure in First Message **Goal**: Verify the agent states a required disclosure at the start of the conversation. **Regex Pattern**: `this call may be recorded` **Role**: `AGENT` **Position**: `first` **Case Insensitive**: Enabled **Returns**: 1 if disclosure is in the first agent message, 0 if missing. #### Example 3: Prohibited Language (Compliance) **Goal**: Ensure the agent never makes unauthorized promises. **Regex Pattern**: `\b(guarantee|promise|100%|definitely)\b` **Role**: `AGENT` **Match Mode**: `absence` **Case Insensitive**: Enabled **Returns**: 1 if the agent did NOT use prohibited language (pass), 0 if prohibited language was found (fail). #### Example 4: Closing Statement in Last Message **Goal**: Verify the agent ends the conversation with a proper closing. **Regex Pattern**: `(goodbye|have a (great|nice|lovely) day|thank you for calling)` **Role**: `AGENT` **Position**: `last` **Case Insensitive**: Enabled **Returns**: 1 if closing statement found in last agent message, 0 if missing. #### Example 5: Phone Number Format Validation **Goal**: Detect when the user provides a phone number in standard US format. **Regex Pattern**: `\b\d{3}[-.]?\d{3}[-.]?\d{4}\b` **Role**: `PERSONA` **Returns**: 1 if valid format detected, 0 if invalid or missing. ### How It Works 1. The metric filters transcript messages by **Role** (if specified). If no role is set, all messages are checked. 2. The **Position** filter is applied: `first` keeps only the first matching message, `last` keeps only the last. 3. The **Regex Pattern** is matched against the filtered messages, with **Case Insensitive** applied if enabled. 4. The **Match Mode** determines the result: - `presence`: returns 1 if the pattern was found, 0 if not. - `absence`: returns 1 if the pattern was NOT found, 0 if it was. 5. Direct pattern matching — no LLM required, fast and deterministic. --- ### Words Per Message (Threshold) **Purpose**: Validates that all agent messages meet a configurable word count requirement. **What it measures**: Whether every agent message satisfies a word count condition — for example, "all messages must have fewer than 50 words" or "all messages must have at least 5 words." **When to use**: - Enforcing response length guidelines (e.g., keeping answers concise) - Detecting unexpectedly short or empty responses - Validating that the agent doesn't produce overly verbose replies **How it works**: Counts words in each agent message and checks whether all messages satisfy the configured operator and threshold. Returns YES only if every message passes; NO if any message fails. **How to interpret**: - **YES** = all agent messages meet the word count condition. - **NO** = at least one message violated the condition. The detail view identifies which messages failed and their word counts. --- ## Customized Audio Metrics ### Custom Pause Analysis **Purpose**: Measures how frequently the agent pauses mid-speech and how long those pauses are. **What it measures**: Frequency of agent pauses within a turn (pauses per minute), along with total and average pause duration. **When to use**: - Identifying unnatural or excessive hesitations in agent speech - Detecting processing delays that manifest as in-speech pauses - Evaluating speech fluency across different configurations **How it works**: Identifies gaps between consecutive agent speaking segments within the same turn and measures their duration. Persona pauses and inter-turn gaps are excluded. **How to interpret**: - Lower values indicate more fluent speech. - The detail view shows each pause with its timestamp and duration. - Brief pauses are normal and often expressive; frequent longer pauses may indicate hesitation artifacts. ### Volume Variance **Purpose**: Measures how consistently the agent maintains volume throughout the conversation. **What it measures**: Standard deviation of audio volume (in dB) across agent speech — lower values indicate more consistent volume. **When to use**: - Identifying erratic loudness changes in agent speech - Ensuring consistent audio quality across a call - Comparing voice model configurations for volume stability **How it works**: Divides agent speech into fixed-length intervals and measures the volume of each. Intervals are flagged as too loud or too soft based on absolute dBFS thresholds. The primary score is the standard deviation across all intervals. To adjust sensitivity there are different thresholds available: | Preset | Loud threshold | Soft threshold | |--------|---------------|----------------| | `strict` | above -3 dBFS | below -30 dBFS | | `normal` (default) | above -6 dBFS | below -35 dBFS | | `lenient` | above -9 dBFS | below -40 dBFS | You can also override thresholds individually with `loud_threshold_db`, `soft_threshold_db`, or `interval_seconds`. **How to interpret**: - Lower standard deviation = more consistent volume. - The detail view shows only the problematic intervals (too loud or too soft) with their timestamps and dB values. ### Abrupt Pitch Changes **Purpose**: Detects sudden, jittery transitions in pitch that can make speech sound unnatural. **What it measures**: Distinct segments where pitch changes abruptly between frames, reported as events per minute. **When to use**: - Detecting unnatural speech characteristics in synthesized voices - Identifying voice models with unstable or jittery pitch - Comparing voice configurations for smoothness **How it works**: Compares pitch values frame-by-frame, flags frames where the change exceeds a threshold, and groups consecutive flagged frames into segments. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `significant_changes_threshold_hz` | `200.0` | Minimum pitch change in Hz to consider a transition abrupt | **How to interpret**: - Lower values indicate smoother, more natural pitch transitions. - Higher values suggest jittery or unstable pitch. ### Volume/Pitch Misalignment **Purpose**: Detects moments where pitch and volume move in opposite directions, which can indicate unnatural prosody in synthesized speech. **What it measures**: Frames where the pitch is rising while volume is falling (or vice versa), scored by severity relative to the clip's own baseline. **When to use**: Identifying unnatural-sounding speech output — for example, a voice that gets louder while its pitch drops unexpectedly, or vice versa. Useful for: - Evaluating text-to-speech engine quality - Detecting prosody issues that may sound "off" to listeners - Comparing voice model configurations **How it works**: Analyzes frame-by-frame pitch and volume changes across the audio. Frames where the two signals diverge in opposite directions are flagged. Each event receives a severity score based on how unusual the divergence is relative to the rest of the clip (using z-scored magnitudes), making the metric robust across different speakers and recording conditions. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `min_volume_change_for_pitch_misalignment` | `7` | Minimum intensity change (dB) required to flag a misalignment event | **How to interpret**: Severity scores are **relative to the clip**, not absolute. A higher score means both pitch and volume were moving unusually for this speaker in this recording. - **Low severity (~0 – 1)**: Both signals are near their mean change magnitude — nothing unusual relative to the speaker's baseline. - **Medium severity (~1 – 2)**: One or both signals are about 1 standard deviation above their clip mean. - **High severity (~2–6+)**: Both signals are 2+ standard deviations above their clip mean — a genuinely unusual frame. Because severity is z-score based, values are comparable across different speakers and recording conditions. ### Non-Expressive Pauses **Purpose**: Identifies pauses in speech that lack preparatory pitch movement, which can make the agent sound flat or monotone. **What it measures**: Pauses above a minimum duration where pitch shows little variation in the frames immediately before the pause, reported as events per minute. **When to use**: - Evaluating whether a voice sounds expressive and natural - Detecting monotone delivery in synthesized speech - Comparing voice configurations for expressiveness **How it works**: Detects pauses above a minimum duration threshold, then examines the pitch trajectory in the frames immediately preceding each pause. Pauses with minimal pitch variation beforehand are flagged as non-expressive. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `min_pause_duration_seconds` | `0.6` | Minimum silence duration (s) to qualify as a pause | | `pre_pause_window` | `5` | Number of 10ms frames to inspect before each pause for pitch movement | **How to interpret**: - Lower values indicate more expressive delivery — pitch varies naturally before pauses. - Higher values suggest a flat or robotic cadence where pauses arrive without natural pitch cues. ### Vocal Fry **Purpose**: Detects vocal fry — a low, creaky speech quality, typically occurring at the end of phrases. **What it measures**: Total time spent in vocal fry (in seconds), with additional detail on percentage of affected speech and longest continuous fry segment. **When to use**: - Evaluating whether a voice has creaky or rough-sounding artifacts - Monitoring vocal quality across different voice configurations - Identifying voices where fry affects listener experience **How it works**: Identifies frames with simultaneously low pitch, high acoustic roughness, and irregular vocal cord vibration. Consecutive flagged frames are grouped into fry segments. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `sample_rate_seconds` | `0.01` | Analysis frame rate in seconds | | `pitch_floor` | `60` | Minimum pitch frequency (Hz) for detection | | `pitch_ceiling` | `400` | Maximum pitch frequency (Hz) for detection | | `low_pitch_threshold_multiplier` | `0.6` | Fraction of speaker's median pitch below which a frame is considered low-pitched | | `jitter_threshold_multiplier` | `2.0` | Multiple of baseline jitter above which a frame is flagged | | `harmonics_to_noise_ratio_threshold_offset_db` | `-10.0` | dB offset below baseline HNR that marks a frame as noisy | | `harmonics_to_noise_ratio_minimum_pitch` | `60` | Minimum pitch for HNR calculation (Hz) | | `harmonics_to_noise_ratio_silence_threshold` | `0.1` | Amplitude threshold below which frames are treated as silent | | `harmonics_to_noise_ratio_periods_per_window` | `1.0` | Analysis window size in pitch periods for HNR | | `baseline_calculation_multiplier` | `0.8` | Fraction of median pitch used to define the "clear voice" baseline for HNR and jitter | | `min_fry_segment_seconds` | `0.05` | Minimum duration (s) for a fry segment to be counted | **How to interpret**: - Total time in vocal fry (seconds). Lower is better. - Occasional brief fry is common in natural speech; sustained or frequent fry may reduce perceived quality. ### Spectrogram Pitch Analysis **Purpose**: Evaluates whether audio contains natural upper-frequency content, which is a key indicator of voice naturalness. Synthetic or bandwidth-limited audio often lacks energy in higher frequency ranges. **What it measures**: The fraction of upper-frequency spectrogram bins that have energy above a noise floor, averaged across analysis windows. Returns **1.0 (pass)** or **0.0 (fail)** based on whether the average fill ratio meets the naturalness threshold. **When to use**: - Detecting bandwidth-limited or muffled synthesized speech - Comparing voice model configurations for spectral richness - Identifying voices that lack harmonic upper-frequency energy **How it works**: Splits the audio into fixed-length windows and computes a frequency spectrum for each. The fraction of bins in the upper frequency region that exceed the noise floor is measured per window. If the average fill ratio across all windows meets the naturalness threshold, the metric passes. **Configuration** (via metric metadata): | Parameter | Default | Description | |-----------|---------|-------------| | `naturalness_threshold` | `0.10` | Minimum average fill ratio (0.0–1.0) to pass | | `upper_region_percentage` | `0.25` | Fraction of the frequency range treated as the upper region | | `noise_floor_db` | `-15.0` | dB level above which a bin counts as filled | | `segment_length_seconds` | `2.0` | Duration of each analysis window | **How to interpret**: - **1.0** = pass — average upper-frequency fill ratio meets the naturalness threshold. - **0.0** = fail — audio lacks sufficient upper-frequency energy. - The detail view shows the fill ratio per window across the recording timeline. --- ## Using Trace Context in LLM Judge Metrics **Purpose**: Give an LLM Judge metric visibility into what your agent actually did — not just what it said — by including OpenTelemetry span data alongside the transcript. When **Include Traces** is enabled on a custom transcript scope, the judge automatically receives a `TRACE CONTEXT:` block appended to its prompt. This block summarizes the OTel spans from the conversation: span names, timing windows, and key attributes like tool call names and function arguments. ### Walkthrough [Video: Loom Video](https://www.loom.com/embed/17d8a2dcb55e46b49cde11c515acc658) ### When to Enable Include Traces Trace context is most valuable when the behavior you want to evaluate isn't visible in the transcript alone: | Use Case | Why Traces Help | |----------|----------------| | Verify the agent used the right tools in the right order | Tool call spans show what functions were invoked and with what arguments | | Catch hallucinations — agent claimed to do something it didn't | Trace spans show whether the action actually occurred | | Evaluate retrieval quality | Retrieval spans show what data was fetched before the agent responded | | Assess error handling | Error spans reveal failures the agent may have silently recovered from | ### How to Enable 1. Open or create an LLM Judge metric (Binary, Numerical, Categorical, or Audio). 2. Set **Transcript Scope** to **Custom**. 3. In the custom scope configuration panel, toggle **Include Traces** on. The trace context is appended automatically — no changes to your judge prompt are required, though you can reference it explicitly for better results. ### Requirements - Your agent must emit OpenTelemetry traces to Coval. See the [OpenTelemetry Traces guide](/concepts/simulations/traces/opentelemetry) for setup. - The simulation must have produced trace data. If no trace data is available, the toggle has no effect and the prompt is sent without a trace context block. ### Writing Prompts That Leverage Trace Context When writing prompts for metrics with trace context enabled, reference the trace data explicitly. The judge sees a `TRACE CONTEXT:` block appended after the transcript — you can instruct it to reason about both sources. #### Example: Verify Tool Usage ``` Given the transcript and trace context, did the assistant call the `lookup_account` function before providing account balance information? Return YES if: • The TRACE CONTEXT shows a tool call to `lookup_account` (or equivalent) occurring before the agent stated the balance • The transcript confirms the agent provided balance details Return NO if: • The agent mentioned account balance information but no `lookup_account` tool call appears in the TRACE CONTEXT • The tool call appears AFTER the agent has already stated the balance (out of order) • The TRACE CONTEXT shows a failed or missing tool call for this operation Note: If no TRACE CONTEXT is provided, evaluate based on transcript alone. ``` #### Example: Catch Hallucination ``` Given the transcript and trace context, did the assistant accurately report what actions it took? Return YES if: • All actions the assistant claims to have performed appear in the TRACE CONTEXT as actual tool or function calls Return NO if: • The assistant stated it performed an action (e.g., "I've updated your address") but no corresponding tool call appears in the TRACE CONTEXT • The TRACE CONTEXT shows an error or missing call for an action the assistant claimed was successful Note: Minor phrasing differences between the transcript and trace data are acceptable — evaluate intent. ``` > **Tip:** Add "Note: If no TRACE CONTEXT is provided, evaluate based on transcript alone" to your prompt. This makes the metric degrade gracefully on simulations where traces weren't captured. --- ## Utilizing Attributes You can embed dynamic values from agents, test cases, and simulations into your metric prompts using template variables. This allows you to create context-aware metrics that adapt to specific agent configurations or test case requirements. For comprehensive documentation on using attributes, including nested paths, array indexing, dynamic keys, and complete examples, see [Attributes](/concepts/attributes/overview). --- ## Advanced Prompting Techniques ### 1. **Chain of Thought for Complex Evaluations** ``` Before making your final determination, consider: 1. What was the user's primary goal? 2. What actions did the assistant take? 3. What was the final outcome? 4. Did the outcome match the user's goal? Based on this analysis, did the assistant successfully resolve the user's issue? ``` ### 2. **Few-Shot Examples for Edge Cases** ``` Examples of what constitutes resolution: • User: "That fixed it, thanks!" → YES • User: "I'll try that and call back if needed" → YES • User: "This is too complicated, forget it" → NO • User hangs up without confirmation → NO Given the transcript, did the assistant successfully resolve the user's issue? ``` ### 3. **Hierarchical Decision Making** ``` First, determine if the assistant attempted to address the user's concern: - If no attempt was made → Return NO - If attempt was made → Continue to step 2 Second, evaluate if the attempt was successful: - If user confirmed satisfaction → Return YES - If user remained unsatisfied → Return NO - If outcome unclear → Return NO (err on conservative side) ``` --- ## Using Agent Attributes and Test Case Attributes You can make your metric prompts more dynamic and context-aware by referencing agent attributes and test case attributes. This allows you to create metrics that evaluate agent performance against specific agent configurations or test case requirements. ### Agent Attributes Agent attributes are custom properties you define for each [agent configuration](/concepts/agents/overview#attributes). **How to use agent attributes in metric prompts:** Insert `{{agent.attribute_name}}` anywhere in your metric prompt. The system will automatically replace this placeholder with the actual attribute value from the agent being evaluated. **Example 1: Business Hours Verification** ``` Given the transcript, did the assistant provide the correct opening hours? The correct opening hours are: {{agent.opening_hours}} Return YES if: • The assistant stated the opening hours as {{agent.opening_hours}} • The assistant provided opening hours that match exactly (e.g., "9 AM to 5 PM" matches "9:00am-5:00pm") Return NO if: • The assistant provided different opening hours than {{agent.opening_hours}} • The assistant claimed not to know the opening hours • The assistant provided incorrect or conflicting information ``` ### Test Case Attributes For a test case with attributes like: ```json { "source": "LAX", "destination": "SFO", "ticket_class": "business" } ``` You could create a metric prompt: ``` Given the transcript, did the assistant correctly process the flight booking request? The booking details are: - Source: {{test_case.source}} - Destination: {{test_case.destination}} - Ticket Class: {{test_case.ticket_class}} Return YES if: • The assistant confirmed all three details correctly (source, destination, and ticket class) • The assistant used the exact values: {{test_case.source}}, {{test_case.destination}}, and {{test_case.ticket_class}} Return NO if: • Any of the three details were incorrect or missing • The assistant confused source and destination • The assistant used a different ticket class than {{test_case.ticket_class}} ``` ### Combining Agent and Test Case Attributes You can use both agent attributes and test case attributes in the same metric prompt to create comprehensive evaluations: --- ## Knowledge Base Metrics Coval allows you to connect a knowledge base (KB) to your agent and create LLM Judge metrics that use your knowledge base as context. This enables you to track accuracy on specific articles, knowledge bases, or different flows mentioned in your documentation. **Use cases for KB metrics:** - Verify agents answer questions using approved knowledge base content. - Track accuracy across different documentation sources. - Ensure compliance with specific information in FAQs, policies, or procedures. - Monitor whether agents provide consistent responses based on authoritative sources. > **Tip:** **Pro Tip:** KB metrics are particularly valuable for customer service agents, healthcare bots, or any application where accuracy against documented information is critical. ### Setting Up Your Knowledge Base **Step 1: Navigate to Agent Configuration** 1. Go to your **Agent** setup page 2. Select the agent you want to connect to a knowledge base 3. Scroll down to the **Knowledge Base** section **Step 2: Add Knowledge Base Entries** Coval supports multiple knowledge base formats 1. Click "Add Knowledge Base Entry" 2. Select your file type 3. Upload your file (Coval will automatically parse it) 4. Add a descriptive name (e.g., "Hotel FAQ", "Product Documentation") 5. Optionally add tags for organization 6. Click "Upload" ![image.png](/images/image.png) All uploaded entries will appear in your knowledge base list, associated with the selected agent. ### Creating Knowledge Base Metrics **Step 1: Create a New Metric** 1. Navigate to the **Metrics** section 2. Click "Create New Metric" 3. Select **Binary LLM Judge** as the metric type 4. Name your metric (e.g., "FAQ Knowledge Base Accuracy") ### Step 2: Write Your LLM Judge Prompt Structure your prompt to evaluate whether the agent used knowledge base information correctly: **Example Prompt Structure:** ``` Given the transcript, did the assistant answer the user's initial question accurately using information from the Hotel FAQ knowledge base? Return YES if: - The assistant provided specific FAQ details that are factually correct (exact addresses, dollar amounts, precise policies, named amenities) - Core facts match the FAQ even if paraphrased (e.g., "4:00pm check-in" can be stated as "check-in at 4 PM") - The response directly addresses the user's initial question with accurate FAQ information Return NO if: - The assistant provided information that contradicts the FAQ (e.g., claiming there is a pool when FAQ states there is no pool) - The assistant gave generic responses without specific FAQ details - The assistant fabricated information not contained in the FAQ - The assistant claims lack of information when the FAQ contains the answer - The initial question remains unanswered despite FAQ coverage - The assistant provided factually incorrect information, even if detailed and specific Return Unknown if: - The user's question is not covered in the FAQ **Critical: Prioritize factual accuracy over response detail. A detailed but incorrect answer must return NO.** ``` > **Tip:** **Best Practice:** Be specific about what constitutes accurate vs. inaccurate > responses based on your knowledge base. Include edge cases where the KB might > not have complete information. **Step 3: Enable Knowledge Base Context** **Critical step:** At the bottom of the metric configuration: 1. Locate the **Knowledge Base** toggle (initially disabled) 2. **Enable** the Knowledge Base option 3. The system will automatically include your knowledge base as context when evaluating > **Warning:** **Critital**: If you don't **enable the Knowledge Base toggle**, the metric will evaluate > without KB context and may produce inaccurate results. ### Step 4: Save Your Metric 1. Review your prompt and settings 2. Click "Create Metric" 3. Your KB metric is now ready to use in simulations and monitoring ### Using Knowledge Base Metrics in Evaluations **In Simulations** 1. Create or select a test set with scenarios that should use KB information 2. Launch a simulation (or use a template) 3. Select your KB accuracy metric in the metrics list 4. Run the simulation **In Monitoring** 1. Set your KB metric as a **Default Metric** to run on all incoming transcripts 2. Create **Metric Rules** to apply KB metrics conditionally 3. Monitor results in real-time to catch KB accuracy issues in production ## Best Practices for Knowledge Base Metrics ### Writing Effective Prompts **Do:** - Be explicit about what information should come from the KB. - Define clear conditions for YES and NO responses. - Account for situations where the KB doesn't have complete information. - Consider partial accuracy vs. complete inaccuracy. **Don't:** - Make assumptions about what the LLM knows without KB context. - Create overly complex evaluation criteria. ### Knowledge Base Organization **Recommended structure:** - Use clear, descriptive names for each KB entry. - Add tags to categorize different types of information. - Keep individual KB files focused on specific topics. - Update KB entries regularly to reflect current information. ## Metric Validation and Testing ### 1. **Metric Improvement Process** - Use Coval's "Improve Metric" feature with test transcripts. - Iterate on prompts to reduce variance. - Test edge cases and ambiguous scenarios. - Aim for \>90% consistency across similar evaluations. ### 2. **Common Issues and Solutions** | Issue | Solution | | -------------------- | -------------------------------------------------- | | Inconsistent scoring | Add more specific criteria and examples | | Edge case failures | Include explicit handling for boundary conditions | | LLM hallucination | Use more structured prompts with clear constraints | | Low correlation | Ensure metric measures what you intend to measure | ### 3. **Performance Optimization** - Keep prompts under 2,000 characters when possible. - Use regex metrics for simple pattern detection. - Combine related evaluations into single metrics when logical. - Test with diverse conversation types and lengths. ## Best Practices Summary For Creating Metric Prompts - Use specific, measurable criteria. - Provide clear positive and negative examples. - Test extensively with real conversation data. - Maintain consistent terminology and structure. - Include edge case handling. This systematic approach to metric creation will ensure reliable, actionable insights from your Coval evaluations. --- ## Custom Trace Metrics Source: https://docs.coval.dev/concepts/metrics/custom-trace-metrics Extract numerical values from OpenTelemetry spans to measure custom latency, performance, and behavior signals. ## Walkthrough [Video: Loom Video](https://www.loom.com/embed/54f0c0062ea045ceb65d8feecc9cbd92) ## Overview Custom Trace Metrics let you extract a specific numerical value from your agent's OpenTelemetry spans and aggregate it across all turns in a simulation. Use Custom Trace Metrics when you have a signal already captured in your traces — latency measurements, confidence scores, token counts, retry attempts — that you want to track and trend across runs. ## Prerequisites Your agent must be instrumented with OpenTelemetry and sending spans to Coval. See the [OpenTelemetry Traces guide](/concepts/simulations/traces/opentelemetry) for setup instructions. If traces are not present for a simulation, the metric will report an error at execution time. ## Configuration When creating a Custom Trace Metric, configure three fields: | Field | Description | |-------|-------------| | **Span Name** | The name of the OTel span to query (e.g. `llm`, `tts`, `stt`, `llm_tool_call`, or any custom span name you emit). | | **Metric Attribute** | The span attribute to extract the value from (e.g. `retrieval_latency_ms`, `confidence_score`, or another custom numeric attribute key). | | **Aggregation Method** | How to aggregate the extracted values across all matching spans in the simulation. | ### Aggregation Methods | Method | Description | |--------|-------------| | **Average** | Mean value across all matching spans. Best for typical-case latency or scores. | | **Median** | Median value across all matching spans. More robust to outliers than average. | | **p90** | 90th-percentile value. Best for understanding worst-case performance at scale. | | **Max** | Maximum value observed across all matching spans. Useful for worst-case detection. | | **Min** | Minimum value observed across all matching spans. | ### Span Names Any span name your agent emits can be queried. The following well-known span names map to Coval's built-in trace components: | Span Name | Component | |-----------|-----------| | `llm` | Language model invocations | | `tts` | Speech synthesis | | `stt` | Speech recognition | | `llm_tool_call` | Individual tool/function calls | | `turn` | A single conversation turn | Custom span names (e.g. `document_retrieval`, `database_lookup`) work as well — use whatever names your agent emits. ## How to Create **Step: Open the Metrics page** Navigate to the **Metrics** section in the Coval dashboard. **Step: Click Create Metric** Select **Custom Trace Metrics** from the metric type group. **Step: Configure the metric** Fill in **Span Name**, **Metric Attribute**, and **Aggregation Method** for your use case. **Step: Name and save** Give the metric a descriptive name and save. It is now available to add to any run. ## Use Cases ### Custom Latency Tracking Extract average document retrieval latency from your custom retrieval spans: | Field | Value | |-------|-------| | Span Name | `document_retrieval` | | Metric Attribute | `retrieval_latency_ms` | | Aggregation Method | Average | This gives you the average retrieval latency across all turns in the simulation. Compare it across runs to catch regressions after changes to your index, embeddings, or chunking strategy. ### p90 External API Latency Track tail latency for an external service your agent depends on: | Field | Value | |-------|-------| | Span Name | `weather_api` | | Metric Attribute | `duration_ms` | | Aggregation Method | p90 | Use p90 instead of average when you care about tail performance instead of typical performance, especially for services that can occasionally spike. ### Tool Call Duration Monitoring If your agent emits custom spans for specific tool calls with a duration attribute: | Field | Value | |-------|-------| | Span Name | `database_lookup` | | Metric Attribute | `duration_ms` | | Aggregation Method | Average | ### Confidence Score Extraction If your agent records a confidence score on each language model span: | Field | Value | |-------|-------| | Span Name | `llm` | | Metric Attribute | `confidence_score` | | Aggregation Method | Average | > **Tip:** Custom Trace Metrics complement built-in trace metrics like **LLM Time to First Byte** and **TTS Time to First Byte**. Use the built-in metrics for standard pipeline components and Custom Trace Metrics for signals specific to your agent's instrumentation. --- ## Metric Chaining Source: https://docs.coval.dev/concepts/metrics/MetricChaining Combine metrics into custom logic flows for more efficient and accurate evaluations Metric Chaining allows you to create conditional metric flows where a follow-up metric runs only when specific criteria are met by a trigger metric. This approach helps you avoid cramming multiple evaluation checks into a single metric while ensuring accuracy and efficiency in your evaluations. Instead of creating one complex metric that tries to handle multiple scenarios, you can break down your evaluation logic into separate, focused metrics that run conditionally based on the results of previous metrics. ## Benefits of Metric Chaining - **Improved Accuracy**: Each metric focuses on a specific aspect of the conversation - **Efficiency**: Only run metrics that are relevant to the specific conversation flow - **Clarity**: Separate concerns make metrics easier to understand and maintain - **Flexibility**: Create complex evaluation logic without overwhelming single metrics ## How Metric Chaining Works 1. **Trigger Metric**: The primary metric that runs first and determines whether additional metrics should execute 2. **Follow-up Metric**: The secondary metric that runs conditionally based on the trigger metric's result 3. **Criteria**: The condition that determines when the follow-up metric should run (e.g., "equal to 0", "greater than 0") ## TL;DR Walkthrough: [Video: Loom Video](https://www.loom.com/embed/743bcac22b124fc09fe7c49e5429ad87?sid=871e5013-0f7d-4102-8d7b-2755cb684424) ## Example Use Case: Appointment Setter Agent Consider an appointment setter agent with two distinct evaluation needs: 1. **Repeat Caller Handling**: Check if the agent correctly identifies repeat callers 2. **Patient Information Collection**: Check if the agent collects necessary information (first name, last name, phone number) Without metric chaining, you might be tempted to create one large metric covering both scenarios. With metric chaining, you can: - Use "Repeat Caller Handling" as your trigger metric - If it returns "No" (agent couldn't identify as repeat caller), then run "Patient Information Collection" - If it returns "Yes" (agent identified repeat caller), skip the information collection check ## Setting Up Metric Chaining ### Prerequisites Before creating a metric chain, ensure you have: 1. Created your trigger metric 2. Created your follow-up metric 3. Tested both metrics individually ### Creating a Metric Chain 1. Navigate to **Metric Chains** in your dashboard 2. Click **"Add a Metric Chain"** 3. Configure the chain: - **Status**: Set as Active or Inactive - **Trigger Metric**: Select your primary metric - **Follow-up Metric**: Select the metric to run conditionally - **Criteria**: Define when the follow-up metric should run - **Equal to 0**: Run follow-up when trigger returns "No" - **Greater than 0**: Run follow-up when trigger returns "Yes" - **Other conditions**: As needed for your use case 4. **Save** your metric chain ### Applying Metric Chains in Evaluations When launching an evaluation with metric chains: 1. In your evaluation setup, select **only the trigger metric** 2. The follow-up metric will automatically run based on your chain conditions 3. Do not manually select the follow-up metric - the chain will handle this \ Only select the trigger metric when launching evaluations. The chained metrics will run automatically based on your configured conditions. \ ## Example Results ### Scenario 1: Trigger Metric Returns "Yes" - **Trigger**: "Repeat Caller Handling" returns **Yes** - **Result**: Agent successfully identified repeat caller - **Chain Action**: Follow-up metric does NOT run - **Transcript Example**: "Perfect, John Doe, I see you..." ### Scenario 2: Trigger Metric Returns "No" - **Trigger**: "Repeat Caller Handling" returns **No** - **Result**: Agent could not identify repeat caller - **Chain Action**: "Patient Information Collection" runs automatically - **Transcript Example**: "I can't find your information in our system, I'm sorry..." ## Best Practices - **Start Simple**: Begin with two-metric chains before creating more complex flows - **Test Individually**: Ensure each metric works correctly on its own before chaining - **Clear Logic**: Make sure your chain conditions align with your evaluation goals - **Document Chains**: Keep track of your metric chain logic for team collaboration ## Advanced Usage Metric chaining can be extended for more complex scenarios: - **Multi-step Chains**: Chain multiple metrics in sequence - **Different Conditions**: Use various threshold conditions for triggering - **Business Logic**: Implement complex business rules through chained evaluations > **Info:** **Need more complex metric chaining scenarios?** Contact our team to discuss advanced metric chain configurations for your specific use case. # Metric Chaining vs Workflow Verification: When to Use Each > Understanding the differences and choosing the right evaluation approach for your use case Both Metric Chaining and Workflow Verification help you evaluate conditional logic in your agent conversations, but they serve different purposes and work in distinct ways. This guide helps you choose the right approach for your specific evaluation needs. ## Overview Comparison | Feature | Metric Chaining | Workflow Verification | | :------------------- | :---------------------------------- | :--------------------------------------- | | **Purpose** | Custom conditional evaluation logic | Pre-defined workflow compliance checking | | **Setup Complexity** | Moderate (create multiple metrics) | Simple (uses existing agent workflow) | | **Flexibility** | High - any conditional logic | Limited to predefined workflows | | **Granularity** | Separate results for each condition | Single workflow compliance score | | **Efficiency** | Runs only relevant metrics | Evaluates entire workflow path | ## When to Use Metric Chaining Choose **Metric Chaining** when you need: ### **Custom Conditional Logic** - Complex "if-then" scenarios that don't follow a linear workflow - Multiple branching conditions based on conversation context - Business rules that vary based on user characteristics or responses ### **Granular Insights** - Separate scores for each evaluation step - Detailed breakdown of where conversations succeed or fail - Ability to analyze specific conditional branches independently ### **Efficiency Optimization** - Avoid running irrelevant evaluations - Save computation costs on large-scale monitoring - Focus evaluation resources on applicable scenarios ### **Example Use Cases** - **New vs. Returning Users**: "If user is new → check info collection, if returning → check account verification" - **Product-Specific Flows**: "If insurance inquiry → check coverage questions, if claims → check claim validation" - **Escalation Scenarios**: "If technical issue → check troubleshooting steps, if billing → check payment verification" ## When to Use Workflow Verification Choose **Workflow Verification** when you have: ### **Pre-Defined Linear Workflows** - Clear, sequential steps your agent should follow - Workflows already configured during agent creation - Standard operating procedures that rarely change ### **Overall Compliance Checking** - Need to verify agents follow established processes - Simple pass/fail evaluation for entire workflow - Regulatory or compliance requirements ### **Quick Setup Requirements** - Want immediate evaluation without creating custom metrics - Have straightforward, documented agent workflows - Need basic workflow adherence monitoring ### **Example Use Cases** - **Customer Service Flow**: "Greeting → Issue Identification → Resolution → Closure" - **Sales Process**: "Qualification → Needs Assessment → Presentation → Close" - **Support Tickets**: "Intake → Categorization → Assignment → Resolution" ## Detailed Example: Appointment Scheduling Agent Let's compare how each approach handles an appointment scheduling scenario: ### **Scenario**: Agent should collect different information based on appointment type **Requirements:** - New patient appointments: Collect name, phone, insurance - Follow-up appointments: Verify existing info, confirm time - Emergency appointments: Prioritize urgency, collect minimal info ### **Metric Chaining Approach** ``` Trigger Metric: "Appointment Type Identification" ├── If "New Patient" → Run "New Patient Info Collection" ├── If "Follow-up" → Run "Existing Patient Verification" └── If "Emergency" → Run "Emergency Prioritization Check" ``` **Benefits:** - Each appointment type gets targeted evaluation - Separate success rates for different flows - No wasted evaluations on irrelevant scenarios **Results Example:** - Appointment Type ID: 95% success - New Patient Info: 87% success (only for new patients) - Follow-up Verification: 92% success (only for follow-ups) ### **Workflow Verification Approach** ``` Predefined Workflow: 1. Identify appointment type 2. Collect appropriate information 3. Schedule appointment 4. Confirm details ``` **Benefits:** - Simple setup using existing agent workflow - Single compliance score for entire process - Easy to understand pass/fail results **Results Example:** - Overall Workflow Compliance: 89% success ## Implementation Guidance ### **Start With Workflow Verification If:** - Your agent has well-defined, linear workflows - You need quick evaluation setup - Simple compliance checking meets your needs - Your team prefers straightforward metrics --- ## Human Reviews Source: https://docs.coval.dev/concepts/metrics/human-review/human-review Learn how to perform human reviews of agent conversations and metrics in Coval ## Overview The human review workflow in Coval allows you to manually review conversations or runs to ensure quality and identify areas for improvement. This guide will walk you through the process of conducting reviews and providing actionable feedback. > **Info:** **Pro Tip:** Regular human reviews are essential for maintaining high-quality AI interactions and identifying areas for improvement in your agent's performance. [Video: Loom Video](https://www.loom.com/embed/460ff2faae254749af88bb32fc5c6c53) ## Getting Started with Human Review **Step: Select a Run or Conversation** ![Runs Page](/concepts/metrics/human-review/images/runs_page.png) 1. Navigate to either the Runs or Monitoring pages 2. Choose the specific run or conversation you want to review 3. Use keyboard shortcuts to navigate: - `j` and `k` to cycle through runs - `left` and `right` arrow keys to cycle through metrics **Step: Review the Content** ![Review Interface](/concepts/metrics/human-review/images/review_run.png) - Compare and review the agent's performance against the metrics - Determine the correct value based on your assessment - Update the metric if the automated value is incorrect - Add notes to the run to provide feedback - Notes can be dragged and positioned anywhere in the review interface **Step: Track Reviewed Content** ![Human Eval Page](/concepts/metrics/human-review/images/human_eval_page.png) - Reviewed or partially reviewed content automatically appears in the Human Eval page - View all your reviewed runs from both simulations and monitoring ## Supported Metric Types Not all metrics support human review — only those with a defined annotation mechanism can be labeled in the review interface. Metrics fall into four categories based on how reviewers interact with them. ### Direct Value Metrics Reviewers provide a single value for the entire conversation using buttons, a number input, or a dropdown. #### Binary (Pass/Fail) Reviewers select **Yes**, **No**, or **N/A** using on-screen buttons or keyboard shortcuts. - Applies to: binary AI judge metrics, audio binary judge, agent repeats itself #### Numerical Reviewers enter a number within a configured min/max range. - Applies to: numerical AI judge, audio numerical judge #### Categorical Reviewers select from a configured list of categories using a dropdown. - Applies to: categorical AI judge, audio categorical judge #### Transcript Sentiment Analysis Reviewers select a sentiment label (e.g. Rude, Polite, Encouraging, Professional) using category buttons. #### Composite Evaluation Reviewers assess each criterion individually using MET / NOT_MET / UNKNOWN toggles. --- ### Audio Region Metrics Reviewers mark or edit regions on an audio waveform timeline. These metrics require an audio recording to be present on the conversation. Includes: interruption rate, latency, abrupt pitch changes, volume/pitch misalignment, non-expressive pauses, vocal fry, music detection, time to first audio, volume variance, custom pause analysis, agent needs reprompting. --- ### Per-Segment Labeling Reviewers assign a label to each speaking segment in the conversation. - **Audio sentiment** — label each segment as Neutral, Angry, Happy, or Sad --- ### Per-Message Review Reviewers provide a value for each individual message in the transcript. - **Words per message** — count of words per assistant message ## Next Steps After reviewing runs, you can: 1. **Improve Your Agent** - Use the feedback to update prompts and capabilities - Run new simulations to test improvements 2. **Refine Your Metrics** - Test metric changes in simulations before deploying - Use create metrics to update or test new metrics 3. **Assign More Reviews** - Delegate runs to team members for additional review - Track review progress in the Human Eval page > **Info:** **Continuous Improvement:** Use these insights to iteratively enhance both your agent and metrics, creating a feedback loop that drives better performance. --- ## Templates Source: https://docs.coval.dev/concepts/templates/overview Launch evaluations quickly with pre-saved configurations Templates let you save evaluation configurations—including agent, test set, persona, and metrics—so you can launch simulations consistently with one click. You can also schedule recurring evaluations from any template. ## Creating a Template Navigate to **Templates** in the sidebar, then click **New Template**. ### Configuration Steps The template creation form walks you through each component: **1. Select Agent** Choose the voice or chat agent you want to test. Your agent connection settings (phone number, websocket URL, etc.) are preserved from your agent configuration. **2. Select Persona(s)** Choose how the simulated user should behave. You can select multiple personas—each persona will create a separate run, letting you compare performance across different user types. **3. Select Test Set** Pick the test cases that define the conversation scenarios. These determine what the simulated user will say and do during the evaluation. **4. Set Iterations** Define how many times each test case runs. With 2 test cases and 3 iterations, you'll get 6 total conversations. **5. Set Concurrency** Control how many simulations run in parallel. Higher concurrency speeds up evaluation but may hit rate limits on your agent infrastructure. **6. Select Metrics** Choose which metrics to evaluate. These can be built-in metrics (latency, interruptions) or custom metrics you've created. **7. Select Mutations (Optional)** If you've set up [agent mutations](/concepts/agents/overview#agent-mutations), select which variants to test. Each mutation creates a separate run comparing the base agent against the mutated version. **8. Save Template** Click **Create Template** to save. Your template now appears in the templates list. ## Launching from a Template From the Templates list: 1. Find your template and click **Run Now** 2. Review the pre-filled configuration 3. Click **Launch Evaluation** to start immediately Or from the **Launch Evaluation** page: 1. Select **Use Template** 2. Choose your saved template 3. Customize any settings for this specific run 4. Launch ## Scheduling Recurring Evaluations Templates can power scheduled, recurring evaluations—useful for continuous monitoring and regression detection. ### Creating a Scheduled Run 1. From the Templates list, click **Schedule** on your template 2. Configure the schedule: - **Name**: Identify this scheduled job - **Frequency**: Hourly, daily, or weekly - **Start/End dates**: Optional window for the schedule 3. Review the template configuration that will be used 4. Click **Create Schedule** The scheduled run inherits all template settings—agent, personas, test set, metrics, and mutations. Each time the schedule triggers, it launches a new evaluation with those exact parameters. ### Managing Schedules View all scheduled runs in the **Scheduled** tab: - **Active**: Schedules currently running on their cadence - **Paused**: Temporarily disabled schedules - **Completed**: Schedules that reached their end date Click any schedule to see its run history, success rate, and trend metrics over time. ## Best Practices **Template Organization** - Create templates for each major workflow you test regularly - Name templates descriptively (e.g., "Disputes - Angry Customer Persona") - Use folders or naming conventions to group related templates **Scheduled Runs** - Start with daily schedules for active development - Use hourly only for high-traffic production monitoring - Set end dates for temporary testing periods **Mutation Testing** - Create templates with mutations to validate prompt changes - Compare base vs. mutated results before deploying changes ## Deprecated Features > **Warning:** The legacy "Scheduled Evaluations" feature has been removed. All recurring evaluations now use Templates with Scheduled Runs, which provides: > - Better visibility into configuration > - Consistent parameter inheritance > - Centralized management in the Templates section --- ## Simulations Source: https://docs.coval.dev/concepts/simulations/overview Simulate agent-user conversations and evaluate the results. When you launch a run, you trigger a simulation and subsequent evaluation of that simulation. Coval supports different simulation approaches: - **Text-based**: For chat agents using text inputs and outputs - **Voice-based**: For voice agents with audio inputs and outputs [Video: Loom Video](https://www.loom.com/embed/b47a9ab6f7a04c0baff2b8817882554f) ## **Setting Up an Evaluation** 1. Click "Launch Evaluation" 2. Select a template or configure manually: - Choose a test set - Select an agent to test - Select a persona - Choose metrics to track - Set simulation parameters - _(Optional)_ Add tags to label this run _(See "Templates" for more information)_ ## **Tagging Runs** You can add up to 20 tags to a run at launch time. Tags are useful for organizing and filtering runs — for example, by environment, release version, or test type. **From the UI:** A "Tags" card appears in the launch panel. Type a tag name and click **+** (or press Enter) to add it. Click the **×** on any tag chip to remove it. **Via the API:** Pass tags in the `metadata.tags` field of the launch request: ```json { "agent_id": "...", "persona_id": "...", "test_set_id": "...", "metadata": { "tags": ["regression", "v2.1", "nightly"] } } ``` Constraints: max 20 tags per run, each tag max 200 characters. After launch, you can filter runs by tag using the `tag=` filter expression (e.g., `tag="regression"`). ## **Scheduling Recurring Evaluations** 1. Enable the "Schedule Recurring" option 2. Set frequency (hourly, daily, weekly) 3. Configure start and end dates if applicable 4. Set alert thresholds for specific metrics (in "Alerts") > **Info:** **Benefits of Recurring Evaluations:** > - Continuous monitoring of your agent's performance > - Early detection of regressions or issues > - Ability to set alerts when specific metrics underperform > - Historical performance tracking for trend analysis # **Analyzing Evaluation Results** A simulation is a simulated conversation between our agent and your voice or chat agent. You can define the environment on how to test your agent within test sets and Templates. Metrics define the success or failure criteria for your tests. [Video: Loom Video](https://www.loom.com/embed/2dd159a3a35f470ab67eb6f56e27321f?sid=b770fa47-f18f-423e-8ce0-3833e03be93a) ## Runs A Run is an evaluation. A Run can consist of multiple conversations (e.g., if the test set consists of multiple scenarios/transcripts). On each run, you will see the following set of actions: - Resimulate: Re-run if something looks off, or to confirm the performance of a specific metric - Rerun metrics: An LLM Judge metric doesn't perform as expected and you need to adjust it? Go back to the run and rerun that specific metric - Compare: Compare a run with any other run that was performed on the same test set - Human Review: Provide feedback on the run results and send it to the "Manual Review" for team members to collaborate on iterations - Share: share an internal or public link to your run results - a great way to use simulations as part of your sales process\! ![Docs Runresults Pn](/images/docs-runresults.png) Clicking on one call of this run will open your metric results in detail, allowing you to check your results in depth, detect where in the transcript your issues arise, and see detailed explanations for LLM Judge metrics. If [OpenTelemetry traces](/concepts/simulations/traces/opentelemetry) are available for the simulation, an **OTel Traces** card appears in the metric grid showing span count and linking to the trace viewer. ![Docs Runresults2 Pn](/images/docs-runresults2.png) ## Overview The Overview tab consists of all individual conversations. It helps you get an overview of your agent's performance by creating your own summary graphs and see aggregated performance over time. ## Review Use Coval's Human-in-the-loop review capabilities to label runs for review. ## Deterministic Simulation Modes By default, the persona generates responses dynamically using an LLM. For cases where you need repeatable, deterministic persona behavior, Coval offers two additional test case input types: - **Audio Upload**: Upload a pre-recorded audio file (persona's side of the conversation) that plays back exactly as recorded instead of generating persona speech. The audio is automatically transcribed so persona turns still appear in the transcript. After playback completes, the simulation waits a 30-second grace period for the agent to finish responding, then ends the call. You can optionally attach a ground truth transcript to each test case to enable the [STT Word Error Rate (Audio Upload)](/concepts/metrics/built-in-metrics#stt-word-error-rate-audio-upload) metric, which measures your agent's speech recognition accuracy against the known-correct transcript. See [Test Sets — Audio Upload](/concepts/test-sets/overview#3-audio-upload) for setup details. - **Scripted Turns**: Define an ordered list of exact lines for the persona to deliver turn by turn. The persona still uses the configured voice and background sounds, but speaks the scripted text instead of LLM-generated responses. A built-in divergence detector monitors agent responses and can end the simulation early if the agent goes off-track. See [Test Sets — Script](/concepts/test-sets/overview#4-script) for setup details. ## Simulation Time Limits Each simulated conversation has a maximum duration: | | Duration | |---|---| | **Default timeout** | 10 minutes | | **Maximum timeout** | 15 minutes | A simulation ends when the conversation reaches a natural conclusion, the test objective is met, or the timeout is reached — whichever comes first. > **Info:** If your agent requires longer conversations, contact [support@coval.dev](mailto:support@coval.dev) to discuss your use case. The hard maximum per simulation is 15 minutes. --- ## **Best Practices for your Evaluations:** > **Tip:** **Testing Strategy:** > - Start with core functionality test cases > - Expand to edge cases and failure scenarios > - Include regression tests for fixed issues > - Test across different user personas and scenarios > **Tip:** **Continuous Improvement:** > - Regularly update test sets based on production data > - Refine metrics as your understanding of agent performance evolves --- ## Multi-Run Analysis Source: https://docs.coval.dev/concepts/simulations/multi-run-analysis Compare and analyze multiple evaluation runs side-by-side in a single report. Multi-Run Analysis lets you bring multiple runs together into a single view so you can spot regressions, compare agent variants, and track metric trends across evaluations. Instead of flipping between individual run pages, you see all the data in one place—with color-coded grouping, aggregated statistics, and a shareable URL. ## When to use it Multi-Run Analysis is useful when you want to: - **Compare agent versions** — run the same test set against different agent builds and see which performs better across every metric - **Evaluate persona impact** — test the same agent against multiple personas and understand how user behavior affects outcomes ## Accessing Reports Navigate to **Reports** in the left sidebar ## Creating a Report There are two ways to start a new report. ### From the Runs list 1. Go to **Runs** in the sidebar. 2. Enable select mode and check the runs you want to analyze (up to 50 at a time). 3. Click **Multi-Run Report** — this opens the report builder pre-loaded with those run IDs in the URL (`?run_ids=...`). Alternatively, while the runs list has filters applied, click **Multi-Run Report with filters** to open a report that dynamically loads up to 50 runs matching the current filter state. The filter parameters are encoded in the URL, so the same link will resolve to the same runs when shared. ### From the Reports page Click **New Report** from the Reports page, then add run IDs manually or navigate there via the runs list flow above. ## Compare By The **Compare by** dropdown (top-right of the report) is the core analytical tool. It segments the data by a dimension you choose, then color-codes each segment so patterns are immediately visible in both the metric cards and the results table. | Option | What it segments by | |---|---| | **None** | No grouping — all rows shown together | | **Agent** | Groups rows by the agent that ran the simulation | | **Persona** | Groups rows by the persona used | | **Test case** | Groups rows by the specific test case input | | **Metadata** | Groups rows by a custom metadata key you specify | ## View Modes Once a Compare By dimension is selected, you can switch between two view modes: ### Row view Each simulation output appears as an individual row, color-coded by its Compare By group. This is the default. Use it when you want to inspect individual conversations or find outliers within a group. ### Grouped view Rows are collapsed into one row per group. Each group row shows aggregated metric scores for all simulations in that group. Use this when you want a high-level comparison across groups without the noise of individual results. The grouped view toolbar lets you toggle between five aggregation modes: | Mode | What it shows | |---|---| | **Average** | Mean score across all simulations in the group | | **Median** | Middle value — less sensitive to outliers than average | | **P95** | 95th percentile — useful for understanding worst-case performance | | **Min** | Lowest score in the group | | **Max** | Highest score in the group | Click a group row in the grouped view to expand it and see the individual simulation rows within that group. ## Filtering by Metric Clicking a metric card in the left pane filters the results table to show only the column for that metric, making it easier to focus on one score at a time. Click **All Metrics** in the breadcrumb to return to the full table. ## Saving a Report An unsaved report (opened from the runs list) shows a **Save Report** button in the header. Click it to save the current set of run IDs and view configuration (Compare By setting, view mode, and color overrides). After saving, the report gets a permanent ID and appears in the Reports list. On a saved report, the Save button is only active when the view configuration has changed from what was last saved. Click it to persist your latest configuration changes. To rename a saved report, click the pencil icon next to the report title and type a new name. Press Enter or click away to save. ## Sharing a Report Saved reports can be published for external sharing. 1. Open a saved report. 2. Click the **Share** button in the header. 3. Click **Publish shareable link** — this publishes all runs in the report and generates a public URL at `/shared/reports/{report_id}`. 4. Copy the link from the popover and share it. Anyone with the link can view the report without a Coval account. Published reports show a **Public** badge in the Reports list. To revoke access, open the Share popover and click **Unpublish all**. > **Info:** Reports copied from the Reports list actions menu show a warning if the report is still private — the link won't be accessible until the report is published. ## Deleting a Report From the Reports list, open the actions menu (three-dot icon) on any report row and select **Delete**. You'll be asked to confirm before the report is permanently removed. Deleting a report does not delete the underlying runs. --- ## OpenTelemetry Traces Source: https://docs.coval.dev/concepts/simulations/traces/opentelemetry Send traces from your agent to Coval using the OpenTelemetry SDK. > **Warning:** **Beta Feature** — Tracing with OpenTelemetry is currently in beta and under active development. Functionality and APIs may change as we continue to improve the experience. You can send traces from your agent to Coval using the [OpenTelemetry](https://opentelemetry.io/) SDK. This lets you capture detailed span data — such as tool calls, LLM invocations, and other operations — and export it directly to Coval for analysis alongside your simulation or monitoring results. Tracing works for both **simulations** (where Coval calls your agent) and **monitoring** (where you submit post-hoc call data). The setup differs only in how you identify the call — everything else (instrumentation, span naming, viewing) is the same. > **Tip:** **New to tracing?** If you're using Pipecat, LiveKit, or Vapi, the [Coval Wizard (Beta)](/concepts/simulations/traces/wizard) can instrument your agent automatically with one command: `npx @coval/wizard` ## **Prerequisites** - A Coval account with an API key ([manage your keys](https://app.coval.dev/settings)) - A simulation output ID (for simulations) or a conversation ID (for monitoring) - Python 3.8+ with the OpenTelemetry SDK installed Install the required packages: ```bash pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-http ``` ## **Configuration** Configure the OpenTelemetry tracer provider to export spans to Coval's trace ingestion endpoint: ```python from opentelemetry.sdk import trace as trace_sdk from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.sdk.resources import SERVICE_NAME, Resource # Configure tracer resource = Resource.create({SERVICE_NAME: "my-agent"}) provider = trace_sdk.TracerProvider(resource=resource) exporter = OTLPSpanExporter( endpoint="https://api.coval.dev/v1/traces", headers={ "X-API-Key": "", "X-Simulation-Id": "", }, timeout=30, ) provider.add_span_processor(SimpleSpanProcessor(exporter)) tracer = provider.get_tracer("my-agent") ``` | Parameter | Description | |---|---| | `endpoint` | Coval's OTLP trace ingestion URL: `https://api.coval.dev/v1/traces` | | `X-API-Key` | Your Coval API key | | `X-Simulation-Id` | The **simulation output ID** for the individual call being traced. This is per-simulation-call, not the run ID. | | `timeout` | Export timeout in seconds. Must be set to `30` (see note below) | | `SERVICE_NAME` | A name identifying your agent service | > **Note:** The `timeout` parameter must be set to **30 seconds** to ensure spans are exported reliably. We are working on reducing this requirement in a future update. ## **Getting the Simulation Output ID** The `X-Simulation-Id` header must be set to the **simulation output ID** for the specific call you're tracing. The simulation output ID is a per-call identifier — different from the run ID. Here's how to obtain it at runtime. ### Inbound voice agents When Coval places an inbound call, it passes the simulation output ID as a SIP header: `X-Coval-Simulation-Id`. Read this header when the call arrives and use it to configure your OTLP exporter. ```python # Example: reading the simulation output ID from a SIP header # In your call.initiated webhook handler (Telnyx example): simulation_id = next( h["value"] for h in event["payload"]["sip_headers"] if h["name"] == "X-Coval-Simulation-Id" ) exporter = OTLPSpanExporter( endpoint="https://api.coval.dev/v1/traces", headers={ "X-API-Key": "", "X-Simulation-Id": simulation_id, }, timeout=30, ) ``` See the [Inbound Voice guide](/guides/simulations/inbound-voice) for provider-specific instructions on reading SIP headers (Twilio, Telnyx, etc.). ### Outbound voice agents Coval's outbound trigger POST can include the simulation output ID in the request payload. Add `simulation_output_id` to your `trigger_call_payload` configuration in your template, then read it when your webhook receives the trigger and use it to configure the exporter. > **Tip:** You can also find simulation output IDs in the Coval dashboard under any run's results, or via the Coval API. ## **Tracing for Monitoring Calls** For [monitoring](/concepts/monitoring/overview) (post-hoc call evaluation), there is no Coval-initiated call, so there is no simulation output ID available at call time. Instead, you use a **conversation ID** to associate traces with a monitoring conversation. The conversation ID is only available *after* the call ends and you submit the transcript to Coval — which means you can't configure the OTLP exporter up front. The solution is to buffer spans in memory during the call, then flush them once you have the ID. **Step: Buffer spans during the call** Use `InMemorySpanExporter` (included in `opentelemetry-sdk`) to hold spans locally during the call instead of exporting them in real time. ```python from opentelemetry.sdk import trace as trace_sdk from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter from opentelemetry.sdk.resources import SERVICE_NAME, Resource resource = Resource.create({SERVICE_NAME: "my-agent"}) in_memory_exporter = InMemorySpanExporter() provider = trace_sdk.TracerProvider(resource=resource) provider.add_span_processor(SimpleSpanProcessor(in_memory_exporter)) tracer = provider.get_tracer("my-agent") # Instrument your agent as normal — spans accumulate in memory with tracer.start_as_current_span("llm") as span: span.set_attribute("metrics.ttfb", 0.42) response = call_llm() ``` **Step: Submit the conversation after the call ends** Post the transcript (and optionally audio) to `POST /v1/conversations:submit`. The response contains the `conversation_id` you need for trace export. ```python import requests response = requests.post( "https://api.coval.dev/v1/conversations:submit", headers={ "x-api-key": "", "Content-Type": "application/json", }, json={ "transcript": [ {"role": "user", "content": "Hello", "start_time": 0.0, "end_time": 1.2}, {"role": "assistant", "content": "Hi! How can I help?", "start_time": 1.5, "end_time": 3.0}, ], }, ) conversation_id = response.json()["conversation"]["conversation_id"] ``` See [`POST /v1/conversations:submit`](/api-reference/v1/conversations/conversations/submit-conversation-for-evaluation) for the full request schema including optional audio, metadata, and metrics fields. **Step: Export the buffered spans** Create an OTLP exporter with `X-Conversation-Id` and flush the buffered spans to Coval. ```python from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter otlp_exporter = OTLPSpanExporter( endpoint="https://api.coval.dev/v1/traces", headers={ "X-API-Key": "", "X-Conversation-Id": conversation_id, }, timeout=30, ) finished_spans = in_memory_exporter.get_finished_spans() if finished_spans: otlp_exporter.export(list(finished_spans)) ``` | Parameter | Description | |---|---| | `X-Conversation-Id` | The `conversation_id` returned by `POST /v1/conversations:submit`. Use this **instead of** `X-Simulation-Id`. | > **Tip:** Traces can be sent immediately after submitting a conversation — no delay is needed. ### Full monitoring example ```python from opentelemetry.sdk import trace as trace_sdk from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter from opentelemetry.sdk.resources import SERVICE_NAME, Resource COVAL_API_KEY = "" # --- Call setup: buffer spans in memory --- resource = Resource.create({SERVICE_NAME: "my-agent"}) in_memory_exporter = InMemorySpanExporter() provider = trace_sdk.TracerProvider(resource=resource) provider.add_span_processor(SimpleSpanProcessor(in_memory_exporter)) tracer = provider.get_tracer("my-agent") # --- During the call: instrument as normal --- with tracer.start_as_current_span("llm") as span: span.set_attribute("metrics.ttfb", 0.42) response = call_llm() with tracer.start_as_current_span("tts") as span: span.set_attribute("metrics.ttfb", 0.18) audio = synthesize_speech(response) # --- After the call ends: submit transcript, then export spans --- submit_response = requests.post( "https://api.coval.dev/v1/conversations:submit", headers={"x-api-key": COVAL_API_KEY, "Content-Type": "application/json"}, json={"transcript": transcript}, ) conversation_id = submit_response.json()["conversation"]["conversation_id"] otlp_exporter = OTLPSpanExporter( endpoint="https://api.coval.dev/v1/traces", headers={"X-API-Key": COVAL_API_KEY, "X-Conversation-Id": conversation_id}, timeout=30, ) finished_spans = in_memory_exporter.get_finished_spans() if finished_spans: otlp_exporter.export(list(finished_spans)) ``` ### Uploading Traces via the Dashboard You can also upload traces directly from the Coval dashboard without using the SDK. In the **Monitoring** page, click **Upload to Monitoring** and: 1. Add your audio file or transcript as usual 2. In the **Traces (Optional)** section, select your OTLP traces JSON file (must contain a `resourceSpans` array) 3. Click **Upload** — the conversation and traces are submitted together This is useful for testing, debugging, or uploading historical traces that were captured separately. ## **Instrumenting Your Agent** Once the tracer is configured, wrap operations in spans to capture trace data: ```python # Use tracer in agent code with tracer.start_as_current_span("tool_call") as span: span.set_attribute("tool.name", "search_database") result = call_tool() ``` You can nest spans to capture the full call hierarchy of your agent — for example, a parent span for the overall request and child spans for individual tool calls or LLM invocations. > **Info:** **Shutdown** — Call `provider.shutdown()` when your agent exits. With `SimpleSpanProcessor`, spans are exported synchronously as each span ends (not buffered), so they are already in Coval before shutdown is called. Shutdown is still good practice for clean resource teardown. ```python # Call on agent exit for clean resource teardown. provider.shutdown() ``` ## **Span Naming Conventions** Coval's trace viewer applies semantic colors and labels to well-known span names. Using these names gives a richer experience in the UI. | Span Name | Use For | Key Attributes | |-----------|---------|----------------| | `llm` | LLM invocations | `metrics.ttfb` (seconds), `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `llm.finish_reason` (`stop`, `tool_calls`, `length`, `content_filter`) | | `tts` | Text-to-Speech | `metrics.ttfb` (seconds) | | `stt` | Speech-to-Text | `metrics.ttfb` (seconds), `transcript` (transcribed text — required for [STT Word Error Rate](/concepts/metrics/built-in-metrics#stt-word-error-rate) and [Audio Upload variant](/concepts/metrics/built-in-metrics#stt-word-error-rate-audio-upload)), `stt.confidence` (ASR confidence 0.0–1.0) | | `stt.provider.` | Per-provider STT attempt (child of `stt`) | `stt.providerName`, `stt.confidence`, `metrics.ttfb` | | `vad` | Voice Activity Detection | — | | `llm_tool_call` | Individual tool/function calls | `function.name`, `tool_call_id`, `function.arguments` | | `turn` | A single conversation turn | — | | `conversation` | Full conversation | — | | `pipeline` | Processing pipeline | — | | `transport` | Audio/network transport | — | Any span name works — spans with names not listed above will still appear in the UI with auto-assigned colors. Use `service.name` in your `Resource` to group spans by service. > **Info:** For complete working implementations, see the [voice agent examples](https://github.com/coval-ai/coval-examples/tree/main/voice-agents) on GitHub — Vapi, Pipecat, and LiveKit agents that emit the full span schema. ## **Instrumenting STT Spans** To use the [STT Word Error Rate](/concepts/metrics/built-in-metrics#stt-word-error-rate) metric (or its [Audio Upload](/concepts/metrics/built-in-metrics#stt-word-error-rate-audio-upload) variant), your agent must emit `stt` spans with a `transcript` attribute containing the transcribed text. This is what allows Coval to compare your agent's STT output against a reference transcript. We also recommend attaching `stt.confidence` when your STT provider exposes a per-utterance confidence score. Here is an example using the [Pipecat](https://github.com/pipecat-ai/pipecat) framework: ```python from opentelemetry import trace as otel_trace from pipecat.services.deepgram.stt import DeepgramSTTService from pipecat.utils.tracing.service_decorators import traced_stt def _read_path(value, *path): current = value for segment in path: if current is None: return None if isinstance(segment, int): if isinstance(current, (list, tuple)) and 0 <= segment < len(current): current = current[segment] else: return None continue if isinstance(current, dict): current = current.get(segment) else: current = getattr(current, segment, None) return current def extract_stt_confidence(result): confidence = _read_path(result, "channel", "alternatives", 0, "confidence") if confidence is None: return None normalized = float(confidence) if 0.0 <= normalized <= 1.0: return round(normalized, 4) return None class CovalDeepgramSTTService(DeepgramSTTService): """Adds stt.confidence to Pipecat's built-in traced `stt` spans.""" def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._current_stt_confidence = None async def _on_message(self, *args, **kwargs): result = kwargs.get("result") is_final = bool(getattr(result, "is_final", False)) if result else False self._current_stt_confidence = extract_stt_confidence(result) if is_final else None try: await super()._on_message(*args, **kwargs) finally: if is_final: self._current_stt_confidence = None @traced_stt async def _handle_transcription(self, transcript, is_final, language=None): if is_final and self._current_stt_confidence is not None: span = otel_trace.get_current_span() if span.is_recording(): span.set_attribute("stt.confidence", self._current_stt_confidence) ``` Instantiate the subclass in your pipeline. With `PipelineTask(..., enable_tracing=True)`, Pipecat still emits the standard `stt` span, and the subclass adds `stt.confidence` onto that same span: ```python stt = CovalDeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY")) pipeline = Pipeline([ transport.input(), stt, context_aggregator.user(), llm, tts, transport.output(), ]) ``` For non-Pipecat agents, emit equivalent spans wherever your STT returns final transcriptions: ```python from opentelemetry import trace as otel_trace tracer = otel_trace.get_tracer("my-stt-instrumentation") with tracer.start_as_current_span("stt") as span: span.set_attribute("transcript", transcription_text) if confidence_score is not None: span.set_attribute("stt.confidence", confidence_score) ``` The span must be named `"stt"` and include the `transcript` attribute with the transcribed text. `stt.confidence` is optional, but when present it should be a 0.0-1.0 score for the final utterance. ## **Instrumenting LLM Spans** Include `llm.finish_reason` on `llm` spans so you can tell why the model stopped generating. This is especially useful when debugging responses that were silently cut off because `llm.finish_reason=length`. Here is a Pipecat example that enriches the built-in traced `llm` span: ```python from opentelemetry import trace as otel_trace from pipecat.services.openai.llm import OpenAILLMService def _read_path(value, *path): current = value for segment in path: if current is None: return None if isinstance(segment, int): if isinstance(current, (list, tuple)) and 0 <= segment < len(current): current = current[segment] else: return None continue if isinstance(current, dict): current = current.get(segment) else: current = getattr(current, segment, None) return current class _FinishReasonTrackingStream: def __init__(self, stream): self._stream = stream self._iter = stream.__aiter__() def __aiter__(self): return self async def __anext__(self): chunk = await self._iter.__anext__() finish_reason = _read_path(chunk, "choices", 0, "finish_reason") if finish_reason is not None: span = otel_trace.get_current_span() if span.is_recording(): span.set_attribute("llm.finish_reason", str(finish_reason)) return chunk async def aclose(self): if hasattr(self._iter, "aclose"): await self._iter.aclose() elif hasattr(self._stream, "aclose"): await self._stream.aclose() async def close(self): if hasattr(self._stream, "close"): await self._stream.close() else: await self.aclose() class CovalOpenAILLMService(OpenAILLMService): """Adds llm.finish_reason to Pipecat's built-in traced `llm` spans.""" async def get_chat_completions(self, params_from_context): stream = await super().get_chat_completions(params_from_context) return _FinishReasonTrackingStream(stream) ``` For non-Pipecat agents, set the attribute directly on your `llm` span after the provider response finishes: ```python with tracer.start_as_current_span("llm") as span: response = client.responses.create(...) if response.finish_reason: span.set_attribute("llm.finish_reason", response.finish_reason) ``` Common values include `stop`, `length`, `tool_calls`, and `content_filter`. ## **Provider Fallback Spans** Many voice agents use a provider fallback chain for STT — for example, Deepgram → Google → Azure. Without per-provider spans, a single `stt` span only shows the final result; there is no visibility into which provider served the call, how long each attempt took, or why a fallback triggered. The convention is to create one `stt.provider.` child span per provider attempt, nested inside the parent `stt` span: ``` stt ← parent span: final result └── stt.provider.deepgram ← attempt 1 (succeeded) ``` Or for a fallback: ``` stt ← parent span: final result ├── stt.provider.deepgram ← attempt 1 (failed, span status = ERROR) └── stt.provider.google ← attempt 2 (succeeded) ``` ### Span attributes | Attribute | Type | Description | |-----------|------|-------------| | `stt.providerName` | string | Provider name, e.g. `"deepgram"`, `"google"`, `"azure"` | | `stt.confidence` | float | ASR confidence score from this provider (0.0–1.0) | | `metrics.ttfb` | float | Time to first byte for this provider attempt (seconds) | ### Code example ```python from opentelemetry import trace as otel_trace tracer = otel_trace.get_tracer("my-stt-instrumentation") def transcribe_with_fallback(audio): providers = [("deepgram", deepgram_stt), ("google", google_stt)] final_transcript = None with tracer.start_as_current_span("stt") as stt_span: for provider_name, stt_fn in providers: attempt_start = time.time() with tracer.start_as_current_span(f"stt.provider.{provider_name}") as provider_span: provider_span.set_attribute("stt.providerName", provider_name) try: result = stt_fn(audio) ttfb = time.time() - attempt_start provider_span.set_attribute("metrics.ttfb", round(ttfb, 4)) confidence = getattr(result, "confidence", None) if confidence is not None: provider_span.set_attribute("stt.confidence", confidence) final_transcript = result.transcript break # success — stop trying fallbacks except Exception as e: provider_span.set_status(otel_trace.StatusCode.ERROR, str(e)) if final_transcript: stt_span.set_attribute("transcript", final_transcript) return final_transcript ``` ## **Viewing Traces in Coval** After a simulation completes or monitoring traces are received, an **OTel Traces** card automatically appears in the metric grid on the result page when trace data is available. The card shows the total span count and a **View Traces** button that navigates directly to the trace viewer. To view traces: open a run or monitoring result, click into a result, and click the OTel Traces card. You can also navigate directly via URL: ``` https://app.coval.dev//runs//results//traces ``` Traces appear within a few seconds of the simulation completing or being submitted. ### Trace viewer features The trace viewer has two visualization modes you can switch between using the toggle in the header: **Waterfall view** — Shows spans as horizontal bars on a timeline, nested by parent-child relationships. Use the collapse/expand controls to focus on specific parts of the call hierarchy. You can filter by span type using the color-coded legend pills in the header. **Flame graph view** — Shows all spans stacked by depth, giving a birds-eye view of where time is spent. Interactions include: - **Scroll** to pan the timeline left/right - **Ctrl/Cmd + scroll** to zoom in and out - **Drag-select** a region to zoom into that time range - **Double-click** a span to zoom to fit that span's duration - **Press F** to reset the view to fit the full trace - A **mini-map** above the flame graph shows the full trace with your current viewport highlighted — drag it to pan quickly In both views, clicking any span opens a **detail panel** on the right showing the span's attributes, timing, status, and parent chain. When no span is selected, the detail panel shows a **trace summary** with total spans, duration, span type breakdown with time percentages, slowest spans, and any error spans. ## **Transition Hotspots** Transition Hotspots give you a run-level view of how conversations flow through your agent's states — and where they fail. Rather than inspecting individual simulations one by one, you can see the full distribution of state-to-state transitions across an entire run at a glance. ### Walkthrough [Video: Loom Video](https://www.loom.com/embed/22d81a41276340f4b7fb42609dc455bc) ### Accessing Transition Hotspots The **Hotspots** tab appears on the run results page when at least one simulation in the run has OTel trace data. Navigate to a run, then click the **Hotspots** tab. If the tab is not visible, the run does not contain any traced simulations. You can also access it directly via the `?view=hotspots` query parameter on the run results URL. ### Reading the Heatmap The Hotspots view displays a heatmap matrix where: - **Rows** represent the origin state of a transition (the "from" state) - **Columns** represent the destination state (the "to" state) - **Each cell** represents a pair of states — for example, "greeting → account_lookup" Toggle between two views using the buttons in the header: | View | Description | |------|-------------| | **Counts** | Each cell shows how many times that state-to-state transition occurred across all simulations in the run | | **Failure Rate** | Each cell shows the percentage of simulations that failed when hitting that transition | Darker cells indicate higher counts or higher failure rates, depending on the active view. ### Drilling Down Click any cell in the heatmap to open a detail panel showing: - The total count and failure count for that transition - **Exemplar simulations** — individual simulations that passed through that state transition, with direct links to review them Use exemplars to understand why a particular transition has a high failure rate: open a failing simulation and inspect the transcript and trace together. ### Top Hotspots Sidebar The **Top Hotspots** sidebar ranks state transitions by failure count, making it easy to find the most impactful problems without scanning the full matrix. The top-ranked transitions are the ones where the most simulations failed. ### Span Filters Use the **span type filters** to include or exclude specific span types from the transition analysis. Wrapper spans — such as `conversation`, `pipeline`, `transport`, and `session:*` spans — are automatically collapsed and filtered by default, so the heatmap focuses on the meaningful transitions within your agent's processing logic. > **Tip:** Start with the **Failure Rate** view to find which transitions are most problematic, then switch to **Counts** to understand the volume. A transition with a 100% failure rate but only 1 occurrence is less concerning than one with a 30% failure rate across 50 simulations. ## **Full Example** ```python from opentelemetry.sdk import trace as trace_sdk from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.sdk.resources import SERVICE_NAME, Resource # Configure tracer resource = Resource.create({SERVICE_NAME: "my-agent"}) provider = trace_sdk.TracerProvider(resource=resource) exporter = OTLPSpanExporter( endpoint="https://api.coval.dev/v1/traces", headers={ "X-API-Key": "", "X-Simulation-Id": "", }, timeout=30, ) provider.add_span_processor(SimpleSpanProcessor(exporter)) tracer = provider.get_tracer("my-agent") # Use tracer in agent code with tracer.start_as_current_span("tool_call") as span: span.set_attribute("tool.name", "search_database") result = call_tool() # Call on agent exit for clean resource teardown. provider.shutdown() ``` ## Using Span Attributes in Custom Metrics Any numeric span attribute your agent emits can be measured using a **Custom Trace Metric** (`METRIC_CUSTOM_TRACE`). This lets you track latency, token counts, or any other numeric value from your traces without writing custom evaluation code. To create a custom trace metric, specify: - **Span Name** — the `span_name` of the spans to aggregate (e.g. `llm`, `tts`, or any custom span you create) - **Metric Attribute** — the span attribute key containing the numeric value (e.g. `metrics.ttfb`, `token_count`) - **Aggregation Method** — how to aggregate across turns: `average`, `median`, `p90`, `max`, or `min` See [Create Metric](/api-reference/v1/metrics/metrics/create-metric) for the full API reference. --- ## Tracing Wizard Source: https://docs.coval.dev/concepts/simulations/traces/wizard Automatically add Coval OTel tracing to your Python voice agent with one command. > **Warning:** The Coval Wizard is in beta and under active development — more features coming soon! It uses an LLM to analyze and modify your code, so results may vary. Always review the proposed diff carefully before applying changes. The Coval Wizard ([`@coval/wizard`](https://www.npmjs.com/package/@coval/wizard)) reads your Python agent code, figures out exactly where to inject tracing, and writes the changes for you — including a diff preview, file backup, and connectivity validation. It works for [Pipecat](https://pipecat.ai), [LiveKit Agents](https://docs.livekit.io/agents/), [Vapi](https://vapi.ai), and generic Python agents. ## **Quick Start** Run this from your agent's project directory: ```bash npx @coval/wizard ``` The wizard will prompt you for your Coval API key if `COVAL_API_KEY` is not already set in your environment. ```bash # With API key pre-set COVAL_API_KEY=your-key npx @coval/wizard ``` ## **What It Does** **Step: Detects your project** Scans the directory for a Python project manifest (`pyproject.toml`, `requirements.txt`, `Pipfile`, or `setup.py`) and identifies your framework (Pipecat, LiveKit, Vapi, or generic Python). **Step: Analyzes your code** Sends your entry point file to an LLM along with framework-specific injection rules to determine the minimal changes needed. **Step: Shows a diff** Displays a colored diff of the proposed changes to your entry point and asks for confirmation before writing anything. **Step: Writes the files** Creates `coval_tracing.py` — a self-contained OpenTelemetry module — and modifies your entry point. Your original entry point is backed up to `.bak`. **Step: Validates connectivity** Sends a test span to `api.coval.dev` to confirm your API key is working and spans can reach Coval. If you don't have an API key yet, go to **Settings** in the Coval platform and click **Create Key**. ## **Supported Frameworks** | Framework | Detection | What gets injected | |-----------|-----------|-------------------| | **Pipecat** | `pipecat-ai` in dependencies | `setup_coval_tracing()` before pipeline construction; simulation ID from `args.body` SIP headers; `enable_metrics=True, enable_tracing=True` on `PipelineTask` | | **LiveKit Agents** | `livekit-agents` in dependencies | `setup_coval_tracing()` before `AgentSession`; simulation ID from SIP participant attributes; `instrument_session(session)` after `await session.start()` | | **Vapi** | Vapi webhook patterns in `.py` files | `setup_coval_tracing()` at module level; simulation ID extracted from `assistantOverrides.variableValues["coval-simulation-id"]` in webhook handler | | **Generic Python** | Any Python project | `setup_coval_tracing()` at module level; `# TODO` comment marking where to call `set_simulation_id()` | ## **What It Sets Up** The wizard creates `coval_tracing.py` — a self-contained module you import in your agent. It provides three public functions: ```python from coval_tracing import setup_coval_tracing, set_simulation_id, instrument_session # Call once at startup (or at the start of each call session) setup_coval_tracing(service_name="my-agent") # Call when the Coval simulation ID arrives (e.g. from a SIP header) set_simulation_id(simulation_id) # LiveKit only — hooks session events to emit STT, LLM, and tool call spans instrument_session(session) ``` Spans emitted before `set_simulation_id()` is called are buffered and flushed automatically once the ID arrives — no spans are lost even if tracing initializes before the call connects. The module also includes convenience helpers (`create_llm_span`, `create_stt_span`, `create_tts_span`, `create_tool_call_span`) for manually wrapping operations in spans if needed. ## **What It Doesn't Set Up** > **Warning:** The wizard installs the tracing infrastructure but does **not** add advanced span attributes automatically — *yet*. For richer observability, you will need to add these manually after running the wizard: > > - `stt.confidence` — ASR confidence score per utterance (0.0–1.0). Requires hooking into your STT provider's result to extract the confidence value. > - `llm.finish_reason` — Why the LLM stopped generating (`stop`, `tool_calls`, `length`). Requires observing the LLM response before emitting the span. > - `gen_ai.usage.input_tokens` / `gen_ai.usage.output_tokens` — LLM token counts. Requires extracting token usage from your LLM response. > - `stt.provider.` sub-spans — Per-provider attempt spans for fallback chains. Requires wrapping each provider call individually. > > See the [voice agent examples](https://github.com/coval-ai/coval-examples/tree/main/voice-agents) for complete reference implementations that include all of these attributes. ## **Environment Variables** | Variable | Required | Description | |----------|----------|-------------| | `COVAL_API_KEY` | Yes | Your Coval API key. Prompted interactively if not set. | | `WIZARD_LLM_KEY` | No | LLM API key for direct use. | | `WIZARD_LLM_PROVIDER` | No | `anthropic`, `openai`, or `gemini` | | `WIZARD_LLM_MODEL` | No | Override the default model (e.g. `gpt-4o`, `gemini-2.5-flash`) | ## **Limitations** - **Python only** — The wizard requires a Python project manifest (`pyproject.toml`, `requirements.txt`, `Pipfile`, or `setup.py`) in the target directory. - **Single entry point** — Only one file is analyzed and modified. The entry point must be named `agent.py`, `main.py`, `bot.py`, `app.py`, or `server.py`, or be the sole `.py` file in the project root. Entry points larger than 50 KB are not supported. - **LLM-generated code** — The wizard uses a language model to determine what to inject. Results are generally accurate for standard Pipecat, LiveKit, and Vapi patterns, but unusual project structures may require manual corrections. Always review the diff before confirming. - **Generic Python wizard support is minimal** — For projects that aren't one of the three supported frameworks, the wizard adds `setup_coval_tracing()` and a `# TODO` comment. You will need to call `set_simulation_id()` manually and instrument spans yourself. - **Validation is connectivity-only** — The validation step confirms that your API key can reach `api.coval.dev`. It does not verify that spans are correctly wired to your agent's call lifecycle. - **No multi-file analysis** — The wizard reads your entry point and dependency manifest only. It does not analyze helper modules, shared utilities, or subpackages. ## **Next Steps** After running the wizard: 1. Deploy your updated agent and run a simulation in Coval. 2. Open the result and look for the **OTel Traces** card — traces appear within a few seconds of the simulation completing. 3. To add richer span attributes (`stt.confidence`, `llm.finish_reason`, provider sub-spans), see the [OpenTelemetry guide](/concepts/simulations/traces/opentelemetry) and the [voice agent examples](https://github.com/coval-ai/coval-examples/tree/main/voice-agents). --- ## Dashboard Source: https://docs.coval.dev/concepts/dashboard/overview Monitor, analyze, and optimize your voice AI performance with real-time insights and drill-down capabilities. # Analytics Dashboard for Voice AI ![Coval's Dashboard](/images/dashboard/overview.png) Monitor, analyze, and optimize your voice AI performance with real-time insights and drill-down capabilities. Whether you're tracking simulation results, monitoring live system health, or analyzing conversation patterns, our dashboard provides the tools you need to make data-driven decisions. ## **Flexible Dashboard** - Multiple Configurable Dashboards: Organize and switch between dashboards instantly to monitor different aspects of your voice AI system - Intuitive Layout: Rearrange widgets, automatically adapts to any screen size, resize widgets, and automatically saved ## Widget Configuration with Live Preview [Image: Widget Configuration] - Choose from all available metrics on Coval's platform - See your changes applied immediately in the live preview pane - Filter widgets by specific agent IDs for focused monitoring - Add multiple aggregation dimensions (Agent + Persona combinations) ## **Widget Library** ### **Bar Charts** - Compare performance across categories with regular or stacked visualizations - Perfect for analyzing success rates, error distributions, or agent performance comparisons - Smart color-coding for binary metrics (Yes/No scenarios) ### **Line Charts** - Track trends over time with multi-series support - Built-in outlier detection to quickly spot anomalies - Ideal for monitoring response times, call volumes, or quality scores ### **Review Management** Monitor and manage your human review assignees directly from the dashboard. These widgets give you a centralized view of review progress and activity across all your projects. [Video: Review Management Walkthrough](https://www.loom.com/embed/9b96bd12624e4624b860e56f0f5aa375) ### **Threshold / Target Zone Visualization** Display your org-level metric thresholds directly on dashboard charts to quickly assess whether performance is within acceptable bounds. - **Enable it**: Toggle **"Show target zone"** in the widget configuration panel for success rate bar or average line chart widget - **Line charts**: renders a shaded "fail zone" above or below the threshold line, making out-of-range periods immediately visible - **Bar charts**: renders a dashed reference line at the threshold value for at-a-glance comparison against each bar - The threshold value is pulled automatically from the custom threshold set in the metrics page ### **Question Monitoring** - Track unanswered questions and conversation gaps - Identify areas where your AI needs improvement - Monitor user satisfaction and engagement patterns ## **Filtering & Analysis** ### **Multi-Dimensional Aggregation** - **Agent-Level Analysis**: Compare performance across different AI agents - **Persona-Based Insights**: Analyze how different conversational personas perform - **Combined Analysis**: Mix and match agents and personas for comprehensive views - **Binary Metric Support**: Automatic Yes/No breakdowns for quality metrics ### **Intelligent Date Range Management** - **Per-Widget Control**: Set different time periods for each widget - **Global Overrides**: Apply date ranges across all widgets instantly - **Smart Presets**: Today, Yesterday, Last 7/30/90 days, Year-to-date - **Custom Ranges**: "Last N days/weeks/months" or specific date periods ### **Metadata Filtering** Filter your dashboard data using custom metadata fields attached to your simulations or live calls. This allows you to segment your analytics by any attributes you track—such as customer tier, campaign ID, region, or experiment variant. > **Note:** Metadata filters work with any key-value pairs you've included in your > simulation or monitoring data. Values are matched exactly, and multiple > filters are combined with AND logic. **How it works:** - **Key Selection**: Choose from existing metadata keys or enter a custom key name - **Value Filtering**: Select from suggested values or type your own custom value - **Search Support**: Quickly find values by typing—results filter as you type - **Multiple Filters**: Add several metadata filters to narrow down your analysis **Common Use Cases:** - Compare performance across different customer segments - Analyze A/B test results by experiment variant - Filter by deployment environment (staging vs. production) - Segment by geographic region or language ### **Powerful Filtering Options** - Filter by agent types, conversation attributes, and metadata fields - Real-time filter application without page refresh - Save and reuse common filter combinations ## **Deep Dive Analysis** ### **Focus Mode** - Click any data point to enter full-screen analysis mode - 50/50 split view: chart on left, detailed run data on right - Seamless transition back to overview dashboard ### **Run Details Investigation** - Drill down from aggregate charts to individual conversation data - See exactly which calls contributed to each data point - Investigate outliers and anomalies at the source level - Full context for root cause analysis ![Run Details Investigation](/images/dashboard/run-details.png) ### **Smart Data Bucketing** - Automatic time bucket optimization based on data density - Calendar-aware grouping (hours, days, weeks, months) - Timezone-aware calculations for accurate reporting --- > **Tip:** **Ready to transform your voice AI analytics?** Our dashboard platform > provides the insights and tools you need to monitor, analyze, and optimize > your voice AI performance with confidence. --- ## Human Review Source: https://docs.coval.dev/guides/improving-metrics-with-human-review Step-by-step guide to refine your metrics using human review ## Overview Coval's human review projects give you real feedback on the accuracy of your metrics. Create a project, label your simulations, and then use the metrics studio to improve metric performance. Note: Any conversation can be annotated in the results page, regardless of human review projects. > **Note:** Human review is supported for a subset of metric types. See [Supported Metric > Types](/concepts/metrics/human-review/human-review#supported-metric-types) for > the full list. ## Project Types When creating a human review project, you can choose between two modes using the **Collaborative** toggle. ### Collaborative Projects In **Collaborative** mode, all reviewers share a single queue and work toward a unified set of labels. **How it works:** - Each metric-simulation pair has one shared annotation — only one review score is recorded per pair - Reviewers can see each other's existing annotations as pre-fill when they open a conversation - A 10-second polling lock prevents two reviewers from annotating the same row at the same time — if a metric is locked, another reviewer is actively annotating it - An assignment is marked complete as soon as _any_ reviewer submits an annotation **Best for:** Building a ground-truth dataset, dividing labeling work across a team without duplication, or any scenario where a single authoritative label per conversation is the goal. ### Individual Projects (Default) In **Individual** mode, each reviewer has their own private queue and annotations. Reviewers cannot see each other's work. **Best for:** Measuring inter-annotator agreement, collecting multiple independent labels for the same conversation, or comparing perspectives across reviewers. > **Tip:** Use **Collaborative** mode when you want one ground-truth label per > conversation. Use **Individual** mode when you want to measure consistency or > collect multiple independent perspectives. ## Step-by-Step Workflow **Step: Create a human review project** Human review projects help you manage assignees, simulations, and metrics that you are accurately looking to track. ![Create a project](/images/human-review/project.png) 1. Navigate to the [projects tab of the **Human Review** page](https://app.coval.dev/appointmentdemo/review/projects) 2. Choose which metrics you would like to label 3. Assign labelers to the project > **Tip:** Set auto-add rules to have conversations that pass a certain condition (or all conversations) get reviewed. **Step: Add Conversations** Add conversations to label. You can do this in the runs page, monitoring, or on a single simulation. ![Add runs from the runs page](/images/human-review/add-runs.png) **Step: Open your assignments** Navigate to the [Human Review page](https://app.coval.dev/review) in the Coval Dashboard. The **Assignments** tab shows all pending annotations assigned to you. Click on an assignment to open the review interface with the conversation transcript (and audio player for voice simulations) alongside the metrics to evaluate. **Step: Label conversations** Read transcripts, listen to the audio, and provide your ground-truth assessment for each metric: - **Binary metrics**: Select Yes, No, or N/A - **Numerical metrics**: Enter a value within the configured range - **Categorical metrics**: Choose from the dropdown - **Audio region metrics**: Mark or edit regions on the waveform timeline - **Composite metrics**: Toggle MET / NOT_MET / UNKNOWN for each criterion Optionally add notes to explain your labeling decision. Notes can be positioned anywhere in the review interface and are visible to project collaborators. Use keyboard shortcuts `j` and `k` to cycle through assignments, and arrow keys to navigate between metrics. ![Review Interface](/images/human-review/labeling.gif) **Step: Check progress** Switch to the [Projects tab](https://app.coval.dev/review/projects) to see overall project completion, per-assignee progress, and annotation statistics. **Step: Improve your Metric** ![Metric Details](/images/human-review/testing.gif) 1. Navigate to your metric in the Metrics tab. 2. Agreement shows on your metric, so you can decide to improve it. 3. Draft a new version of your metric in the prompt box 4. Click "Test Metric" to open the testing panel. 5. View the results of your experiment. --- ## Live Monitoring Source: https://docs.coval.dev/concepts/monitoring/overview Live Monitoring lets you run evaluations on your live-calls. ## Understanding Coval Monitoring By pushing your post-call transcript to Coval (transcript-only or incl. audio), you can run all Metrics that you run for simulations, also for Observability. The goal is to not only test your agent pre-production but also to observe and evaluate how your agent behaves in production. ### Audio File Requirements When uploading audio files for monitoring: - **Stereo required**: Audio must be stereo (2 channels). Mono audio is not supported. - **Channel mapping**: Channel 0 (left) = Agent, Channel 1 (right) = User > **Info:** **Features specific to Monitoring:** > > - **Default Metrics**: Define your set of default metrics to run on all incoming transcripts > - **Metric Rules**: Add metrics conditionally based on results or metadata keys > - **Add to Test Sets**: Convert production issues into regression tests > - **[OpenTelemetry Traces](/concepts/simulations/traces/opentelemetry#tracing-for-monitoring-calls)**: Send trace data from your agent alongside conversation submissions for detailed performance analysis — via the API, the OpenTelemetry SDK, or directly from the Upload to Monitoring dialog > **Note:** You can re-run metrics for up to 100 calls at a time if you change metric formulations. ## Alerts Set up custom alerts to be notified of performance issues, goal discrepancies, or any anomalies in real-time. This allows for proactive issue resolution and ensures smooth operations as your agents scale. Alerts can be set for both simulations and monitoring. ![Docs Alerts Pn](/images/docs-alerts.png) --- ## API Keys Source: https://docs.coval.dev/guides/api-keys Create and manage API keys to authenticate with the Coval API API keys are used to authenticate requests to the Coval REST API and CLI. You can create multiple keys per organization, each with its own permissions and lifecycle. ## Creating an API Key **Step: Open Settings** Navigate to **Settings** in the Coval dashboard sidebar, then select the **API Keys** tab. **Step: Click Create Key** Click the **Create Key** button in the top right corner. **Step: Configure your key** Fill in the following fields: | Field | Required | Description | |-------|----------|-------------| | **Name** | Yes | A descriptive name for the key (e.g., "CI/CD Pipeline", "Local Development") | | **Description** | No | Optional notes about the key's purpose | | **Key Type** | Yes | **Service** for automated systems, **User** for individual access | | **Permissions** | Yes | **Full Access** or scoped to specific resources | **Step: Save your key** Click **Create Key**. Your API key will be displayed once. > **Warning:** Copy the key immediately. You will not be able to view the full key again after closing the dialog. ## Using Your API Key Include the key in the `X-API-Key` header with every request: ```bash curl https://api.coval.dev/v1/agents \ -H "X-API-Key: your_api_key" ``` Or set it as an environment variable for the CLI: ```bash export COVAL_API_KEY=your_api_key coval agents list ``` ## Permissions By default, keys are created with **Full Access** to all resources. You can restrict a key to specific scopes using the permissions picker. Available permission scopes: | Resource | Scopes | |----------|--------| | Runs | `runs:read`, `runs:write` | | Agents | `agents:read`, `agents:write` | | Conversations | `conversations:read`, `conversations:submit`, `conversations:delete` | | Metrics | `metrics:read`, `metrics:write`, `metrics:delete` | | Test Sets | `test-sets:read`, `test-sets:write` | | Test Cases | `test-cases:read`, `test-cases:write` | | Personas | `personas:read`, `personas:write`, `personas:delete` | | Simulations | `simulations:read`, `simulations:write` | | Traces | `traces:read`, `traces:write` | | Dashboards | `dashboards:read`, `dashboards:write`, `dashboards:delete` | | Scheduled Runs | `scheduled-runs:read`, `scheduled-runs:write`, `scheduled-runs:delete` | | Run Templates | `run-templates:read`, `run-templates:write`, `run-templates:delete` | | API Keys | `api-keys:read`, `api-keys:write`, `api-keys:delete` | > **Tip:** Use the **preset buttons** to quickly configure common permission sets like **Read Only**, **Run Evaluations**, or **Upload Conversations**. ## Managing Key Status Each key has a lifecycle status that controls whether it can authenticate requests. | Status | Description | |--------|-------------| | **Active** | The key is working and can authenticate requests | | **Suspended** | Temporarily disabled. Can be reactivated | | **Revoked** | Permanently disabled. Cannot be reactivated | ### Suspending a Key To temporarily disable a key, click the **actions menu** (three dots) on the key row and select **Suspend**. Suspended keys can be reactivated at any time. ### Revoking a Key To permanently disable a key, select **Revoke** from the actions menu. You must provide a reason. Revoked keys cannot be reactivated. > **Warning:** Revoking a key is permanent. Any systems using the key will immediately lose access. ### Deleting a Key To remove a key entirely, select **Delete** from the actions menu. This permanently removes the key record from your organization. ## Filtering Keys Use the status tabs above the table to filter keys by their current status: - **All** — Show all keys - **Active** — Only active keys - **Suspended** — Only suspended keys - **Revoked** — Only revoked keys ## Best Practices Avoid full access keys in production. Scope each key to only the permissions it needs. Create new keys and revoke old ones periodically, especially for production systems. Name keys after their purpose (e.g., "GitHub Actions CI", "Staging Environment") so you can identify them later. Promptly revoke keys that are no longer in use to minimize your attack surface. ## Next Steps - [API Reference](/api-reference/v1/introduction): Explore the full API documentation - [CLI Installation](/cli/installation): Install the Coval CLI and authenticate with your key - [CLI API Keys Commands](/cli/api-keys): Manage API keys programmatically from the command line - [GitHub Actions](/getting-started/github-actions-tutorial): Use API keys in your CI/CD pipeline --- ## Human Review via API Source: https://docs.coval.dev/guides/human-review-api Step-by-step guide to managing human review projects and annotations programmatically ## Overview The Coval Reviews API lets you programmatically create review projects, assign reviewers, and submit ground-truth annotations. This is useful for integrating human review into CI/CD pipelines, bulk-labeling workflows, or custom review dashboards. > **Info:** All requests require an `X-API-Key` header. See the [API Keys guide](/guides/api-keys) for setup. ## Key Concepts - **Review Projects** group simulations, metrics, and assignees together. Creating a project auto-generates annotations for every (simulation, metric, assignee) combination. - **Review Annotations** are individual review tasks. Each annotation links a simulation output to a metric and an assignee. Providing a ground-truth value auto-completes the annotation. > **Tip:** Using Claude Code? We have [skills to support human review](https://github.com/coval-ai/coval-external-skills/tree/main/skills/human-review) in your workflow. ## Step-by-Step: Create and Complete a Review Project **Step: Create a review project** Link your simulations, metrics, and assignees into a project. This auto-generates one annotation per (simulation, metric, assignee) combination. > **Info:** **Finding your IDs:** Retrieve metric IDs via [`GET /v1/metrics`](/api-reference/v1/metrics/metrics/list-metrics) and simulation IDs via [`GET /v1/simulations`](/api-reference/v1/simulations/simulations/list-simulations). Both endpoints return an `id` field for each resource. ```bash curl -X POST https://api.coval.dev/v1/review-projects \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{ "display_name": "Q1 Voice Agent Review", "description": "Review accuracy and latency for Q1 voice simulations", "assignees": ["alice@company.com", "bob@company.com"], "linked_simulation_ids": ["sim-output-001", "sim-output-002"], "linked_metric_ids": ["metric-accuracy", "metric-latency"], "project_type": "PROJECT_COLLABORATIVE", "notifications": true }' ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | `display_name` | string | **Yes** | Human-readable project name | | `assignees` | string[] | **Yes** | Reviewer email addresses (at least one) | | `linked_simulation_ids` | string[] | **Yes** | Simulation output IDs to review | | `linked_metric_ids` | string[] | **Yes** | Metric IDs to evaluate | | `description` | string | No | Optional project description | | `project_type` | string | No | `PROJECT_COLLABORATIVE` (shared queue) or `PROJECT_INDIVIDUAL` (per-reviewer queues). Defaults to `PROJECT_INDIVIDUAL` | | `notifications` | boolean | No | Enable email notifications for assignees. Defaults to `true` | > **Tip:** Use `PROJECT_COLLABORATIVE` when you want one ground-truth label per conversation. Use `PROJECT_INDIVIDUAL` to measure inter-annotator agreement. **Step: List annotations for the project** After creating a project, annotations are auto-generated. List them to see what needs to be reviewed. ```bash curl "https://api.coval.dev/v1/review-annotations?filter=project_id%3D%22%22" \ -H "X-API-Key: " ``` Filter annotations by status to find pending work: ```bash curl "https://api.coval.dev/v1/review-annotations?filter=project_id%3D%22%22%20AND%20completion_status%3D%22PENDING%22" \ -H "X-API-Key: " ``` | Parameter | Description | |-----------|-------------| | `filter` | AIP-160 filter expression. Supports `simulation_output_id`, `metric_id`, `assignee`, `status`, `completion_status`, `project_id` | | `page_size` | Results per page (1–100, default 50) | | `page_token` | Pagination token from previous response | | `order_by` | Sort field with optional `-` prefix for descending. Valid: `create_time`, `update_time`, `assignee`, `priority` | **Step: Submit ground-truth values** Update each annotation with the reviewer's ground-truth assessment. Providing a ground-truth value automatically sets `completion_status` to `COMPLETED`. **For numeric metrics** (e.g., latency, numerical scores): ```bash curl -X PATCH https://api.coval.dev/v1/review-annotations/ \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{ "ground_truth_float_value": 0.85, "reviewer_notes": "Agent responded accurately but with slight delay" }' ``` **For string/categorical metrics** (e.g., binary pass/fail, sentiment): ```bash curl -X PATCH https://api.coval.dev/v1/review-annotations/ \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{ "ground_truth_string_value": "PASS", "reviewer_notes": "Correct greeting and resolution" }' ``` | Field | Type | Description | |-------|------|-------------| | `ground_truth_float_value` | number | Ground-truth numeric value (auto-completes annotation) | | `ground_truth_string_value` | string | Ground-truth string value (auto-completes annotation) | | `ground_truth_subvalues_by_timestamp` | array | Ground-truth subvalues keyed by timestamp (for audio region or per-segment metrics) | | `reviewer_notes` | string | Free-text reviewer notes | | `assignee` | string | Reassign to a different reviewer | | `priority` | string | `PRIORITY_PRIMARY` or `PRIORITY_STANDARD` | **Step: Track project progress** Re-fetch the project and its annotations to check completion status. ```bash # Get project details curl https://api.coval.dev/v1/review-projects/ \ -H "X-API-Key: " # Count completed annotations curl "https://api.coval.dev/v1/review-annotations?filter=project_id%3D%22%22%20AND%20completion_status%3D%22COMPLETED%22&page_size=1" \ -H "X-API-Key: " ``` **Step: Use results to improve metrics** Once annotations are complete, navigate to your metric in the [Coval Dashboard](https://app.coval.dev) to see agreement scores between human labels and AI-generated values. Use the metrics studio to draft and test improved metric prompts. ## Managing Review Projects ### List Projects ```bash curl "https://api.coval.dev/v1/review-projects?order_by=-create_time&page_size=10" \ -H "X-API-Key: " ``` ### Update a Project Add new assignees or link additional simulations: ```bash curl -X PATCH https://api.coval.dev/v1/review-projects/ \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{ "assignees": ["alice@company.com", "bob@company.com", "charlie@company.com"], "linked_simulation_ids": ["sim-output-001", "sim-output-002", "sim-output-003"] }' ``` ### Delete a Project ```bash curl -X DELETE https://api.coval.dev/v1/review-projects/ \ -H "X-API-Key: " ``` ## Managing Review Annotations ### Create a Standalone Annotation You can create annotations outside of a project for ad-hoc reviews: ```bash curl -X POST https://api.coval.dev/v1/review-annotations \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{ "simulation_output_id": "sim-output-abc123", "metric_id": "metric-accuracy-001", "assignee": "reviewer@company.com" }' ``` ### Get a Single Annotation ```bash curl https://api.coval.dev/v1/review-annotations/ \ -H "X-API-Key: " ``` ### Delete an Annotation ```bash curl -X DELETE https://api.coval.dev/v1/review-annotations/ \ -H "X-API-Key: " ``` Reviewers can also complete their assignments directly in the [Human Review platform](/guides/improving-metrics-with-human-review). --- ## Live Monitoring Source: https://docs.coval.dev/guides/observability Guide to uploading transcripts and audio to Coval for live monitoring and evaluation. ## Overview This guide provides comprehensive documentation for uploading transcripts to the Coval for live monitoring. Coval supports multiple formats. ### Video Tutorial [Video: Coval Observability Tutorial](https://www.youtube.com/embed/ocjxU6Eevyo) ### Required Fields > **Info:** **Essential transcript fields:** > > - **`role`**: Must be one of `"user"`, `"assistant"`, `"system"`, or `"tool"` > - **`content`**: The actual message content (string) > - **`beginning`**: Index position in the conversation (number) > - **`end`**: End position in the conversation (number) ### Optional Fields - **`start_timestamp`**: Unix timestamp for when the message started (number) - **`end_timestamp`**: Unix timestamp for when the message ended (number) - **`error`**: Error message if transcription failed (string) - **`transcriptionError`**: Boolean flag indicating transcription error - **`name`**: Name identifier for the message (string) ## Supported Formats **OpenAI Format (Recommended):** The system primarily expects transcripts in OpenAI's chat completion format: ```json [ { "role": "user", "content": "Hello, I would like assistance.", "start_time": 0.0, "end_time": 3.2 }, { "role": "assistant", "content": "Of course! How can I help you today?", "start_time": 3.2, "end_time": 6.8 }, { "role": "user", "content": "I'm having an issue with my recent order.", "start_time": 6.8, "end_time": 10.5 }, { "role": "assistant", "content": "I'm sorry to hear that. Could you provide me with your order number?", "start_time": 10.5, "end_time": 14.2 } ] ``` **Extended Studio Format:** For detailed transcripts with timing information: ```json [ { "role": "user", "content": "Hello, I would like assistance.", "start_time": 0.0, "end_time": 3.2, "beginning": 0, "end": 1, "start_timestamp": 1640995200, "end_timestamp": 1640995210 }, { "role": "assistant", "content": "Of course! How can I help you today?", "start_time": 3.2, "end_time": 6.8, "beginning": 1, "end": 2, "start_timestamp": 1640995210, "end_timestamp": 1640995220 } ] ``` **Raw Text Format:** The system can also accept raw text, which will be automatically converted: ``` User: Hello, I would like assistance. Assistant: Of course! How can I help you today? User: I'm having an issue with my recent order. Assistant: I'm sorry to hear that. Could you provide me with your order number? ``` ## Tool Call Messages For tool call messages, the `content` field should contain a JSON string that can be parsed to extract tool information. ### Tool Call Content Examples ```json Simple Tool Call { "role": "tool", "content": "{\"tool\": \"waiting_on_customer\"}", "start_time": 12.0, "end_time": 12.5, "beginning": 3, "end": 4 } ``` ```json Tool Call with Arguments { "role": "tool", "content": "{\"query\": \"search term\", \"tool\": \"query_knowledge\"}", "start_time": 15.2, "end_time": 15.8, "beginning": 4, "end": 5 } ``` ```json Standard Tool Call Format { "role": "tool", "content": "{\"tool_call\": \"function_name\", \"arguments\": {\"param1\": \"value1\"}}", "start_time": 18.5, "end_time": 19.1, "beginning": 5, "end": 6, "name": "function_name" } ``` ```json System Role with Tool Call { "role": "system", "content": "{\"tool_call\": \"database_query\", \"arguments\": {\"table\": \"users\"}}", "start_time": 22.0, "end_time": 22.3, "beginning": 6, "end": 7 } ``` ### Alternative Tool Call Formats The system supports these formats in the `content` field: 1. **Function format**: `{"function": "name", "arguments": {...}}` 2. **Tool format**: `{"tool": "name", ...}` (other fields become arguments) 3. **Custom backend format**: `{tool_call: name, arguments: {...}}` ## Validation Rules ### Content Limits > **Warning:** **Important limits to keep in mind:** > > - **Individual message content**: Maximum 1,000 characters > - **Total transcript size**: Maximum 40MB > - **Number of messages**: Maximum 1,000 messages per transcript ### Role Validation - Only `"user"`, `"assistant"`, `"system"`, and `"tool"` roles are accepted - Each message must have `role`, `content`, `start_time`, and `end_time` fields - `start_time` and `end_time` must be float values representing seconds ### Role Normalization For monitoring and evaluation purposes, roles may be normalized: - `"system"` messages with tool call content may be treated as `"tool"` for display purposes - Tool calls in `"system"` role are automatically detected and parsed - The UI will display tool calls with appropriate icons and formatting regardless of the original role ### Timing Validation - `beginning` and `end` values should be sequential integers - `start_timestamp` and `end_timestamp` should be valid Unix timestamps - If timestamps are provided, `end_timestamp` should be greater than `start_timestamp` ### Audio Requirements > **Warning:** **Audio files must be stereo (2 channels).** Mono audio files are not supported and will be rejected. When uploading audio files, the channel mapping determines speaker roles: | Channel | Position | Role | |---------|----------|------| | Channel 0 | Left | Agent | | Channel 1 | Right | User | The system uses channel position to assign roles during transcription—channel 0 (left) is always treated as the agent, and channel 1 (right) is always treated as the user. --- ## Webhooks Source: https://docs.coval.dev/guides/webhooks Set up webhooks to receive notifications when your evaluations complete ## Overview Webhooks allow you to receive notifications when your Coval evaluations complete. When a run finishes, Coval will send a POST request to your configured webhook endpoints. ## Configure Webhook Endpoints Navigate to https://app.coval.dev/settings The webhook configuration interface includes: 1. **Webhook Type Dropdown**: Select "Job Complete" from the available options 2. **Webhook URL Input**: Enter the URL where you want to receive notifications 3. **Add Webhook Button**: Click to save your webhook configuration 4. **Webhook List**: View and manage your existing webhooks 5. **Delete Button**: Remove webhooks using the trash icon ### Supported Webhook Types - **Job Complete**: Triggers when evaluation runs complete ## Webhook Data Format When a run completes, your webhook endpoint will receive a POST request with: ```json { "organization_id": "string", "message": "▲ Run on [Test Set Name] Succeeded: https://app.coval.dev/your-org/runs/[run-id]\n Created By: [user]\n\n[metric results if available]", "run_id": "string", "status": "COMPLETED|FAILED" } ``` The payload includes: - `organization_id`: Your organization's ID - `message`: The formatted alert message with run details and metrics - `run_id`: The ID of the completed run - `status`: The status of the run (COMPLETED or FAILED) --- ## Scheduled Runs Source: https://docs.coval.dev/guides/scheduled-runs Automate recurring evaluations to catch regressions and monitor agent quality over time ## Overview Scheduled Runs let you run evaluations on a recurring cadence—hourly, daily, weekly, or on a custom interval. They're built on top of [Templates](/concepts/templates/overview), which capture your full evaluation configuration (agent, test set, personas, metrics, and mutations). Each time the schedule fires, a new run is launched automatically with those exact parameters. Common use cases: - **Regression detection**: Catch when a new deployment breaks expected behaviors - **Continuous quality monitoring**: Track metric trends across agent versions - **Health checks**: Validate your agent is responsive and performing correctly at regular intervals ## Prerequisites Before creating a scheduled run, you need: 1. **An agent configured** in Coval — see [Agents](/concepts/agents/overview) 2. **A test set** with the conversation scenarios to run — see [Test Sets](/concepts/test-sets/overview) 3. **At least one metric** selected for evaluation — see [Metrics](/concepts/metrics/overview) 4. **A saved template** that ties these together — see [Templates](/concepts/templates/overview) ## Full Setup Flow **Step: Create a Template** Navigate to **Templates** in the sidebar and click **New Template**. Configure your evaluation: - **Agent**: The voice or chat agent to test - **Test Set**: The conversation scenarios to run - **Persona(s)**: How the simulated user should behave - **Iterations**: How many times each test case runs - **Concurrency**: How many simulations run in parallel - **Metrics**: Which metrics to evaluate against Click **Create Template** to save. This template will be the source of truth for every scheduled run — all parameters are inherited automatically. **Step: Schedule the Template** From the **Templates** list, click **Schedule** on your template. This opens the Schedule Evaluation dialog. Fill in the schedule configuration: **Name** Give the scheduled run a descriptive name (e.g., "Nightly Regression – Disputes Flow"). This appears in the Scheduled Runs list and in run history. **Schedule Type** Choose between two scheduling modes: **Interval-based:** Runs fire at a fixed cadence from when the schedule is created. Select a quick preset: | Preset | Interval | |--------|----------| | 15 min | Every 15 minutes | | 30 min | Every 30 minutes | | 1 hour | Every hour | | 6 hours | Every 6 hours | | 12 hours | Every 12 hours | | Daily | Once per day | | Weekly | Once per week | | Monthly | Every 30 days | Or set a **Custom Interval** by entering a number and selecting minutes, hours, or days. The minimum is 15 minutes and the maximum is 30 days. **Time of Day:** Runs fire at a specific time, either daily or on selected days of the week. - **Time**: Set the hour, minute, and AM/PM. Times are anchored to your local timezone. - **Recurrence**: Choose **Daily** (every day at that time) or **Weekly** (select specific days). - For weekly schedules, select which days of the week to run (e.g., Mon–Fri for weekdays only). Click **Schedule** to create the scheduled run. It activates immediately. **Step: Monitor Your Scheduled Runs** Navigate to **Scheduled Runs** in the sidebar to see all your configured schedules. The list shows: - **Status**: Active (running on schedule) or Disabled (paused) - **Name**: The label you gave the scheduled run - **Agent**: Which agent is being evaluated - **Schedule**: Human-readable frequency (e.g., "Daily at 9:00 AM", "Every 6 hours") - **Template**: The template powering this schedule (click to view its configuration) - **Created**: When the schedule was set up Use the **search bar** to filter by name or ID, or filter the list by **Active** / **Disabled** status. Click any row to open the run history for that schedule — you'll see every evaluation it has launched, with pass/fail results for each metric. ## Managing Scheduled Runs ### Enable and Disable To pause a schedule without deleting it, open the actions menu (`⋮`) on any row and select **Disable**. Re-enable it the same way. You can also bulk-enable or bulk-disable: check multiple rows, then use the **Enable Selected** or **Disable Selected** buttons in the toolbar that appears. ### Edit a Schedule To change the name or timing of an existing schedule, open the actions menu and select **Edit Schedule**. You can update the display name, switch between interval and time-of-day modes, or adjust the frequency. > **Note:** Editing a schedule does not affect the underlying template or evaluation parameters — only the timing changes. To update what gets evaluated (agent, metrics, test cases), edit the template directly. ### Delete a Schedule Scheduled runs must be **disabled before they can be deleted**. Once disabled, open the actions menu and select **Delete**. This action is permanent. To delete multiple schedules at once, disable them first, then select them and use **Delete Selected**. ## Viewing Run History Click any scheduled run to open its detail page. Here you can see: - Every run triggered by this schedule - The pass/fail result for each metric per run - Trend data showing how metric scores change over time This is useful for spotting regressions: if a metric score drops across consecutive runs, something likely changed in your agent or its environment. ## Tips - **Start with daily schedules** during active development. Hourly is better suited for production monitoring where you need fast feedback. - **Name schedules clearly** — include the agent name and what it tests (e.g., "Hourly – Billing Bot – Core Flows"). - **Use the Template link** in the Scheduled Runs list to quickly verify what configuration is being used before debugging a failing run. - **Disable rather than delete** schedules you might want to resume — deleted schedules and their history are gone permanently. --- ## GitHub Actions Source: https://docs.coval.dev/getting-started/github-actions-tutorial Launch Coval evaluation runs from GitHub Actions ## Prerequisites **Step: Get Your API Key** Navigate to your [Coval dashboard](https://app.coval.dev) and generate an API key from Settings. **Step: Add GitHub Secret** 1. Go to your repository **Settings > Secrets and variables > Actions** 2. Click **New repository secret** 3. Name: `COVAL_API_KEY` 4. Value: Your Coval API key **Step: Gather Required IDs** You'll need the following identifiers: - **Agent ID** (22 chars): Found in Agents page → Select agent → Copy ID - **Persona ID** (22 chars): Found in Personas page → Select persona → Copy ID - **Test Set ID** (8 chars): Found in Test Sets page → Select test set → Copy ID - **Metric IDs** (22 chars each, optional): Found in Metrics page → Click metric → Copy ID ## Quick Start ### Automatic PR Checks Create `.github/workflows/coval-eval.yml`: ```yaml name: Coval Evaluation on: pull_request: branches: [main] jobs: evaluate-agent: runs-on: ubuntu-latest steps: - name: Run Coval Evaluation uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: "gk3jK9mPq2xRt5vW8yZaBc" persona_id: "hL4kL0nQr3ySt6vX9zAcDd" test_set_id: "aB1cD2eF" ``` ### Manual Workflow Dispatch Create `.github/workflows/manual-eval.yml`: ```yaml name: Manual Evaluation on: workflow_dispatch: inputs: agent_id: description: "Agent ID (22 characters)" required: true type: string persona_id: description: "Persona ID (22 characters)" required: true type: string test_set_id: description: "Test Set ID (8 characters)" required: true type: string jobs: evaluate: runs-on: ubuntu-latest steps: - name: Run Evaluation uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: ${{ inputs.agent_id }} persona_id: ${{ inputs.persona_id }} test_set_id: ${{ inputs.test_set_id }} ``` To trigger: 1. Navigate to **Actions** tab 2. Select **Manual Evaluation** 3. Click **Run workflow** 4. Enter your IDs and click **Run workflow** ## Advanced Configuration ### Custom Metrics and Options ```yaml - name: Advanced Evaluation uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: "gk3jK9mPq2xRt5vW8yZaBc" persona_id: "hL4kL0nQr3ySt6vX9zAcDd" test_set_id: "aB1cD2eF" # Specific metrics to evaluate metric_ids: '["iM5lM1oRs4zTu7wY0aBdEe", "jN6mN2pSt5aUv8xZ1bCeFf"]' # Run each test case 3 times iteration_count: 3 # Run 2 simulations concurrently concurrency: 2 # Custom metadata for tracking metadata: '{"campaign": "q4_2025", "env": "staging"}' ``` ### Using Outputs ```yaml - name: Run Evaluation id: coval uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: "gk3jK9mPq2xRt5vW8yZaBc" persona_id: "hL4kL0nQr3ySt6vX9zAcDd" test_set_id: "aB1cD2eF" - name: Post Results run: | echo "Run ID: ${{ steps.coval.outputs.run_id }}" echo "Status: ${{ steps.coval.outputs.status }}" echo "View: ${{ steps.coval.outputs.run_url }}" - name: Comment on PR if: github.event_name == 'pull_request' uses: actions/github-script@v7 with: script: | github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: '✓ Evaluation complete: ${{ steps.coval.outputs.run_url }}' }) ``` ## Configuration Reference ### Inputs | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `agent_id` | string | Yes | - | Agent to test (22 chars) | | `persona_id` | string | Yes | - | Simulated persona (22 chars) | | `test_set_id` | string | Yes | - | Test set with test cases (8 chars) | | `metric_ids` | JSON array | No | Agent defaults | Metric IDs to evaluate (22 chars each) | | `iteration_count` | integer | No | `1` | Runs per test case (1-10) | | `concurrency` | integer | No | `1` | Concurrent simulations (1-5) | | `metadata` | JSON object | No | `{}` | Custom metadata for tracking | | `max_wait_time` | integer | No | `600` | Max wait time in seconds | | `check_interval` | integer | No | `30` | Status check interval in seconds | ### Outputs | Output | Type | Description | |--------|------|-------------| | `run_id` | string | Unique run identifier | | `status` | string | Final status (COMPLETED, FAILED, etc.) | | `run_url` | string | Dashboard URL to view results | ### Environment Variables | Variable | Required | Description | |----------|----------|-------------| | `COVAL_API_KEY` | Yes | Your Coval API key | ## API Details The action uses the Coval v1 Runs API: ### Launch Run **Endpoint:** `POST https://api.coval.dev/v1/runs` **Request:** ```json { "agent_id": "gk3jK9mPq2xRt5vW8yZaBc", "persona_id": "hL4kL0nQr3ySt6vX9zAcDd", "test_set_id": "aB1cD2eF", "metric_ids": ["iM5lM1oRs4zTu7wY0aBdEe"], "options": { "iteration_count": 3, "concurrency": 2 }, "metadata": { "campaign": "q4_2025" } } ``` **Response:** ```json { "run": { "run_id": "8EktrIgaVxn9LfxkIynagX", "status": "PENDING", "create_time": "2025-10-14T12:00:00Z" } } ``` ### Monitor Run **Endpoint:** `GET https://api.coval.dev/v1/runs/{run_id}` **Response:** ```json { "run": { "run_id": "8EktrIgaVxn9LfxkIynagX", "status": "IN PROGRESS", "progress": { "total_test_cases": 10, "completed_test_cases": 5, "failed_test_cases": 0, "in_progress_test_cases": 1 } } } ``` ### Run Statuses | Status | Description | |--------|-------------| | `PENDING` | Waiting to start | | `IN QUEUE` | Queued for execution | | `IN PROGRESS` | Running test cases | | `COMPLETED` | Successfully completed | | `FAILED` | Run failed | ## Examples ### Environment-Based Testing ```yaml name: Multi-Environment Testing on: push: branches: [main, staging, dev] jobs: evaluate: runs-on: ubuntu-latest steps: - name: Set Environment id: env run: | if [[ "${{ github.ref }}" == "refs/heads/main" ]]; then echo "agent=prodAgentId12345678" >> $GITHUB_OUTPUT echo "env=production" >> $GITHUB_OUTPUT elif [[ "${{ github.ref }}" == "refs/heads/staging" ]]; then echo "agent=stgAgentId123456789" >> $GITHUB_OUTPUT echo "env=staging" >> $GITHUB_OUTPUT else echo "agent=devAgentId123456789" >> $GITHUB_OUTPUT echo "env=development" >> $GITHUB_OUTPUT fi - name: Evaluate uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: ${{ steps.env.outputs.agent }} persona_id: "hL4kL0nQr3ySt6vX9zAcDd" test_set_id: "aB1cD2eF" metadata: '{"env": "${{ steps.env.outputs.env }}", "commit": "${{ github.sha }}"}' ``` ### Parallel Persona Testing ```yaml name: Multi-Persona Testing on: workflow_dispatch: jobs: test: runs-on: ubuntu-latest strategy: matrix: persona: - { id: "persona1234567890abcd", name: "Friendly" } - { id: "persona1234567890efgh", name: "Frustrated" } - { id: "persona1234567890ijkl", name: "Technical" } steps: - name: Test ${{ matrix.persona.name }} uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: "gk3jK9mPq2xRt5vW8yZaBc" persona_id: ${{ matrix.persona.id }} test_set_id: "aB1cD2eF" metadata: '{"persona": "${{ matrix.persona.name }}"}' ``` ### Scheduled Regression Testing ```yaml name: Nightly Regression on: schedule: - cron: '0 2 * * *' # 2 AM daily jobs: regression: runs-on: ubuntu-latest steps: - name: Run Tests uses: coval-ai/coval-github-action@v1 env: COVAL_API_KEY: ${{ secrets.COVAL_API_KEY }} with: agent_id: "gk3jK9mPq2xRt5vW8yZaBc" persona_id: "hL4kL0nQr3ySt6vX9zAcDd" test_set_id: "regrTest" iteration_count: 5 concurrency: 3 max_wait_time: 1800 ``` ## Troubleshooting ### Invalid API Key ``` Status Code: 401 Error Code: UNAUTHENTICATED Message: Invalid or missing API key ``` **Solution:** Verify `COVAL_API_KEY` is set correctly in GitHub Secrets. ### Invalid Agent ID ``` Status Code: 400 Error Code: INVALID_ARGUMENT Message: Invalid agent_id: Agent not found ``` **Solution:** Confirm the agent ID is 22 characters and exists in your organization. ### Validation Errors ``` Status Code: 400 Details: - iteration_count: Value must be between 1 and 10 ``` **Solution:** Ensure all parameters meet the constraints listed in the Configuration Reference. ### Timeout **Solution:** Increase `max_wait_time` for larger test sets or check the Coval dashboard for run status. ### Invalid JSON ```yaml # Wrong - will fail metric_ids: ["id1", "id2"] # Correct - use single quotes around JSON metric_ids: '["id1", "id2"]' ``` ## Resources - [API Reference](/api-reference/v1/runs/runs/launch-run) - [Coval Dashboard](https://app.coval.dev) - [GitHub Action Repository](https://github.com/coval-ai/coval-github-action) - [Support](mailto:support@coval.dev) --- ## Overview Source: https://docs.coval.dev/use-cases/overview Explore real-world examples of how teams use Coval to evaluate and improve their AI agents. --- ## Airline Help Desk Source: https://docs.coval.dev/use-cases/leveraging-test-users This example demonstrates how to leverage test users to evaluate an airline help desk voice agent. We'll assume the voice agent has access to an internal system that maintains customer accounts. ## Goal Ensure that the airline voice agent books users on the correct flights. ## Step One: Configure Your Agent Attributes The first step in testing the agent is to configure a list of test users that exist in the agent's internal system. These users will be used throughout your test sets. Navigate to [the Agent Details page](https://app.coval.dev/coval/agents) and add the following attributes: ```json { "qa_accounts": { "user1": { "tier": "platinum", "miles": 100000, "credit_card": "379923037966854", "user_token": "duhfsaihd1234567654323456789" }, "user2": { "tier": "standard", "miles": 8, "credit_card": "4134823389064963", "user_token": "8976890dfaoisfuapsd80873248179" } } } ``` Notice that `user1` is a platinum member while `user2` is a standard member. This allows us to compare the agent's behavior between different user tiers. ## Step Two: Create Booking Test Cases ![Example Booking Test Cases](/images/use-cases/leveraging-users/test-case.png) Now we can use these users in test cases. Let's examine the first test case configuration. ### Setting Up Test Case Metadata The goal of the metadata is to store values we need for deterministic validation. When we create metrics later, we'll need to know the exact flight path (in airport codes) to perform simple comparisons on the ticket. For this test case, we configure the following metadata: - **source**: `SFO` - **destination**: `LAX` - **user**: `user1` The `user` field identifies which user account the flight will be booked on, allowing us to verify the booking was made for the correct account. ### Test Case Prompt ```markdown You are calling an airline help desk. Book a flight from {{test_case.source}} to {{test_case.destination}}. Use your credit card: {{agent.qa_accounts.user1.credit_card}} ``` Using the test case metadata and agent attributes in the prompt allows everything to be fully in sync. That way, you only have to change the value in one place. However, this will ultimately be processed as: ```markdown You are calling an airline help desk. Book a flight from SFO to LAX. Use your credit card: 379923037966854 ``` You can create many permutations of this test case, requesting different sources, with different users, etc. ## Step Three: Create an API State Match Metric After a simulation, we want to check if the airline's internal database has a ticket for our user. In Step Two, you created a test set with many users and ticket combinations. To do this, navigate to the [metric creation page](https://app.coval.dev/appointmentdemo/metrics/create) and create an API State Match metric. ![Example Metric](/images/use-cases/leveraging-users/metric.png) In our example, the airline has an API that allows us to see all tickets for a specific user. It takes in a `userId` and a `user_token`, and outputs a list of tickets. After the simulation, we will call the API with `{{test_case.user}}`, which will be transformed to `user1` for our first test case. ```json { "user": "user1" } ``` We will receive the response: ```json { "tickets": [ { "source": "SFO", "destination": "LAX", "date_booked": "12/12/25", "confirmed": true } ] } ``` Use the match path `tickets[source={{test_case.source}},destination={{test_case.destination}}].confirmed`. This will be rendered as, for example, `tickets[source=SFO,destination=LAX].confirmed` for a given test case. It will select the first ticket that matches both source and destination, and verify the `confirmed` field. If the ticket exists, the metric will return `MATCH`. If the ticket exists but is not confirmed, it will return `DIFF`. If the ticket doesn't exist, it will return `NOT_FOUND`. ## Step Four: Run Your Simulations! Now, we have all the building blocks to run our simulations. --- ## Hackathons Source: https://docs.coval.dev/collaborate/hackathons/overview Take part in the hackathons we support and collaborate with us in advancing responsible AI development by testing your agents! ## Upcoming Hackathons *No upcoming hackathons at this time. Check back soon!* ## Past Hackathons ### Gemini x Pipecat Hackathon **Saturday, October 11, 2024 at 9am** **YC SF Office** We were excited to co-host a special voice and realtime AI hackathon with Daily (W16) and Google DeepMind at the YC office. The event brought together builders to create multimodal AI applications, working alongside engineers from Google, Daily, Boundary (W23), Coval (S24), Langfuse (W23), Tavus (S21), and the AI Tinkerers community. The hackathon featured prizes including a guaranteed YC interview, lunch with Google engineers, and special swag for winners. [Image: Hackathon participants collaborating] [Image: Hackathon team presentations] [Image: Hackathon workspace] [Image: Hackathon networking] --- ## API Reference Source: https://docs.coval.dev/api-reference/v1/introduction The Coval REST API enables you to programmatically launch voice and chat evaluations, manage test data, and analyze AI agent performance. ## Most Used - [Runs](/api-reference/v1/runs/runs/launch-run): Launch evaluation runs and track results across your agents - [Agents](/api-reference/v1/agents/agents/list-agents): Connect and configure your AI agents for testing - [Simulations](/api-reference/v1/simulations/simulations/list-simulations): View simulation results with transcripts and metric scores - [Test Sets](/api-reference/v1/test-sets/test-sets/list-test-sets): Create and manage test cases for your evaluations ## Getting Started **Step: Get your API key** Obtain your API key from the [Coval Dashboard](https://app.coval.dev/settings). See the [API Keys guide](/guides/api-keys) for detailed setup instructions. **Step: Authenticate your requests** Include your API key in the `X-API-Key` header: ```bash curl https://api.coval.dev/v1/agents \ -H "X-API-Key: your_api_key" ``` **Step: Create your first agent** Set up an agent configuration for testing: ```bash curl -X POST https://api.coval.dev/v1/agents \ -H "X-API-Key: your_api_key" \ -H "Content-Type: application/json" \ -d '{ "display_name": "My Test Agent", "model_type": "VOICE", "phone_number": "+15551234567" }' ``` **Step: Launch a simulation** Run evaluations against your agent using the Simulations API. ## Base URL ``` https://api.coval.dev/v1 ``` ## OpenAPI Specification We publish our OpenAPI specifications at public endpoints (no authentication required). ### List available specs ```bash GET https://api.coval.dev/v1/openapi ``` Returns the available `spec_name` values and URLs. ### Fetch a specific spec ```bash GET https://api.coval.dev/v1/openapi/{spec_name} ``` - **Default response**: YAML (`application/yaml`) - **JSON response**: set `Accept: application/json` ### Examples ```bash # List all available specs curl -s https://api.coval.dev/v1/openapi # Fetch a spec as YAML (default) curl -s https://api.coval.dev/v1/openapi/agents # Fetch a spec as JSON curl -s -H "Accept: application/json" https://api.coval.dev/v1/openapi/agents ``` ## Authentication All API requests require an `X-API-Key` header with every request. --- ## list-runs Source: https://docs.coval.dev/api-reference/v1/runs/runs/list-runs --- openapi: get /runs --- --- ## launch-run Source: https://docs.coval.dev/api-reference/v1/runs/runs/launch-run --- openapi: post /runs --- --- ## get-run Source: https://docs.coval.dev/api-reference/v1/runs/runs/get-run --- openapi: get /runs/{run_id} --- --- ## delete-run Source: https://docs.coval.dev/api-reference/v1/runs/runs/delete-run --- openapi: delete /runs/{run_id} --- --- ## list-simulations Source: https://docs.coval.dev/api-reference/v1/simulations/simulations/list-simulations --- openapi: get /v1/simulations --- --- ## get-simulation Source: https://docs.coval.dev/api-reference/v1/simulations/simulations/get-simulation --- openapi: get /v1/simulations/{simulation_id} --- --- ## delete-or-cancel-simulation Source: https://docs.coval.dev/api-reference/v1/simulations/simulations/delete-or-cancel-simulation --- openapi: delete /v1/simulations/{simulation_id} --- --- ## get-audio-file-url Source: https://docs.coval.dev/api-reference/v1/simulations/simulations/get-audio-file-url --- openapi: get /v1/simulations/{simulation_id}/audio --- --- ## list-metrics Source: https://docs.coval.dev/api-reference/v1/simulations/metric-outputs/list-metrics --- openapi: get /v1/simulations/{simulation_id}/metrics --- --- ## get-metric Source: https://docs.coval.dev/api-reference/v1/simulations/metric-outputs/get-metric --- openapi: get /v1/simulations/{simulation_id}/metrics/{metric_output_id} --- --- ## list-conversations Source: https://docs.coval.dev/api-reference/v1/conversations/conversations/list-conversations --- openapi: get /v1/conversations --- --- ## submit-conversation-for-evaluation Source: https://docs.coval.dev/api-reference/v1/conversations/conversations/submit-conversation-for-evaluation --- openapi: post /v1/conversations:submit --- --- ## get-conversation-details Source: https://docs.coval.dev/api-reference/v1/conversations/conversations/get-conversation-details --- openapi: get /v1/conversations/{conversation_id} --- --- ## delete-or-cancel-conversation Source: https://docs.coval.dev/api-reference/v1/conversations/conversations/delete-or-cancel-conversation --- openapi: delete /v1/conversations/{conversation_id} --- --- ## list-conversation-metrics Source: https://docs.coval.dev/api-reference/v1/conversations/conversations/list-conversation-metrics --- openapi: get /v1/conversations/{conversation_id}/metrics --- --- ## get-single-conversation-metric Source: https://docs.coval.dev/api-reference/v1/conversations/conversations/get-single-conversation-metric --- openapi: get /v1/conversations/{conversation_id}/metrics/{metric_output_id} --- --- ## get-conversation-audio Source: https://docs.coval.dev/api-reference/v1/conversations/audio/get-conversation-audio --- openapi: get /v1/conversations/{conversation_id}/audio --- --- ## list-agents Source: https://docs.coval.dev/api-reference/v1/agents/agents/list-agents --- openapi: get /v1/agents --- --- ## connect-an-agent Source: https://docs.coval.dev/api-reference/v1/agents/agents/connect-an-agent --- openapi: post /v1/agents --- --- ## get-agent Source: https://docs.coval.dev/api-reference/v1/agents/agents/get-agent --- openapi: get /v1/agents/{agent_id} --- --- ## update-agent Source: https://docs.coval.dev/api-reference/v1/agents/agents/update-agent --- openapi: patch /v1/agents/{agent_id} --- --- ## delete-agent Source: https://docs.coval.dev/api-reference/v1/agents/agents/delete-agent --- openapi: delete /v1/agents/{agent_id} --- --- ## list-mutations Source: https://docs.coval.dev/api-reference/v1/mutations/mutations/list-mutations --- openapi: get /v1/agents/{agent_id}/mutations --- --- ## create-mutation Source: https://docs.coval.dev/api-reference/v1/mutations/mutations/create-mutation --- openapi: post /v1/agents/{agent_id}/mutations --- --- ## get-mutation Source: https://docs.coval.dev/api-reference/v1/mutations/mutations/get-mutation --- openapi: get /v1/agents/{agent_id}/mutations/{mutation_id} --- --- ## update-mutation Source: https://docs.coval.dev/api-reference/v1/mutations/mutations/update-mutation --- openapi: patch /v1/agents/{agent_id}/mutations/{mutation_id} --- --- ## delete-mutation Source: https://docs.coval.dev/api-reference/v1/mutations/mutations/delete-mutation --- openapi: delete /v1/agents/{agent_id}/mutations/{mutation_id} --- --- ## list-test-sets Source: https://docs.coval.dev/api-reference/v1/test-sets/test-sets/list-test-sets --- openapi: get /test-sets --- --- ## create-test-set Source: https://docs.coval.dev/api-reference/v1/test-sets/test-sets/create-test-set --- openapi: post /test-sets --- --- ## get-test-set Source: https://docs.coval.dev/api-reference/v1/test-sets/test-sets/get-test-set --- openapi: get /test-sets/{test_set_id} --- --- ## update-test-set Source: https://docs.coval.dev/api-reference/v1/test-sets/test-sets/update-test-set --- openapi: patch /test-sets/{test_set_id} --- --- ## delete-test-set Source: https://docs.coval.dev/api-reference/v1/test-sets/test-sets/delete-test-set --- openapi: delete /test-sets/{test_set_id} --- --- ## list-test-cases Source: https://docs.coval.dev/api-reference/v1/test-cases/test-cases/list-test-cases --- openapi: get /test-cases --- --- ## create-test-case Source: https://docs.coval.dev/api-reference/v1/test-cases/test-cases/create-test-case --- openapi: post /test-cases --- --- ## get-test-case Source: https://docs.coval.dev/api-reference/v1/test-cases/test-cases/get-test-case --- openapi: get /test-cases/{test_case_id} --- --- ## delete-test-case Source: https://docs.coval.dev/api-reference/v1/test-cases/test-cases/delete-test-case --- openapi: delete /test-cases/{test_case_id} --- --- ## update-test-case Source: https://docs.coval.dev/api-reference/v1/test-cases/test-cases/update-test-case --- openapi: patch /test-cases/{test_case_id} --- --- ## list-personas Source: https://docs.coval.dev/api-reference/v1/personas/personas/list-personas --- openapi: get /personas --- --- ## create-persona Source: https://docs.coval.dev/api-reference/v1/personas/personas/create-persona --- openapi: post /personas --- --- ## get-persona Source: https://docs.coval.dev/api-reference/v1/personas/personas/get-persona --- openapi: get /personas/{persona_id} --- --- ## update-persona Source: https://docs.coval.dev/api-reference/v1/personas/personas/update-persona --- openapi: patch /personas/{persona_id} --- --- ## delete-persona Source: https://docs.coval.dev/api-reference/v1/personas/personas/delete-persona --- openapi: delete /personas/{persona_id} --- --- ## list-available-voices Source: https://docs.coval.dev/api-reference/v1/personas/personas/list-available-voices --- openapi: get /personas/voices --- --- ## list-phone-number-mappings Source: https://docs.coval.dev/api-reference/v1/personas/personas/list-phone-number-mappings --- openapi: get /personas/phone-numbers --- --- ## list-metrics Source: https://docs.coval.dev/api-reference/v1/metrics/metrics/list-metrics --- openapi: get /v1/metrics --- --- ## create-metric Source: https://docs.coval.dev/api-reference/v1/metrics/metrics/create-metric --- openapi: post /v1/metrics --- --- ## get-metric Source: https://docs.coval.dev/api-reference/v1/metrics/metrics/get-metric --- openapi: get /v1/metrics/{metric_id} --- --- ## update-metric Source: https://docs.coval.dev/api-reference/v1/metrics/metrics/update-metric --- openapi: patch /v1/metrics/{metric_id} --- --- ## delete-metric Source: https://docs.coval.dev/api-reference/v1/metrics/metrics/delete-metric --- openapi: delete /v1/metrics/{metric_id} --- --- ## list-run-templates Source: https://docs.coval.dev/api-reference/v1/run-templates/run-templates/list-run-templates --- openapi: get /v1/run-templates --- --- ## create-run-template Source: https://docs.coval.dev/api-reference/v1/run-templates/run-templates/create-run-template --- openapi: post /v1/run-templates --- --- ## get-run-template Source: https://docs.coval.dev/api-reference/v1/run-templates/run-templates/get-run-template --- openapi: get /v1/run-templates/{run_template_id} --- --- ## update-run-template Source: https://docs.coval.dev/api-reference/v1/run-templates/run-templates/update-run-template --- openapi: patch /v1/run-templates/{run_template_id} --- --- ## delete-run-template Source: https://docs.coval.dev/api-reference/v1/run-templates/run-templates/delete-run-template --- openapi: delete /v1/run-templates/{run_template_id} --- --- ## list-scheduled-runs Source: https://docs.coval.dev/api-reference/v1/scheduled-runs/scheduled-runs/list-scheduled-runs --- openapi: get /v1/scheduled-runs --- --- ## create-scheduled-run Source: https://docs.coval.dev/api-reference/v1/scheduled-runs/scheduled-runs/create-scheduled-run --- openapi: post /v1/scheduled-runs --- --- ## get-scheduled-run Source: https://docs.coval.dev/api-reference/v1/scheduled-runs/scheduled-runs/get-scheduled-run --- openapi: get /v1/scheduled-runs/{scheduled_run_id} --- --- ## update-scheduled-run Source: https://docs.coval.dev/api-reference/v1/scheduled-runs/scheduled-runs/update-scheduled-run --- openapi: patch /v1/scheduled-runs/{scheduled_run_id} --- --- ## delete-scheduled-run Source: https://docs.coval.dev/api-reference/v1/scheduled-runs/scheduled-runs/delete-scheduled-run --- openapi: delete /v1/scheduled-runs/{scheduled_run_id} --- --- ## list-review-projects Source: https://docs.coval.dev/api-reference/v1/reviews/review-projects/list-review-projects --- openapi: get /v1/review-projects --- --- ## create-review-project Source: https://docs.coval.dev/api-reference/v1/reviews/review-projects/create-review-project --- openapi: post /v1/review-projects --- --- ## get-review-project Source: https://docs.coval.dev/api-reference/v1/reviews/review-projects/get-review-project --- openapi: get /v1/review-projects/{project_id} --- --- ## update-review-project Source: https://docs.coval.dev/api-reference/v1/reviews/review-projects/update-review-project --- openapi: patch /v1/review-projects/{project_id} --- --- ## delete-review-project Source: https://docs.coval.dev/api-reference/v1/reviews/review-projects/delete-review-project --- openapi: delete /v1/review-projects/{project_id} --- --- ## list-review-annotations Source: https://docs.coval.dev/api-reference/v1/reviews/review-annotations/list-review-annotations --- openapi: get /v1/review-annotations --- --- ## create-review-annotation Source: https://docs.coval.dev/api-reference/v1/reviews/review-annotations/create-review-annotation --- openapi: post /v1/review-annotations --- --- ## get-review-annotation Source: https://docs.coval.dev/api-reference/v1/reviews/review-annotations/get-review-annotation --- openapi: get /v1/review-annotations/{annotation_id} --- --- ## update-review-annotation Source: https://docs.coval.dev/api-reference/v1/reviews/review-annotations/update-review-annotation --- openapi: patch /v1/review-annotations/{annotation_id} --- --- ## delete-review-annotation Source: https://docs.coval.dev/api-reference/v1/reviews/review-annotations/delete-review-annotation --- openapi: delete /v1/review-annotations/{annotation_id} --- --- ## list-dashboards Source: https://docs.coval.dev/api-reference/v1/dashboards/dashboards/list-dashboards --- openapi: get /v1/dashboards --- --- ## create-dashboard Source: https://docs.coval.dev/api-reference/v1/dashboards/dashboards/create-dashboard --- openapi: post /v1/dashboards --- --- ## get-dashboard Source: https://docs.coval.dev/api-reference/v1/dashboards/dashboards/get-dashboard --- openapi: get /v1/dashboards/{dashboard_id} --- --- ## update-dashboard Source: https://docs.coval.dev/api-reference/v1/dashboards/dashboards/update-dashboard --- openapi: patch /v1/dashboards/{dashboard_id} --- --- ## delete-dashboard Source: https://docs.coval.dev/api-reference/v1/dashboards/dashboards/delete-dashboard --- openapi: delete /v1/dashboards/{dashboard_id} --- --- ## list-widgets Source: https://docs.coval.dev/api-reference/v1/dashboards/widgets/list-widgets --- openapi: get /v1/dashboards/{dashboard_id}/widgets --- --- ## create-widget Source: https://docs.coval.dev/api-reference/v1/dashboards/widgets/create-widget --- openapi: post /v1/dashboards/{dashboard_id}/widgets --- --- ## get-widget Source: https://docs.coval.dev/api-reference/v1/dashboards/widgets/get-widget --- openapi: get /v1/dashboards/{dashboard_id}/widgets/{widget_id} --- --- ## update-widget Source: https://docs.coval.dev/api-reference/v1/dashboards/widgets/update-widget --- openapi: patch /v1/dashboards/{dashboard_id}/widgets/{widget_id} --- --- ## delete-widget Source: https://docs.coval.dev/api-reference/v1/dashboards/widgets/delete-widget --- openapi: delete /v1/dashboards/{dashboard_id}/widgets/{widget_id} --- --- ## list-monitors Source: https://docs.coval.dev/api-reference/v1/monitors/monitors/list-monitors --- openapi: get /monitors --- --- ## create-a-monitor Source: https://docs.coval.dev/api-reference/v1/monitors/monitors/create-a-monitor --- openapi: post /monitors --- --- ## get-a-monitor Source: https://docs.coval.dev/api-reference/v1/monitors/monitors/get-a-monitor --- openapi: get /monitors/{monitor_id} --- --- ## update-a-monitor Source: https://docs.coval.dev/api-reference/v1/monitors/monitors/update-a-monitor --- openapi: patch /monitors/{monitor_id} --- --- ## delete-a-monitor Source: https://docs.coval.dev/api-reference/v1/monitors/monitors/delete-a-monitor --- openapi: delete /monitors/{monitor_id} --- --- ## test-evaluate-a-monitor Source: https://docs.coval.dev/api-reference/v1/monitors/monitors/test-evaluate-a-monitor --- openapi: post /monitors/{monitor_id}/test-evaluate --- --- ## list-monitor-events Source: https://docs.coval.dev/api-reference/v1/monitors/monitor-events/list-monitor-events --- openapi: get /monitors/{monitor_id}/events --- --- ## list-api-keys Source: https://docs.coval.dev/api-reference/v1/api-keys/api-keys/list-api-keys --- openapi: get /v1/api-keys --- --- ## create-api-key Source: https://docs.coval.dev/api-reference/v1/api-keys/api-keys/create-api-key --- openapi: post /v1/api-keys --- --- ## update-api-key-status Source: https://docs.coval.dev/api-reference/v1/api-keys/api-keys/update-api-key-status --- openapi: patch /v1/api-keys/{api_key_id} --- --- ## delete-api-key Source: https://docs.coval.dev/api-reference/v1/api-keys/api-keys/delete-api-key --- openapi: delete /v1/api-keys/{api_key_id} --- --- ## CLI Source: https://docs.coval.dev/cli/overview Command-line interface for the Coval AI evaluation platform The **Coval CLI** provides terminal access to Coval's evaluation APIs for scripting, automation, and CI/CD integration. - [GitHub Repository](https://github.com/coval-ai/cli): View source, releases, and contribute ## Quick Start **Step: Install the CLI** ```bash brew install coval-ai/tap/coval ``` See [Installation](/cli/installation) for all methods. **Step: Authenticate** ```bash coval login ``` **Step: Launch an evaluation** ```bash coval runs launch \ --agent-id \ --persona-id \ --test-set-id ``` **Step: Watch progress** ```bash coval runs watch ``` ## Command Reference - [Agents](/cli/agents): Create, list, update, and delete agent configurations - [Runs](/cli/runs): Launch evaluations and monitor progress in real time - [Simulations](/cli/simulations): Inspect individual simulation results and download audio - [Test Sets](/cli/test-sets): Organize test cases into collections - [Test Cases](/cli/test-cases): Define inputs and expected outputs for evaluations - [Personas](/cli/personas): Configure simulated callers with voice and language settings - [Metrics](/cli/metrics): Define how simulations are scored and evaluated - [Mutations](/cli/mutations): Test agent variations with config overrides - [API Keys](/cli/api-keys): Manage API keys for programmatic access - [Run Templates](/cli/run-templates): Save reusable evaluation configurations - [Scheduled Runs](/cli/scheduled-runs): Schedule recurring evaluation runs - [Dashboards](/cli/dashboards): Create dashboards and widgets for monitoring ## Global Flags All commands support these flags: | Flag | Description | Default | |------|-------------|---------| | `--format ` | Output format: `table` or `json` | `table` | | `--api-key ` | Override API key for this command | — | | `--api-url ` | Override API base URL | — | | `--help` | Show help for any command | — | ## JSON Output for Scripting Use `--format json` to get machine-readable output: ```bash # Get run status coval runs get abc123 --format json | jq '.status' # List agent IDs coval agents list --format json | jq '.[].id' # Extract simulation transcript coval simulations get sim123 --format json | jq '.transcript' ``` ## Requirements - macOS, Linux, or Windows - Coval API key from [Dashboard Settings](https://app.coval.dev/settings) --- ## Installation & Configuration Source: https://docs.coval.dev/cli/installation Install the Coval CLI and configure authentication ## Installation **Homebrew:** ```bash brew install coval-ai/tap/coval ``` **Cargo:** ```bash cargo install coval ``` Requires [Rust](https://rustup.rs/) to be installed. **Binary:** Download pre-built binaries from [GitHub Releases](https://github.com/coval-ai/cli/releases). Verify your installation: ```bash coval --help ``` ## Authentication ### Interactive Login ```bash coval login ``` You will be prompted to enter your API key. Get one from [Dashboard Settings](https://app.coval.dev/settings). ### API Key Flag Pass your API key directly: ```bash coval login --api-key sk_your_api_key ``` > **Warning:** Passing API keys as command arguments can expose them in shell history and process lists. For CI/CD pipelines, prefer using the `COVAL_API_KEY` environment variable or your CI provider's secret management instead. ### Verify Authentication ```bash coval whoami ``` Displays your masked API key (e.g., `sk_...****`) to confirm you are authenticated. ## Configuration The CLI stores configuration in a platform-specific config directory. Run `coval config path` to see the exact location on your system. ### View Config Path ```bash coval config path ``` ### Get a Config Value ```bash coval config get api_key coval config get api_url ``` ### Set a Config Value ```bash coval config set api_key sk_your_api_key coval config set api_url https://api.coval.dev ``` ### Config File Format ```toml api_key = "sk_..." api_url = "https://api.coval.dev" ``` ## Environment Variables Environment variables override config file values: | Variable | Description | |----------|-------------| | `COVAL_API_KEY` | API key (overrides config file) | | `COVAL_API_URL` | API base URL (overrides config file) | ```bash # Use in CI/CD pipelines export COVAL_API_KEY=sk_your_api_key coval runs launch --agent-id abc123 --persona-id xyz789 --test-set-id ts123 ``` ## Supported Platforms - macOS (Intel and Apple Silicon) - Linux (x86_64) - Windows --- ## Agents Source: https://docs.coval.dev/cli/agents Manage AI agent configurations with the Coval CLI ## List Agents ```bash coval agents list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression (e.g., `model_type="voice"`) | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order (e.g., `-create_time`) | **Output columns:** ID, NAME, TYPE, CREATED ```bash # List all agents coval agents list # Filter by type coval agents list --filter 'model_type="voice"' # JSON output coval agents list --format json ``` ## Get Agent ```bash coval agents get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `agent_id` | string | **Yes** | The agent ID | Returns full agent details as JSON including configuration, metadata, and associated resources. ```bash coval agents get ag_abc123 ``` ## Create Agent ```bash coval agents create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name for the agent | | `--type` | string | **Yes** | Agent type (see below) | | `--phone-number` | string | Conditional | Phone number in E.164 format (required for `voice`, `sms`) | | `--endpoint` | string | Conditional | Webhook URL (required for `outbound-voice`) | | `--prompt` | string | No | System prompt / instructions | | `--metadata` | string | No | JSON string for agent metadata (e.g., `chat_endpoint`, `input_template`) | | `--metric-ids` | string | No | Comma-separated metric IDs to associate | | `--test-set-ids` | string | No | Comma-separated test set IDs to associate | ### Agent Types | Type | Description | Required Fields | |------|-------------|-----------------| | `voice` | Inbound voice agent | `--phone-number` | | `outbound-voice` | Outbound voice agent | `--endpoint` | | `chat` | Chat/text-based agent | `metadata.chat_endpoint` | | `sms` | SMS messaging agent | `--phone-number` | | `websocket` | WebSocket-based agent | `metadata.endpoint`, `metadata.initialization_json` | > **Info:** For `chat` and `websocket` agents, required fields like `chat_endpoint` and `initialization_json` are passed via the `--metadata` flag as a JSON string. ```bash # Create a voice agent coval agents create \ --name "Support Agent" \ --type voice \ --phone-number "+15551234567" # Create a chat agent with metadata coval agents create \ --name "Chat Bot" \ --type chat \ --metadata '{"chat_endpoint":"https://api.example.com/chat"}' # Create with associated metrics and test sets coval agents create \ --name "Support Agent" \ --type voice \ --phone-number "+15551234567" \ --metric-ids "met_abc,met_def" \ --test-set-ids "ts_123" ``` ## Update Agent ```bash coval agents update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `agent_id` | string | **Yes** | The agent ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--type` | string | New agent type | | `--phone-number` | string | New phone number | | `--endpoint` | string | New endpoint URL | | `--prompt` | string | New system prompt | | `--metadata` | string | JSON string for agent metadata | | `--metric-ids` | string | Comma-separated metric IDs | | `--test-set-ids` | string | Comma-separated test set IDs | ```bash # Update agent name coval agents update ag_abc123 --name "Updated Agent Name" # Update agent metadata (e.g., chat endpoint and input template) coval agents update ag_abc123 \ --metadata '{"chat_endpoint":"https://proxy.example.com/chat","input_template":"{\"user_id\":\"{{user_id}}\"}"}' ``` ## Delete Agent ```bash coval agents delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `agent_id` | string | **Yes** | The agent ID to delete | ```bash coval agents delete ag_abc123 ``` --- ## Runs Source: https://docs.coval.dev/cli/runs Launch and manage evaluation runs with the Coval CLI ## List Runs ```bash coval runs list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression (e.g., `status="COMPLETED"`) | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order (e.g., `-create_time`) | **Output columns:** ID, STATUS, PROGRESS, CREATED ```bash # List all runs coval runs list # Filter completed runs coval runs list --filter 'status="COMPLETED"' ``` ## Get Run ```bash coval runs get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_id` | string | **Yes** | The run ID | Returns full run details as JSON including status, progress, results, and metrics. ```bash coval runs get run_abc123 ``` ## Launch Run ```bash coval runs launch [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | Agent ID to evaluate | | `--persona-id` | string | **Yes** | Persona ID for simulated caller | | `--test-set-id` | string | **Yes** | Test set ID containing test cases | | `--iterations` | number | No | Iterations per test case (default: 1) | | `--concurrency` | number | No | Parallel simulations | | `--name` | string | No | Display name for the run | | `--mutation-id` | string | No | Single mutation ID to test | | `--mutation-ids` | string | No | Comma-separated mutation IDs | ```bash # Basic run coval runs launch \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 # Run with options coval runs launch \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 \ --iterations 3 \ --concurrency 5 \ --name "Regression Test" # Run with mutations coval runs launch \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 \ --mutation-ids "mut_001,mut_002,mut_003" ``` ## Watch Run Monitor a run's progress in real time with a live progress bar. ```bash coval runs watch [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_id` | string | **Yes** | The run ID to watch | | Option | Type | Default | Description | |--------|------|---------|-------------| | `--interval` | number | 2 | Poll interval in seconds | ```bash # Watch with default interval coval runs watch run_abc123 # Watch with faster polling coval runs watch run_abc123 --interval 1 ``` The watch command displays a progress bar and exits when the run reaches a terminal status. ## Delete Run ```bash coval runs delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_id` | string | **Yes** | The run ID to delete | ## Run Statuses | Status | Description | |--------|-------------| | `PENDING` | Run is created but not yet started | | `IN_QUEUE` | Run is queued for execution | | `IN_PROGRESS` | Simulations are actively running | | `COMPLETED` | All simulations finished successfully | | `FAILED` | Run encountered an error | | `CANCELLED` | Run was cancelled | | `DELETED` | Run was deleted | > **Info:** When using `--filter`, use the underscore-separated enum values (e.g., `status="IN_PROGRESS"`). --- ## Simulations Source: https://docs.coval.dev/cli/simulations View simulation results and download audio with the Coval CLI ## List Simulations ```bash coval simulations list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--run-id` | string | — | Filter by run ID | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, STATUS, RUN, TEST CASE, AUDIO ```bash # List all simulations coval simulations list # Filter by run coval simulations list --run-id run_abc123 # Combine filters coval simulations list --filter 'status="COMPLETED"' --run-id run_abc123 ``` ## Get Simulation ```bash coval simulations get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `simulation_id` | string | **Yes** | The simulation ID | Returns full simulation details as JSON including transcript, status, and metadata. ```bash coval simulations get sim_abc123 ``` ## Download Audio Download or get the audio URL for a simulation recording. ```bash coval simulations audio [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `simulation_id` | string | **Yes** | The simulation ID | | Option | Type | Description | |--------|------|-------------| | `-o, --output` | string | File path to save audio | ```bash # Print audio URL coval simulations audio sim_abc123 # Download audio file coval simulations audio sim_abc123 -o recording.wav ``` When using `-o`, a progress bar shows the download status. ## Delete Simulation ```bash coval simulations delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `simulation_id` | string | **Yes** | The simulation ID to delete | ## Simulation Statuses | Status | Description | |--------|-------------| | `PENDING` | Simulation is created but not yet started | | `IN_QUEUE` | Simulation is queued for execution | | `IN_PROGRESS` | Simulation is actively running | | `COMPLETED` | Simulation finished successfully | | `FAILED` | Simulation encountered an error | | `CANCELLED` | Simulation was cancelled | | `DELETED` | Simulation was deleted | > **Info:** When using `--filter`, use the underscore-separated enum values (e.g., `status="IN_PROGRESS"`). --- ## Test Sets Source: https://docs.coval.dev/cli/test-sets Manage test set collections with the Coval CLI ## List Test Sets ```bash coval test-sets list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, NAME, TYPE, CASES, CREATED ```bash coval test-sets list ``` ## Get Test Set ```bash coval test-sets get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_set_id` | string | **Yes** | The test set ID | ```bash coval test-sets get ts_abc123 ``` ## Create Test Set ```bash coval test-sets create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Test set name | | `--slug` | string | No | URL-friendly identifier (auto-generated if omitted) | | `--description` | string | No | Description of the test set | | `--type` | string | No | Test set type: `DEFAULT`, `SCENARIO`, `TRANSCRIPT`, or `WORKFLOW` | ```bash # Create a basic test set coval test-sets create --name "Customer Support Scenarios" # Create with all options coval test-sets create \ --name "Billing Scenarios" \ --slug "billing-scenarios" \ --description "Test cases for billing-related inquiries" \ --type SCENARIO ``` ## Update Test Set ```bash coval test-sets update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_set_id` | string | **Yes** | The test set ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New name | | `--slug` | string | New slug | | `--description` | string | New description | ```bash coval test-sets update ts_abc123 --name "Updated Name" ``` ## Delete Test Set ```bash coval test-sets delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_set_id` | string | **Yes** | The test set ID to delete | --- ## Test Cases Source: https://docs.coval.dev/cli/test-cases Manage individual test cases with the Coval CLI ## List Test Cases ```bash coval test-cases list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--test-set-id` | string | — | Filter by test set ID | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, INPUT, TYPE, TEST SET, CREATED ```bash # List all test cases coval test-cases list # Filter by test set coval test-cases list --test-set-id ts_abc123 ``` ## Get Test Case ```bash coval test-cases get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_case_id` | string | **Yes** | The test case ID | ```bash coval test-cases get tc_abc123 ``` ## Create Test Case Create a single test case or bulk import from stdin. ```bash coval test-cases create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--test-set-id` | string | **Yes** | Test set to add the case to | | `--input` | string | No | Test case input text | | `--expected` | string | No | Expected output | | `--description` | string | No | Test case description | | `--stdin` | flag | No | Read test cases from stdin (JSON) | > **Info:** You must provide exactly one of `--input` or `--stdin`. They are mutually exclusive — supplying both or neither will result in an error. ### Single Test Case ```bash coval test-cases create \ --test-set-id ts_abc123 \ --input "I need help with my order" \ --expected "Order assistance provided" \ --description "Basic order help request" ``` ### Bulk Import from Stdin Pass `--stdin` to read one JSON object per line: ```bash echo '{"input_str": "I need a refund", "expected_output_str": "Refund processed", "description": "Refund request"} {"input_str": "Where is my order?", "expected_output_str": "Order status provided", "description": "Order tracking"}' \ | coval test-cases create --test-set-id ts_abc123 --stdin ``` Or import from a file: ```bash cat test_cases.jsonl | coval test-cases create --test-set-id ts_abc123 --stdin ``` Each line must be valid JSON with the following fields: | Field | Type | Required | Description | |-------|------|----------|-------------| | `input_str` | string | **Yes** | Input text | | `expected_output_str` | string | No | Expected output | | `description` | string | No | Description | ## Update Test Case ```bash coval test-cases update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_case_id` | string | **Yes** | The test case ID to update | | Option | Type | Description | |--------|------|-------------| | `--input` | string | New input text | | `--expected` | string | New expected output | | `--description` | string | New description | ```bash coval test-cases update tc_abc123 --input "Updated input text" ``` ## Delete Test Case ```bash coval test-cases delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `test_case_id` | string | **Yes** | The test case ID to delete | --- ## Personas Source: https://docs.coval.dev/cli/personas Manage simulated personas with the Coval CLI ## List Personas ```bash coval personas list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, NAME, VOICE, LANGUAGE, CREATED ```bash coval personas list ``` ## Get Persona ```bash coval personas get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `persona_id` | string | **Yes** | The persona ID | ```bash coval personas get per_abc123 ``` ## Create Persona ```bash coval personas create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Persona display name | | `--voice` | string | **Yes** | Voice name (see available voices below) | | `--language` | string | **Yes** | Language code (e.g., `en-US`) | | `--prompt` | string | No | Persona system prompt / behavior instructions | | `--background` | string | No | Background sound during simulation | | `--wait-seconds` | number | No | Wait time between responses | ```bash # Create a basic persona coval personas create \ --name "Frustrated Customer" \ --voice "Aria" \ --language "en-US" # Create with full configuration coval personas create \ --name "Impatient Caller" \ --voice "Callum" \ --language "en-US" \ --prompt "You are an impatient customer who wants quick answers" \ --background "office" \ --wait-seconds 1.5 ``` ### Available Voices Alejandro, Angela, Aria, Ashwin, Autumn, Brynn, Callum, Caspian, Corwin, Darrow, Delphine, Dorian, Elara, Erika, Harry, Kieran, Lysander, Marina, Mark, Monika, Naveen, Orion, Raju, Rowan, Skye, Soren, Vera ## Update Persona ```bash coval personas update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `persona_id` | string | **Yes** | The persona ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--voice` | string | New voice name | | `--language` | string | New language code | | `--prompt` | string | New system prompt | | `--background` | string | New background sound | | `--wait-seconds` | number | New wait time | ```bash coval personas update per_abc123 --voice "Brynn" --wait-seconds 2.0 ``` ## Delete Persona ```bash coval personas delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `persona_id` | string | **Yes** | The persona ID to delete | --- ## Metrics Source: https://docs.coval.dev/cli/metrics Manage evaluation metrics with the Coval CLI ## List Metrics ```bash coval metrics list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression (supports metric_type, metric_name, create_time) | | `--page-size` | number | 50 | Results per page (1-100) | | `--order-by` | string | — | Sort field, prefix with `-` for descending | | `--include-builtin` | flag | — | Include built-in metrics (e.g. Turn Count, Audio Duration) | **Output columns:** ID, NAME, TYPE, CREATED ```bash coval metrics list ``` ## Get Metric ```bash coval metrics get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `metric_id` | string | **Yes** | The metric ID | ```bash coval metrics get met_abc123 ``` ## Create Metric ```bash coval metrics create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Metric display name | | `--description` | string | **Yes** | What this metric evaluates | | `--type` | string | **Yes** | Metric type (see below) | | `--prompt` | string | No | LLM evaluation prompt (required for `llm-binary`, `categorical`, `numerical` and their audio variants) | | `--categories` | string | No | Comma-separated categories (required for `categorical`, `audio-categorical`) | | `--min-value` | number | No | Minimum value (required for `numerical`, `audio-numerical`) | | `--max-value` | number | No | Maximum value (required for `numerical`, `audio-numerical`) | | `--regex-pattern` | string | No | Regex pattern to match (required for `regex`) | | `--role` | string | No | Transcript role to match against (optional for `regex`) | | `--match-mode` | string | No | `presence` (default) or `absence` — absence returns 1 if pattern NOT found | | `--position` | string | No | `any` (default), `first`, or `last` message of the role | | `--case-insensitive` | boolean | No | Enable case-insensitive matching | | `--metadata-field-type` | string | No | Metadata field type (required for `metadata`) | | `--metadata-field-key` | string | No | Metadata field key to extract (required for `metadata`) | | `--min-pause-duration-seconds` | number | No | Minimum pause duration threshold (required for `pause`) | ### Metric Types | Type | Description | Type-Specific Options | |------|-------------|----------------------| | `llm-binary` | Binary (yes/no) LLM judgment | `--prompt` | | `categorical` | Categorical LLM judgment with defined options | `--prompt`, `--categories` | | `numerical` | Numerical score from LLM judgment | `--prompt`, `--min-value`, `--max-value` | | `audio-binary` | Binary audio analysis | `--prompt` | | `audio-categorical` | Categorical audio analysis | `--prompt`, `--categories` | | `audio-numerical` | Numerical audio analysis | `--prompt`, `--min-value`, `--max-value` | | `toolcall` | Tool call success verification | — | | `metadata` | Extract metadata field value | `--metadata-field-type`, `--metadata-field-key` | | `regex` | Match transcript against a regex pattern | `--regex-pattern`, `--role`, `--match-mode`, `--position`, `--case-insensitive` | | `pause` | Analyze pause durations in audio | `--min-pause-duration-seconds` | ### Examples ```bash # LLM Binary coval metrics create \ --name "Issue Resolved" \ --description "Did the agent resolve the customer issue?" \ --type llm-binary \ --prompt "Was the customer's issue fully resolved?" # Categorical coval metrics create \ --name "Sentiment" \ --description "Customer sentiment during the call" \ --type categorical \ --categories "positive,neutral,negative" \ --prompt "What was the customer's overall sentiment?" # Numerical coval metrics create \ --name "Professionalism Score" \ --description "Rate the agent's professionalism" \ --type numerical \ --min-value 1 \ --max-value 10 \ --prompt "Rate the agent's professionalism on a scale of 1-10" # Audio Binary coval metrics create \ --name "Background Noise" \ --description "Is there excessive background noise?" \ --type audio-binary \ --prompt "Is there excessive background noise in the audio?" # Audio Numerical coval metrics create \ --name "Audio Clarity" \ --description "Rate the audio clarity" \ --type audio-numerical \ --min-value 1 \ --max-value 5 \ --prompt "Rate the audio clarity on a scale of 1-5" # Tool Call coval metrics create \ --name "Tool Usage" \ --description "Did the agent use the correct tool?" \ --type toolcall # Metadata coval metrics create \ --name "Response Time" \ --description "Extract the response time from metadata" \ --type metadata \ --metadata-field-type "number" \ --metadata-field-key "response_time_ms" # Regex — basic pattern match coval metrics create \ --name "Greeting Check" \ --description "Did the agent greet the customer?" \ --type regex \ --regex-pattern "(hello|hi|welcome|good morning)" \ --role "agent" \ --case-insensitive # Regex — compliance (absence mode) coval metrics create \ --name "No Unauthorized Promises" \ --description "Agent must not make unauthorized promises" \ --type regex \ --regex-pattern "(guarantee|promise|definitely)" \ --role "agent" \ --match-mode "absence" \ --case-insensitive # Regex — first message disclosure coval metrics create \ --name "Recording Disclosure" \ --description "Agent must state recording disclosure in first message" \ --type regex \ --regex-pattern "this call may be recorded" \ --role "agent" \ --position "first" \ --case-insensitive # Pause coval metrics create \ --name "Long Pauses" \ --description "Detect pauses longer than 3 seconds" \ --type pause \ --min-pause-duration-seconds 3.0 ``` ## Update Metric ```bash coval metrics update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `metric_id` | string | **Yes** | The metric ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--description` | string | New description | | `--prompt` | string | New evaluation prompt | ```bash coval metrics update met_abc123 --prompt "Updated evaluation prompt" ``` ## Delete Metric ```bash coval metrics delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `metric_id` | string | **Yes** | The metric ID to delete | --- ## Mutations Source: https://docs.coval.dev/cli/mutations Test agent variations with config overrides using the Coval CLI Mutations let you test variations of an agent by overriding configuration values without modifying the original agent. This is useful for A/B testing prompts, parameters, or model settings. ## List Mutations ```bash coval mutations list --agent-id [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | The parent agent ID | | `--page-size` | number | No | Results per page (default: 50) | **Output columns:** ID, NAME, PARAMETERS, CREATED ```bash coval mutations list --agent-id ag_abc123 ``` ## Get Mutation ```bash coval mutations get --agent-id ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `mutation_id` | string | **Yes** | The mutation ID | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | The parent agent ID | ```bash coval mutations get --agent-id ag_abc123 mut_xyz789 ``` ## Create Mutation ```bash coval mutations create --agent-id [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | The parent agent ID | | `--name` | string | **Yes** | Mutation display name | | `--description` | string | No | Description of what this mutation changes | | `--config` | string | No | JSON config overrides | ```bash # Create a mutation with config overrides coval mutations create \ --agent-id ag_abc123 \ --name "Higher Temperature" \ --description "Test with increased temperature" \ --config '{"temperature": 0.9}' # Create a prompt variation coval mutations create \ --agent-id ag_abc123 \ --name "Formal Tone" \ --description "Agent uses formal language" \ --config '{"prompt": "You are a formal customer service agent. Always use professional language."}' ``` ## Update Mutation ```bash coval mutations update --agent-id [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `mutation_id` | string | **Yes** | The mutation ID to update | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | The parent agent ID | | `--name` | string | No | New display name | | `--description` | string | No | New description | | `--config` | string | No | New JSON config overrides | ```bash coval mutations update --agent-id ag_abc123 mut_xyz789 \ --config '{"temperature": 0.7}' ``` ## Delete Mutation ```bash coval mutations delete --agent-id ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `mutation_id` | string | **Yes** | The mutation ID to delete | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--agent-id` | string | **Yes** | The parent agent ID | ```bash coval mutations delete --agent-id ag_abc123 mut_xyz789 ``` ## Using Mutations in Runs Pass mutation IDs when launching a run to test agent variations: ```bash # Test a single mutation coval runs launch \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 \ --mutation-id mut_001 # Test multiple mutations coval runs launch \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 \ --mutation-ids "mut_001,mut_002,mut_003" ``` --- ## API Keys Source: https://docs.coval.dev/cli/api-keys Manage API keys for programmatic access with the Coval CLI API keys provide programmatic access to the Coval API. You can create keys scoped to specific environments and permissions. > **Tip:** You can also create and manage API keys from the dashboard. See the [API Keys guide](/guides/api-keys) for instructions. ## List API Keys ```bash coval api-keys list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | | `--status` | string | — | Filter by status (`active`, `revoked`, `suspended`, `expired`) | | `--environment` | string | — | Filter by environment (`production`, `staging`, `development`) | **Output columns:** ID, NAME, TYPE, ENV, STATUS, PERMISSIONS, LAST USED ```bash # List all API keys coval api-keys list # Filter by environment coval api-keys list --environment production # List only active keys coval api-keys list --status active ``` ## Create API Key ```bash coval api-keys create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name for the key | | `--description` | string | No | Optional description | | `--type` | string | **Yes** | Key type (`service` or `user`) | | `--environment` | string | **Yes** | Target environment (`production`, `staging`, `development`) | | `--permissions` | string | No | Comma-separated permission scopes | > **Warning:** The full API key is only shown once at creation time. Store it securely — it cannot be retrieved later. ### Key Types | Type | Description | |------|-------------| | `service` | For server-to-server integrations and CI/CD pipelines | | `user` | For individual user access | ```bash # Create a production service key coval api-keys create \ --name "CI Pipeline" \ --type service \ --environment production # Create a development key with description coval api-keys create \ --name "Dev Testing" \ --type user \ --environment development \ --description "Key for local development" ``` ## Update API Key ```bash coval api-keys update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `api_key_id` | string | **Yes** | The API key ID to update | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--status` | string | **Yes** | New status (`active`, `revoked`, `suspended`, `expired`) | | `--reason` | string | No | Reason for the status change | ```bash # Revoke a key coval api-keys update ak_abc123 --status revoked # Revoke with a reason coval api-keys update ak_abc123 --status revoked --reason "Key compromised" ``` ## Delete API Key ```bash coval api-keys delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `api_key_id` | string | **Yes** | The API key ID to delete | ```bash coval api-keys delete ak_abc123 ``` --- ## Human Review Source: https://docs.coval.dev/cli/human-review Manage human review projects and annotations with the Coval CLI > **Tip:** Using Claude Code? We have [skills to support human review](https://github.com/coval-ai/coval-external-skills/tree/main/skills/human-review) in your workflow. ## Review Projects ### List Review Projects ```bash coval review-projects list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order (e.g., `-create_time`) | **Output columns:** ID, NAME, TYPE, ASSIGNEES, SIMULATIONS, METRICS, CREATED ```bash # List all review projects coval review-projects list # Sort by most recent coval review-projects list --order-by "-create_time" # JSON output coval review-projects list --format json ``` ### Get Review Project ```bash coval review-projects get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `project_id` | string | **Yes** | The review project ID | Returns full project details as JSON including assignees, linked simulations, and linked metrics. ```bash coval review-projects get 01HXYZ1234567890ABCDEF ``` ### Create Review Project ```bash coval review-projects create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name for the project | | `--assignees` | string | **Yes** | Comma-separated reviewer email addresses | | `--simulation-ids` | string | **Yes** | Comma-separated simulation output IDs | | `--metric-ids` | string | **Yes** | Comma-separated metric IDs | | `--description` | string | No | Project description | | `--type` | string | No | `collaborative` or `individual` (default: `individual`) | | `--notifications` | boolean | No | Enable email notifications (default: `true`) | > **Info:** Creating a project auto-generates review annotations for every (simulation, metric, assignee) combination. > **Info:** **Finding your IDs:** Run `coval metrics list` to get metric IDs and `coval simulations list` to get simulation IDs. ```bash # Create a collaborative review project coval review-projects create \ --name "Q1 Voice Agent Review" \ --assignees "alice@company.com,bob@company.com" \ --simulation-ids "sim-output-001,sim-output-002" \ --metric-ids "metric-accuracy,metric-latency" \ --type collaborative # Create with description and notifications disabled coval review-projects create \ --name "Internal Audit" \ --assignees "reviewer@company.com" \ --simulation-ids "sim-output-003" \ --metric-ids "metric-accuracy" \ --description "Spot-check accuracy labels" \ --notifications false ``` ### Update Review Project ```bash coval review-projects update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `project_id` | string | **Yes** | The project ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | Updated display name | | `--assignees` | string | Updated comma-separated reviewer emails | | `--simulation-ids` | string | Updated comma-separated simulation IDs | | `--metric-ids` | string | Updated comma-separated metric IDs | | `--description` | string | Updated description | | `--notifications` | boolean | Updated notification setting | ```bash # Add a new assignee coval review-projects update 01HXYZ1234567890ABCDEF \ --assignees "alice@company.com,bob@company.com,charlie@company.com" # Update project name coval review-projects update 01HXYZ1234567890ABCDEF \ --name "Q1 Voice Agent Review - Updated" ``` ### Delete Review Project ```bash coval review-projects delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `project_id` | string | **Yes** | The project ID to delete | ```bash coval review-projects delete 01HXYZ1234567890ABCDEF ``` --- ## Review Annotations ### List Review Annotations ```bash coval review-annotations list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression (e.g., `project_id="abc"`) | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order (e.g., `-create_time`) | **Supported filter fields:** `simulation_output_id`, `metric_id`, `assignee`, `status` (`ACTIVE`/`ARCHIVED`), `completion_status` (`PENDING`/`COMPLETED`), `project_id` **Output columns:** ID, SIMULATION, METRIC, ASSIGNEE, STATUS, PRIORITY ```bash # List all annotations coval review-annotations list # Filter by project coval review-annotations list --filter 'project_id="01HXYZ1234567890ABCDEF"' # Filter pending annotations for a specific assignee coval review-annotations list \ --filter 'completion_status="PENDING" AND assignee="alice@company.com"' # JSON output coval review-annotations list --format json ``` ### Get Review Annotation ```bash coval review-annotations get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `annotation_id` | string | **Yes** | The annotation ID | Returns full annotation details as JSON including ground-truth values, reviewer notes, and completion status. ```bash coval review-annotations get abc123def456ghi789jklm ``` ### Create Review Annotation ```bash coval review-annotations create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--simulation-id` | string | **Yes** | Simulation output ID to link | | `--metric-id` | string | **Yes** | Metric ID to link | | `--assignee` | string | **Yes** | Reviewer email address | | `--ground-truth-float` | number | No | Ground-truth numeric value (auto-completes) | | `--ground-truth-string` | string | No | Ground-truth string value (auto-completes) | | `--notes` | string | No | Reviewer notes | | `--priority` | string | No | `primary` or `standard` (default: `standard`) | ```bash # Create a basic annotation coval review-annotations create \ --simulation-id sim-output-abc123 \ --metric-id metric-accuracy-001 \ --assignee reviewer@company.com # Create with ground truth (auto-completes) coval review-annotations create \ --simulation-id sim-output-abc123 \ --metric-id metric-accuracy-001 \ --assignee reviewer@company.com \ --ground-truth-float 0.95 \ --notes "Verified correct response" ``` ### Update Review Annotation ```bash coval review-annotations update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `annotation_id` | string | **Yes** | The annotation ID to update | | Option | Type | Description | |--------|------|-------------| | `--ground-truth-float` | number | Ground-truth numeric value (auto-completes) | | `--ground-truth-string` | string | Ground-truth string value (auto-completes) | | `--notes` | string | Reviewer notes | | `--assignee` | string | Reassign to a different reviewer | | `--priority` | string | `primary` or `standard` | ```bash # Submit a ground-truth value coval review-annotations update abc123def456ghi789jklm \ --ground-truth-float 0.85 \ --notes "Agent responded accurately but with slight delay" # Reassign an annotation coval review-annotations update abc123def456ghi789jklm \ --assignee new-reviewer@company.com ``` ### Delete Review Annotation ```bash coval review-annotations delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `annotation_id` | string | **Yes** | The annotation ID to delete | ```bash coval review-annotations delete abc123def456ghi789jklm ``` --- ## Completion Statuses | Status | Description | |--------|-------------| | `PENDING` | Annotation has not been reviewed yet | | `COMPLETED` | Ground-truth value has been submitted | ## Annotation Priorities | Priority | Description | |----------|-------------| | `PRIORITY_PRIMARY` | High-priority annotation — surfaces first in reviewer queues | | `PRIORITY_STANDARD` | Default priority | ## Project Types | Type | Description | |------|-------------| | `collaborative` | All reviewers share a single queue with one annotation per simulation-metric pair | | `individual` | Each reviewer gets their own private queue and annotations | > **Tip:** Use `collaborative` projects when building ground-truth datasets. Use `individual` projects when measuring inter-annotator agreement. --- ## Run Templates Source: https://docs.coval.dev/cli/run-templates Create reusable evaluation configurations with the Coval CLI Run templates save a full evaluation configuration — agent, persona, test set, metrics, and parameters — so you can re-launch identical runs without specifying every option each time. ## List Run Templates ```bash coval run-templates list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, NAME, AGENT, PERSONA, TEST SET, ITERATIONS, CONCURRENCY ```bash # List all templates coval run-templates list # JSON output coval run-templates list --format json ``` ## Get Run Template ```bash coval run-templates get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_template_id` | string | **Yes** | The run template ID | ```bash coval run-templates get rt_abc123 ``` ## Create Run Template ```bash coval run-templates create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name for the template | | `--agent-id` | string | No | Agent to evaluate | | `--persona-id` | string | No | Persona for simulations | | `--test-set-id` | string | No | Test set to use | | `--metric-ids` | string | No | Comma-separated metric IDs | | `--mutation-ids` | string | No | Comma-separated mutation IDs | | `--iteration-count` | number | No | Number of iterations per test case | | `--concurrency` | number | No | Max concurrent simulations | | `--sub-sample-size` | number | No | Number of test cases to sample | | `--sub-sample-seed` | number | No | Random seed for sampling | ```bash # Create a basic template coval run-templates create \ --name "Nightly Regression" \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 # Create a template with metrics and concurrency coval run-templates create \ --name "Full Evaluation" \ --agent-id ag_abc123 \ --persona-id per_xyz789 \ --test-set-id ts_123456 \ --metric-ids "met_001,met_002,met_003" \ --iteration-count 3 \ --concurrency 5 ``` ## Update Run Template ```bash coval run-templates update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_template_id` | string | **Yes** | The run template ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--agent-id` | string | New agent ID | | `--persona-id` | string | New persona ID | | `--test-set-id` | string | New test set ID | | `--metric-ids` | string | New comma-separated metric IDs | | `--mutation-ids` | string | New comma-separated mutation IDs | | `--iteration-count` | number | New iteration count | | `--concurrency` | number | New concurrency limit | | `--sub-sample-size` | number | New sample size | | `--sub-sample-seed` | number | New sample seed | ```bash coval run-templates update rt_abc123 \ --concurrency 10 \ --iteration-count 5 ``` ## Delete Run Template ```bash coval run-templates delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `run_template_id` | string | **Yes** | The run template ID to delete | > **Info:** Deleting a run template will fail with a 409 error if it has active scheduled runs. Remove or disable associated scheduled runs first. ```bash coval run-templates delete rt_abc123 ``` --- ## Scheduled Runs Source: https://docs.coval.dev/cli/scheduled-runs Schedule recurring evaluation runs with the Coval CLI Scheduled runs automatically launch evaluations on a recurring basis using a run template and a cron-style schedule expression. ## List Scheduled Runs ```bash coval scheduled-runs list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | | `--enabled` | boolean | — | Filter by enabled status (`true` or `false`) | | `--template-id` | string | — | Filter by run template ID | **Output columns:** ID, NAME, TEMPLATE, SCHEDULE, TIMEZONE, ENABLED, LAST RUN ```bash # List all scheduled runs coval scheduled-runs list # List only enabled schedules coval scheduled-runs list --enabled true # Filter by template coval scheduled-runs list --template-id rt_abc123 ``` ## Get Scheduled Run ```bash coval scheduled-runs get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `scheduled_run_id` | string | **Yes** | The scheduled run ID | ```bash coval scheduled-runs get sr_abc123 ``` ## Create Scheduled Run ```bash coval scheduled-runs create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name | | `--template-id` | string | **Yes** | Run template to execute | | `--schedule` | string | **Yes** | Cron expression (e.g., `0 9 * * *`) | | `--timezone` | string | No | IANA timezone (default: UTC) | | `--enabled` | boolean | No | Whether the schedule is active | ```bash # Run every day at 9am UTC coval scheduled-runs create \ --name "Daily Regression" \ --template-id rt_abc123 \ --schedule "0 9 * * *" # Run weekdays at 6am Pacific, starting disabled coval scheduled-runs create \ --name "Weekday Check" \ --template-id rt_abc123 \ --schedule "0 6 * * 1-5" \ --timezone "America/Los_Angeles" \ --enabled false ``` ## Update Scheduled Run ```bash coval scheduled-runs update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `scheduled_run_id` | string | **Yes** | The scheduled run ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--schedule` | string | New cron expression | | `--timezone` | string | New IANA timezone | | `--enabled` | boolean | Enable or disable the schedule | ```bash # Disable a schedule coval scheduled-runs update sr_abc123 --enabled false # Change schedule to hourly coval scheduled-runs update sr_abc123 --schedule "0 * * * *" ``` ## Delete Scheduled Run ```bash coval scheduled-runs delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `scheduled_run_id` | string | **Yes** | The scheduled run ID to delete | ```bash coval scheduled-runs delete sr_abc123 ``` --- ## Dashboards Source: https://docs.coval.dev/cli/dashboards Create and manage dashboards and widgets with the Coval CLI Dashboards provide customizable views for monitoring evaluation results. Each dashboard contains widgets that display charts, tables, or text. ## List Dashboards ```bash coval dashboards list [OPTIONS] ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `--filter` | string | — | Filter expression | | `--page-size` | number | 50 | Results per page | | `--order-by` | string | — | Sort order | **Output columns:** ID, NAME, CREATED, UPDATED ```bash coval dashboards list ``` ## Get Dashboard ```bash coval dashboards get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The dashboard ID | ```bash coval dashboards get db_abc123 ``` ## Create Dashboard ```bash coval dashboards create [OPTIONS] ``` | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Display name | ```bash coval dashboards create --name "Production Metrics" ``` ## Update Dashboard ```bash coval dashboards update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The dashboard ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | ```bash coval dashboards update db_abc123 --name "Staging Metrics" ``` ## Delete Dashboard ```bash coval dashboards delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The dashboard ID to delete | ```bash coval dashboards delete db_abc123 ``` --- ## Widgets Widgets are visual components that live inside a dashboard. All widget commands are nested under `coval dashboards widgets`. ### List Widgets ```bash coval dashboards widgets list [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The parent dashboard ID | | Option | Type | Default | Description | |--------|------|---------|-------------| | `--page-size` | number | 50 | Results per page | **Output columns:** ID, NAME, TYPE, GRID, CREATED ```bash coval dashboards widgets list db_abc123 ``` ### Get Widget ```bash coval dashboards widgets get ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The parent dashboard ID | | `widget_id` | string | **Yes** | The widget ID | ```bash coval dashboards widgets get db_abc123 wgt_xyz789 ``` ### Create Widget ```bash coval dashboards widgets create [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The parent dashboard ID | | Option | Type | Required | Description | |--------|------|----------|-------------| | `--name` | string | **Yes** | Widget display name | | `--type` | string | **Yes** | Widget type (see below) | | `--config` | string | No | JSON config string or `@filepath` to read from file | | `--grid-w` | number | No | Grid width | | `--grid-h` | number | No | Grid height | | `--grid-x` | number | No | Grid X position | | `--grid-y` | number | No | Grid Y position | ### Widget Types | Type | Description | |------|-------------| | `chart` | Line, bar, or area chart visualization | | `table` | Tabular data display | | `text` | Static text or markdown content | ```bash # Create a chart widget coval dashboards widgets create db_abc123 \ --name "Score Trends" \ --type chart \ --config '{"metric_id": "met_001"}' \ --grid-w 6 \ --grid-h 4 # Create a widget with config from a file coval dashboards widgets create db_abc123 \ --name "Detailed Report" \ --type table \ --config @widget-config.json ``` ### Update Widget ```bash coval dashboards widgets update [OPTIONS] ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The parent dashboard ID | | `widget_id` | string | **Yes** | The widget ID to update | | Option | Type | Description | |--------|------|-------------| | `--name` | string | New display name | | `--type` | string | New widget type | | `--config` | string | New JSON config or `@filepath` | | `--grid-w` | number | New grid width | | `--grid-h` | number | New grid height | | `--grid-x` | number | New grid X position | | `--grid-y` | number | New grid Y position | ```bash coval dashboards widgets update db_abc123 wgt_xyz789 \ --grid-w 12 --grid-h 6 ``` ### Delete Widget ```bash coval dashboards widgets delete ``` | Argument | Type | Required | Description | |----------|------|----------|-------------| | `dashboard_id` | string | **Yes** | The parent dashboard ID | | `widget_id` | string | **Yes** | The widget ID to delete | ```bash coval dashboards widgets delete db_abc123 wgt_xyz789 ``` --- ## Evaluations for Agents Source: https://docs.coval.dev/agents/overview Give your AI coding agents the tools and knowledge to evaluate AI quality — via Skills, MCP, CLI, or API. Coval works with any AI coding agent. Whether you use Claude Code, Cursor, Windsurf, Codex, or another tool, your agent can run evaluations, manage test sets, and score AI outputs through the interface that fits your workflow. ## Get Started - [Agent Skills](/agents/skills): Install evaluation expertise with one command. Your agent learns how to build test sets, select metrics, and run evals. - [Guided Onboarding](/agents/onboarding): Run `/onboard` and your agent walks you through setting up a complete evaluation from scratch. - [MCP Server](/mcp/overview): Connect the Coval MCP server for native tool access in Claude Desktop, Cursor, and other MCP clients. - [CLI](/cli/overview): The Coval CLI gives agents structured JSON output for scripting evaluations in any terminal. ## Three Ways Agents Use Coval | Layer | What It Does | Install | |-------|-------------|---------| | **Agent Skills** | Teaches agents *how* to evaluate well (knowledge) | `npx skills add coval-ai/coval-external-skills` | | **MCP Server** | Gives agents *tools* to execute evaluations | `npx coval-mcp` | | **CLI** | Runs evaluations from *any terminal* with JSON output | `brew install coval-ai/tap/coval` | Skills and MCP are complementary — Skills give your agent the expertise to design good evaluations, while MCP and CLI let it execute them. Use whichever combination fits your workflow. ## Supported Agents Skills + MCP + CLI Skills + MCP Skills + MCP Skills + CLI CLI + API CLI + API ## AI-Readable Documentation Coval publishes machine-readable documentation following the [llms.txt standard](https://llmstxt.org): - **[llms.txt](https://docs.coval.dev/llms.txt)** — Curated index of all documentation pages (~7KB) - **[llms-full.txt](https://docs.coval.dev/llms-full.txt)** — Complete documentation in a single file (~386KB) Point your agent at these files when it needs context about Coval's platform, API, or concepts. --- ## Guided Onboarding Source: https://docs.coval.dev/agents/onboarding Run /onboard to set up a complete AI evaluation interactively — from connecting your agent to viewing results. The `/onboard` skill guides you through setting up your first Coval evaluation step by step. Your AI coding agent asks questions about your use case, then creates all the resources and launches the evaluation using the Coval CLI. ## Quick Start ```bash # 1. Install Coval skills npx skills add coval-ai/coval-external-skills # 2. Open your AI coding agent (Claude Code, Cursor, etc.) # 3. Run the onboarding skill /onboard ``` The skill handles everything from there — including installing the CLI and authenticating if you haven't already. ## What Gets Created The onboarding flow creates a complete evaluation setup: | Resource | What It Is | |----------|-----------| | **Agent** | Your AI agent connected to Coval (voice, chat, SMS, or WebSocket) | | **Persona** | A simulated caller with voice, language, and behavior settings | | **Test Set** | 3 test cases: happy path, edge case, and compliance scenario | | **Metrics** | Use-case-specific metrics plus built-in audio and conversation metrics | | **Run Template** | Reusable configuration bundling everything above | | **Evaluation Run** | Your first evaluation, launched and monitored | ## The Flow The skill walks through 6 phases: **Step: Setup** Checks if the Coval CLI is installed and you're authenticated. Guides installation if needed. Detects any existing resources so you don't duplicate work. **Step: Connect Agent** Asks your agent type (voice, chat, SMS, WebSocket) and connection details (phone number or endpoint URL). **Step: Discover Use Case** Asks what your agent does (customer support, insurance, healthcare, sales, etc.) and what language it speaks. Creates a persona tailored to your vertical. **Step: Build Test Cases** Generates 3 test cases based on your use case — a happy path, an edge case, and a compliance scenario. Each includes expected behaviors your agent should follow. **Step: Select Metrics** Recommends metrics based on your use case and agent type. Includes custom LLM judge metrics, audio quality metrics (for voice), and built-in metrics like latency and sentiment. **Step: Launch and Review** Bundles everything into a reusable template, launches the evaluation, watches progress, and presents results with scores per test case. ## Supported Verticals The skill includes templates for these use cases, with pre-built personas, test cases, and metrics for each: | Vertical | Persona | Custom Metric | |----------|---------|---------------| | Customer Support | Jordan | Issue Resolution | | Scheduling & Booking | Taylor | Booking Accuracy | | Sales | Morgan | Sales Accuracy | | Insurance Claims | Sarah | Identity Verification | | Healthcare Intake | Michael | HIPAA Compliance | | Restaurant Orders | Alex | Order Accuracy | | Debt Collection | Chris | Regulatory Compliance | | IT Helpdesk | Pat | Ticket Resolution | If your use case doesn't match a vertical, the skill uses a general-purpose template and adapts based on your description. ## After Onboarding Once your first evaluation completes, you can: - **Add more test cases**: `coval test-cases create --test-set-id --input "..."` - **Schedule recurring runs**: `coval scheduled-runs create --template-id --schedule "cron(0 9 * * MON)"` - **Listen to recordings**: `coval simulations audio -o recording.wav` - **Iterate on metrics**: Adjust prompts based on what you learned from results - **View in dashboard**: Visit `app.coval.dev` to see full results with transcripts ## Requirements - An AI coding agent that supports skills (Claude Code, Cursor, Windsurf, Codex, etc.) - An AI agent to evaluate (voice or chat, accessible via phone number or endpoint) - A Coval account ([sign up at coval.dev](https://coval.dev)) --- ## Agent Skills Source: https://docs.coval.dev/agents/skills Install evaluation expertise into your AI coding agent with one command. Agent Skills are modular knowledge packages that teach your AI coding agent how to evaluate effectively. They follow the open [Agent Skills standard](https://agentskills.io) and work with Claude Code, Cursor, Windsurf, Codex, and 40+ other agents. ## Install ```bash npx skills add coval-ai/coval-external-skills ``` This installs all Coval skills into your agent's skills directory. Skills are loaded on demand — only the name and description are in memory until activated. ## Skills vs MCP vs CLI | | Skills | MCP Server | CLI | |---|--------|-----------|-----| | **What it provides** | Knowledge (how to evaluate well) | Tools (execute operations) | Operations (run from terminal) | | **Install** | `npx skills add coval-ai/coval-external-skills` | `npx coval-mcp` | `brew install coval-ai/tap/coval` | | **Use when** | Agent needs to *design* evaluations | Agent needs to *run* evaluations natively | Scripting, CI/CD, any terminal | | **Works with** | Any agent supporting skills | MCP-compatible clients | Any shell environment | We recommend **Skills + CLI** for the most complete experience. Skills teach your agent what to create, and the CLI executes it with structured JSON output. ## Available Skills ### Onboarding - [onboard](/agents/onboarding): Interactive guided setup for your first evaluation. Walks through connecting an agent, creating personas, building test cases, selecting metrics, and launching a run. ### Runs | Skill | Description | |-------|-------------| | **launch-run** | Launch an evaluation run against an AI agent | | **watch-run** | Monitor a run's progress with live status updates | | **quick-eval** | Full workflow — launch, watch, and summarize results in one go | ### Simulations | Skill | Description | |-------|-------------| | **get-results** | Retrieve and analyze simulation results from a run | | **download-audio** | Download audio recordings from voice simulations | ### Resources | Skill | Description | |-------|-------------| | **coval-resources** | Complete reference for all Coval resources, their hierarchy, relationships, API endpoints, and ID formats | ### Dashboards | Skill | Description | |-------|-------------| | **create-dashboard** | Create a new dashboard and populate it with metric widgets | | **add-widget** | Add a chart, table, or text widget to a dashboard | | **manage-dashboard** | Get, update, or delete a dashboard | | **manage-widgets** | List, update, resize, or delete widgets | | **list-dashboards** | List all dashboards with filtering | ### Test Cases | Skill | Description | |-------|-------------| | **huggingface-import** | Import datasets from HuggingFace and convert them to Coval test sets | ### Migrations | Skill | Description | |-------|-------------| | **migrate-bluejay** | Migrate configuration from Bluejay voice AI testing platform to Coval | ### Human Review | Skill | Description | |-------|-------------| | **review-llm-annotations-and-improve-prompt** | Calculate agreement between human and machine labels, then propose improved metric prompts | ## How Skills Work Skills use **progressive disclosure** to stay lightweight: 1. **At startup** (~100 tokens per skill): Only the `name` and `description` are loaded 2. **When activated** (<5000 tokens): The full skill instructions load when your agent detects a relevant task 3. **On demand**: Reference files (templates, examples) load only when needed This means having all Coval skills installed adds minimal overhead to your agent's context. ## Skill Structure Each skill follows the [Agent Skills spec](https://agentskills.io/specification): ``` skill-name/ ├── SKILL.md # Instructions (required) ├── references/ # Templates, detailed docs (optional) ├── scripts/ # Executable code (optional) └── assets/ # Static resources (optional) ``` ## Source Code All skills are open source: [github.com/coval-ai/coval-external-skills](https://github.com/coval-ai/coval-external-skills) --- ## MCP Server Source: https://docs.coval.dev/mcp/overview Use Coval directly from Claude Desktop, Cursor, and other MCP-compatible clients The **Coval MCP Server** enables AI assistants to interact with Coval's evaluation APIs through the [Model Context Protocol](https://modelcontextprotocol.io). ![Coval MCP in Claude Desktop](/images/mcp/claude-desktop.png) ## What You Can Do With the MCP server, you can ask Claude or Cursor to: - **Launch evaluations** - "Run the billing test set against my support agent" - **Monitor runs** - "What's the status of my latest evaluation?" - **Manage agents** - "Create a new voice agent for customer service" - **View metrics** - "Show me the metrics for run abc123" - **Organize tests** - "List my test sets and their configurations" ## Quick Start **Step: Get your API key** Go to [Coval Dashboard](https://app.coval.dev/settings) and copy your API key. **Step: Configure Claude Desktop** Add to `~/Library/Application Support/Claude/claude_desktop_config.json`: ```json { "mcpServers": { "coval": { "command": "npx", "args": ["-y", "@covalai/mcp-server"], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` **Step: Restart Claude Desktop** Quit and reopen Claude Desktop to load the server. **Step: Start using Coval** Ask Claude: "List my Coval agents" or "Show my recent evaluation runs" ## Available Tools The MCP server exposes 18 tools across 6 categories: | Category | Tools | Description | |----------|-------|-------------| | **Runs** | `list_runs`, `get_run`, `create_run`, `delete_run` | Launch and monitor evaluations | | **Agents** | `list_agents`, `get_agent`, `create_agent`, `update_agent` | Manage agent configurations | | **Test Sets** | `list_test_sets`, `get_test_set`, `create_test_set` | Organize test cases | | **Test Cases** | `list_test_cases`, `get_test_case`, `create_test_case`, `update_test_case` | Manage individual test cases | | **Metrics** | `list_metrics`, `get_metric` | View evaluation metrics | | **Personas** | `list_personas`, `get_persona` | Configure simulated users | - [Tools Reference](/mcp/tools): See complete parameter documentation for all tools ## Example Usage Once connected, you can ask Claude or Cursor things like: - "Show me my recent evaluation runs" - "List all my agents" - "Run an evaluation of my customer-support-agent against the billing-inquiries test set" - "What are the metrics for run abc123?" - "Create a new test set for voice agent scenarios" ## Requirements - Node.js 20+ - Coval API key - MCP-compatible client (Claude Desktop, Cursor, etc.) ## Support - [GitHub Issues](https://github.com/coval-ai/mcp-server/issues) - [Coval Support](mailto:support@coval.dev) --- ## Installation Source: https://docs.coval.dev/mcp/installation Configure the Coval MCP server for your AI assistant ## Installation ```bash npx @covalai/mcp-server ``` ## Claude Desktop Add to `~/Library/Application Support/Claude/claude_desktop_config.json`: **NPX (Recommended):** ```json { "mcpServers": { "coval": { "command": "npx", "args": ["-y", "@covalai/mcp-server"], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` **Remote Server:** ```json { "mcpServers": { "coval": { "command": "npx", "args": [ "-y", "mcp-remote", "https://mcp.coval.dev/mcp", "--header", "X-API-KEY: ${COVAL_API_KEY}" ], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` > **Warning:** Restart Claude Desktop after modifying the config file. ## Cursor Add to `.cursor/mcp.json` in your project: ```json { "mcpServers": { "coval": { "command": "npx", "args": ["-y", "@covalai/mcp-server"], "env": { "COVAL_API_KEY": "your_api_key_here" } } } } ``` ## Environment Variables | Variable | Required | Default | Description | |----------|----------|---------|-------------| | `COVAL_API_KEY` | **Yes** | - | Your API key from [dashboard](https://app.coval.dev/settings) | | `COVAL_API_BASE_URL` | No | `https://api.coval.dev/v1` | Custom API endpoint | | `LOG_LEVEL` | No | `info` | Logging level (`debug`, `info`, `warn`, `error`) | ## Local Development Clone and run from source: ```bash git clone https://github.com/coval-ai/mcp-server cd coval-mcp-server npm install npm run build # Set your API key export COVAL_API_KEY=your_api_key_here # Run the server npm start ``` ### Testing with MCP Inspector ```bash npm run inspector ``` This launches a web UI for testing tool calls interactively. ## Troubleshooting **Server not loading in Claude Desktop** 1. Check that Node.js 20+ is installed: `node --version` 2. Verify the config file path is correct 3. Ensure JSON syntax is valid (no trailing commas) 4. Restart Claude Desktop completely (quit from menu bar) **Authentication errors** 1. Verify your API key at [dashboard](https://app.coval.dev/settings) 2. Check the key is correctly set in `COVAL_API_KEY` 3. Ensure no extra whitespace around the key **Tools not appearing** 1. Check Claude Desktop logs: `~/Library/Logs/Claude/` 2. Run the server manually to see errors: `COVAL_API_KEY=xxx npx @covalai/mcp-server` 3. Verify you have a valid API key (without one, only `ping` tool is available) --- ## Tools Reference Source: https://docs.coval.dev/mcp/tools Complete reference for all MCP server tools ## Run Management ### list_runs List evaluation runs with filtering and pagination. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token from previous response | | `order_by` | string | No | Sort order (e.g., `-create_time` for newest first) | | `filter` | string | No | Filter expression (e.g., `status="COMPLETED"`) | ### get_run Get detailed information about a specific run. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `run_id` | string | **Yes** | The unique run ID | Returns run details including status, progress, and metrics (if completed). ### create_run Launch a new evaluation run. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `agent_id` | string | **Yes** | Agent ID from `list_agents` | | `persona_id` | string | **Yes** | Persona ID from `list_personas` | | `test_set_id` | string | **Yes** | Test set ID from `list_test_sets` | | `metric_ids` | string[] | No | Specific metrics to evaluate | | `options.iteration_count` | number | No | Iterations per test case (1-10, default: 1) | | `options.concurrency` | number | No | Parallel simulations (1-5, default: 1) | | `metadata` | object | No | Custom metadata for tracking | --- ## Agent Management ### list_agents List all configured agents. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token | | `order_by` | string | No | Sort order | | `filter` | string | No | Filter by `model_type`, `display_name`, etc. | ### get_agent Get detailed configuration for a specific agent. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `agent_id` | string | **Yes** | Agent ID from `list_agents` | ### create_agent Create a new agent configuration. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `display_name` | string | **Yes** | Human-readable name (1-200 chars) | | `model_type` | string | **Yes** | Agent type (see below) | | `phone_number` | string | No | E.164 format for voice agents | | `endpoint` | string | No | Webhook or WebSocket URL | | `prompt` | string | No | System prompt/instructions | | `metadata` | object | No | Custom metadata | **Model Types:** - `MODEL_TYPE_VOICE` - Inbound voice - `MODEL_TYPE_OUTBOUND_VOICE` - Outbound voice - `MODEL_TYPE_CHAT` - Chat/text - `MODEL_TYPE_SMS` - SMS messaging - `MODEL_TYPE_WEBSOCKET` - WebSocket ### update_agent Update an existing agent configuration. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `agent_id` | string | **Yes** | Agent to update | | `display_name` | string | No | New name | | `phone_number` | string | No | New phone number | | `endpoint` | string | No | New endpoint URL | | `prompt` | string | No | New system prompt | | `metadata` | object | No | New metadata | --- ## Test Set Management ### list_test_sets List all test sets available for evaluation. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token | | `order_by` | string | No | Sort order | | `filter` | string | No | Filter expression | ### get_test_set Get detailed information about a test set. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `test_set_id` | string | **Yes** | Test set ID from `list_test_sets` | ### create_test_set Create a new test set. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `display_name` | string | **Yes** | Test set name (1-100 chars) | | `slug` | string | No | URL-friendly ID (auto-generated if omitted) | | `description` | string | No | Test set description | | `test_set_type` | string | No | `DEFAULT`, `SCENARIO`, `TRANSCRIPT`, or `WORKFLOW` | | `test_set_metadata` | object | No | Configuration metadata | | `parameters` | object | No | Test parameterization | --- ## Test Case Management ### list_test_cases List test cases with optional filtering by test set. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `test_set_id` | string | No | Filter by test set ID | | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token | | `order_by` | string | No | Sort order | | `filter` | string | No | Filter expression | ### get_test_case Get detailed information about a test case. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `test_case_id` | string | **Yes** | Test case ID from `list_test_cases` | ### create_test_case Create a new test case in a test set. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `test_set_id` | string | **Yes** | Test set to add the case to | | `display_name` | string | **Yes** | Test case name | | `description` | string | No | Test case description | | `input` | object | No | Input data for the test | | `expected_output` | object | No | Expected output for validation | | `metadata` | object | No | Custom metadata | ### update_test_case Update an existing test case. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `test_case_id` | string | **Yes** | Test case to update | | `display_name` | string | No | New name | | `description` | string | No | New description | | `input` | object | No | New input data | | `expected_output` | object | No | New expected output | | `metadata` | object | No | New metadata | --- ## Metrics ### list_metrics List available evaluation metrics. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token | | `order_by` | string | No | Sort order | | `filter` | string | No | Filter expression | | `include_builtin` | boolean | No | Include built-in metrics | ### get_metric Get detailed configuration for a specific metric. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `metric_id` | string | **Yes** | Metric ID from `list_metrics` | --- ## Personas ### list_personas List available simulated personas for testing. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `page_size` | number | No | Results per page (1-100, default: 50) | | `page_token` | string | No | Pagination token | | `order_by` | string | No | Sort order | | `filter` | string | No | Filter expression | ### get_persona Get detailed configuration for a specific persona. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `persona_id` | string | **Yes** | Persona ID from `list_personas` | Returns persona configuration including voice settings, language, and behavior. --- ## Getting More Out of Coval with Claude (MCP Guide) Source: https://docs.coval.dev/mcp/beginners-guide A practical guide for Coval UI users who want to work faster using Claude and the MCP server # Getting More Out of Coval with Claude (MCP Guide) If you already use the Coval UI, you know how much you can do — but there's a faster way to work. Coval's MCP server connects Claude directly to your Coval workspace, so you can create and manage evaluations just by describing what you want. No clicking through menus. No copy-pasting prompts. Just tell Claude what you need. --- ## What you can do ### Build test sets faster Instead of manually entering test cases one by one, describe your scenarios to Claude: "Create a test set for a billing support bot — include cases for refund requests, subscription changes, and payment failures." Claude generates and adds them directly to your workspace. ### Create and refine metrics Describe the behavior you want to evaluate in plain language. Claude can draft a Composite Evaluation criteria set, and you can iterate on it conversationally until it captures exactly what matters. ### Trigger simulation runs Ask Claude to kick off a run against a specific agent and test set. No need to navigate to the UI — Claude handles it and can summarize results when it finishes. ### Check results and debug Ask "which test cases are failing?" or "what's the pass rate on my escalation metric?" Claude pulls the data and explains what it finds. --- ## How to get set up MCP requires Claude Desktop (the downloadable app) — it doesn't run in the browser at claude.ai. ### Option 1: Via the app 1. Download Claude Desktop at [anthropic.com/download](https://anthropic.com/download) 2. Open **Settings → Developer → Edit Config** 3. Paste in the Coval MCP server config (copy from the Coval docs — link below) 4. Restart Claude Desktop ### Option 2: Via terminal ```bash # 1. Install Claude Desktop (if you haven't already) # Download from anthropic.com/download and run the installer # 2. Open the Claude Desktop config file open ~/Library/Application\ Support/Claude/claude_desktop_config.json ``` Add the Coval MCP server entry into the `mcpServers` section: ```json { "mcpServers": { "coval": { "command": "npx", "args": ["-y", "@coval-ai/mcp-server"], "env": { "COVAL_API_KEY": "your-api-key-here" } } } } ``` ```bash # 3. Restart Claude Desktop ``` Your Coval API key can be found in the Coval UI under **Settings → API Keys**.