You can send traces from your agent to Coval using the OpenTelemetry SDK. This lets you capture detailed span data — such as tool calls, LLM invocations, and other operations — and export it directly to Coval for analysis alongside your simulation or conversation results. Tracing works for both simulations (where Coval calls your agent) and conversations (where you submit post-hoc call data). The setup differs only in how you identify the call — everything else (instrumentation, span naming, viewing) is the same.Documentation Index
Fetch the complete documentation index at: https://docs.coval.dev/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- A Coval account with an API key (manage your keys)
- A simulation output ID (for simulations) or a conversation ID (for conversations)
- Python 3.8+ with the OpenTelemetry SDK installed
Configuration
Configure the OpenTelemetry tracer provider to export spans to Coval’s trace ingestion endpoint:| Parameter | Description |
|---|---|
endpoint | Coval’s OTLP trace ingestion URL: https://api.coval.dev/v1/traces |
X-API-Key | Your Coval API key |
X-Simulation-Id | The simulation output ID for the individual call being traced. This is per-simulation-call, not the run ID. |
timeout | Export timeout in seconds. Must be set to 30 (see note below) |
SERVICE_NAME | A name identifying your agent service |
The
timeout parameter must be set to 30 seconds to ensure spans are exported reliably. We are working on reducing this requirement in a future update.Getting the Simulation Output ID
TheX-Simulation-Id header must be set to the simulation output ID for the specific call you’re tracing. The simulation output ID is a per-call identifier — different from the run ID. Here’s how to obtain it at runtime.
Inbound voice agents
When Coval places an inbound call, it passes the simulation output ID as a SIP header:X-Coval-Simulation-Id. Read this header when the call arrives and use it to configure your OTLP exporter.
Twilio Programmable Voice (PSTN) — Standard Twilio phone numbers route over the public telephone network, which strips SIP headers. Use the
pre_call_webhook_url agent config instead: Coval will POST the simulation ID to your agent before dialing. See the Twilio ConversationRelay guide.Outbound voice agents
Coval’s outbound trigger POST can include the simulation output ID in the request payload. Addsimulation_output_id to your trigger_call_payload configuration in your template, then read it when your webhook receives the trigger and use it to configure the exporter.
Tracing for Conversations
For conversations (post-hoc call evaluation), there is no Coval-initiated call, so there is no simulation output ID available at call time. Instead, you use a conversation ID to associate traces with a conversation. The conversation ID is only available after the call ends and you submit the transcript to Coval — which means you can’t configure the OTLP exporter up front. The solution is to buffer spans in memory during the call, then flush them once you have the ID.Buffer spans during the call
Use
InMemorySpanExporter (included in opentelemetry-sdk) to hold spans locally during the call instead of exporting them in real time.Submit the conversation after the call ends
Post the transcript (and optionally audio) to See
POST /v1/conversations:submit. The response contains the conversation_id you need for trace export.POST /v1/conversations:submit for the full request schema including optional audio, metadata, and metrics fields.| Parameter | Description |
|---|---|
X-Conversation-Id | The conversation_id returned by POST /v1/conversations:submit. Use this instead of X-Simulation-Id. |
Full conversation tracing example
Uploading Traces via the Dashboard
You can also upload traces directly from the Coval dashboard without using the SDK. In the Conversations page, click Upload to Conversations and:- Add your audio file or transcript as usual
- In the Traces (Optional) section, select your OTLP traces JSON file (must contain a
resourceSpansarray) - Click Upload — the conversation and traces are submitted together
Payload Limits & Batching
A single export request to/v1/traces has a size limit. Large buffered exports — most commonly the end-of-call flush in the conversation flow above — can exceed it and fail with 413 Request Entity Too Large.
Keep each export request under roughly 3–4 MB. Treat this as a practical target, not a fixed contract: stay comfortably below it rather than tuning to an exact boundary.
Splitting spans across requests
You can split one call’s spans across multiple export requests. Every request carrying the sameX-Conversation-Id (or X-Simulation-Id) is merged server-side into a single trace, reconstructed from each span’s parent/child relationships. There is no ordering requirement between requests.
The simplest way to stay under the limit is BatchSpanProcessor with a bounded batch size, which chunks exports for you:
max_export_batch_size if your spans carry large attributes such as full transcripts or prompts.
Spans can arrive before
POST /v1/conversations:submit has finished registering the conversation. They are still attributed correctly and reconcile automatically — no special handling needed on your side.Instrumenting Your Agent
Once the tracer is configured, wrap operations in spans to capture trace data:Shutdown — Call
provider.shutdown() when your agent exits. With SimpleSpanProcessor, spans are exported synchronously as each span ends (not buffered), so they are already in Coval before shutdown is called. Shutdown is still good practice for clean resource teardown.Span Naming Conventions
Coval’s trace viewer applies semantic colors and labels to well-known span names. Using these names gives a richer experience in the UI and enables built-in trace metrics.| Span Name | Use For | Required Attributes | Optional / Recommended Attributes | Accepted Compatibility Aliases |
|---|---|---|---|---|
llm | LLM invocations | — | metrics.ttfb (seconds), gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, llm.finish_reason (stop, tool_calls, length, content_filter) | — |
tts | Text-to-Speech | — | metrics.ttfb (seconds) | — |
stt | Speech-to-Text | transcript when using STT Word Error Rate or the Audio Upload variant | metrics.ttfb (seconds), stt.confidence (ASR confidence 0.0-1.0) | stt.transcription is accepted by STT WER for older integrations, but new integrations should emit transcript |
stt.provider.<name> | Per-provider STT attempt (child of stt) | — | stt.providerName, stt.confidence, metrics.ttfb | — |
vad | Voice Activity Detection | — | — | — |
llm_tool_call | Individual tool/function calls | — | function.name, tool_call_id, function.arguments | Span name tool_call; attributes tool.name, tool.call_id, tool.arguments |
turn | A single conversation turn | — | — | — |
conversation | Full conversation | — | — | — |
pipeline | Processing pipeline | — | — | — |
transport | Audio/network transport | — | — | — |
service.name in your Resource to group spans by service.
For complete working implementations, see the voice agent examples on GitHub — Vapi, Pipecat, and LiveKit agents that emit the full span schema.
Instrumenting STT Spans
To use the STT Word Error Rate metric (or its Audio Upload variant), your agent must emitstt spans with a transcript attribute containing the transcribed text. This is what allows Coval to compare your agent’s STT output against a reference transcript. Coval also accepts the older stt.transcription alias for compatibility, but transcript is the canonical attribute for new integrations. We also recommend attaching stt.confidence when your STT provider exposes a per-utterance confidence score.
Here is an example using the Pipecat framework:
PipelineTask(..., enable_tracing=True), Pipecat still emits the standard stt span, and the subclass adds stt.confidence onto that same span:
"stt" and include the transcript attribute with the transcribed text. stt.confidence is optional, but when present it should be a 0.0-1.0 score for the final utterance.
Instrumenting LLM Spans
Includellm.finish_reason on llm spans so you can tell why the model stopped generating. This is especially useful when debugging responses that were silently cut off because llm.finish_reason=length.
Here is a Pipecat example that enriches the built-in traced llm span:
llm span after the provider response finishes:
stop, length, tool_calls, and content_filter.
Provider Fallback Spans
Many voice agents use a provider fallback chain for STT — for example, Deepgram → Google → Azure. Without per-provider spans, a singlestt span only shows the final result; there is no visibility into which provider served the call, how long each attempt took, or why a fallback triggered.
The convention is to create one stt.provider.<name> child span per provider attempt, nested inside the parent stt span:
Span attributes
| Attribute | Type | Description |
|---|---|---|
stt.providerName | string | Provider name, e.g. "deepgram", "google", "azure" |
stt.confidence | float | ASR confidence score from this provider (0.0–1.0) |
metrics.ttfb | float | Time to first byte for this provider attempt (seconds) |
Code example
Viewing Traces in Coval
After a simulation completes or conversation traces are received, an OTel Traces card automatically appears in the metric grid on the result page when trace data is available. The card shows the total span count and a View Traces button that navigates directly to the trace viewer. To view traces: open a run or conversation result, click into a result, and click the OTel Traces card. You can also navigate directly via URL:Trace viewer features
The trace viewer has two visualization modes you can switch between using the toggle in the header: Waterfall view — Shows spans as horizontal bars on a timeline, nested by parent-child relationships. Use the collapse/expand controls to focus on specific parts of the call hierarchy. You can filter by span type using the color-coded legend pills in the header. Flame graph view — Shows all spans stacked by depth, giving a birds-eye view of where time is spent. Interactions include:- Scroll to pan the timeline left/right
- Ctrl/Cmd + scroll to zoom in and out
- Drag-select a region to zoom into that time range
- Double-click a span to zoom to fit that span’s duration
- Press F to reset the view to fit the full trace
- A mini-map above the flame graph shows the full trace with your current viewport highlighted — drag it to pan quickly
Transition Hotspots
Transition Hotspots give you a run-level view of how conversations flow through your agent’s states — and where they fail. Rather than inspecting individual simulations one by one, you can see the full distribution of state-to-state transitions across an entire run at a glance.Walkthrough
Accessing Transition Hotspots
The Hotspots tab appears on the run results page when at least one simulation in the run has OTel trace data. Navigate to a run, then click the Hotspots tab. If the tab is not visible, the run does not contain any traced simulations. You can also access it directly via the?view=hotspots query parameter on the run results URL.
Reading the Heatmap
The Hotspots view displays a heatmap matrix where:- Rows represent the origin state of a transition (the “from” state)
- Columns represent the destination state (the “to” state)
- Each cell represents a pair of states — for example, “greeting → account_lookup”
| View | Description |
|---|---|
| Counts | Each cell shows how many times that state-to-state transition occurred across all simulations in the run |
| Failure Rate | Each cell shows the percentage of simulations that failed when hitting that transition |
Drilling Down
Click any cell in the heatmap to open a detail panel showing:- The total count and failure count for that transition
- Exemplar simulations — individual simulations that passed through that state transition, with direct links to review them
Top Hotspots Sidebar
The Top Hotspots sidebar ranks state transitions by failure count, making it easy to find the most impactful problems without scanning the full matrix. The top-ranked transitions are the ones where the most simulations failed.Span Filters
Use the span type filters to include or exclude specific span types from the transition analysis. Wrapper spans — such asconversation, pipeline, transport, and session:* spans — are automatically collapsed and filtered by default, so the heatmap focuses on the meaningful transitions within your agent’s processing logic.
Full Example
Using Span Attributes in Custom Metrics
Any numeric span attribute your agent emits can be measured using a Custom Trace Metric (METRIC_CUSTOM_TRACE). This lets you track latency, token counts, or any other numeric value from your traces without writing custom evaluation code.
To create a custom trace metric, specify:
- Span Name — the
span_nameof the spans to aggregate (e.g.llm,tts, or any custom span you create) - Metric Attribute — the span attribute key containing the numeric value (e.g.
metrics.ttfb,token_count) - Aggregation Method — how to aggregate across turns:
average,median,p90,max, ormin

