The transcript tells you what the agent said. It doesn’t tell you whether the tool call actually fired, what arguments it received, how long it took, or what it returned. To validate tool calls, instrument your agent with tracing so you can see everything that happened underneath the conversation. This cookbook follows one running example — an auto-insurance voice agent namedDocumentation Index
Fetch the complete documentation index at: https://docs.coval.dev/llms.txt
Use this file to discover all available pages before exploring further.
claims-bot — from instrumentation all the way to a custom LLM judge that catches it making up data.
The Example
Caller dialsclaims-bot after locking their keys in the car. The agent needs to:
- Verify the caller’s identity (DOB + last 4 of policy number)
- Look up their roadside-assistance coverage
- Dispatch a locksmith and quote an ETA
verify_caller, lookup_policy, dispatch_roadside. We want every factual claim the agent makes on the call (the policy tier, the dispatch ETA, the claim ID) to be backed by a real tool result — not invented because a tool errored.
Step One: Instrument claims-bot with OpenTelemetry
Send traces from your agent to Coval using the OpenTelemetry SDK. This captures detailed span data — tool calls, LLM invocations, and other operations — and exports it directly to Coval alongside your simulation results.
Follow the setup guide: OpenTelemetry Traces.
When you instrument the LLM, emit a tool_call span (or llm_tool_call, depending on your convention) per tool invocation with the tool’s arguments, result, and any error. See Instrumenting LLM Spans for the exact shape Coval expects.
For our claims-bot example, a single turn that calls lookup_policy looks like this:
tool_call spans under the LLM turn that produced them: verify_caller, lookup_policy, dispatch_roadside.
Without
tool_call spans, Coval can only show you the surrounding conversation. The span is what unlocks every step below.Step Two: Inspect the Trace for the Locksmith Call
Run a simulation of the locksmith scenario, then open its detail page and click the Traces card. You’ll see each tool call in order with everything you need to debug it:| What you see in the trace | Example value from the locksmith call |
|---|---|
| Tool name | dispatch_roadside |
| Arguments | {"policy_id": "P-48213", "service": "locksmith", "location": "37.7749,-122.4194"} |
| Result | {"claim_id": "RA-90412", "eta_minutes": 35, "provider": "Bay Locksmith Co."} |
| Latency | 1.8s |
| Error (if any) | null |
| Span metadata | model="gpt-4o", turn=4, parent_span=llm_turn |
Step Three: Search Tool Calls Across All claims-bot Simulations
To look across many simulations at once, use Trace Search. Try a query like tool calls in last week to pull every tool call span from the past 7 days:
Open Trace Search →
For claims-bot last week, this surfaces a pattern: dispatch_roadside returned SERVICE_UNAVAILABLE on 12 of 340 calls (≈3.5%). From the search results you can:
- Drill into specific simulations — open any of those 12 calls and see how the agent responded after the error. Did it tell the caller dispatch was unavailable, or did it confidently quote a fake claim ID?
- View the failure matrix — see whether errors cluster on a specific region, policy tier, or time of day.
- Refine the query — narrow to
dispatch_roadside SERVICE_UNAVAILABLEto look only at the failure cohort, or to a single agent if you run multiple.
Step Four: Catch Fabrication With a Custom Trace Metric
Drilling into one of the 12 failure cases above, we want to score automatically — across every future simulation — whether the agent fabricated data after a tool error. We do this with a custom LLM judge metric that runs over the trace, not just the transcript. Create the metric in the dashboard:- Name:
Tool-Grounded Claim Integrity - Metric Type: Text LLM Judge
- Output Type: Binary (YES / NO)
includeTraces=True← this is the critical setting
claims-bot simulations. The metric should:
- Return YES on the clean locksmith run from Step Two (every claim is grounded in a real span).
- Return NO on calls where
dispatch_roadsideerrored but the agent still quoted an ETA — exactly the failure mode Trace Search surfaced in Step Three.
Coming Soon: Latency Metrics on Tool Calls
We’re rolling out custom trace metrics that evaluate tool call timing directly:- “How long did tool calls take to return — avg / p50 / p95?”
- “How long did the
dispatch_roadsidetool specifically take — avg / p95?”

