WebSocket (Voice)

Overview

WebSocket voice agents stream audio over a single persistent WebSocket connection. Coval can exchange raw binary PCM frames or JSON envelopes that wrap base64-encoded PCM / MP3 audio, plus configured non-audio events (cart updates, session signals) the agent emits. Use this connection type for voice agents that:

Stream audio over WebSocket rather than SIP, WebRTC, or HTTP.
Receive Coval’s Linear PCM audio at a fixed sample rate and return PCM or MP3 audio.
Optionally send structured side-events, such as cart updates or session status messages, alongside audio.

For text-only WebSocket agents (token-by-token chat) see Chat WebSocket.

Connection modes

Mode	When to use
Direct	The agent exposes a stable `wss://` URL Coval can dial directly.
HTTP-first	The agent requires an HTTP setup call to provision a per-session WebSocket URL before the audio stream begins.

In HTTP-first mode, Coval makes the configured HTTP request, extracts the WebSocket URL using websocket_url_response_path, and opens the audio WebSocket against that URL.

Authentication

WebSocket voice agents authenticate during the WebSocket upgrade.

Authorization header — set authorization_header to the auth value Coval should send during the WebSocket upgrade. Values like Bearer <ACCESS_TOKEN> and Basic <BASE64_CREDENTIALS> are sent as the Authorization header value. Values like X-API-Key <KEY> are sent as the X-API-Key header.
Query-string token — when the agent only supports browser-style auth, encode the token directly in the endpoint, for example wss://example.com/ws?token=....
Custom headers — custom_headers accepts additional upgrade headers. In the UI, add header name/value rows. Through the API, send metadata.custom_headers as a JSON object or as a JSON-encoded object string, for example {"X-Foo":"bar"} or "{\"X-Foo\":\"bar\"}".

The UI masks the authorization header field. Do not pass JSON in authorization_header; use custom_headers for additional named headers. Tokens included directly in the endpoint query string may be visible anywhere URLs are logged, so prefer authorization_header when the agent supports it.

Audio transport

Audio can be exchanged as raw PCM bytes or as JSON envelopes containing a base64-encoded audio payload. The default JSON shape is audio_chunk / data; the JSON audio preset uses audio_message / audio_bytes. Both JSON shapes are configurable per agent, and setting send_audio_template to exactly {{audio_data}} makes outbound audio raw bytes instead. Coval’s simulator only sends Linear PCM. The JSON audio preset uses:

Codec: PCM (linear)
Sample rate: 16 000 Hz
Bit depth: 16-bit
Endianness: little-endian
Channels: 1 (mono)
Recommended frame duration for peer implementations: 20-100 ms

Message frames can be bidirectional. Use audio_message_type_value to identify the agent frames that contain inbound audio, and use send_audio_template to shape Coval-originated audio frames. For the JSON audio preset, Coval sends audio_message frames with sender: "USER" and the agent should send its own audio_message frames with sender: "AI".

Audio format fields

Field	Default	Purpose
`endpoint`	– (required)	`wss://` URL Coval connects to in direct mode. Plain `ws://`, `http://`, and `https://` endpoints are rejected for direct WebSocket connections.
`connection_mode`	`direct`	`direct` or `http_first`.
`initialization_json`	empty	Optional JSON object Coval sends after the WebSocket upgrade and before any ready-message wait.
`send_audio_template`	`{"type":"audio_chunk","data":"{{audio_data}}"}`	Outbound JSON template. Must contain `{{audio_data}}`. Setting it to literally `{{audio_data}}` sends raw PCM bytes (no JSON wrapping).
`message_type_path`	`type`	Dot-notation path to the field that names the message kind.
`audio_message_type_value`	`audio_chunk`	Value that identifies an inbound audio frame. Use `*` to treat every JSON message as audio.
`audio_data_path`	`data`	Dot-notation path to the base64 audio payload inside an inbound frame.
`audio_encoding`	`pcm`	Inbound JSON audio payload encoding: `pcm` or `mp3`. MP3 frames are decoded to 16 kHz mono PCM before evaluation.
`receive_audio_channels`	`2`	`1` for mono inbound JSON PCM, `2` to keep the legacy stereo-to-mono averaging behavior.
`send_sample_rate_hertz`	`16000`	Outbound sample rate Coval sends to the agent. Allowed: 8 000, 16 000, 24 000, 48 000.
`receive_sample_rate_hertz`	`48000`	Sample rate the agent sends. Allowed: 8 000, 16 000, 24 000, 48 000.
`pipeline_sample_rate_hertz`	`16000`	Coval processing rate; must stay 16 000.
`pace_inbound_binary_audio`	inferred	Pace inbound binary PCM in real time so resampling and metrics see realistic timing. Defaults on when outbound audio is configured for raw PCM bytes and off for JSON templates.

The default receive rate is higher than the send rate because many voice integrations return higher-rate audio while receiving 16 kHz audio from Coval. Paths use dot notation for nested fields, for example payload.audio.data. Match send_sample_rate_hertz / receive_sample_rate_hertz to the agent’s actual stream format; mismatched sample rates can cause speed, pitch, or quality issues.

HTTP-first setup fields

These fields apply when connection_mode is http_first:

Field	Default	Purpose
`http_url`	– (required)	`https://` setup endpoint Coval calls before opening the WebSocket.
`http_method`	`POST`	Setup request method. Allowed: `GET`, `POST`, `PUT`, `PATCH`, `DELETE`, `HEAD`, `OPTIONS`.
`http_request_body`	`{}`	JSON object body for the setup request.
`http_headers`	`{}`	JSON object of headers for the setup request.
`websocket_url_response_path`	– (required)	Dot-notation path to the WebSocket URL in the setup response, for example `data.websocket_url`.
`authorization_header`	empty	Auth value for the WebSocket upgrade after setup. This is separate from `http_headers`.
`custom_headers`	`{}`	Additional headers for the WebSocket upgrade after setup.

Handshake

Field	Default	Purpose
`handshake_ready_message_type`	`session_ready` in direct mode; empty in HTTP-first mode	Set to an empty string to skip the ready-message wait.
`handshake_requires_session_id`	`true` in direct mode; `false` in HTTP-first mode	When `true`, the ready message must include `session_id`.
`handshake_timeout_seconds`	`30`	Seconds Coval waits for the ready message.

With the default message_type_path of type, a direct-mode ready message looks like:

{
  "type": "session_ready",
  "session_id": "abc123"
}

If you customize message_type_path, Coval uses that same path to find the ready-message type.

Non-audio event capture

Many voice agents emit side-events alongside the audio stream — cart updates, transcript fragments, session telemetry. By default, Coval ignores non-audio JSON messages. To tell Coval which message types to accept, set:

{ "non_audio_event_message_types": ["system_notify"] }

For each matching message, Coval reads:

event_type — the value at message_type_path (for example system_notify).
event_name — the optional event field from the payload (for example ocb:cart-updated).
payload — the full parsed JSON message.

For example, when message_type_path is action and non_audio_event_message_types includes system_notify, this inbound message is accepted as a non-audio event:

{
  "action": "system_notify",
  "event": "ocb:cart-updated",
  "payload": {
    "items": [
      {
        "name": "latte",
        "quantity": 1,
        "modifiers": ["oat milk"],
        "price": 4.5
      }
    ],
    "subtotal": 4.5,
    "total": 4.5
  }
}

Coval does not emit these events to your agent. It only receives configured message types and ignores unconfigured non-audio JSON messages. Accepted event messages are stored with the simulation transcript as websocket_event entries. Transcript-based metrics, including LLM judge metrics, see JSON that includes event_type, event_name, and the full payload, so they can evaluate structured side-channel data such as cart contents, selected menu items, modifiers, quantities, and prices alongside the spoken conversation.

Media (image) frames

Voice WebSocket simulations can attach images from a test case mid-conversation. send_media_template controls the outbound shape:

{
  "type": "media",
  "name": "{{media_name}}",
  "mime_type": "{{mime_type}}",
  "data": "{{media_data}}"
}

Template rules:

{{media_data}} is required.
{{media_name}} and {{mime_type}} are optional placeholders.
If the template is exactly {{media_data}}, Coval sends raw bytes.
Otherwise, Coval base64-encodes the image and substitutes it into your JSON template.

See Test Sets — image attachments for the test-case side.

Examples

Initialization payload:

{
  "action": "start_session",
  "session_type": "simulation",
  "metadata": {
    "source": "coval",
    "test_mode": true
  }
}

Custom WebSocket upgrade headers:

{
  "X-Client-ID": "coval-simulation",
  "X-API-Version": "2024-01",
  "X-Environment": "production"
}

JSON audio preset

The agent UI ships a JSON audio preset that fills the metadata for JSON audio WebSocket agents. It sets:

{
  "connection_mode": "direct",
  "websocket_compat_profile": "json_audio",
  "initialization_json": "",
  "handshake_ready_message_type": "",
  "handshake_requires_session_id": false,
  "send_sample_rate_hertz": 16000,
  "receive_sample_rate_hertz": 16000,
  "send_audio_template": "{\"action\":\"audio_message\",\"payload\":{},\"audio_bytes\":\"{{audio_data}}\",\"sender\":\"USER\"}",
  "message_type_path": "action",
  "audio_message_type_value": "audio_message",
  "audio_data_path": "audio_bytes",
  "audio_encoding": "pcm",
  "receive_audio_channels": 1,
  "non_audio_event_message_types": ["system_notify"]
}

Set authorization_header to Bearer <ACCESS_TOKEN> after picking the preset if the agent requires auth (most production endpoints do).

Setup

Prepare the agent endpoint. Confirm wss:// is reachable, audio format matches the configuration above, and decide whether the agent requires Bearer auth.
Create the agent in Coval. Open the Agents page in your Coval org, choose WebSocket as the connection type, and either fill the fields manually or apply the JSON audio preset.
Smoke test. Build a small test set with a single voice persona and run a simulation. The transcript should show alternating turns, the result page should expose usable audio, and any configured side-events should be available to transcript-based metrics.

How simulations work

Coval performs any HTTP-first setup, then opens the WebSocket with any configured Bearer token or custom headers.
If handshake_ready_message_type is set, Coval waits for the ready message before sending audio.
Coval streams persona audio outward using send_audio_template at the configured sample rate: raw PCM bytes for {{audio_data}}, or JSON text frames for any JSON template.
Inbound binary frames or matching JSON audio frames are decoded and resampled if needed.
Inbound non-audio JSON messages whose type is in non_audio_event_message_types are accepted; unconfigured non-audio messages are ignored.
When the persona finishes, Coval closes the WebSocket cleanly.

Troubleshooting

Empty transcript with audio frames flowing. Check that audio_message_type_value matches the agent’s field, that audio_data_path points at the base64 payload, and that audio_encoding matches the wire format. Inbound audio sounds half-speed or distorted. Confirm receive_audio_channels. JSON PCM that arrives mono should be configured with receive_audio_channels: 1; the historical default 2 averages two channels and halves the apparent rate when the source is mono. Cart events / status messages look ignored. Add the action value to non_audio_event_message_types. Without it, Coval ignores non-audio JSON messages. Auth failures during handshake. Verify the authorization_header value, or move the token to a ?token=... query string when the agent only supports browser-style auth. Connection refused locally. Tunnel the agent’s ws:// server through ngrok or Cloudflare Tunnel and use the resulting wss:// URL as the agent endpoint.

ngrok http 8080
# Use the generated wss:// URL as the agent endpoint.

If your tunnel provider shows an https:// URL, use the corresponding wss:// URL in Coval. Update the agent configuration when the tunnel URL changes, or use a reserved tunnel domain for a stable endpoint. Unreadable audio or media payloads. For JSON audio/media templates, Coval substitutes base64 data into {{audio_data}} / {{media_data}}; for raw templates, the agent must expect raw PCM or media bytes. Verify the JSON is valid, the configured message fields match the agent payload, audio_encoding is correct, and send_media_template includes {{media_name}} / {{mime_type}} when the agent needs file metadata. Timeouts or no response. Confirm the agent keeps the WebSocket open for the whole conversation, processes incoming audio frames without blocking, sends audio responses in the configured shape, and logs initialization / ready messages while testing.

Best practices

Pick the JSON audio preset (or a similar named preset) instead of hand-filling fields when one exists. It keeps the metadata canonical for the agent shape.
Mirror the agent’s sample rate exactly in send_sample_rate_hertz / receive_sample_rate_hertz. Resampling is supported but degrades audio.
Capture the side-events you care about by adding their action values to non_audio_event_message_types. Don’t silently rely on the agent emitting them.
Keep the agent’s WebSocket handler long-lived and avoid closing the connection while the simulation is active.
Log initialization payloads, ready messages, and payload parsing errors during initial setup.
Rotate Bearer tokens on a schedule; Coval re-reads the value at every connection setup.

Introduction

Configuration

Observability

Step-by-Step Guides

Use Cases

Connect & Collaborate

Overview

Connection modes

Authentication

Audio transport

Audio format fields

HTTP-first setup fields

Handshake

Non-audio event capture

Media (image) frames

Examples

JSON audio preset

Setup

How simulations work

Troubleshooting

Best practices

Introduction

Configuration

Observability

Step-by-Step Guides

Use Cases

Connect & Collaborate

Documentation Index

​Overview

​Connection modes

​Authentication

​Audio transport

​Audio format fields

​HTTP-first setup fields

​Handshake

​Non-audio event capture

​Media (image) frames

​Examples

​JSON audio preset

​Setup

​How simulations work

​Troubleshooting

​Best practices

Overview

Connection modes

Authentication

Audio transport

Audio format fields

HTTP-first setup fields

Handshake

Non-audio event capture

Media (image) frames

Examples

JSON audio preset

Setup

How simulations work

Troubleshooting

Best practices