Skip to main content

Overview

The human review workflow in Coval allows you to manually review conversations or runs to ensure quality and identify areas for improvement. This guide will walk you through the process of conducting reviews and providing actionable feedback.
Pro Tip: Regular human reviews are essential for maintaining high-quality AI interactions and identifying areas for improvement in your agent’s performance.

Getting Started with Human Review

1

Select a Run or Conversation

Runs Page
  1. Navigate to either the Simulations or Conversations pages
  2. Choose the specific run or conversation you want to review
  3. Use keyboard shortcuts to navigate:
    • j / k, w / s, or up / down to move through rows
    • Enter to open the selected run or conversation
    • once you are in a result view, h / l, a / d, or left / right move between neighboring conversations
2

Review the Content

Review Interface
  • Compare and review the agent’s performance against the metrics
    • Determine the correct value based on your assessment
    • Update the metric if the automated value is incorrect
  • Add notes to the run to provide feedback
    • Notes can be dragged and positioned anywhere in the review interface
3

Track Reviewed Content

Human Eval Page
  • Reviewed or partially reviewed content automatically appears in the Human Eval page
  • View all your reviewed runs from both simulations and conversations

Supported Metric Types

Not all metrics support human review — only those with a defined annotation mechanism can be labeled in the review interface. Metrics fall into four categories based on how reviewers interact with them.

Direct Value Metrics

Reviewers provide a single value for the entire conversation using buttons, a number input, or a dropdown.

Binary (Pass/Fail)

Reviewers select Yes, No, or N/A using on-screen buttons or keyboard shortcuts.
  • Applies to: binary LLM judge metrics, audio binary judge, agent repeats itself

Numerical

Reviewers enter a number within a configured min/max range.
  • Applies to: numerical LLM judge, audio numerical judge

Categorical

Reviewers select from a configured list of categories using a dropdown.
  • Applies to: categorical LLM judge, audio categorical judge

Transcript Sentiment Analysis

Reviewers select a sentiment label (e.g. Rude, Polite, Encouraging, Professional) using category buttons.

Composite Evaluation

Reviewers assess each criterion individually using MET / NOT_MET / UNKNOWN toggles.

Audio Region Metrics

Reviewers mark or edit regions on an audio waveform timeline. These metrics require an audio recording to be present on the conversation. Includes: interruption rate, latency, abrupt pitch changes, volume/pitch misalignment, non-expressive pauses, vocal fry, music detection, time to first audio, volume variance, custom pause analysis, agent needs reprompting.

Per-Segment Labeling

Reviewers assign a label to each speaking segment in the conversation.
  • Audio sentiment — label each segment as Neutral, Angry, Happy, or Sad

Per-Message Review

Reviewers provide a value for each individual message in the transcript.
  • Words per message — count of words per assistant message

Next Steps

After reviewing runs, you can:
  1. Improve Your Agent
  • Use the feedback to update prompts and capabilities
  • Run new simulations to test improvements
  1. Refine Your Metrics
  • Test metric changes in simulations before deploying
  • Use create metrics to update or test new metrics
  1. Assign More Reviews
  • Delegate runs to team members for additional review
  • Track review progress in the Human Eval page
Continuous Improvement: Use these insights to iteratively enhance both your agent and metrics, creating a feedback loop that drives better performance.
For the full keyboard model, see the Keyboard Navigation guide.