Overview
The human review workflow in Coval allows you to manually review conversations or runs to ensure quality and identify areas for improvement. This guide will walk you through the process of conducting reviews and providing actionable feedback.Pro Tip: Regular human reviews are essential for maintaining high-quality AI interactions and identifying areas for improvement in your agent’s performance.
Getting Started with Human Review
Select a Run or Conversation

- Navigate to either the Simulations or Conversations pages
- Choose the specific run or conversation you want to review
- Use keyboard shortcuts to navigate:
j/k,w/s, orup/downto move through rowsEnterto open the selected run or conversation- once you are in a result view,
h/l,a/d, orleft/rightmove between neighboring conversations
Review the Content

- Compare and review the agent’s performance against the metrics
- Determine the correct value based on your assessment
- Update the metric if the automated value is incorrect
- Add notes to the run to provide feedback
- Notes can be dragged and positioned anywhere in the review interface
Supported Metric Types
Not all metrics support human review — only those with a defined annotation mechanism can be labeled in the review interface. Metrics fall into four categories based on how reviewers interact with them.Direct Value Metrics
Reviewers provide a single value for the entire conversation using buttons, a number input, or a dropdown.Binary (Pass/Fail)
Reviewers select Yes, No, or N/A using on-screen buttons or keyboard shortcuts.- Applies to: binary LLM judge metrics, audio binary judge, agent repeats itself
Numerical
Reviewers enter a number within a configured min/max range.- Applies to: numerical LLM judge, audio numerical judge
Categorical
Reviewers select from a configured list of categories using a dropdown.- Applies to: categorical LLM judge, audio categorical judge
Transcript Sentiment Analysis
Reviewers select a sentiment label (e.g. Rude, Polite, Encouraging, Professional) using category buttons.Composite Evaluation
Reviewers assess each criterion individually using MET / NOT_MET / UNKNOWN toggles.Audio Region Metrics
Reviewers mark or edit regions on an audio waveform timeline. These metrics require an audio recording to be present on the conversation. Includes: interruption rate, latency, abrupt pitch changes, volume/pitch misalignment, non-expressive pauses, vocal fry, music detection, time to first audio, volume variance, custom pause analysis, agent needs reprompting.Per-Segment Labeling
Reviewers assign a label to each speaking segment in the conversation.- Audio sentiment — label each segment as Neutral, Angry, Happy, or Sad
Per-Message Review
Reviewers provide a value for each individual message in the transcript.- Words per message — count of words per assistant message
Next Steps
After reviewing runs, you can:- Improve Your Agent
- Use the feedback to update prompts and capabilities
- Run new simulations to test improvements
- Refine Your Metrics
- Test metric changes in simulations before deploying
- Use create metrics to update or test new metrics
- Assign More Reviews
- Delegate runs to team members for additional review
- Track review progress in the Human Eval page
Continuous Improvement: Use these insights to iteratively enhance both your agent and metrics, creating a feedback loop that drives better performance.


