Install
Skills vs MCP vs CLI
| Skills | MCP Server | CLI | |
|---|---|---|---|
| What it provides | Knowledge (how to evaluate well) | Tools (execute operations) | Operations (run from terminal) |
| Install | npx skills add coval-ai/coval-external-skills | npx coval-mcp | brew install coval-ai/tap/coval |
| Use when | Agent needs to design evaluations | Agent needs to run evaluations natively | Scripting, CI/CD, any terminal |
| Works with | Any agent supporting skills | MCP-compatible clients | Any shell environment |
Available Skills
Onboarding
onboard
Interactive guided setup for your first evaluation. Walks through connecting an agent, creating personas, building test cases, selecting metrics, and launching a run.
Runs
| Skill | Description |
|---|---|
| launch-run | Launch an evaluation run against an AI agent |
| watch-run | Monitor a run’s progress with live status updates |
| quick-eval | Full workflow — launch, watch, and summarize results in one go |
Simulations
| Skill | Description |
|---|---|
| get-results | Retrieve and analyze simulation results from a run |
| download-audio | Download audio recordings from voice simulations |
Resources
| Skill | Description |
|---|---|
| coval-resources | Complete reference for all Coval resources, their hierarchy, relationships, API endpoints, and ID formats |
Dashboards
| Skill | Description |
|---|---|
| create-dashboard | Create a new dashboard and populate it with metric widgets |
| add-widget | Add a chart, table, or text widget to a dashboard |
| manage-dashboard | Get, update, or delete a dashboard |
| manage-widgets | List, update, resize, or delete widgets |
| list-dashboards | List all dashboards with filtering |
Test Cases
| Skill | Description |
|---|---|
| huggingface-import | Import datasets from HuggingFace and convert them to Coval test sets |
Migrations
| Skill | Description |
|---|---|
| migrate-bluejay | Migrate configuration from Bluejay voice AI testing platform to Coval |
Human Review
| Skill | Description |
|---|---|
| review-llm-annotations-and-improve-prompt | Calculate agreement between human and machine labels, then propose improved metric prompts |
How Skills Work
Skills use progressive disclosure to stay lightweight:- At startup (~100 tokens per skill): Only the
nameanddescriptionare loaded - When activated (under 5000 tokens): The full skill instructions load when your agent detects a relevant task
- On demand: Reference files (templates, examples) load only when needed

