Files
english/.opencode/skills/skill-creator/references/eval-infrastructure-guide.md
2026-04-12 01:06:31 +07:00

130 lines
3.4 KiB
Markdown

# Eval Infrastructure Guide
Quantitative skill evaluation using parallel testing, grading, and human-in-the-loop feedback.
## Overview
Eval infrastructure tests skills via:
1. **Trigger accuracy** — Does skill activate on correct queries?
2. **Output quality** — Do outputs meet assertions?
3. **Performance comparison** — With-skill vs baseline metrics
## Workspace Structure
```
<skill-name>-workspace/
├── iteration-1/
│ ├── eval-0-descriptive-name/
│ │ ├── with_skill/outputs/
│ │ ├── without_skill/outputs/
│ │ └── eval_metadata.json
│ ├── eval-1-another-test/
│ ├── benchmark.json
│ ├── benchmark.md
│ └── timing.json
├── iteration-2/
└── feedback.json
```
## Step-by-Step Evaluation
### 1. Create Test Cases
Write `evals/evals.json`:
```json
{
"skill_name": "my-skill",
"evals": [
{
"id": 0,
"prompt": "User task description",
"expected_output": "What correct output looks like",
"files": [],
"assertions": [
{"id": "a-1", "text": "Output is valid JSON"},
{"id": "a-2", "text": "All input rows present in output"}
]
}
]
}
```
### 2. Spawn Parallel Runs (CRITICAL)
**MUST** spawn with-skill AND baseline runs simultaneously in same turn.
- Sequential spawning = unfair timing comparison
- Capture timing data from subagent notifications immediately (only opportunity)
- Draft assertions while runs execute
### 3. Grade Outputs
Use grader agent template (`agents/grader.md`):
- Evaluates outputs against assertions
- Returns pass/fail with evidence for each assertion
- Output: `grading.json`
### 4. Aggregate Results
Run `scripts/aggregate_benchmark.py`:
- Consolidates multiple run results
- Calculates mean, stddev, min, max per metric
- Generates `benchmark.json` + `benchmark.md`
### 5. Launch Viewer
Run `scripts/generate_review.py`:
- Interactive HTML with two tabs:
- **Outputs** — qualitative review, feedback textbox, prev/next
- **Benchmark** — quantitative metrics, analyst observations
- Auto-saves feedback to `feedback.json`
### 6. Iterate
Read `feedback.json`, generalize from patterns:
- Don't overfit to test examples
- Keep prompts lean — remove ineffective instructions
- Scale test set to 5-10 cases for production skills
## Assertion Design
**Good (objective, discriminating):**
- "Output is valid JSON"
- "All input rows present in output"
- "Execution completes in <5 seconds"
**Bad (subjective, non-discriminating):**
- "Output is well-written" (subjective)
- "Skill executes" (passes with or without skill)
- "Output file exists" (too vague)
## Performance Metrics
| Metric | Description |
|--------|-------------|
| pass_rate | % of assertions passing (0.0-1.0) |
| tokens_used | Total input+output tokens |
| execution_time_ms | Wall-clock duration |
| tool_calls | Number of tool invocations |
| files_created | Output file count |
**Expected improvements:**
- Code generation: +40-70% pass rate, -20-30% tokens
- Data processing: +50-80% pass rate, -30-50% time
- Analysis: +30-50% pass rate
## Environment Adaptations
### Claude Code (Full)
- Spawn parallel with+without runs
- Full benchmarking + viewer
- Description optimization available
### Claude.ai (No subagents)
- Run tests sequentially
- Skip baseline runs
- Skip quantitative benchmarking
### Cowork (No browser)
- Use `--static <output_path>` for standalone HTML
- Download feedback.json from viewer