init
This commit is contained in:
@@ -0,0 +1,129 @@
|
||||
# Eval Infrastructure Guide
|
||||
|
||||
Quantitative skill evaluation using parallel testing, grading, and human-in-the-loop feedback.
|
||||
|
||||
## Overview
|
||||
|
||||
Eval infrastructure tests skills via:
|
||||
1. **Trigger accuracy** — Does skill activate on correct queries?
|
||||
2. **Output quality** — Do outputs meet assertions?
|
||||
3. **Performance comparison** — With-skill vs baseline metrics
|
||||
|
||||
## Workspace Structure
|
||||
|
||||
```
|
||||
<skill-name>-workspace/
|
||||
├── iteration-1/
|
||||
│ ├── eval-0-descriptive-name/
|
||||
│ │ ├── with_skill/outputs/
|
||||
│ │ ├── without_skill/outputs/
|
||||
│ │ └── eval_metadata.json
|
||||
│ ├── eval-1-another-test/
|
||||
│ ├── benchmark.json
|
||||
│ ├── benchmark.md
|
||||
│ └── timing.json
|
||||
├── iteration-2/
|
||||
└── feedback.json
|
||||
```
|
||||
|
||||
## Step-by-Step Evaluation
|
||||
|
||||
### 1. Create Test Cases
|
||||
|
||||
Write `evals/evals.json`:
|
||||
```json
|
||||
{
|
||||
"skill_name": "my-skill",
|
||||
"evals": [
|
||||
{
|
||||
"id": 0,
|
||||
"prompt": "User task description",
|
||||
"expected_output": "What correct output looks like",
|
||||
"files": [],
|
||||
"assertions": [
|
||||
{"id": "a-1", "text": "Output is valid JSON"},
|
||||
{"id": "a-2", "text": "All input rows present in output"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Spawn Parallel Runs (CRITICAL)
|
||||
|
||||
**MUST** spawn with-skill AND baseline runs simultaneously in same turn.
|
||||
- Sequential spawning = unfair timing comparison
|
||||
- Capture timing data from subagent notifications immediately (only opportunity)
|
||||
- Draft assertions while runs execute
|
||||
|
||||
### 3. Grade Outputs
|
||||
|
||||
Use grader agent template (`agents/grader.md`):
|
||||
- Evaluates outputs against assertions
|
||||
- Returns pass/fail with evidence for each assertion
|
||||
- Output: `grading.json`
|
||||
|
||||
### 4. Aggregate Results
|
||||
|
||||
Run `scripts/aggregate_benchmark.py`:
|
||||
- Consolidates multiple run results
|
||||
- Calculates mean, stddev, min, max per metric
|
||||
- Generates `benchmark.json` + `benchmark.md`
|
||||
|
||||
### 5. Launch Viewer
|
||||
|
||||
Run `scripts/generate_review.py`:
|
||||
- Interactive HTML with two tabs:
|
||||
- **Outputs** — qualitative review, feedback textbox, prev/next
|
||||
- **Benchmark** — quantitative metrics, analyst observations
|
||||
- Auto-saves feedback to `feedback.json`
|
||||
|
||||
### 6. Iterate
|
||||
|
||||
Read `feedback.json`, generalize from patterns:
|
||||
- Don't overfit to test examples
|
||||
- Keep prompts lean — remove ineffective instructions
|
||||
- Scale test set to 5-10 cases for production skills
|
||||
|
||||
## Assertion Design
|
||||
|
||||
**Good (objective, discriminating):**
|
||||
- "Output is valid JSON"
|
||||
- "All input rows present in output"
|
||||
- "Execution completes in <5 seconds"
|
||||
|
||||
**Bad (subjective, non-discriminating):**
|
||||
- "Output is well-written" (subjective)
|
||||
- "Skill executes" (passes with or without skill)
|
||||
- "Output file exists" (too vague)
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| pass_rate | % of assertions passing (0.0-1.0) |
|
||||
| tokens_used | Total input+output tokens |
|
||||
| execution_time_ms | Wall-clock duration |
|
||||
| tool_calls | Number of tool invocations |
|
||||
| files_created | Output file count |
|
||||
|
||||
**Expected improvements:**
|
||||
- Code generation: +40-70% pass rate, -20-30% tokens
|
||||
- Data processing: +50-80% pass rate, -30-50% time
|
||||
- Analysis: +30-50% pass rate
|
||||
|
||||
## Environment Adaptations
|
||||
|
||||
### Claude Code (Full)
|
||||
- Spawn parallel with+without runs
|
||||
- Full benchmarking + viewer
|
||||
- Description optimization available
|
||||
|
||||
### Claude.ai (No subagents)
|
||||
- Run tests sequentially
|
||||
- Skip baseline runs
|
||||
- Skip quantitative benchmarking
|
||||
|
||||
### Cowork (No browser)
|
||||
- Use `--static <output_path>` for standalone HTML
|
||||
- Download feedback.json from viewer
|
||||
Reference in New Issue
Block a user