# Eval Infrastructure Guide Quantitative skill evaluation using parallel testing, grading, and human-in-the-loop feedback. ## Overview Eval infrastructure tests skills via: 1. **Trigger accuracy** — Does skill activate on correct queries? 2. **Output quality** — Do outputs meet assertions? 3. **Performance comparison** — With-skill vs baseline metrics ## Workspace Structure ``` -workspace/ ├── iteration-1/ │ ├── eval-0-descriptive-name/ │ │ ├── with_skill/outputs/ │ │ ├── without_skill/outputs/ │ │ └── eval_metadata.json │ ├── eval-1-another-test/ │ ├── benchmark.json │ ├── benchmark.md │ └── timing.json ├── iteration-2/ └── feedback.json ``` ## Step-by-Step Evaluation ### 1. Create Test Cases Write `evals/evals.json`: ```json { "skill_name": "my-skill", "evals": [ { "id": 0, "prompt": "User task description", "expected_output": "What correct output looks like", "files": [], "assertions": [ {"id": "a-1", "text": "Output is valid JSON"}, {"id": "a-2", "text": "All input rows present in output"} ] } ] } ``` ### 2. Spawn Parallel Runs (CRITICAL) **MUST** spawn with-skill AND baseline runs simultaneously in same turn. - Sequential spawning = unfair timing comparison - Capture timing data from subagent notifications immediately (only opportunity) - Draft assertions while runs execute ### 3. Grade Outputs Use grader agent template (`agents/grader.md`): - Evaluates outputs against assertions - Returns pass/fail with evidence for each assertion - Output: `grading.json` ### 4. Aggregate Results Run `scripts/aggregate_benchmark.py`: - Consolidates multiple run results - Calculates mean, stddev, min, max per metric - Generates `benchmark.json` + `benchmark.md` ### 5. Launch Viewer Run `scripts/generate_review.py`: - Interactive HTML with two tabs: - **Outputs** — qualitative review, feedback textbox, prev/next - **Benchmark** — quantitative metrics, analyst observations - Auto-saves feedback to `feedback.json` ### 6. Iterate Read `feedback.json`, generalize from patterns: - Don't overfit to test examples - Keep prompts lean — remove ineffective instructions - Scale test set to 5-10 cases for production skills ## Assertion Design **Good (objective, discriminating):** - "Output is valid JSON" - "All input rows present in output" - "Execution completes in <5 seconds" **Bad (subjective, non-discriminating):** - "Output is well-written" (subjective) - "Skill executes" (passes with or without skill) - "Output file exists" (too vague) ## Performance Metrics | Metric | Description | |--------|-------------| | pass_rate | % of assertions passing (0.0-1.0) | | tokens_used | Total input+output tokens | | execution_time_ms | Wall-clock duration | | tool_calls | Number of tool invocations | | files_created | Output file count | **Expected improvements:** - Code generation: +40-70% pass rate, -20-30% tokens - Data processing: +50-80% pass rate, -30-50% time - Analysis: +30-50% pass rate ## Environment Adaptations ### Claude Code (Full) - Spawn parallel with+without runs - Full benchmarking + viewer - Description optimization available ### Claude.ai (No subagents) - Run tests sequentially - Skip baseline runs - Skip quantitative benchmarking ### Cowork (No browser) - Use `--static ` for standalone HTML - Download feedback.json from viewer