Files
english/.opencode/skills/skill-creator/references/eval-infrastructure-guide.md
2026-04-12 01:06:31 +07:00

3.4 KiB

Eval Infrastructure Guide

Quantitative skill evaluation using parallel testing, grading, and human-in-the-loop feedback.

Overview

Eval infrastructure tests skills via:

  1. Trigger accuracy — Does skill activate on correct queries?
  2. Output quality — Do outputs meet assertions?
  3. Performance comparison — With-skill vs baseline metrics

Workspace Structure

<skill-name>-workspace/
├── iteration-1/
│   ├── eval-0-descriptive-name/
│   │   ├── with_skill/outputs/
│   │   ├── without_skill/outputs/
│   │   └── eval_metadata.json
│   ├── eval-1-another-test/
│   ├── benchmark.json
│   ├── benchmark.md
│   └── timing.json
├── iteration-2/
└── feedback.json

Step-by-Step Evaluation

1. Create Test Cases

Write evals/evals.json:

{
  "skill_name": "my-skill",
  "evals": [
    {
      "id": 0,
      "prompt": "User task description",
      "expected_output": "What correct output looks like",
      "files": [],
      "assertions": [
        {"id": "a-1", "text": "Output is valid JSON"},
        {"id": "a-2", "text": "All input rows present in output"}
      ]
    }
  ]
}

2. Spawn Parallel Runs (CRITICAL)

MUST spawn with-skill AND baseline runs simultaneously in same turn.

  • Sequential spawning = unfair timing comparison
  • Capture timing data from subagent notifications immediately (only opportunity)
  • Draft assertions while runs execute

3. Grade Outputs

Use grader agent template (agents/grader.md):

  • Evaluates outputs against assertions
  • Returns pass/fail with evidence for each assertion
  • Output: grading.json

4. Aggregate Results

Run scripts/aggregate_benchmark.py:

  • Consolidates multiple run results
  • Calculates mean, stddev, min, max per metric
  • Generates benchmark.json + benchmark.md

5. Launch Viewer

Run scripts/generate_review.py:

  • Interactive HTML with two tabs:
    • Outputs — qualitative review, feedback textbox, prev/next
    • Benchmark — quantitative metrics, analyst observations
  • Auto-saves feedback to feedback.json

6. Iterate

Read feedback.json, generalize from patterns:

  • Don't overfit to test examples
  • Keep prompts lean — remove ineffective instructions
  • Scale test set to 5-10 cases for production skills

Assertion Design

Good (objective, discriminating):

  • "Output is valid JSON"
  • "All input rows present in output"
  • "Execution completes in <5 seconds"

Bad (subjective, non-discriminating):

  • "Output is well-written" (subjective)
  • "Skill executes" (passes with or without skill)
  • "Output file exists" (too vague)

Performance Metrics

Metric Description
pass_rate % of assertions passing (0.0-1.0)
tokens_used Total input+output tokens
execution_time_ms Wall-clock duration
tool_calls Number of tool invocations
files_created Output file count

Expected improvements:

  • Code generation: +40-70% pass rate, -20-30% tokens
  • Data processing: +50-80% pass rate, -30-50% time
  • Analysis: +30-50% pass rate

Environment Adaptations

Claude Code (Full)

  • Spawn parallel with+without runs
  • Full benchmarking + viewer
  • Description optimization available

Claude.ai (No subagents)

  • Run tests sequentially
  • Skip baseline runs
  • Skip quantitative benchmarking

Cowork (No browser)

  • Use --static <output_path> for standalone HTML
  • Download feedback.json from viewer