Files
english/.opencode/skills/context-engineering/references/evaluation.md
2026-04-12 01:06:31 +07:00

2.2 KiB

Evaluation

Systematically assess agent performance and context engineering choices.

Key Finding: 95% Performance Variance

  • Token usage: 80% of variance
  • Tool calls: ~10% of variance
  • Model choice: ~5% of variance

Implication: Token budgets matter more than model upgrades.

Multi-Dimensional Rubric

Dimension Weight Description
Factual Accuracy 30% Ground truth verification
Completeness 25% Coverage of requirements
Tool Efficiency 20% Appropriate tool usage
Citation Accuracy 15% Sources match claims
Source Quality 10% Authority/credibility

Evaluation Methods

LLM-as-Judge

Beware biases:

  • Position: First position preferred
  • Length: Longer = higher score
  • Self-enhancement: Rating own outputs higher
  • Verbosity: Detailed = better

Mitigation: Position swapping, anti-bias prompting

Pairwise Comparison

score_ab = judge.compare(output_a, output_b)
score_ba = judge.compare(output_b, output_a)
consistent = (score_ab > 0.5) != (score_ba > 0.5)

Probe-Based Testing

Probe Tests Example
Recall Facts "What was the error?"
Artifact Files "Which files modified?"
Continuation Planning "What's next?"
Decision Reasoning "Why chose X?"

Test Set Design

class TestSet:
    def sample_stratified(self, n):
        per_level = n // 3
        return (
            sample(self.simple, per_level) +
            sample(self.medium, per_level) +
            sample(self.complex, per_level)
        )

Production Monitoring

class Monitor:
    sample_rate = 0.01  # 1% sampling
    alert_threshold = 0.85

    def check(self, scores):
        if avg(scores) < self.alert_threshold:
            self.alert(f"Quality degraded: {avg(scores):.2f}")

Guidelines

  1. Start with outcome evaluation, not step-by-step
  2. Use multi-dimensional rubrics
  3. Mitigate LLM-as-Judge biases
  4. Test with stratified complexity
  5. Implement continuous monitoring
  6. Focus on token efficiency (80% variance)