2.2 KiB
2.2 KiB
Evaluation
Systematically assess agent performance and context engineering choices.
Key Finding: 95% Performance Variance
- Token usage: 80% of variance
- Tool calls: ~10% of variance
- Model choice: ~5% of variance
Implication: Token budgets matter more than model upgrades.
Multi-Dimensional Rubric
| Dimension | Weight | Description |
|---|---|---|
| Factual Accuracy | 30% | Ground truth verification |
| Completeness | 25% | Coverage of requirements |
| Tool Efficiency | 20% | Appropriate tool usage |
| Citation Accuracy | 15% | Sources match claims |
| Source Quality | 10% | Authority/credibility |
Evaluation Methods
LLM-as-Judge
Beware biases:
- Position: First position preferred
- Length: Longer = higher score
- Self-enhancement: Rating own outputs higher
- Verbosity: Detailed = better
Mitigation: Position swapping, anti-bias prompting
Pairwise Comparison
score_ab = judge.compare(output_a, output_b)
score_ba = judge.compare(output_b, output_a)
consistent = (score_ab > 0.5) != (score_ba > 0.5)
Probe-Based Testing
| Probe | Tests | Example |
|---|---|---|
| Recall | Facts | "What was the error?" |
| Artifact | Files | "Which files modified?" |
| Continuation | Planning | "What's next?" |
| Decision | Reasoning | "Why chose X?" |
Test Set Design
class TestSet:
def sample_stratified(self, n):
per_level = n // 3
return (
sample(self.simple, per_level) +
sample(self.medium, per_level) +
sample(self.complex, per_level)
)
Production Monitoring
class Monitor:
sample_rate = 0.01 # 1% sampling
alert_threshold = 0.85
def check(self, scores):
if avg(scores) < self.alert_threshold:
self.alert(f"Quality degraded: {avg(scores):.2f}")
Guidelines
- Start with outcome evaluation, not step-by-step
- Use multi-dimensional rubrics
- Mitigate LLM-as-Judge biases
- Test with stratified complexity
- Implement continuous monitoring
- Focus on token efficiency (80% variance)