Files
2026-04-12 01:06:31 +07:00

90 lines
2.2 KiB
Markdown

# Evaluation
Systematically assess agent performance and context engineering choices.
## Key Finding: 95% Performance Variance
- **Token usage**: 80% of variance
- **Tool calls**: ~10% of variance
- **Model choice**: ~5% of variance
**Implication**: Token budgets matter more than model upgrades.
## Multi-Dimensional Rubric
| Dimension | Weight | Description |
|-----------|--------|-------------|
| Factual Accuracy | 30% | Ground truth verification |
| Completeness | 25% | Coverage of requirements |
| Tool Efficiency | 20% | Appropriate tool usage |
| Citation Accuracy | 15% | Sources match claims |
| Source Quality | 10% | Authority/credibility |
## Evaluation Methods
### LLM-as-Judge
Beware biases:
- **Position**: First position preferred
- **Length**: Longer = higher score
- **Self-enhancement**: Rating own outputs higher
- **Verbosity**: Detailed = better
**Mitigation**: Position swapping, anti-bias prompting
### Pairwise Comparison
```python
score_ab = judge.compare(output_a, output_b)
score_ba = judge.compare(output_b, output_a)
consistent = (score_ab > 0.5) != (score_ba > 0.5)
```
### Probe-Based Testing
| Probe | Tests | Example |
|-------|-------|---------|
| Recall | Facts | "What was the error?" |
| Artifact | Files | "Which files modified?" |
| Continuation | Planning | "What's next?" |
| Decision | Reasoning | "Why chose X?" |
## Test Set Design
```python
class TestSet:
def sample_stratified(self, n):
per_level = n // 3
return (
sample(self.simple, per_level) +
sample(self.medium, per_level) +
sample(self.complex, per_level)
)
```
## Production Monitoring
```python
class Monitor:
sample_rate = 0.01 # 1% sampling
alert_threshold = 0.85
def check(self, scores):
if avg(scores) < self.alert_threshold:
self.alert(f"Quality degraded: {avg(scores):.2f}")
```
## Guidelines
1. Start with outcome evaluation, not step-by-step
2. Use multi-dimensional rubrics
3. Mitigate LLM-as-Judge biases
4. Test with stratified complexity
5. Implement continuous monitoring
6. Focus on token efficiency (80% variance)
## Related
- [Context Compression](./context-compression.md)
- [Tool Design](./tool-design.md)