init
This commit is contained in:
110
.opencode/skills/context-engineering/SKILL.md
Normal file
110
.opencode/skills/context-engineering/SKILL.md
Normal file
@@ -0,0 +1,110 @@
|
||||
---
|
||||
name: ck:context-engineering
|
||||
description: >-
|
||||
Check context usage limits, monitor time remaining, optimize token consumption, debug context failures.
|
||||
Use when asking about context percentage, rate limits, usage warnings, context optimization, agent architectures, memory systems.
|
||||
argument-hint: "[topic or question]"
|
||||
metadata:
|
||||
author: claudekit
|
||||
version: "1.0.0"
|
||||
---
|
||||
|
||||
# Context Engineering
|
||||
|
||||
Context engineering curates the smallest high-signal token set for LLM tasks. The goal: maximize reasoning quality while minimizing token usage.
|
||||
|
||||
## When to Activate
|
||||
|
||||
- Designing/debugging agent systems
|
||||
- Context limits constrain performance
|
||||
- Optimizing cost/latency
|
||||
- Building multi-agent coordination
|
||||
- Implementing memory systems
|
||||
- Evaluating agent performance
|
||||
- Developing LLM-powered pipelines
|
||||
|
||||
## Core Principles
|
||||
|
||||
1. **Context quality > quantity** - High-signal tokens beat exhaustive content
|
||||
2. **Attention is finite** - U-shaped curve favors beginning/end positions
|
||||
3. **Progressive disclosure** - Load information just-in-time
|
||||
4. **Isolation prevents degradation** - Partition work across sub-agents
|
||||
5. **Measure before optimizing** - Know your baseline
|
||||
|
||||
**IMPORTANT:**
|
||||
- Sacrifice grammar for the sake of concision.
|
||||
- Ensure token efficiency while maintaining high quality.
|
||||
- Pass these rules to subagents.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Topic | When to Use | Reference |
|
||||
|-------|-------------|-----------|
|
||||
| **Fundamentals** | Understanding context anatomy, attention mechanics | [context-fundamentals.md](./references/context-fundamentals.md) |
|
||||
| **Degradation** | Debugging failures, lost-in-middle, poisoning | [context-degradation.md](./references/context-degradation.md) |
|
||||
| **Optimization** | Compaction, masking, caching, partitioning | [context-optimization.md](./references/context-optimization.md) |
|
||||
| **Compression** | Long sessions, summarization strategies | [context-compression.md](./references/context-compression.md) |
|
||||
| **Memory** | Cross-session persistence, knowledge graphs | [memory-systems.md](./references/memory-systems.md) |
|
||||
| **Multi-Agent** | Coordination patterns, context isolation | [multi-agent-patterns.md](./references/multi-agent-patterns.md) |
|
||||
| **Evaluation** | Testing agents, LLM-as-Judge, metrics | [evaluation.md](./references/evaluation.md) |
|
||||
| **Tool Design** | Tool consolidation, description engineering | [tool-design.md](./references/tool-design.md) |
|
||||
| **Pipelines** | Project development, batch processing | [project-development.md](./references/project-development.md) |
|
||||
| **Runtime Awareness** | Usage limits, context window monitoring | [runtime-awareness.md](./references/runtime-awareness.md) |
|
||||
|
||||
## Key Metrics
|
||||
|
||||
- **Token utilization**: Warning at 70%, trigger optimization at 80%
|
||||
- **Token variance**: Explains 80% of agent performance variance
|
||||
- **Multi-agent cost**: ~15x single agent baseline
|
||||
- **Compaction target**: 50-70% reduction, <5% quality loss
|
||||
- **Cache hit target**: 70%+ for stable workloads
|
||||
|
||||
## Four-Bucket Strategy
|
||||
|
||||
1. **Write**: Save context externally (scratchpads, files)
|
||||
2. **Select**: Pull only relevant context (retrieval, filtering)
|
||||
3. **Compress**: Reduce tokens while preserving info (summarization)
|
||||
4. **Isolate**: Split across sub-agents (partitioning)
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
- Exhaustive context over curated context
|
||||
- Critical info in middle positions
|
||||
- No compaction triggers before limits
|
||||
- Single agent for parallelizable tasks
|
||||
- Tools without clear descriptions
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. Place critical info at beginning/end of context
|
||||
2. Implement compaction at 70-80% utilization
|
||||
3. Use sub-agents for context isolation, not role-play
|
||||
4. Design tools with 4-question framework (what, when, inputs, returns)
|
||||
5. Optimize for tokens-per-task, not tokens-per-request
|
||||
6. Validate with probe-based evaluation
|
||||
7. Monitor KV-cache hit rates in production
|
||||
8. Start minimal, add complexity only when proven necessary
|
||||
|
||||
## Runtime Awareness
|
||||
|
||||
The system automatically injects usage awareness via PostToolUse hook:
|
||||
|
||||
```xml
|
||||
<usage-awareness>
|
||||
Claude Usage Limits: 5h=45%, 7d=32%
|
||||
Context Window Usage: 67%
|
||||
</usage-awareness>
|
||||
```
|
||||
|
||||
**Thresholds:**
|
||||
- 70%: WARNING - consider optimization/compaction
|
||||
- 90%: CRITICAL - immediate action needed
|
||||
|
||||
**Data Sources:**
|
||||
- Usage limits: Anthropic OAuth API (`https://api.anthropic.com/api/oauth/usage`)
|
||||
- Context window: Statusline temp file (`/tmp/ck-context-{session_id}.json`)
|
||||
|
||||
## Scripts
|
||||
|
||||
- [context_analyzer.py](./scripts/context_analyzer.py) - Context health analysis, degradation detection
|
||||
- [compression_evaluator.py](./scripts/compression_evaluator.py) - Compression quality evaluation
|
||||
@@ -0,0 +1,84 @@
|
||||
# Context Compression
|
||||
|
||||
Strategies for long-running sessions exceeding context windows.
|
||||
|
||||
## Core Insight
|
||||
|
||||
Optimize **tokens-per-task** (total to completion), not tokens-per-request.
|
||||
Aggressive compression causing re-fetching costs more than better retention.
|
||||
|
||||
## Compression Methods
|
||||
|
||||
| Method | Compression | Quality | Best For |
|
||||
|--------|-------------|---------|----------|
|
||||
| **Anchored Iterative** | 98.6% | 3.70/5 | Best balance |
|
||||
| **Regenerative Full** | 98.7% | 3.44/5 | Readability |
|
||||
| **Opaque** | 99.3% | 3.35/5 | Max compression |
|
||||
|
||||
## Anchored Iterative Summary Template
|
||||
|
||||
```markdown
|
||||
## Session Intent
|
||||
Original goal: [preserved]
|
||||
|
||||
## Files Modified
|
||||
- file.py: Changes made
|
||||
|
||||
## Decisions Made
|
||||
- Key decisions with rationale
|
||||
|
||||
## Current State
|
||||
Progress summary
|
||||
|
||||
## Next Steps
|
||||
1. Next action items
|
||||
```
|
||||
|
||||
**On compression**: Merge new content into existing sections, don't regenerate.
|
||||
|
||||
## Compression Triggers
|
||||
|
||||
| Strategy | Trigger | Use Case |
|
||||
|----------|---------|----------|
|
||||
| Fixed threshold | 70-80% utilization | General purpose |
|
||||
| Sliding window | Keep last N turns + summary | Conversations |
|
||||
| Task-boundary | At logical completion | Multi-step workflows |
|
||||
|
||||
## Artifact Trail Problem
|
||||
|
||||
Weakest dimension (2.2-2.5/5.0). Coding agents need explicit tracking of:
|
||||
- Files created/modified/read
|
||||
- Function/variable names, error messages
|
||||
|
||||
**Solution**: Dedicated artifact section in summary.
|
||||
|
||||
## Probe-Based Evaluation
|
||||
|
||||
| Probe Type | Tests | Example |
|
||||
|------------|-------|---------|
|
||||
| Recall | Factual retention | "What was the error?" |
|
||||
| Artifact | File tracking | "Which files modified?" |
|
||||
| Continuation | Task planning | "What next?" |
|
||||
| Decision | Reasoning chains | "Why chose X?" |
|
||||
|
||||
## Six Evaluation Dimensions
|
||||
|
||||
1. **Accuracy** - Technical correctness
|
||||
2. **Context Awareness** - Conversation state
|
||||
3. **Artifact Trail** - File tracking (universally weak)
|
||||
4. **Completeness** - Coverage depth
|
||||
5. **Continuity** - Work continuation
|
||||
6. **Instruction Following** - Constraints
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. Use anchored iterative for best quality/compression
|
||||
2. Maintain explicit artifact tracking section
|
||||
3. Trigger compression at 70% utilization
|
||||
4. Merge into sections, don't regenerate
|
||||
5. Evaluate with probes, not lexical metrics
|
||||
|
||||
## Related
|
||||
|
||||
- [Context Optimization](./context-optimization.md)
|
||||
- [Evaluation](./evaluation.md)
|
||||
@@ -0,0 +1,93 @@
|
||||
# Context Degradation Patterns
|
||||
|
||||
Predictable degradation as context grows. Not binary - a continuum.
|
||||
|
||||
## Degradation Patterns
|
||||
|
||||
| Pattern | Cause | Detection |
|
||||
|---------|-------|-----------|
|
||||
| **Lost-in-Middle** | U-shaped attention | Critical info recall drops 10-40% |
|
||||
| **Context Poisoning** | Errors compound via reference | Persistent hallucinations despite correction |
|
||||
| **Context Distraction** | Irrelevant info overwhelms | Single distractor degrades performance |
|
||||
| **Context Confusion** | Multiple tasks mix | Wrong tool calls, mixed requirements |
|
||||
| **Context Clash** | Contradictory info | Conflicting outputs, inconsistent reasoning |
|
||||
|
||||
## Lost-in-Middle Phenomenon
|
||||
|
||||
- Information in middle gets 10-40% lower recall
|
||||
- Models allocate massive attention to first token (BOS sink)
|
||||
- As context grows, middle tokens fail to get sufficient attention
|
||||
- **Mitigation**: Place critical info at beginning/end
|
||||
|
||||
```markdown
|
||||
[CURRENT TASK] # Beginning - high attention
|
||||
- Critical requirements
|
||||
|
||||
[DETAILED CONTEXT] # Middle - lower attention
|
||||
- Supporting details
|
||||
|
||||
[KEY FINDINGS] # End - high attention
|
||||
- Important conclusions
|
||||
```
|
||||
|
||||
## Context Poisoning
|
||||
|
||||
**Entry points**:
|
||||
1. Tool outputs with errors/unexpected formats
|
||||
2. Retrieved docs with incorrect/outdated info
|
||||
3. Model-generated summaries with hallucinations
|
||||
|
||||
**Detection symptoms**:
|
||||
- Degraded quality on previously successful tasks
|
||||
- Tool misalignment (wrong tools/parameters)
|
||||
- Persistent hallucinations
|
||||
|
||||
**Recovery**:
|
||||
- Truncate to before poisoning point
|
||||
- Explicit note + re-evaluation request
|
||||
- Restart with clean context, preserve only verified info
|
||||
|
||||
## Model Degradation Thresholds
|
||||
|
||||
| Model | Degradation Onset | Severe Degradation |
|
||||
|-------|-------------------|-------------------|
|
||||
| GPT-5.2 | ~64K tokens | ~200K tokens |
|
||||
| Claude Opus 4.5 | ~100K tokens | ~180K tokens |
|
||||
| Claude Sonnet 4.5 | ~80K tokens | ~150K tokens |
|
||||
| Gemini 3 Pro | ~500K tokens | ~800K tokens |
|
||||
|
||||
## Four-Bucket Mitigation
|
||||
|
||||
1. **Write**: Save externally (scratchpads, files)
|
||||
2. **Select**: Pull only relevant (retrieval, filtering)
|
||||
3. **Compress**: Reduce tokens (summarization)
|
||||
4. **Isolate**: Split across sub-agents (partitioning)
|
||||
|
||||
## Detection Heuristics
|
||||
|
||||
```python
|
||||
def calculate_health(utilization, degradation_risk, poisoning_risk):
|
||||
"""Health score: 1.0 = healthy, 0.0 = critical"""
|
||||
score = 1.0
|
||||
score -= utilization * 0.5 if utilization > 0.7 else 0
|
||||
score -= degradation_risk * 0.3
|
||||
score -= poisoning_risk * 0.2
|
||||
return max(0, score)
|
||||
|
||||
# Thresholds: healthy >0.8, warning >0.6, degraded >0.4, critical <=0.4
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. Monitor context length vs performance correlation
|
||||
2. Place critical info at beginning/end
|
||||
3. Implement compaction before degradation
|
||||
4. Validate retrieved docs before adding
|
||||
5. Use versioning to prevent outdated clash
|
||||
6. Segment tasks to prevent confusion
|
||||
7. Design for graceful degradation
|
||||
|
||||
## Related Topics
|
||||
|
||||
- [Context Optimization](./context-optimization.md) - Mitigation techniques
|
||||
- [Multi-Agent Patterns](./multi-agent-patterns.md) - Isolation strategies
|
||||
@@ -0,0 +1,75 @@
|
||||
# Context Fundamentals
|
||||
|
||||
Context = all input provided to LLM for task completion.
|
||||
|
||||
## Anatomy of Context
|
||||
|
||||
| Component | Purpose | Token Impact |
|
||||
|-----------|---------|--------------|
|
||||
| System Prompt | Identity, constraints, guidelines | Stable, cacheable |
|
||||
| Tool Definitions | Action specs with params/returns | Grows with capabilities |
|
||||
| Retrieved Docs | Domain knowledge, just-in-time | Variable, selective |
|
||||
| Message History | Conversation state, task progress | Accumulates over time |
|
||||
| Tool Outputs | Results from actions | 83.9% of typical context |
|
||||
|
||||
## Attention Mechanics
|
||||
|
||||
- **U-shaped curve**: Beginning/end get more attention than middle
|
||||
- **Attention budget**: n^2 relationships for n tokens depletes with growth
|
||||
- **Position encoding**: Interpolation allows longer sequences with degradation
|
||||
- **First-token sink**: BOS token absorbs large attention budget
|
||||
|
||||
## System Prompt Structure
|
||||
|
||||
```xml
|
||||
<BACKGROUND_INFORMATION>Domain knowledge, role definition</BACKGROUND_INFORMATION>
|
||||
<INSTRUCTIONS>Step-by-step procedures</INSTRUCTIONS>
|
||||
<TOOL_GUIDANCE>When/how to use tools</TOOL_GUIDANCE>
|
||||
<OUTPUT_DESCRIPTION>Format requirements</OUTPUT_DESCRIPTION>
|
||||
```
|
||||
|
||||
## Progressive Disclosure Levels
|
||||
|
||||
1. **Metadata** (~100 words) - Always in context
|
||||
2. **SKILL.md body** (<5k words) - When skill triggers
|
||||
3. **Bundled resources** (Unlimited) - As needed
|
||||
|
||||
## Token Budget Allocation
|
||||
|
||||
| Component | Typical Range | Notes |
|
||||
|-----------|---------------|-------|
|
||||
| System Prompt | 500-2000 | Stable, optimize once |
|
||||
| Tool Definitions | 100-500 per tool | Keep under 20 tools |
|
||||
| Retrieved Docs | 1000-5000 | Selective loading |
|
||||
| Message History | Variable | Summarize at 70% |
|
||||
| Reserved Buffer | 10-20% | For responses |
|
||||
|
||||
## Document Management
|
||||
|
||||
**Strong identifiers**: `customer_pricing_rates.json` not `data/file1.json`
|
||||
**Chunk at semantic boundaries**: Paragraphs, sections, not arbitrary lengths
|
||||
**Include metadata**: Source, date, relevance score
|
||||
|
||||
## Message History Pattern
|
||||
|
||||
```python
|
||||
# Summary injection every 20 messages
|
||||
if len(messages) % 20 == 0:
|
||||
summary = summarize_conversation(messages[-20:])
|
||||
messages.append({"role": "system", "content": f"Summary: {summary}"})
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. Treat context as finite with diminishing returns
|
||||
2. Place critical info at attention-favored positions
|
||||
3. Use file-system-based access for large documents
|
||||
4. Pre-load stable content, just-in-time load dynamic
|
||||
5. Design with explicit token budgets
|
||||
6. Monitor usage, implement compaction triggers at 70-80%
|
||||
|
||||
## Related Topics
|
||||
|
||||
- [Context Degradation](./context-degradation.md) - Failure patterns
|
||||
- [Context Optimization](./context-optimization.md) - Efficiency techniques
|
||||
- [Memory Systems](./memory-systems.md) - External storage
|
||||
@@ -0,0 +1,82 @@
|
||||
# Context Optimization
|
||||
|
||||
Extend effective context capacity through strategic techniques.
|
||||
|
||||
## Four Core Strategies
|
||||
|
||||
| Strategy | Target | Reduction | When to Use |
|
||||
|----------|--------|-----------|-------------|
|
||||
| **Compaction** | Full context | 50-70% | Approaching limits |
|
||||
| **Observation Masking** | Tool outputs | 60-80% | Verbose outputs >80% |
|
||||
| **KV-Cache Optimization** | Repeated prefixes | 70%+ hit | Stable prompts |
|
||||
| **Context Partitioning** | Work distribution | N/A | Parallelizable tasks |
|
||||
|
||||
## Compaction
|
||||
|
||||
Summarize context when approaching limits.
|
||||
|
||||
**Priority**: Tool outputs → Old turns → Retrieved docs → Never: System prompt
|
||||
|
||||
```python
|
||||
if context_tokens / context_limit > 0.8:
|
||||
context = compact_context(context)
|
||||
```
|
||||
|
||||
**Preserve**: Key findings, decisions, commitments (remove supporting details)
|
||||
|
||||
## Observation Masking
|
||||
|
||||
Replace verbose tool outputs with compact references.
|
||||
|
||||
```python
|
||||
if len(observation) > max_length:
|
||||
ref_id = store_observation(observation)
|
||||
return f"[Obs:{ref_id}. Key: {extract_key(observation)}]"
|
||||
```
|
||||
|
||||
**Never mask**: Current task critical, most recent turn, active reasoning
|
||||
**Always mask**: Repeated outputs, boilerplate, already summarized
|
||||
|
||||
## KV-Cache Optimization
|
||||
|
||||
Reuse cached Key/Value tensors for identical prefixes.
|
||||
|
||||
```python
|
||||
# Cache-friendly ordering (stable first)
|
||||
context = [system_prompt, tool_definitions] # Cacheable
|
||||
context += [unique_content] # Variable last
|
||||
```
|
||||
|
||||
**Tips**: Avoid timestamps in stable sections, consistent formatting, stable structure
|
||||
|
||||
## Context Partitioning
|
||||
|
||||
Split work across sub-agents with isolated contexts.
|
||||
|
||||
```python
|
||||
result = await sub_agent.process(subtask, clean_context=True)
|
||||
coordinator.receive(result.summary) # Only essentials
|
||||
```
|
||||
|
||||
## Decision Framework
|
||||
|
||||
| Dominant Component | Apply |
|
||||
|-------------------|-------|
|
||||
| Tool outputs | Observation masking |
|
||||
| Retrieved docs | Summarization or partitioning |
|
||||
| Message history | Compaction + summarization |
|
||||
| Multiple | Combine strategies |
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. Measure before optimizing
|
||||
2. Apply compaction before masking
|
||||
3. Design for cache stability
|
||||
4. Partition before context problematic
|
||||
5. Monitor effectiveness over time
|
||||
6. Balance savings vs quality
|
||||
|
||||
## Related
|
||||
|
||||
- [Context Compression](./context-compression.md)
|
||||
- [Memory Systems](./memory-systems.md)
|
||||
@@ -0,0 +1,89 @@
|
||||
# Evaluation
|
||||
|
||||
Systematically assess agent performance and context engineering choices.
|
||||
|
||||
## Key Finding: 95% Performance Variance
|
||||
|
||||
- **Token usage**: 80% of variance
|
||||
- **Tool calls**: ~10% of variance
|
||||
- **Model choice**: ~5% of variance
|
||||
|
||||
**Implication**: Token budgets matter more than model upgrades.
|
||||
|
||||
## Multi-Dimensional Rubric
|
||||
|
||||
| Dimension | Weight | Description |
|
||||
|-----------|--------|-------------|
|
||||
| Factual Accuracy | 30% | Ground truth verification |
|
||||
| Completeness | 25% | Coverage of requirements |
|
||||
| Tool Efficiency | 20% | Appropriate tool usage |
|
||||
| Citation Accuracy | 15% | Sources match claims |
|
||||
| Source Quality | 10% | Authority/credibility |
|
||||
|
||||
## Evaluation Methods
|
||||
|
||||
### LLM-as-Judge
|
||||
|
||||
Beware biases:
|
||||
- **Position**: First position preferred
|
||||
- **Length**: Longer = higher score
|
||||
- **Self-enhancement**: Rating own outputs higher
|
||||
- **Verbosity**: Detailed = better
|
||||
|
||||
**Mitigation**: Position swapping, anti-bias prompting
|
||||
|
||||
### Pairwise Comparison
|
||||
|
||||
```python
|
||||
score_ab = judge.compare(output_a, output_b)
|
||||
score_ba = judge.compare(output_b, output_a)
|
||||
consistent = (score_ab > 0.5) != (score_ba > 0.5)
|
||||
```
|
||||
|
||||
### Probe-Based Testing
|
||||
|
||||
| Probe | Tests | Example |
|
||||
|-------|-------|---------|
|
||||
| Recall | Facts | "What was the error?" |
|
||||
| Artifact | Files | "Which files modified?" |
|
||||
| Continuation | Planning | "What's next?" |
|
||||
| Decision | Reasoning | "Why chose X?" |
|
||||
|
||||
## Test Set Design
|
||||
|
||||
```python
|
||||
class TestSet:
|
||||
def sample_stratified(self, n):
|
||||
per_level = n // 3
|
||||
return (
|
||||
sample(self.simple, per_level) +
|
||||
sample(self.medium, per_level) +
|
||||
sample(self.complex, per_level)
|
||||
)
|
||||
```
|
||||
|
||||
## Production Monitoring
|
||||
|
||||
```python
|
||||
class Monitor:
|
||||
sample_rate = 0.01 # 1% sampling
|
||||
alert_threshold = 0.85
|
||||
|
||||
def check(self, scores):
|
||||
if avg(scores) < self.alert_threshold:
|
||||
self.alert(f"Quality degraded: {avg(scores):.2f}")
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. Start with outcome evaluation, not step-by-step
|
||||
2. Use multi-dimensional rubrics
|
||||
3. Mitigate LLM-as-Judge biases
|
||||
4. Test with stratified complexity
|
||||
5. Implement continuous monitoring
|
||||
6. Focus on token efficiency (80% variance)
|
||||
|
||||
## Related
|
||||
|
||||
- [Context Compression](./context-compression.md)
|
||||
- [Tool Design](./tool-design.md)
|
||||
@@ -0,0 +1,88 @@
|
||||
# Memory Systems
|
||||
|
||||
Architectures for persistent context beyond the window.
|
||||
|
||||
## Memory Layer Architecture
|
||||
|
||||
| Layer | Scope | Persistence | Use Case |
|
||||
|-------|-------|-------------|----------|
|
||||
| L1: Working | Current window | None | Active reasoning |
|
||||
| L2: Short-Term | Session | Session | Task continuity |
|
||||
| L3: Long-Term | Cross-session | Persistent | User preferences |
|
||||
| L4: Entity | Per-entity | Persistent | Consistency |
|
||||
| L5: Temporal Graph | Time-aware | Persistent | Evolving facts |
|
||||
|
||||
## Benchmark Performance (DMR Accuracy)
|
||||
|
||||
| System | Accuracy | Approach |
|
||||
|--------|----------|----------|
|
||||
| Zep | 94.8% | Temporal knowledge graphs |
|
||||
| MemGPT | 93.4% | Hierarchical memory |
|
||||
| GraphRAG | 75-85% | Knowledge graphs |
|
||||
| Vector RAG | 60-70% | Embedding similarity |
|
||||
|
||||
## Vector Store with Metadata
|
||||
|
||||
```python
|
||||
class MetadataVectorStore:
|
||||
def add(self, text, embedding, metadata):
|
||||
doc = {
|
||||
"text": text, "embedding": embedding,
|
||||
"entities": metadata.get("entities", []),
|
||||
"timestamp": metadata.get("timestamp")
|
||||
}
|
||||
self.index_by_entity(doc)
|
||||
|
||||
def search_by_entity(self, entity, k=5):
|
||||
return self.entity_index.get(entity, [])[:k]
|
||||
```
|
||||
|
||||
## Temporal Knowledge Graph
|
||||
|
||||
```python
|
||||
class TemporalKnowledgeGraph:
|
||||
def add_fact(self, subject, predicate, obj, valid_from, valid_to=None):
|
||||
self.facts.append({
|
||||
"triple": (subject, predicate, obj),
|
||||
"valid_from": valid_from,
|
||||
"valid_to": valid_to or "current"
|
||||
})
|
||||
|
||||
def query_at_time(self, subject, predicate, timestamp):
|
||||
for fact in self.facts:
|
||||
if (fact["triple"][0] == subject and
|
||||
fact["valid_from"] <= timestamp <= fact["valid_to"]):
|
||||
return fact["triple"][2]
|
||||
```
|
||||
|
||||
## Memory Retrieval Patterns
|
||||
|
||||
| Pattern | Query | Use Case |
|
||||
|---------|-------|----------|
|
||||
| Semantic | "Similar to X" | General recall |
|
||||
| Entity-based | "About user John" | Consistency |
|
||||
| Temporal | "Valid on date" | Evolving facts |
|
||||
| Hybrid | Combine above | Production |
|
||||
|
||||
## File-System-as-Memory
|
||||
|
||||
```
|
||||
memory/
|
||||
├── sessions/{id}/summary.md
|
||||
├── entities/{id}.json
|
||||
└── facts/{timestamp}_{id}.json
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. Start with file-system-as-memory (simplest)
|
||||
2. Add vector search for scale
|
||||
3. Use entity indexing for consistency
|
||||
4. Add temporal awareness for evolving facts
|
||||
5. Implement consolidation for health
|
||||
6. Measure retrieval accuracy
|
||||
|
||||
## Related
|
||||
|
||||
- [Context Fundamentals](./context-fundamentals.md)
|
||||
- [Multi-Agent Patterns](./multi-agent-patterns.md)
|
||||
@@ -0,0 +1,90 @@
|
||||
# Multi-Agent Patterns
|
||||
|
||||
Distribute work across multiple context windows for isolation and scale.
|
||||
|
||||
## Core Insight
|
||||
|
||||
Sub-agents exist to **isolate context**, not anthropomorphize roles.
|
||||
|
||||
## Token Economics
|
||||
|
||||
| Architecture | Multiplier | Use Case |
|
||||
|--------------|------------|----------|
|
||||
| Single agent | 1x | Simple tasks |
|
||||
| Single + tools | ~4x | Moderate complexity |
|
||||
| Multi-agent | ~15x | Context isolation needed |
|
||||
|
||||
**Key**: Token usage explains 80% of performance variance.
|
||||
|
||||
## Patterns
|
||||
|
||||
### Supervisor/Orchestrator
|
||||
|
||||
```python
|
||||
class Supervisor:
|
||||
def process(self, task):
|
||||
subtasks = self.decompose(task)
|
||||
results = [worker.execute(st, clean_context=True) for st in subtasks]
|
||||
return self.aggregate(results)
|
||||
```
|
||||
|
||||
**Pros**: Control, human-in-loop | **Cons**: Bottleneck, telephone game
|
||||
|
||||
### Peer-to-Peer/Swarm
|
||||
|
||||
```python
|
||||
def process_with_handoff(agent, task):
|
||||
result = agent.process(task)
|
||||
if "handoff" in result:
|
||||
return process_with_handoff(select_agent(result["to"]), result["state"])
|
||||
return result
|
||||
```
|
||||
|
||||
**Pros**: No SPOF, scales | **Cons**: Complex coordination
|
||||
|
||||
### Hierarchical
|
||||
|
||||
Strategy → Planning → Execution layers
|
||||
**Pros**: Separation of concerns | **Cons**: Coordination overhead
|
||||
|
||||
## Context Isolation Patterns
|
||||
|
||||
| Pattern | Isolation | Use Case |
|
||||
|---------|-----------|----------|
|
||||
| Full delegation | None | Max capability |
|
||||
| Instruction passing | High | Simple tasks |
|
||||
| File coordination | Medium | Shared state |
|
||||
|
||||
## Consensus Mechanisms
|
||||
|
||||
```python
|
||||
def weighted_consensus(responses):
|
||||
scores = {}
|
||||
for r in responses:
|
||||
weight = r["confidence"] * r["expertise"]
|
||||
scores[r["answer"]] = scores.get(r["answer"], 0) + weight
|
||||
return max(scores, key=scores.get)
|
||||
```
|
||||
|
||||
## Failure Recovery
|
||||
|
||||
| Failure | Mitigation |
|
||||
|---------|------------|
|
||||
| Bottleneck | Output schemas, checkpointing |
|
||||
| Overhead | Clear handoffs, batching |
|
||||
| Divergence | Boundaries, convergence checks |
|
||||
| Errors | Validation, circuit breakers |
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. Use multi-agent for context isolation, not role-play
|
||||
2. Accept ~15x token cost for benefits
|
||||
3. Implement circuit breakers
|
||||
4. Use files for shared state
|
||||
5. Design clear handoffs
|
||||
6. Validate between agents
|
||||
|
||||
## Related
|
||||
|
||||
- [Context Optimization](./context-optimization.md)
|
||||
- [Evaluation](./evaluation.md)
|
||||
@@ -0,0 +1,97 @@
|
||||
# Project Development
|
||||
|
||||
Design and build LLM-powered projects from ideation to deployment.
|
||||
|
||||
## Task-Model Fit
|
||||
|
||||
**LLM-Suited**: Synthesis, subjective judgment, NL output, error-tolerant batches
|
||||
**LLM-Unsuited**: Precise computation, real-time, perfect accuracy, deterministic output
|
||||
|
||||
## Manual Prototype First
|
||||
|
||||
Test one example with target model before automation.
|
||||
|
||||
## Pipeline Architecture
|
||||
|
||||
```
|
||||
acquire → prepare → process → parse → render
|
||||
(fetch) (prompt) (LLM) (extract) (output)
|
||||
```
|
||||
|
||||
Stages 1,2,4,5: Deterministic, cheap | Stage 3: Non-deterministic, expensive
|
||||
|
||||
## File System as State
|
||||
|
||||
```
|
||||
data/{id}/
|
||||
├── raw.json # acquire done
|
||||
├── prompt.md # prepare done
|
||||
├── response.md # process done
|
||||
└── parsed.json # parse done
|
||||
```
|
||||
|
||||
```python
|
||||
def get_stage(id):
|
||||
if exists(f"{id}/parsed.json"): return "render"
|
||||
if exists(f"{id}/response.md"): return "parse"
|
||||
# ... check backwards
|
||||
```
|
||||
|
||||
**Benefits**: Idempotent, resumable, debuggable
|
||||
|
||||
## Structured Output
|
||||
|
||||
```markdown
|
||||
## SUMMARY
|
||||
[Overview]
|
||||
|
||||
## KEY_FINDINGS
|
||||
- Finding 1
|
||||
|
||||
## SCORE
|
||||
[1-5]
|
||||
```
|
||||
|
||||
```python
|
||||
def parse(response):
|
||||
return {
|
||||
"summary": extract_section(response, "SUMMARY"),
|
||||
"findings": extract_list(response, "KEY_FINDINGS"),
|
||||
"score": extract_int(response, "SCORE")
|
||||
}
|
||||
```
|
||||
|
||||
## Cost Estimation
|
||||
|
||||
```python
|
||||
def estimate(items, tokens_per, price_per_1k):
|
||||
return len(items) * tokens_per / 1000 * price_per_1k * 1.1 # 10% buffer
|
||||
# 1000 items × 2000 tokens × $0.01/1k = $22
|
||||
```
|
||||
|
||||
## Case Studies
|
||||
|
||||
**Karpathy HN**: 930 items, $58, 1hr, 15 workers
|
||||
**Vercel d0**: 17→2 tools, 80%→100% success, 3.5x faster
|
||||
|
||||
## Single vs Multi-Agent
|
||||
|
||||
| Factor | Single | Multi |
|
||||
|--------|--------|-------|
|
||||
| Context | Fits window | Exceeds |
|
||||
| Tasks | Sequential | Parallel |
|
||||
| Tokens | Limited | 15x OK |
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. Validate manually before automating
|
||||
2. Use 5-stage pipeline
|
||||
3. Track state via files
|
||||
4. Design structured output
|
||||
5. Estimate costs first
|
||||
6. Start single, add multi when needed
|
||||
|
||||
## Related
|
||||
|
||||
- [Context Optimization](./context-optimization.md)
|
||||
- [Multi-Agent Patterns](./multi-agent-patterns.md)
|
||||
@@ -0,0 +1,202 @@
|
||||
# Runtime Awareness
|
||||
|
||||
Monitor usage limits and context window utilization in real-time to optimize Claude Code sessions.
|
||||
|
||||
## Overview
|
||||
|
||||
Runtime awareness provides visibility into two critical metrics:
|
||||
1. **Usage Limits** - API quota consumption (5-hour and 7-day rolling windows)
|
||||
2. **Context Window** - Current token utilization within the 200K context limit
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌──────────────────────────┐
|
||||
│ statusline.cjs │───▶│ /tmp/ck-context-*.json │
|
||||
│ (writes data) │ │ (context window data) │
|
||||
└─────────────────┘ └────────────┬─────────────┘
|
||||
│
|
||||
┌────────────▼─────────────┐
|
||||
│ usage-context-hook.cjs │◀── PostToolUse
|
||||
│ - Reads context file │
|
||||
│ - Fetches usage limits │
|
||||
│ - Injects awareness │
|
||||
└──────────────────────────┘
|
||||
```
|
||||
|
||||
## Usage Limits API
|
||||
|
||||
### Endpoint
|
||||
|
||||
```
|
||||
GET https://api.anthropic.com/api/oauth/usage
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
Requires OAuth Bearer token with `anthropic-beta: oauth-2025-04-20` header.
|
||||
|
||||
### Credential Locations
|
||||
|
||||
| Platform | Method | Location |
|
||||
|----------|--------|----------|
|
||||
| macOS | Keychain | `Claude Code-credentials` |
|
||||
| Windows | File | `%USERPROFILE%\.claude\.credentials.json` |
|
||||
| Linux | File | `~/.opencode/.credentials.json` |
|
||||
|
||||
### Response Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"five_hour": {
|
||||
"utilization": 45,
|
||||
"resets_at": "2025-01-15T18:00:00Z"
|
||||
},
|
||||
"seven_day": {
|
||||
"utilization": 32,
|
||||
"resets_at": "2025-01-22T00:00:00Z"
|
||||
},
|
||||
"seven_day_sonnet": {
|
||||
"utilization": 11,
|
||||
"resets_at": "2025-01-15T09:00:00Z"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- `utilization`: Already a percentage (0-100), NOT a decimal
|
||||
- `resets_at`: ISO 8601 timestamp when quota resets
|
||||
- `seven_day_sonnet`: Model-specific limit (may be null)
|
||||
|
||||
## Context Window Data
|
||||
|
||||
### Source
|
||||
|
||||
Statusline writes context data to `/tmp/ck-context-{session_id}.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"percent": 67,
|
||||
"tokens": 134000,
|
||||
"size": 200000,
|
||||
"usage": {
|
||||
"input_tokens": 80000,
|
||||
"cache_creation_input_tokens": 30000,
|
||||
"cache_read_input_tokens": 24000
|
||||
},
|
||||
"timestamp": 1705312000000
|
||||
}
|
||||
```
|
||||
|
||||
### Token Calculation
|
||||
|
||||
```
|
||||
total = input_tokens + cache_creation_input_tokens + cache_read_input_tokens
|
||||
percent = (total + AUTOCOMPACT_BUFFER) / context_window_size * 100
|
||||
```
|
||||
|
||||
Where `AUTOCOMPACT_BUFFER = 45000` (22.5% reserved).
|
||||
|
||||
## Hook Output
|
||||
|
||||
The PostToolUse hook injects awareness data every 5 minutes:
|
||||
|
||||
```xml
|
||||
<usage-awareness>
|
||||
Limits: 5h=45%, 7d=32%
|
||||
Context: 67%
|
||||
</usage-awareness>
|
||||
```
|
||||
|
||||
### Warning Indicators
|
||||
|
||||
| Level | Threshold | Indicator |
|
||||
|-------|-----------|-----------|
|
||||
| Normal | < 70% | Plain percentage |
|
||||
| Warning | 70-89% | `[WARNING]` |
|
||||
| Critical | ≥ 90% | `[CRITICAL]` |
|
||||
|
||||
### Examples
|
||||
|
||||
Normal state:
|
||||
```xml
|
||||
<usage-awareness>
|
||||
Limits: 5h=45%, 7d=32%
|
||||
Context: 67%
|
||||
</usage-awareness>
|
||||
```
|
||||
|
||||
Warning state:
|
||||
```xml
|
||||
<usage-awareness>
|
||||
Limits: 5h=75% [WARNING], 7d=32%
|
||||
Context: 78% [WARNING - consider compaction]
|
||||
</usage-awareness>
|
||||
```
|
||||
|
||||
Critical state:
|
||||
```xml
|
||||
<usage-awareness>
|
||||
Limits: 5h=92% [CRITICAL], 7d=65%
|
||||
Context: 91% [CRITICAL - compaction needed]
|
||||
</usage-awareness>
|
||||
```
|
||||
|
||||
## Recommendations by Threshold
|
||||
|
||||
### Context Window
|
||||
|
||||
| Utilization | Action |
|
||||
|-------------|--------|
|
||||
| < 70% | Continue normally |
|
||||
| 70-80% | Plan compaction strategy |
|
||||
| 80-90% | Execute compaction |
|
||||
| > 90% | Immediate compaction or session reset |
|
||||
|
||||
### Usage Limits
|
||||
|
||||
| 5-Hour | Action |
|
||||
|--------|--------|
|
||||
| < 70% | Normal usage |
|
||||
| 70-90% | Reduce parallelization, delegate to subagents |
|
||||
| > 90% | Wait for reset or use lower-tier models |
|
||||
|
||||
| 7-Day | Action |
|
||||
|-------|--------|
|
||||
| < 70% | Normal usage |
|
||||
| 70-90% | Monitor daily consumption |
|
||||
| > 90% | Limit usage to essential tasks |
|
||||
|
||||
## Configuration
|
||||
|
||||
### Hook Settings (`.opencode/settings.json`)
|
||||
|
||||
```json
|
||||
{
|
||||
"hooks": {
|
||||
"PostToolUse": [
|
||||
{
|
||||
"matcher": "*",
|
||||
"hooks": [{
|
||||
"type": "command",
|
||||
"command": "node .opencode/hooks/usage-quota-cache-refresh.cjs"
|
||||
}]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Throttling
|
||||
|
||||
- **Injection interval**: 5 minutes (300,000ms)
|
||||
- **API cache TTL**: 60 seconds
|
||||
- **Context data freshness**: 30 seconds
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Issue | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| No usage limits shown | No OAuth token | Run `claude login` |
|
||||
| Stale context data | Statusline not updating | Check statusline config |
|
||||
| 401 Unauthorized | Expired token | Re-authenticate |
|
||||
| Hook not firing | Settings misconfigured | Verify PostToolUse matcher |
|
||||
@@ -0,0 +1,86 @@
|
||||
# Tool Design
|
||||
|
||||
Design effective tools for agent systems.
|
||||
|
||||
## Consolidation Principle
|
||||
|
||||
Single comprehensive tools > multiple narrow tools. **Target**: 10-20 tools max.
|
||||
|
||||
## Architectural Reduction Evidence
|
||||
|
||||
| Metric | 17 Tools | 2 Tools | Improvement |
|
||||
|--------|----------|---------|-------------|
|
||||
| Time | 274.8s | 77.4s | 3.5x faster |
|
||||
| Success | 80% | 100% | +20% |
|
||||
| Tokens | 102k | 61k | 37% fewer |
|
||||
|
||||
**Key**: Good documentation replaces tool sophistication.
|
||||
|
||||
## When Reduction Works
|
||||
|
||||
**Prerequisites**: High docs quality, capable model, navigable problem
|
||||
**Avoid when**: Messy systems, specialized domain, safety-critical
|
||||
|
||||
## Description Engineering
|
||||
|
||||
Answer four questions:
|
||||
1. **What** does the tool do?
|
||||
2. **When** should it be used?
|
||||
3. **What inputs** does it accept?
|
||||
4. **What** does it return?
|
||||
|
||||
### Good Example
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "get_customer",
|
||||
"description": "Retrieve customer profile by ID. Use for order processing, support. Returns 404 if not found.",
|
||||
"parameters": {
|
||||
"customer_id": {"type": "string", "pattern": "^CUST-[0-9]{6}$"},
|
||||
"format": {"enum": ["concise", "detailed"]}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Poor Example
|
||||
|
||||
```json
|
||||
{"name": "search", "description": "Search for things", "parameters": {"q": {}}}
|
||||
```
|
||||
|
||||
## Error Messages
|
||||
|
||||
```python
|
||||
def format_error(code, message, resolution):
|
||||
return {
|
||||
"error": {"code": code, "message": message,
|
||||
"resolution": resolution, "retryable": code in RETRYABLE}
|
||||
}
|
||||
# "Use YYYY-MM-DD format, e.g., '2024-01-05'"
|
||||
```
|
||||
|
||||
## Response Formats
|
||||
|
||||
Offer concise vs detailed:
|
||||
|
||||
```python
|
||||
def get_data(id, format="concise"):
|
||||
if format == "concise":
|
||||
return {"name": data.name}
|
||||
return data.full() # Detailed
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
1. Consolidate tools (target 10-20)
|
||||
2. Answer all four questions
|
||||
3. Use full parameter names
|
||||
4. Design errors for recovery
|
||||
5. Offer concise/detailed formats
|
||||
6. Test with agents before deploy
|
||||
7. Start minimal, add when proven
|
||||
|
||||
## Related
|
||||
|
||||
- [Context Fundamentals](./context-fundamentals.md)
|
||||
- [Multi-Agent Patterns](./multi-agent-patterns.md)
|
||||
@@ -0,0 +1,349 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Compression Evaluator - Evaluate compression quality with probe-based testing.
|
||||
|
||||
Usage:
|
||||
python compression_evaluator.py evaluate <original_file> <compressed_file>
|
||||
python compression_evaluator.py generate-probes <context_file>
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from dataclasses import dataclass, field
|
||||
from enum import Enum
|
||||
from typing import Optional
|
||||
|
||||
MAX_FILE_SIZE_MB = 100
|
||||
|
||||
|
||||
def load_file(path: str, as_json: bool = True):
|
||||
"""Load file with proper error handling and size validation."""
|
||||
try:
|
||||
size_mb = os.path.getsize(path) / (1024 * 1024)
|
||||
if size_mb > MAX_FILE_SIZE_MB:
|
||||
print(f"Error: File too large ({size_mb:.1f}MB). Max {MAX_FILE_SIZE_MB}MB", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
with open(path, encoding='utf-8') as f:
|
||||
return json.load(f) if as_json else f.read()
|
||||
except FileNotFoundError:
|
||||
print(f"Error: File not found: {path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except PermissionError:
|
||||
print(f"Error: Permission denied: {path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except json.JSONDecodeError as e:
|
||||
print(f"Error: Invalid JSON in {path}: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
class ProbeType(Enum):
|
||||
RECALL = "recall" # Factual retention
|
||||
ARTIFACT = "artifact" # File tracking
|
||||
CONTINUATION = "continuation" # Task planning
|
||||
DECISION = "decision" # Reasoning chains
|
||||
|
||||
|
||||
@dataclass
|
||||
class Probe:
|
||||
type: ProbeType
|
||||
question: str
|
||||
ground_truth: str
|
||||
context_reference: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class ProbeResult:
|
||||
probe: Probe
|
||||
response: str
|
||||
scores: dict
|
||||
overall_score: float
|
||||
|
||||
|
||||
@dataclass
|
||||
class EvaluationReport:
|
||||
compression_ratio: float
|
||||
quality_score: float
|
||||
dimension_scores: dict
|
||||
probe_results: list
|
||||
recommendations: list = field(default_factory=list)
|
||||
|
||||
|
||||
# Six evaluation dimensions with weights
|
||||
DIMENSIONS = {
|
||||
"accuracy": {"weight": 0.20, "description": "Technical correctness"},
|
||||
"context_awareness": {"weight": 0.15, "description": "Conversation state"},
|
||||
"artifact_trail": {"weight": 0.20, "description": "File tracking"},
|
||||
"completeness": {"weight": 0.20, "description": "Coverage and depth"},
|
||||
"continuity": {"weight": 0.15, "description": "Work continuation"},
|
||||
"instruction_following": {"weight": 0.10, "description": "Constraint adherence"}
|
||||
}
|
||||
|
||||
|
||||
def estimate_tokens(text: str) -> int:
|
||||
"""Estimate token count."""
|
||||
return len(text) // 4
|
||||
|
||||
|
||||
def extract_facts(messages: list) -> list:
|
||||
"""Extract factual statements that can be probed."""
|
||||
facts = []
|
||||
patterns = [
|
||||
(r"error[:\s]+([^.]+)", "error"),
|
||||
(r"next step[s]?[:\s]+([^.]+)", "next_step"),
|
||||
(r"decided to\s+([^.]+)", "decision"),
|
||||
(r"implemented\s+([^.]+)", "implementation"),
|
||||
(r"found that\s+([^.]+)", "finding")
|
||||
]
|
||||
|
||||
for msg in messages:
|
||||
content = str(msg.get("content", "") if isinstance(msg, dict) else msg)
|
||||
for pattern, fact_type in patterns:
|
||||
matches = re.findall(pattern, content, re.IGNORECASE)
|
||||
for match in matches:
|
||||
facts.append({"type": fact_type, "content": match.strip()})
|
||||
return facts
|
||||
|
||||
|
||||
def extract_files(messages: list) -> list:
|
||||
"""Extract file references."""
|
||||
files = []
|
||||
patterns = [
|
||||
r"(?:created|modified|updated|edited|read)\s+[`'\"]?([a-zA-Z0-9_/.-]+\.[a-zA-Z]+)[`'\"]?",
|
||||
r"file[:\s]+[`'\"]?([a-zA-Z0-9_/.-]+\.[a-zA-Z]+)[`'\"]?"
|
||||
]
|
||||
|
||||
for msg in messages:
|
||||
content = str(msg.get("content", "") if isinstance(msg, dict) else msg)
|
||||
for pattern in patterns:
|
||||
matches = re.findall(pattern, content)
|
||||
files.extend(matches)
|
||||
return list(set(files))
|
||||
|
||||
|
||||
def extract_decisions(messages: list) -> list:
|
||||
"""Extract decision points."""
|
||||
decisions = []
|
||||
patterns = [
|
||||
r"chose\s+([^.]+)\s+(?:because|since|over)",
|
||||
r"decided\s+(?:to\s+)?([^.]+)",
|
||||
r"went with\s+([^.]+)"
|
||||
]
|
||||
|
||||
for msg in messages:
|
||||
content = str(msg.get("content", "") if isinstance(msg, dict) else msg)
|
||||
for pattern in patterns:
|
||||
matches = re.findall(pattern, content, re.IGNORECASE)
|
||||
decisions.extend(matches)
|
||||
return decisions
|
||||
|
||||
|
||||
def generate_probes(messages: list) -> list:
|
||||
"""Generate probe set for evaluation."""
|
||||
probes = []
|
||||
|
||||
# Recall probes from facts
|
||||
facts = extract_facts(messages)
|
||||
for fact in facts[:3]: # Limit to 3 recall probes
|
||||
probes.append(Probe(
|
||||
type=ProbeType.RECALL,
|
||||
question=f"What was the {fact['type'].replace('_', ' ')}?",
|
||||
ground_truth=fact["content"]
|
||||
))
|
||||
|
||||
# Artifact probes from files
|
||||
files = extract_files(messages)
|
||||
if files:
|
||||
probes.append(Probe(
|
||||
type=ProbeType.ARTIFACT,
|
||||
question="Which files have been modified or created?",
|
||||
ground_truth=", ".join(files)
|
||||
))
|
||||
|
||||
# Continuation probe
|
||||
probes.append(Probe(
|
||||
type=ProbeType.CONTINUATION,
|
||||
question="What should be done next?",
|
||||
ground_truth="[Extracted from context]" # Would need LLM to generate
|
||||
))
|
||||
|
||||
# Decision probes
|
||||
decisions = extract_decisions(messages)
|
||||
for decision in decisions[:2]: # Limit to 2 decision probes
|
||||
probes.append(Probe(
|
||||
type=ProbeType.DECISION,
|
||||
question=f"Why was the decision made to {decision[:50]}...?",
|
||||
ground_truth=decision
|
||||
))
|
||||
|
||||
return probes
|
||||
|
||||
|
||||
def evaluate_response(probe: Probe, response: str) -> dict:
|
||||
"""
|
||||
Evaluate response against probe.
|
||||
Note: Production should use LLM-as-Judge.
|
||||
"""
|
||||
scores = {}
|
||||
response_lower = response.lower()
|
||||
ground_truth_lower = probe.ground_truth.lower()
|
||||
|
||||
# Heuristic scoring (replace with LLM evaluation in production)
|
||||
# Check for ground truth presence
|
||||
if ground_truth_lower in response_lower:
|
||||
base_score = 1.0
|
||||
elif any(word in response_lower for word in ground_truth_lower.split()[:3]):
|
||||
base_score = 0.6
|
||||
else:
|
||||
base_score = 0.3
|
||||
|
||||
# Adjust based on probe type
|
||||
if probe.type == ProbeType.ARTIFACT:
|
||||
# Check file mentions
|
||||
files_mentioned = len(re.findall(r'\.[a-z]+', response_lower))
|
||||
scores["artifact_trail"] = min(1.0, base_score + files_mentioned * 0.1)
|
||||
scores["accuracy"] = base_score
|
||||
elif probe.type == ProbeType.RECALL:
|
||||
scores["accuracy"] = base_score
|
||||
scores["completeness"] = base_score
|
||||
elif probe.type == ProbeType.CONTINUATION:
|
||||
scores["continuity"] = base_score
|
||||
scores["context_awareness"] = base_score
|
||||
elif probe.type == ProbeType.DECISION:
|
||||
scores["accuracy"] = base_score
|
||||
scores["context_awareness"] = base_score
|
||||
|
||||
return scores
|
||||
|
||||
|
||||
def calculate_compression_ratio(original: str, compressed: str) -> float:
|
||||
"""Calculate compression ratio."""
|
||||
original_tokens = estimate_tokens(original)
|
||||
compressed_tokens = estimate_tokens(compressed)
|
||||
if original_tokens == 0:
|
||||
return 0.0
|
||||
return 1.0 - (compressed_tokens / original_tokens)
|
||||
|
||||
|
||||
def evaluate_compression(original_messages: list, compressed_text: str,
|
||||
probes: Optional[list] = None) -> EvaluationReport:
|
||||
"""
|
||||
Evaluate compression quality.
|
||||
|
||||
Args:
|
||||
original_messages: Original context messages
|
||||
compressed_text: Compressed summary
|
||||
probes: Optional pre-generated probes
|
||||
|
||||
Returns:
|
||||
EvaluationReport with scores and recommendations
|
||||
"""
|
||||
# Generate probes if not provided
|
||||
if probes is None:
|
||||
probes = generate_probes(original_messages)
|
||||
|
||||
# Calculate compression ratio
|
||||
original_text = json.dumps(original_messages)
|
||||
compression_ratio = calculate_compression_ratio(original_text, compressed_text)
|
||||
|
||||
# Evaluate each probe (simulated - production uses LLM)
|
||||
probe_results = []
|
||||
dimension_scores = {dim: [] for dim in DIMENSIONS}
|
||||
|
||||
for probe in probes:
|
||||
# In production, send compressed_text + probe.question to LLM
|
||||
# Here we simulate with heuristic check
|
||||
scores = evaluate_response(probe, compressed_text)
|
||||
|
||||
overall = sum(scores.values()) / len(scores) if scores else 0
|
||||
probe_results.append(ProbeResult(
|
||||
probe=probe,
|
||||
response="[Would be LLM response]",
|
||||
scores=scores,
|
||||
overall_score=overall
|
||||
))
|
||||
|
||||
# Aggregate by dimension
|
||||
for dim, score in scores.items():
|
||||
if dim in dimension_scores:
|
||||
dimension_scores[dim].append(score)
|
||||
|
||||
# Calculate dimension averages
|
||||
avg_dimensions = {}
|
||||
for dim, scores in dimension_scores.items():
|
||||
avg_dimensions[dim] = sum(scores) / len(scores) if scores else 0.5
|
||||
|
||||
# Calculate weighted quality score
|
||||
quality_score = sum(
|
||||
avg_dimensions.get(dim, 0.5) * info["weight"]
|
||||
for dim, info in DIMENSIONS.items()
|
||||
)
|
||||
|
||||
# Generate recommendations
|
||||
recommendations = []
|
||||
if compression_ratio > 0.99:
|
||||
recommendations.append("Very high compression. Risk of information loss.")
|
||||
if avg_dimensions.get("artifact_trail", 1) < 0.5:
|
||||
recommendations.append("Artifact tracking weak. Add explicit file section to summary.")
|
||||
if avg_dimensions.get("continuity", 1) < 0.5:
|
||||
recommendations.append("Continuity low. Add 'Next Steps' section to summary.")
|
||||
if quality_score < 0.6:
|
||||
recommendations.append("Quality below threshold. Consider less aggressive compression.")
|
||||
|
||||
return EvaluationReport(
|
||||
compression_ratio=compression_ratio,
|
||||
quality_score=quality_score,
|
||||
dimension_scores=avg_dimensions,
|
||||
probe_results=probe_results,
|
||||
recommendations=recommendations
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Compression quality evaluator")
|
||||
subparsers = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
# Evaluate command
|
||||
eval_parser = subparsers.add_parser("evaluate", help="Evaluate compression quality")
|
||||
eval_parser.add_argument("original_file", help="JSON file with original messages")
|
||||
eval_parser.add_argument("compressed_file", help="Text file with compressed summary")
|
||||
|
||||
# Generate probes command
|
||||
probe_parser = subparsers.add_parser("generate-probes", help="Generate evaluation probes")
|
||||
probe_parser.add_argument("context_file", help="JSON file with context messages")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.command == "evaluate":
|
||||
original = load_file(args.original_file, as_json=True)
|
||||
messages = original if isinstance(original, list) else original.get("messages", [])
|
||||
compressed = load_file(args.compressed_file, as_json=False)
|
||||
|
||||
report = evaluate_compression(messages, compressed)
|
||||
print(json.dumps({
|
||||
"compression_ratio": f"{report.compression_ratio:.1%}",
|
||||
"quality_score": f"{report.quality_score:.2f}",
|
||||
"dimension_scores": {k: f"{v:.2f}" for k, v in report.dimension_scores.items()},
|
||||
"probe_count": len(report.probe_results),
|
||||
"recommendations": report.recommendations
|
||||
}, indent=2))
|
||||
|
||||
elif args.command == "generate-probes":
|
||||
data = load_file(args.context_file, as_json=True)
|
||||
messages = data if isinstance(data, list) else data.get("messages", [])
|
||||
|
||||
probes = generate_probes(messages)
|
||||
output = []
|
||||
for probe in probes:
|
||||
output.append({
|
||||
"type": probe.type.value,
|
||||
"question": probe.question,
|
||||
"ground_truth": probe.ground_truth
|
||||
})
|
||||
print(json.dumps(output, indent=2))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
317
.opencode/skills/context-engineering/scripts/context_analyzer.py
Normal file
317
.opencode/skills/context-engineering/scripts/context_analyzer.py
Normal file
@@ -0,0 +1,317 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Context Analyzer - Health analysis and degradation detection for agent contexts.
|
||||
|
||||
Usage:
|
||||
python context_analyzer.py analyze <context_file>
|
||||
python context_analyzer.py budget --system 2000 --tools 1500 --docs 3000 --history 5000
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from dataclasses import dataclass, field
|
||||
from enum import Enum
|
||||
from typing import Optional
|
||||
|
||||
MAX_FILE_SIZE_MB = 100
|
||||
|
||||
|
||||
def load_json_file(path: str):
|
||||
"""Load JSON file with proper error handling and size validation."""
|
||||
try:
|
||||
size_mb = os.path.getsize(path) / (1024 * 1024)
|
||||
if size_mb > MAX_FILE_SIZE_MB:
|
||||
print(f"Error: File too large ({size_mb:.1f}MB). Max {MAX_FILE_SIZE_MB}MB", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
with open(path, encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
except FileNotFoundError:
|
||||
print(f"Error: File not found: {path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except PermissionError:
|
||||
print(f"Error: Permission denied: {path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except json.JSONDecodeError as e:
|
||||
print(f"Error: Invalid JSON in {path}: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
class HealthStatus(Enum):
|
||||
HEALTHY = "healthy"
|
||||
WARNING = "warning"
|
||||
DEGRADED = "degraded"
|
||||
CRITICAL = "critical"
|
||||
|
||||
|
||||
@dataclass
|
||||
class ContextAnalysis:
|
||||
total_tokens: int
|
||||
token_limit: int
|
||||
utilization: float
|
||||
health_status: HealthStatus
|
||||
health_score: float
|
||||
degradation_risk: float
|
||||
poisoning_risk: float
|
||||
recommendations: list = field(default_factory=list)
|
||||
|
||||
|
||||
def estimate_tokens(text: str) -> int:
|
||||
"""Estimate token count (~4 chars per token for English)."""
|
||||
return len(text) // 4
|
||||
|
||||
|
||||
def estimate_message_tokens(messages: list) -> int:
|
||||
"""Estimate tokens in message list."""
|
||||
total = 0
|
||||
for msg in messages:
|
||||
if isinstance(msg, dict):
|
||||
content = msg.get("content", "")
|
||||
total += estimate_tokens(str(content))
|
||||
# Add overhead for role, metadata
|
||||
total += 10
|
||||
else:
|
||||
total += estimate_tokens(str(msg))
|
||||
return total
|
||||
|
||||
|
||||
def measure_attention_distribution(context_length: int, sample_size: int = 100) -> list:
|
||||
"""
|
||||
Simulate U-shaped attention distribution.
|
||||
Real implementation would extract from model attention weights.
|
||||
"""
|
||||
attention = []
|
||||
for i in range(sample_size):
|
||||
position = i / sample_size
|
||||
# U-shaped curve: high at start/end, low in middle
|
||||
if position < 0.1:
|
||||
score = 0.9 - position * 2
|
||||
elif position > 0.9:
|
||||
score = 0.7 + (position - 0.9) * 2
|
||||
else:
|
||||
score = 0.3 + 0.1 * math.sin(position * math.pi)
|
||||
attention.append(score)
|
||||
return attention
|
||||
|
||||
|
||||
def detect_lost_in_middle(messages: list, critical_keywords: list) -> list:
|
||||
"""Identify critical items in attention-degraded regions."""
|
||||
if not messages:
|
||||
return []
|
||||
|
||||
total = len(messages)
|
||||
warnings = []
|
||||
|
||||
for i, msg in enumerate(messages):
|
||||
position = i / total
|
||||
content = str(msg.get("content", "") if isinstance(msg, dict) else msg)
|
||||
|
||||
# Middle region (10%-90%)
|
||||
if 0.1 < position < 0.9:
|
||||
for keyword in critical_keywords:
|
||||
if keyword.lower() in content.lower():
|
||||
warnings.append({
|
||||
"position": i,
|
||||
"position_pct": f"{position:.1%}",
|
||||
"keyword": keyword,
|
||||
"risk": "high" if 0.3 < position < 0.7 else "medium"
|
||||
})
|
||||
return warnings
|
||||
|
||||
|
||||
def detect_poisoning_patterns(messages: list) -> dict:
|
||||
"""Detect potential context poisoning indicators."""
|
||||
error_patterns = [
|
||||
r"error", r"failed", r"exception", r"cannot", r"unable",
|
||||
r"invalid", r"not found", r"undefined", r"null"
|
||||
]
|
||||
# Simple contradiction check - look for both positive and negative statements
|
||||
contradiction_keywords = [
|
||||
("is correct", "is not correct"),
|
||||
("should work", "should not work"),
|
||||
("will succeed", "will fail"),
|
||||
("is valid", "is invalid"),
|
||||
]
|
||||
|
||||
errors_found = []
|
||||
contradictions = []
|
||||
|
||||
for i, msg in enumerate(messages):
|
||||
content = str(msg.get("content", "") if isinstance(msg, dict) else msg).lower()
|
||||
|
||||
# Check error patterns
|
||||
for pattern in error_patterns:
|
||||
if re.search(pattern, content):
|
||||
errors_found.append({"position": i, "pattern": pattern})
|
||||
|
||||
# Check for contradiction keywords (simplified)
|
||||
for pos_phrase, neg_phrase in contradiction_keywords:
|
||||
if pos_phrase in content and neg_phrase in content:
|
||||
contradictions.append({"position": i, "type": "self-contradiction"})
|
||||
|
||||
total = max(len(messages), 1)
|
||||
return {
|
||||
"error_density": len(errors_found) / total,
|
||||
"contradiction_count": len(contradictions),
|
||||
"poisoning_risk": min(1.0, (len(errors_found) * 0.1 + len(contradictions) * 0.3))
|
||||
}
|
||||
|
||||
|
||||
def calculate_health_score(utilization: float, degradation_risk: float, poisoning_risk: float) -> float:
|
||||
"""
|
||||
Calculate composite health score.
|
||||
1.0 = healthy, 0.0 = critical
|
||||
"""
|
||||
score = 1.0
|
||||
# Utilization penalty (kicks in after 70%)
|
||||
if utilization > 0.7:
|
||||
score -= (utilization - 0.7) * 1.5
|
||||
# Degradation penalty
|
||||
score -= degradation_risk * 0.3
|
||||
# Poisoning penalty
|
||||
score -= poisoning_risk * 0.2
|
||||
return max(0.0, min(1.0, score))
|
||||
|
||||
|
||||
def get_health_status(score: float) -> HealthStatus:
|
||||
"""Map health score to status."""
|
||||
if score > 0.8:
|
||||
return HealthStatus.HEALTHY
|
||||
elif score > 0.6:
|
||||
return HealthStatus.WARNING
|
||||
elif score > 0.4:
|
||||
return HealthStatus.DEGRADED
|
||||
return HealthStatus.CRITICAL
|
||||
|
||||
|
||||
def analyze_context(messages: list, token_limit: int = 128000,
|
||||
critical_keywords: Optional[list] = None) -> ContextAnalysis:
|
||||
"""
|
||||
Comprehensive context health analysis.
|
||||
|
||||
Args:
|
||||
messages: List of context messages
|
||||
token_limit: Model's context window size
|
||||
critical_keywords: Keywords that should be at attention-favored positions
|
||||
|
||||
Returns:
|
||||
ContextAnalysis with health metrics and recommendations
|
||||
"""
|
||||
critical_keywords = critical_keywords or ["goal", "task", "important", "critical", "must"]
|
||||
|
||||
# Calculate token utilization
|
||||
total_tokens = estimate_message_tokens(messages)
|
||||
utilization = total_tokens / token_limit
|
||||
|
||||
# Check for lost-in-middle issues
|
||||
middle_warnings = detect_lost_in_middle(messages, critical_keywords)
|
||||
degradation_risk = min(1.0, len(middle_warnings) * 0.2)
|
||||
|
||||
# Check for poisoning
|
||||
poisoning = detect_poisoning_patterns(messages)
|
||||
poisoning_risk = poisoning["poisoning_risk"]
|
||||
|
||||
# Calculate health
|
||||
health_score = calculate_health_score(utilization, degradation_risk, poisoning_risk)
|
||||
health_status = get_health_status(health_score)
|
||||
|
||||
# Generate recommendations
|
||||
recommendations = []
|
||||
if utilization > 0.8:
|
||||
recommendations.append("URGENT: Context utilization >80%. Trigger compaction immediately.")
|
||||
elif utilization > 0.7:
|
||||
recommendations.append("WARNING: Context utilization >70%. Plan for compaction.")
|
||||
|
||||
if middle_warnings:
|
||||
recommendations.append(f"Found {len(middle_warnings)} critical items in middle region. "
|
||||
"Consider moving to beginning/end.")
|
||||
|
||||
if poisoning_risk > 0.3:
|
||||
recommendations.append("High poisoning risk detected. Review recent tool outputs for errors.")
|
||||
|
||||
if health_status == HealthStatus.CRITICAL:
|
||||
recommendations.append("CRITICAL: Consider context reset with clean state.")
|
||||
|
||||
return ContextAnalysis(
|
||||
total_tokens=total_tokens,
|
||||
token_limit=token_limit,
|
||||
utilization=utilization,
|
||||
health_status=health_status,
|
||||
health_score=health_score,
|
||||
degradation_risk=degradation_risk,
|
||||
poisoning_risk=poisoning_risk,
|
||||
recommendations=recommendations
|
||||
)
|
||||
|
||||
|
||||
def calculate_budget(system: int, tools: int, docs: int, history: int,
|
||||
buffer_pct: float = 0.15) -> dict:
|
||||
"""Calculate context budget allocation."""
|
||||
subtotal = system + tools + docs + history
|
||||
buffer = int(subtotal * buffer_pct)
|
||||
total = subtotal + buffer
|
||||
|
||||
return {
|
||||
"allocation": {
|
||||
"system_prompt": system,
|
||||
"tool_definitions": tools,
|
||||
"retrieved_docs": docs,
|
||||
"message_history": history,
|
||||
"reserved_buffer": buffer
|
||||
},
|
||||
"total_budget": total,
|
||||
"warning_threshold": int(total * 0.7),
|
||||
"critical_threshold": int(total * 0.8),
|
||||
"recommendations": [
|
||||
f"Trigger compaction at {int(total * 0.7):,} tokens",
|
||||
f"Aggressive optimization at {int(total * 0.8):,} tokens",
|
||||
f"Reserved {buffer:,} tokens ({buffer_pct:.0%}) for responses"
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Context health analyzer")
|
||||
subparsers = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
# Analyze command
|
||||
analyze_parser = subparsers.add_parser("analyze", help="Analyze context health")
|
||||
analyze_parser.add_argument("context_file", help="JSON file with messages array")
|
||||
analyze_parser.add_argument("--limit", type=int, default=128000, help="Token limit")
|
||||
analyze_parser.add_argument("--keywords", nargs="+", help="Critical keywords to track")
|
||||
|
||||
# Budget command
|
||||
budget_parser = subparsers.add_parser("budget", help="Calculate context budget")
|
||||
budget_parser.add_argument("--system", type=int, default=2000, help="System prompt tokens")
|
||||
budget_parser.add_argument("--tools", type=int, default=1500, help="Tool definitions tokens")
|
||||
budget_parser.add_argument("--docs", type=int, default=3000, help="Retrieved docs tokens")
|
||||
budget_parser.add_argument("--history", type=int, default=5000, help="Message history tokens")
|
||||
budget_parser.add_argument("--buffer", type=float, default=0.15, help="Buffer percentage")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.command == "analyze":
|
||||
data = load_json_file(args.context_file)
|
||||
messages = data if isinstance(data, list) else data.get("messages", [])
|
||||
result = analyze_context(messages, args.limit, args.keywords)
|
||||
print(json.dumps({
|
||||
"total_tokens": result.total_tokens,
|
||||
"token_limit": result.token_limit,
|
||||
"utilization": f"{result.utilization:.1%}",
|
||||
"health_status": result.health_status.value,
|
||||
"health_score": f"{result.health_score:.2f}",
|
||||
"degradation_risk": f"{result.degradation_risk:.2f}",
|
||||
"poisoning_risk": f"{result.poisoning_risk:.2f}",
|
||||
"recommendations": result.recommendations
|
||||
}, indent=2))
|
||||
|
||||
elif args.command == "budget":
|
||||
result = calculate_budget(args.system, args.tools, args.docs, args.history, args.buffer)
|
||||
print(json.dumps(result, indent=2))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,246 @@
|
||||
"""Tests for context-engineering edge case handling.
|
||||
|
||||
Tests the error handling improvements in compression_evaluator.py and context_analyzer.py:
|
||||
- File not found
|
||||
- Permission denied
|
||||
- Invalid JSON
|
||||
- File too large
|
||||
- UTF-8 encoding
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import stat
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
SCRIPTS_DIR = Path(__file__).parent.parent
|
||||
PYTHON = sys.executable
|
||||
|
||||
|
||||
class TestCompressionEvaluatorEdgeCases:
|
||||
"""Test edge cases in compression_evaluator.py"""
|
||||
|
||||
@pytest.fixture
|
||||
def valid_json_file(self, tmp_path):
|
||||
"""Create valid JSON file."""
|
||||
f = tmp_path / "valid.json"
|
||||
f.write_text('{"messages": [{"role": "user", "content": "hello"}]}', encoding='utf-8')
|
||||
return str(f)
|
||||
|
||||
@pytest.fixture
|
||||
def valid_text_file(self, tmp_path):
|
||||
"""Create valid text file."""
|
||||
f = tmp_path / "compressed.txt"
|
||||
f.write_text("Summary of conversation", encoding='utf-8')
|
||||
return str(f)
|
||||
|
||||
def run_script(self, *args, timeout=30):
|
||||
"""Run compression_evaluator.py with args."""
|
||||
cmd = [PYTHON, str(SCRIPTS_DIR / "compression_evaluator.py")] + list(args)
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
|
||||
return result
|
||||
|
||||
def test_missing_file_exits_1(self, tmp_path):
|
||||
"""Test exit code 1 when file not found."""
|
||||
result = self.run_script("evaluate", "/nonexistent/file.json", str(tmp_path / "c.txt"))
|
||||
assert result.returncode == 1
|
||||
assert "File not found" in result.stderr
|
||||
|
||||
def test_missing_file_error_message(self, tmp_path):
|
||||
"""Test error message format for missing file."""
|
||||
missing = "/this/path/does/not/exist/file.json"
|
||||
result = self.run_script("evaluate", missing, str(tmp_path / "c.txt"))
|
||||
assert result.returncode == 1
|
||||
assert missing in result.stderr or "not found" in result.stderr.lower()
|
||||
|
||||
def test_invalid_json_exits_1(self, tmp_path, valid_text_file):
|
||||
"""Test exit code 1 when JSON is invalid."""
|
||||
bad_json = tmp_path / "bad.json"
|
||||
bad_json.write_text("{invalid json content", encoding='utf-8')
|
||||
|
||||
result = self.run_script("evaluate", str(bad_json), valid_text_file)
|
||||
assert result.returncode == 1
|
||||
assert "Invalid JSON" in result.stderr or "JSON" in result.stderr
|
||||
|
||||
def test_valid_files_succeed(self, valid_json_file, valid_text_file):
|
||||
"""Test success with valid inputs."""
|
||||
result = self.run_script("evaluate", valid_json_file, valid_text_file)
|
||||
assert result.returncode == 0
|
||||
output = json.loads(result.stdout)
|
||||
assert "compression_ratio" in output
|
||||
assert "quality_score" in output
|
||||
|
||||
def test_generate_probes_missing_file(self):
|
||||
"""Test generate-probes with missing file."""
|
||||
result = self.run_script("generate-probes", "/nonexistent/context.json")
|
||||
assert result.returncode == 1
|
||||
assert "File not found" in result.stderr
|
||||
|
||||
def test_generate_probes_invalid_json(self, tmp_path):
|
||||
"""Test generate-probes with invalid JSON."""
|
||||
bad = tmp_path / "bad.json"
|
||||
bad.write_text("not valid json {{{", encoding='utf-8')
|
||||
|
||||
result = self.run_script("generate-probes", str(bad))
|
||||
assert result.returncode == 1
|
||||
assert "Invalid JSON" in result.stderr or "JSON" in result.stderr
|
||||
|
||||
def test_generate_probes_success(self, valid_json_file):
|
||||
"""Test generate-probes with valid file."""
|
||||
result = self.run_script("generate-probes", valid_json_file)
|
||||
assert result.returncode == 0
|
||||
output = json.loads(result.stdout)
|
||||
assert isinstance(output, list)
|
||||
|
||||
def test_utf8_content(self, tmp_path):
|
||||
"""Test UTF-8 encoding with special characters."""
|
||||
utf8_file = tmp_path / "utf8.json"
|
||||
content = {"messages": [{"role": "user", "content": "日本語テスト émojis 🎉"}]}
|
||||
utf8_file.write_text(json.dumps(content), encoding='utf-8')
|
||||
|
||||
compressed = tmp_path / "compressed.txt"
|
||||
compressed.write_text("Summary with 日本語 and émojis 🎉", encoding='utf-8')
|
||||
|
||||
result = self.run_script("evaluate", str(utf8_file), str(compressed))
|
||||
assert result.returncode == 0
|
||||
|
||||
@pytest.mark.skipif(os.name == 'nt', reason="Permission test not reliable on Windows")
|
||||
def test_permission_denied(self, tmp_path):
|
||||
"""Test permission denied error."""
|
||||
protected = tmp_path / "protected.json"
|
||||
protected.write_text('{"messages": []}', encoding='utf-8')
|
||||
os.chmod(protected, 0o000)
|
||||
|
||||
try:
|
||||
result = self.run_script("generate-probes", str(protected))
|
||||
assert result.returncode == 1
|
||||
assert "Permission denied" in result.stderr or "permission" in result.stderr.lower()
|
||||
finally:
|
||||
os.chmod(protected, stat.S_IRUSR | stat.S_IWUSR)
|
||||
|
||||
|
||||
class TestContextAnalyzerEdgeCases:
|
||||
"""Test edge cases in context_analyzer.py"""
|
||||
|
||||
@pytest.fixture
|
||||
def valid_context_file(self, tmp_path):
|
||||
"""Create valid context file."""
|
||||
f = tmp_path / "context.json"
|
||||
content = {
|
||||
"messages": [
|
||||
{"role": "user", "content": "implement feature X"},
|
||||
{"role": "assistant", "content": "I'll help with that"}
|
||||
]
|
||||
}
|
||||
f.write_text(json.dumps(content), encoding='utf-8')
|
||||
return str(f)
|
||||
|
||||
def run_script(self, *args, timeout=30):
|
||||
"""Run context_analyzer.py with args."""
|
||||
cmd = [PYTHON, str(SCRIPTS_DIR / "context_analyzer.py")] + list(args)
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
|
||||
return result
|
||||
|
||||
def test_missing_file_exits_1(self):
|
||||
"""Test exit code 1 when file not found."""
|
||||
result = self.run_script("analyze", "/nonexistent/context.json")
|
||||
assert result.returncode == 1
|
||||
assert "File not found" in result.stderr
|
||||
|
||||
def test_invalid_json_exits_1(self, tmp_path):
|
||||
"""Test exit code 1 when JSON is invalid."""
|
||||
bad = tmp_path / "bad.json"
|
||||
bad.write_text("not json", encoding='utf-8')
|
||||
|
||||
result = self.run_script("analyze", str(bad))
|
||||
assert result.returncode == 1
|
||||
assert "Invalid JSON" in result.stderr or "JSON" in result.stderr
|
||||
|
||||
def test_valid_file_succeeds(self, valid_context_file):
|
||||
"""Test success with valid input."""
|
||||
result = self.run_script("analyze", valid_context_file)
|
||||
assert result.returncode == 0
|
||||
output = json.loads(result.stdout)
|
||||
assert "health_status" in output or "health_score" in output
|
||||
|
||||
def test_utf8_content(self, tmp_path):
|
||||
"""Test UTF-8 encoding with international characters."""
|
||||
utf8_file = tmp_path / "utf8.json"
|
||||
content = {
|
||||
"messages": [
|
||||
{"role": "user", "content": "日本語で説明してください"},
|
||||
{"role": "assistant", "content": "はい、説明します。émojis: 🎉🚀"}
|
||||
]
|
||||
}
|
||||
utf8_file.write_text(json.dumps(content, ensure_ascii=False), encoding='utf-8')
|
||||
|
||||
result = self.run_script("analyze", str(utf8_file))
|
||||
assert result.returncode == 0
|
||||
|
||||
def test_empty_messages_array(self, tmp_path):
|
||||
"""Test handling of empty messages array."""
|
||||
f = tmp_path / "empty.json"
|
||||
f.write_text('{"messages": []}', encoding='utf-8')
|
||||
|
||||
result = self.run_script("analyze", str(f))
|
||||
assert result.returncode == 0
|
||||
|
||||
def test_direct_messages_list(self, tmp_path):
|
||||
"""Test handling of direct messages list (no wrapper)."""
|
||||
f = tmp_path / "direct.json"
|
||||
content = [
|
||||
{"role": "user", "content": "hello"},
|
||||
{"role": "assistant", "content": "hi"}
|
||||
]
|
||||
f.write_text(json.dumps(content), encoding='utf-8')
|
||||
|
||||
result = self.run_script("analyze", str(f))
|
||||
assert result.returncode == 0
|
||||
|
||||
@pytest.mark.skipif(os.name == 'nt', reason="Permission test not reliable on Windows")
|
||||
def test_permission_denied(self, tmp_path):
|
||||
"""Test permission denied error."""
|
||||
protected = tmp_path / "protected.json"
|
||||
protected.write_text('{"messages": []}', encoding='utf-8')
|
||||
os.chmod(protected, 0o000)
|
||||
|
||||
try:
|
||||
result = self.run_script("analyze", str(protected))
|
||||
assert result.returncode == 1
|
||||
assert "Permission denied" in result.stderr or "permission" in result.stderr.lower()
|
||||
finally:
|
||||
os.chmod(protected, stat.S_IRUSR | stat.S_IWUSR)
|
||||
|
||||
def test_with_keywords_filter(self, valid_context_file):
|
||||
"""Test analyze with keywords filter."""
|
||||
result = self.run_script("analyze", valid_context_file, "--keywords", "feature,implement")
|
||||
assert result.returncode == 0
|
||||
|
||||
def test_with_limit(self, valid_context_file):
|
||||
"""Test analyze with limit parameter."""
|
||||
result = self.run_script("analyze", valid_context_file, "--limit", "10")
|
||||
assert result.returncode == 0
|
||||
|
||||
|
||||
class TestFileSizeValidation:
|
||||
"""Test file size validation (100MB limit)."""
|
||||
|
||||
def test_large_file_warning_in_code(self):
|
||||
"""Verify MAX_FILE_SIZE_MB constant exists in scripts."""
|
||||
evaluator = SCRIPTS_DIR / "compression_evaluator.py"
|
||||
analyzer = SCRIPTS_DIR / "context_analyzer.py"
|
||||
|
||||
eval_content = evaluator.read_text()
|
||||
analyzer_content = analyzer.read_text()
|
||||
|
||||
assert "MAX_FILE_SIZE_MB = 100" in eval_content
|
||||
assert "MAX_FILE_SIZE_MB = 100" in analyzer_content
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-v"])
|
||||
Reference in New Issue
Block a user