init

2026-04-12 01:06:31 +07:00
commit 10d660cbcb
1066 changed files with 228596 additions and 0 deletions
--- a/.opencode/skills/context-engineering/SKILL.md
+++ b/.opencode/skills/context-engineering/SKILL.md
@@ -0,0 +1,110 @@
+---
+name: ck:context-engineering
+description: >-
+  Check context usage limits, monitor time remaining, optimize token consumption, debug context failures.
+  Use when asking about context percentage, rate limits, usage warnings, context optimization, agent architectures, memory systems.
+argument-hint: "[topic or question]"
+metadata:
+  author: claudekit
+  version: "1.0.0"
+---
+
+# Context Engineering
+
+Context engineering curates the smallest high-signal token set for LLM tasks. The goal: maximize reasoning quality while minimizing token usage.
+
+## When to Activate
+
+- Designing/debugging agent systems
+- Context limits constrain performance
+- Optimizing cost/latency
+- Building multi-agent coordination
+- Implementing memory systems
+- Evaluating agent performance
+- Developing LLM-powered pipelines
+
+## Core Principles
+
+1. **Context quality > quantity** - High-signal tokens beat exhaustive content
+2. **Attention is finite** - U-shaped curve favors beginning/end positions
+3. **Progressive disclosure** - Load information just-in-time
+4. **Isolation prevents degradation** - Partition work across sub-agents
+5. **Measure before optimizing** - Know your baseline
+
+**IMPORTANT:**
+- Sacrifice grammar for the sake of concision.
+- Ensure token efficiency while maintaining high quality.
+- Pass these rules to subagents.
+
+## Quick Reference
+
+| Topic | When to Use | Reference |
+|-------|-------------|-----------|
+| **Fundamentals** | Understanding context anatomy, attention mechanics | [context-fundamentals.md](./references/context-fundamentals.md) |
+| **Degradation** | Debugging failures, lost-in-middle, poisoning | [context-degradation.md](./references/context-degradation.md) |
+| **Optimization** | Compaction, masking, caching, partitioning | [context-optimization.md](./references/context-optimization.md) |
+| **Compression** | Long sessions, summarization strategies | [context-compression.md](./references/context-compression.md) |
+| **Memory** | Cross-session persistence, knowledge graphs | [memory-systems.md](./references/memory-systems.md) |
+| **Multi-Agent** | Coordination patterns, context isolation | [multi-agent-patterns.md](./references/multi-agent-patterns.md) |
+| **Evaluation** | Testing agents, LLM-as-Judge, metrics | [evaluation.md](./references/evaluation.md) |
+| **Tool Design** | Tool consolidation, description engineering | [tool-design.md](./references/tool-design.md) |
+| **Pipelines** | Project development, batch processing | [project-development.md](./references/project-development.md) |
+| **Runtime Awareness** | Usage limits, context window monitoring | [runtime-awareness.md](./references/runtime-awareness.md) |
+
+## Key Metrics
+
+- **Token utilization**: Warning at 70%, trigger optimization at 80%
+- **Token variance**: Explains 80% of agent performance variance
+- **Multi-agent cost**: ~15x single agent baseline
+- **Compaction target**: 50-70% reduction, <5% quality loss
+- **Cache hit target**: 70%+ for stable workloads
+
+## Four-Bucket Strategy
+
+1. **Write**: Save context externally (scratchpads, files)
+2. **Select**: Pull only relevant context (retrieval, filtering)
+3. **Compress**: Reduce tokens while preserving info (summarization)
+4. **Isolate**: Split across sub-agents (partitioning)
+
+## Anti-Patterns
+
+- Exhaustive context over curated context
+- Critical info in middle positions
+- No compaction triggers before limits
+- Single agent for parallelizable tasks
+- Tools without clear descriptions
+
+## Guidelines
+
+1. Place critical info at beginning/end of context
+2. Implement compaction at 70-80% utilization
+3. Use sub-agents for context isolation, not role-play
+4. Design tools with 4-question framework (what, when, inputs, returns)
+5. Optimize for tokens-per-task, not tokens-per-request
+6. Validate with probe-based evaluation
+7. Monitor KV-cache hit rates in production
+8. Start minimal, add complexity only when proven necessary
+
+## Runtime Awareness
+
+The system automatically injects usage awareness via PostToolUse hook:
+
+```xml
+<usage-awareness>
+Claude Usage Limits: 5h=45%, 7d=32%
+Context Window Usage: 67%
+</usage-awareness>
+```
+
+**Thresholds:**
+- 70%: WARNING - consider optimization/compaction
+- 90%: CRITICAL - immediate action needed
+
+**Data Sources:**
+- Usage limits: Anthropic OAuth API (`https://api.anthropic.com/api/oauth/usage`)
+- Context window: Statusline temp file (`/tmp/ck-context-{session_id}.json`)
+
+## Scripts
+
+- [context_analyzer.py](./scripts/context_analyzer.py) - Context health analysis, degradation detection
+- [compression_evaluator.py](./scripts/compression_evaluator.py) - Compression quality evaluation
--- a/.opencode/skills/context-engineering/references/context-compression.md
+++ b/.opencode/skills/context-engineering/references/context-compression.md
@@ -0,0 +1,84 @@
+# Context Compression
+
+Strategies for long-running sessions exceeding context windows.
+
+## Core Insight
+
+Optimize **tokens-per-task** (total to completion), not tokens-per-request.
+Aggressive compression causing re-fetching costs more than better retention.
+
+## Compression Methods
+
+| Method | Compression | Quality | Best For |
+|--------|-------------|---------|----------|
+| **Anchored Iterative** | 98.6% | 3.70/5 | Best balance |
+| **Regenerative Full** | 98.7% | 3.44/5 | Readability |
+| **Opaque** | 99.3% | 3.35/5 | Max compression |
+
+## Anchored Iterative Summary Template
+
+```markdown
+## Session Intent
+Original goal: [preserved]
+
+## Files Modified
+- file.py: Changes made
+
+## Decisions Made
+- Key decisions with rationale
+
+## Current State
+Progress summary
+
+## Next Steps
+1. Next action items
+```
+
+**On compression**: Merge new content into existing sections, don't regenerate.
+
+## Compression Triggers
+
+| Strategy | Trigger | Use Case |
+|----------|---------|----------|
+| Fixed threshold | 70-80% utilization | General purpose |
+| Sliding window | Keep last N turns + summary | Conversations |
+| Task-boundary | At logical completion | Multi-step workflows |
+
+## Artifact Trail Problem
+
+Weakest dimension (2.2-2.5/5.0). Coding agents need explicit tracking of:
+- Files created/modified/read
+- Function/variable names, error messages
+
+**Solution**: Dedicated artifact section in summary.
+
+## Probe-Based Evaluation
+
+| Probe Type | Tests | Example |
+|------------|-------|---------|
+| Recall | Factual retention | "What was the error?" |
+| Artifact | File tracking | "Which files modified?" |
+| Continuation | Task planning | "What next?" |
+| Decision | Reasoning chains | "Why chose X?" |
+
+## Six Evaluation Dimensions
+
+1. **Accuracy** - Technical correctness
+2. **Context Awareness** - Conversation state
+3. **Artifact Trail** - File tracking (universally weak)
+4. **Completeness** - Coverage depth
+5. **Continuity** - Work continuation
+6. **Instruction Following** - Constraints
+
+## Guidelines
+
+1. Use anchored iterative for best quality/compression
+2. Maintain explicit artifact tracking section
+3. Trigger compression at 70% utilization
+4. Merge into sections, don't regenerate
+5. Evaluate with probes, not lexical metrics
+
+## Related
+
+- [Context Optimization](./context-optimization.md)
+- [Evaluation](./evaluation.md)
--- a/.opencode/skills/context-engineering/references/context-degradation.md
+++ b/.opencode/skills/context-engineering/references/context-degradation.md
@@ -0,0 +1,93 @@
+# Context Degradation Patterns
+
+Predictable degradation as context grows. Not binary - a continuum.
+
+## Degradation Patterns
+
+| Pattern | Cause | Detection |
+|---------|-------|-----------|
+| **Lost-in-Middle** | U-shaped attention | Critical info recall drops 10-40% |
+| **Context Poisoning** | Errors compound via reference | Persistent hallucinations despite correction |
+| **Context Distraction** | Irrelevant info overwhelms | Single distractor degrades performance |
+| **Context Confusion** | Multiple tasks mix | Wrong tool calls, mixed requirements |
+| **Context Clash** | Contradictory info | Conflicting outputs, inconsistent reasoning |
+
+## Lost-in-Middle Phenomenon
+
+- Information in middle gets 10-40% lower recall
+- Models allocate massive attention to first token (BOS sink)
+- As context grows, middle tokens fail to get sufficient attention
+- **Mitigation**: Place critical info at beginning/end
+
+```markdown
+[CURRENT TASK]              # Beginning - high attention
+- Critical requirements
+
+[DETAILED CONTEXT]          # Middle - lower attention
+- Supporting details
+
+[KEY FINDINGS]              # End - high attention
+- Important conclusions
+```
+
+## Context Poisoning
+
+**Entry points**:
+1. Tool outputs with errors/unexpected formats
+2. Retrieved docs with incorrect/outdated info
+3. Model-generated summaries with hallucinations
+
+**Detection symptoms**:
+- Degraded quality on previously successful tasks
+- Tool misalignment (wrong tools/parameters)
+- Persistent hallucinations
+
+**Recovery**:
+- Truncate to before poisoning point
+- Explicit note + re-evaluation request
+- Restart with clean context, preserve only verified info
+
+## Model Degradation Thresholds
+
+| Model | Degradation Onset | Severe Degradation |
+|-------|-------------------|-------------------|
+| GPT-5.2 | ~64K tokens | ~200K tokens |
+| Claude Opus 4.5 | ~100K tokens | ~180K tokens |
+| Claude Sonnet 4.5 | ~80K tokens | ~150K tokens |
+| Gemini 3 Pro | ~500K tokens | ~800K tokens |
+
+## Four-Bucket Mitigation
+
+1. **Write**: Save externally (scratchpads, files)
+2. **Select**: Pull only relevant (retrieval, filtering)
+3. **Compress**: Reduce tokens (summarization)
+4. **Isolate**: Split across sub-agents (partitioning)
+
+## Detection Heuristics
+
+```python
+def calculate_health(utilization, degradation_risk, poisoning_risk):
+    """Health score: 1.0 = healthy, 0.0 = critical"""
+    score = 1.0
+    score -= utilization * 0.5 if utilization > 0.7 else 0
+    score -= degradation_risk * 0.3
+    score -= poisoning_risk * 0.2
+    return max(0, score)
+
+# Thresholds: healthy >0.8, warning >0.6, degraded >0.4, critical <=0.4
+```
+
+## Guidelines
+
+1. Monitor context length vs performance correlation
+2. Place critical info at beginning/end
+3. Implement compaction before degradation
+4. Validate retrieved docs before adding
+5. Use versioning to prevent outdated clash
+6. Segment tasks to prevent confusion
+7. Design for graceful degradation
+
+## Related Topics
+
+- [Context Optimization](./context-optimization.md) - Mitigation techniques
+- [Multi-Agent Patterns](./multi-agent-patterns.md) - Isolation strategies
--- a/.opencode/skills/context-engineering/references/context-fundamentals.md
+++ b/.opencode/skills/context-engineering/references/context-fundamentals.md
@@ -0,0 +1,75 @@
+# Context Fundamentals
+
+Context = all input provided to LLM for task completion.
+
+## Anatomy of Context
+
+| Component | Purpose | Token Impact |
+|-----------|---------|--------------|
+| System Prompt | Identity, constraints, guidelines | Stable, cacheable |
+| Tool Definitions | Action specs with params/returns | Grows with capabilities |
+| Retrieved Docs | Domain knowledge, just-in-time | Variable, selective |
+| Message History | Conversation state, task progress | Accumulates over time |
+| Tool Outputs | Results from actions | 83.9% of typical context |
+
+## Attention Mechanics
+
+- **U-shaped curve**: Beginning/end get more attention than middle
+- **Attention budget**: n^2 relationships for n tokens depletes with growth
+- **Position encoding**: Interpolation allows longer sequences with degradation
+- **First-token sink**: BOS token absorbs large attention budget
+
+## System Prompt Structure
+
+```xml
+<BACKGROUND_INFORMATION>Domain knowledge, role definition</BACKGROUND_INFORMATION>
+<INSTRUCTIONS>Step-by-step procedures</INSTRUCTIONS>
+<TOOL_GUIDANCE>When/how to use tools</TOOL_GUIDANCE>
+<OUTPUT_DESCRIPTION>Format requirements</OUTPUT_DESCRIPTION>
+```
+
+## Progressive Disclosure Levels
+
+1. **Metadata** (~100 words) - Always in context
+2. **SKILL.md body** (<5k words) - When skill triggers
+3. **Bundled resources** (Unlimited) - As needed
+
+## Token Budget Allocation
+
+| Component | Typical Range | Notes |
+|-----------|---------------|-------|
+| System Prompt | 500-2000 | Stable, optimize once |
+| Tool Definitions | 100-500 per tool | Keep under 20 tools |
+| Retrieved Docs | 1000-5000 | Selective loading |
+| Message History | Variable | Summarize at 70% |
+| Reserved Buffer | 10-20% | For responses |
+
+## Document Management
+
+**Strong identifiers**: `customer_pricing_rates.json` not `data/file1.json`
+**Chunk at semantic boundaries**: Paragraphs, sections, not arbitrary lengths
+**Include metadata**: Source, date, relevance score
+
+## Message History Pattern
+
+```python
+# Summary injection every 20 messages
+if len(messages) % 20 == 0:
+    summary = summarize_conversation(messages[-20:])
+    messages.append({"role": "system", "content": f"Summary: {summary}"})
+```
+
+## Guidelines
+
+1. Treat context as finite with diminishing returns
+2. Place critical info at attention-favored positions
+3. Use file-system-based access for large documents
+4. Pre-load stable content, just-in-time load dynamic
+5. Design with explicit token budgets
+6. Monitor usage, implement compaction triggers at 70-80%
+
+## Related Topics
+
+- [Context Degradation](./context-degradation.md) - Failure patterns
+- [Context Optimization](./context-optimization.md) - Efficiency techniques
+- [Memory Systems](./memory-systems.md) - External storage
--- a/.opencode/skills/context-engineering/references/context-optimization.md
+++ b/.opencode/skills/context-engineering/references/context-optimization.md
@@ -0,0 +1,82 @@
+# Context Optimization
+
+Extend effective context capacity through strategic techniques.
+
+## Four Core Strategies
+
+| Strategy | Target | Reduction | When to Use |
+|----------|--------|-----------|-------------|
+| **Compaction** | Full context | 50-70% | Approaching limits |
+| **Observation Masking** | Tool outputs | 60-80% | Verbose outputs >80% |
+| **KV-Cache Optimization** | Repeated prefixes | 70%+ hit | Stable prompts |
+| **Context Partitioning** | Work distribution | N/A | Parallelizable tasks |
+
+## Compaction
+
+Summarize context when approaching limits.
+
+**Priority**: Tool outputs → Old turns → Retrieved docs → Never: System prompt
+
+```python
+if context_tokens / context_limit > 0.8:
+    context = compact_context(context)
+```
+
+**Preserve**: Key findings, decisions, commitments (remove supporting details)
+
+## Observation Masking
+
+Replace verbose tool outputs with compact references.
+
+```python
+if len(observation) > max_length:
+    ref_id = store_observation(observation)
+    return f"[Obs:{ref_id}. Key: {extract_key(observation)}]"
+```
+
+**Never mask**: Current task critical, most recent turn, active reasoning
+**Always mask**: Repeated outputs, boilerplate, already summarized
+
+## KV-Cache Optimization
+
+Reuse cached Key/Value tensors for identical prefixes.
+
+```python
+# Cache-friendly ordering (stable first)
+context = [system_prompt, tool_definitions]  # Cacheable
+context += [unique_content]                   # Variable last
+```
+
+**Tips**: Avoid timestamps in stable sections, consistent formatting, stable structure
+
+## Context Partitioning
+
+Split work across sub-agents with isolated contexts.
+
+```python
+result = await sub_agent.process(subtask, clean_context=True)
+coordinator.receive(result.summary)  # Only essentials
+```
+
+## Decision Framework
+
+| Dominant Component | Apply |
+|-------------------|-------|
+| Tool outputs | Observation masking |
+| Retrieved docs | Summarization or partitioning |
+| Message history | Compaction + summarization |
+| Multiple | Combine strategies |
+
+## Guidelines
+
+1. Measure before optimizing
+2. Apply compaction before masking
+3. Design for cache stability
+4. Partition before context problematic
+5. Monitor effectiveness over time
+6. Balance savings vs quality
+
+## Related
+
+- [Context Compression](./context-compression.md)
+- [Memory Systems](./memory-systems.md)
--- a/.opencode/skills/context-engineering/references/evaluation.md
+++ b/.opencode/skills/context-engineering/references/evaluation.md
@@ -0,0 +1,89 @@
+# Evaluation
+
+Systematically assess agent performance and context engineering choices.
+
+## Key Finding: 95% Performance Variance
+
+- **Token usage**: 80% of variance
+- **Tool calls**: ~10% of variance
+- **Model choice**: ~5% of variance
+
+**Implication**: Token budgets matter more than model upgrades.
+
+## Multi-Dimensional Rubric
+
+| Dimension | Weight | Description |
+|-----------|--------|-------------|
+| Factual Accuracy | 30% | Ground truth verification |
+| Completeness | 25% | Coverage of requirements |
+| Tool Efficiency | 20% | Appropriate tool usage |
+| Citation Accuracy | 15% | Sources match claims |
+| Source Quality | 10% | Authority/credibility |
+
+## Evaluation Methods
+
+### LLM-as-Judge
+
+Beware biases:
+- **Position**: First position preferred
+- **Length**: Longer = higher score
+- **Self-enhancement**: Rating own outputs higher
+- **Verbosity**: Detailed = better
+
+**Mitigation**: Position swapping, anti-bias prompting
+
+### Pairwise Comparison
+
+```python
+score_ab = judge.compare(output_a, output_b)
+score_ba = judge.compare(output_b, output_a)
+consistent = (score_ab > 0.5) != (score_ba > 0.5)
+```
+
+### Probe-Based Testing
+
+| Probe | Tests | Example |
+|-------|-------|---------|
+| Recall | Facts | "What was the error?" |
+| Artifact | Files | "Which files modified?" |
+| Continuation | Planning | "What's next?" |
+| Decision | Reasoning | "Why chose X?" |
+
+## Test Set Design
+
+```python
+class TestSet:
+    def sample_stratified(self, n):
+        per_level = n // 3
+        return (
+            sample(self.simple, per_level) +
+            sample(self.medium, per_level) +
+            sample(self.complex, per_level)
+        )
+```
+
+## Production Monitoring
+
+```python
+class Monitor:
+    sample_rate = 0.01  # 1% sampling
+    alert_threshold = 0.85
+
+    def check(self, scores):
+        if avg(scores) < self.alert_threshold:
+            self.alert(f"Quality degraded: {avg(scores):.2f}")
+```
+
+## Guidelines
+
+1. Start with outcome evaluation, not step-by-step
+2. Use multi-dimensional rubrics
+3. Mitigate LLM-as-Judge biases
+4. Test with stratified complexity
+5. Implement continuous monitoring
+6. Focus on token efficiency (80% variance)
+
+## Related
+
+- [Context Compression](./context-compression.md)
+- [Tool Design](./tool-design.md)
--- a/.opencode/skills/context-engineering/references/memory-systems.md
+++ b/.opencode/skills/context-engineering/references/memory-systems.md
@@ -0,0 +1,88 @@
+# Memory Systems
+
+Architectures for persistent context beyond the window.
+
+## Memory Layer Architecture
+
+| Layer | Scope | Persistence | Use Case |
+|-------|-------|-------------|----------|
+| L1: Working | Current window | None | Active reasoning |
+| L2: Short-Term | Session | Session | Task continuity |
+| L3: Long-Term | Cross-session | Persistent | User preferences |
+| L4: Entity | Per-entity | Persistent | Consistency |
+| L5: Temporal Graph | Time-aware | Persistent | Evolving facts |
+
+## Benchmark Performance (DMR Accuracy)
+
+| System | Accuracy | Approach |
+|--------|----------|----------|
+| Zep | 94.8% | Temporal knowledge graphs |
+| MemGPT | 93.4% | Hierarchical memory |
+| GraphRAG | 75-85% | Knowledge graphs |
+| Vector RAG | 60-70% | Embedding similarity |
+
+## Vector Store with Metadata
+
+```python
+class MetadataVectorStore:
+    def add(self, text, embedding, metadata):
+        doc = {
+            "text": text, "embedding": embedding,
+            "entities": metadata.get("entities", []),
+            "timestamp": metadata.get("timestamp")
+        }
+        self.index_by_entity(doc)
+
+    def search_by_entity(self, entity, k=5):
+        return self.entity_index.get(entity, [])[:k]
+```
+
+## Temporal Knowledge Graph
+
+```python
+class TemporalKnowledgeGraph:
+    def add_fact(self, subject, predicate, obj, valid_from, valid_to=None):
+        self.facts.append({
+            "triple": (subject, predicate, obj),
+            "valid_from": valid_from,
+            "valid_to": valid_to or "current"
+        })
+
+    def query_at_time(self, subject, predicate, timestamp):
+        for fact in self.facts:
+            if (fact["triple"][0] == subject and
+                fact["valid_from"] <= timestamp <= fact["valid_to"]):
+                return fact["triple"][2]
+```
+
+## Memory Retrieval Patterns
+
+| Pattern | Query | Use Case |
+|---------|-------|----------|
+| Semantic | "Similar to X" | General recall |
+| Entity-based | "About user John" | Consistency |
+| Temporal | "Valid on date" | Evolving facts |
+| Hybrid | Combine above | Production |
+
+## File-System-as-Memory
+
+```
+memory/
+├── sessions/{id}/summary.md
+├── entities/{id}.json
+└── facts/{timestamp}_{id}.json
+```
+
+## Guidelines
+
+1. Start with file-system-as-memory (simplest)
+2. Add vector search for scale
+3. Use entity indexing for consistency
+4. Add temporal awareness for evolving facts
+5. Implement consolidation for health
+6. Measure retrieval accuracy
+
+## Related
+
+- [Context Fundamentals](./context-fundamentals.md)
+- [Multi-Agent Patterns](./multi-agent-patterns.md)
--- a/.opencode/skills/context-engineering/references/multi-agent-patterns.md
+++ b/.opencode/skills/context-engineering/references/multi-agent-patterns.md
@@ -0,0 +1,90 @@
+# Multi-Agent Patterns
+
+Distribute work across multiple context windows for isolation and scale.
+
+## Core Insight
+
+Sub-agents exist to **isolate context**, not anthropomorphize roles.
+
+## Token Economics
+
+| Architecture | Multiplier | Use Case |
+|--------------|------------|----------|
+| Single agent | 1x | Simple tasks |
+| Single + tools | ~4x | Moderate complexity |
+| Multi-agent | ~15x | Context isolation needed |
+
+**Key**: Token usage explains 80% of performance variance.
+
+## Patterns
+
+### Supervisor/Orchestrator
+
+```python
+class Supervisor:
+    def process(self, task):
+        subtasks = self.decompose(task)
+        results = [worker.execute(st, clean_context=True) for st in subtasks]
+        return self.aggregate(results)
+```
+
+**Pros**: Control, human-in-loop | **Cons**: Bottleneck, telephone game
+
+### Peer-to-Peer/Swarm
+
+```python
+def process_with_handoff(agent, task):
+    result = agent.process(task)
+    if "handoff" in result:
+        return process_with_handoff(select_agent(result["to"]), result["state"])
+    return result
+```
+
+**Pros**: No SPOF, scales | **Cons**: Complex coordination
+
+### Hierarchical
+
+Strategy → Planning → Execution layers
+**Pros**: Separation of concerns | **Cons**: Coordination overhead
+
+## Context Isolation Patterns
+
+| Pattern | Isolation | Use Case |
+|---------|-----------|----------|
+| Full delegation | None | Max capability |
+| Instruction passing | High | Simple tasks |
+| File coordination | Medium | Shared state |
+
+## Consensus Mechanisms
+
+```python
+def weighted_consensus(responses):
+    scores = {}
+    for r in responses:
+        weight = r["confidence"] * r["expertise"]
+        scores[r["answer"]] = scores.get(r["answer"], 0) + weight
+    return max(scores, key=scores.get)
+```
+
+## Failure Recovery
+
+| Failure | Mitigation |
+|---------|------------|
+| Bottleneck | Output schemas, checkpointing |
+| Overhead | Clear handoffs, batching |
+| Divergence | Boundaries, convergence checks |
+| Errors | Validation, circuit breakers |
+
+## Guidelines
+
+1. Use multi-agent for context isolation, not role-play
+2. Accept ~15x token cost for benefits
+3. Implement circuit breakers
+4. Use files for shared state
+5. Design clear handoffs
+6. Validate between agents
+
+## Related
+
+- [Context Optimization](./context-optimization.md)
+- [Evaluation](./evaluation.md)
--- a/.opencode/skills/context-engineering/references/project-development.md
+++ b/.opencode/skills/context-engineering/references/project-development.md
@@ -0,0 +1,97 @@
+# Project Development
+
+Design and build LLM-powered projects from ideation to deployment.
+
+## Task-Model Fit
+
+**LLM-Suited**: Synthesis, subjective judgment, NL output, error-tolerant batches
+**LLM-Unsuited**: Precise computation, real-time, perfect accuracy, deterministic output
+
+## Manual Prototype First
+
+Test one example with target model before automation.
+
+## Pipeline Architecture
+
+```
+acquire → prepare → process → parse → render
+ (fetch)  (prompt)   (LLM)   (extract) (output)
+```
+
+Stages 1,2,4,5: Deterministic, cheap | Stage 3: Non-deterministic, expensive
+
+## File System as State
+
+```
+data/{id}/
+├── raw.json      # acquire done
+├── prompt.md     # prepare done
+├── response.md   # process done
+└── parsed.json   # parse done
+```
+
+```python
+def get_stage(id):
+    if exists(f"{id}/parsed.json"): return "render"
+    if exists(f"{id}/response.md"): return "parse"
+    # ... check backwards
+```
+
+**Benefits**: Idempotent, resumable, debuggable
+
+## Structured Output
+
+```markdown
+## SUMMARY
+[Overview]
+
+## KEY_FINDINGS
+- Finding 1
+
+## SCORE
+[1-5]
+```
+
+```python
+def parse(response):
+    return {
+        "summary": extract_section(response, "SUMMARY"),
+        "findings": extract_list(response, "KEY_FINDINGS"),
+        "score": extract_int(response, "SCORE")
+    }
+```
+
+## Cost Estimation
+
+```python
+def estimate(items, tokens_per, price_per_1k):
+    return len(items) * tokens_per / 1000 * price_per_1k * 1.1  # 10% buffer
+# 1000 items × 2000 tokens × $0.01/1k = $22
+```
+
+## Case Studies
+
+**Karpathy HN**: 930 items, $58, 1hr, 15 workers
+**Vercel d0**: 17→2 tools, 80%→100% success, 3.5x faster
+
+## Single vs Multi-Agent
+
+| Factor | Single | Multi |
+|--------|--------|-------|
+| Context | Fits window | Exceeds |
+| Tasks | Sequential | Parallel |
+| Tokens | Limited | 15x OK |
+
+## Guidelines
+
+1. Validate manually before automating
+2. Use 5-stage pipeline
+3. Track state via files
+4. Design structured output
+5. Estimate costs first
+6. Start single, add multi when needed
+
+## Related
+
+- [Context Optimization](./context-optimization.md)
+- [Multi-Agent Patterns](./multi-agent-patterns.md)
--- a/.opencode/skills/context-engineering/references/runtime-awareness.md
+++ b/.opencode/skills/context-engineering/references/runtime-awareness.md
@@ -0,0 +1,202 @@
+# Runtime Awareness
+
+Monitor usage limits and context window utilization in real-time to optimize Claude Code sessions.
+
+## Overview
+
+Runtime awareness provides visibility into two critical metrics:
+1. **Usage Limits** - API quota consumption (5-hour and 7-day rolling windows)
+2. **Context Window** - Current token utilization within the 200K context limit
+
+## Architecture
+
+```
+┌─────────────────┐    ┌──────────────────────────┐
+│  statusline.cjs │───▶│  /tmp/ck-context-*.json  │
+│  (writes data)  │    │  (context window data)   │
+└─────────────────┘    └────────────┬─────────────┘
+                                    │
+                       ┌────────────▼─────────────┐
+                       │  usage-context-hook.cjs  │◀── PostToolUse
+                       │  - Reads context file    │
+                       │  - Fetches usage limits  │
+                       │  - Injects awareness     │
+                       └──────────────────────────┘
+```
+
+## Usage Limits API
+
+### Endpoint
+
+```
+GET https://api.anthropic.com/api/oauth/usage
+```
+
+### Authentication
+
+Requires OAuth Bearer token with `anthropic-beta: oauth-2025-04-20` header.
+
+### Credential Locations
+
+| Platform | Method | Location |
+|----------|--------|----------|
+| macOS | Keychain | `Claude Code-credentials` |
+| Windows | File | `%USERPROFILE%\.claude\.credentials.json` |
+| Linux | File | `~/.opencode/.credentials.json` |
+
+### Response Structure
+
+```json
+{
+  "five_hour": {
+    "utilization": 45,
+    "resets_at": "2025-01-15T18:00:00Z"
+  },
+  "seven_day": {
+    "utilization": 32,
+    "resets_at": "2025-01-22T00:00:00Z"
+  },
+  "seven_day_sonnet": {
+    "utilization": 11,
+    "resets_at": "2025-01-15T09:00:00Z"
+  }
+}
+```
+
+- `utilization`: Already a percentage (0-100), NOT a decimal
+- `resets_at`: ISO 8601 timestamp when quota resets
+- `seven_day_sonnet`: Model-specific limit (may be null)
+
+## Context Window Data
+
+### Source
+
+Statusline writes context data to `/tmp/ck-context-{session_id}.json`:
+
+```json
+{
+  "percent": 67,
+  "tokens": 134000,
+  "size": 200000,
+  "usage": {
+    "input_tokens": 80000,
+    "cache_creation_input_tokens": 30000,
+    "cache_read_input_tokens": 24000
+  },
+  "timestamp": 1705312000000
+}
+```
+
+### Token Calculation
+
+```
+total = input_tokens + cache_creation_input_tokens + cache_read_input_tokens
+percent = (total + AUTOCOMPACT_BUFFER) / context_window_size * 100
+```
+
+Where `AUTOCOMPACT_BUFFER = 45000` (22.5% reserved).
+
+## Hook Output
+
+The PostToolUse hook injects awareness data every 5 minutes:
+
+```xml
+<usage-awareness>
+Limits: 5h=45%, 7d=32%
+Context: 67%
+</usage-awareness>
+```
+
+### Warning Indicators
+
+| Level | Threshold | Indicator |
+|-------|-----------|-----------|
+| Normal | < 70% | Plain percentage |
+| Warning | 70-89% | `[WARNING]` |
+| Critical | ≥ 90% | `[CRITICAL]` |
+
+### Examples
+
+Normal state:
+```xml
+<usage-awareness>
+Limits: 5h=45%, 7d=32%
+Context: 67%
+</usage-awareness>
+```
+
+Warning state:
+```xml
+<usage-awareness>
+Limits: 5h=75% [WARNING], 7d=32%
+Context: 78% [WARNING - consider compaction]
+</usage-awareness>
+```
+
+Critical state:
+```xml
+<usage-awareness>
+Limits: 5h=92% [CRITICAL], 7d=65%
+Context: 91% [CRITICAL - compaction needed]
+</usage-awareness>
+```
+
+## Recommendations by Threshold
+
+### Context Window
+
+| Utilization | Action |
+|-------------|--------|
+| < 70% | Continue normally |
+| 70-80% | Plan compaction strategy |
+| 80-90% | Execute compaction |
+| > 90% | Immediate compaction or session reset |
+
+### Usage Limits
+
+| 5-Hour | Action |
+|--------|--------|
+| < 70% | Normal usage |
+| 70-90% | Reduce parallelization, delegate to subagents |
+| > 90% | Wait for reset or use lower-tier models |
+
+| 7-Day | Action |
+|-------|--------|
+| < 70% | Normal usage |
+| 70-90% | Monitor daily consumption |
+| > 90% | Limit usage to essential tasks |
+
+## Configuration
+
+### Hook Settings (`.opencode/settings.json`)
+
+```json
+{
+  "hooks": {
+    "PostToolUse": [
+      {
+        "matcher": "*",
+        "hooks": [{
+          "type": "command",
+          "command": "node .opencode/hooks/usage-quota-cache-refresh.cjs"
+        }]
+      }
+    ]
+  }
+}
+```
+
+### Throttling
+
+- **Injection interval**: 5 minutes (300,000ms)
+- **API cache TTL**: 60 seconds
+- **Context data freshness**: 30 seconds
+
+## Troubleshooting
+
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| No usage limits shown | No OAuth token | Run `claude login` |
+| Stale context data | Statusline not updating | Check statusline config |
+| 401 Unauthorized | Expired token | Re-authenticate |
+| Hook not firing | Settings misconfigured | Verify PostToolUse matcher |
--- a/.opencode/skills/context-engineering/references/tool-design.md
+++ b/.opencode/skills/context-engineering/references/tool-design.md
@@ -0,0 +1,86 @@
+# Tool Design
+
+Design effective tools for agent systems.
+
+## Consolidation Principle
+
+Single comprehensive tools > multiple narrow tools. **Target**: 10-20 tools max.
+
+## Architectural Reduction Evidence
+
+| Metric | 17 Tools | 2 Tools | Improvement |
+|--------|----------|---------|-------------|
+| Time | 274.8s | 77.4s | 3.5x faster |
+| Success | 80% | 100% | +20% |
+| Tokens | 102k | 61k | 37% fewer |
+
+**Key**: Good documentation replaces tool sophistication.
+
+## When Reduction Works
+
+**Prerequisites**: High docs quality, capable model, navigable problem
+**Avoid when**: Messy systems, specialized domain, safety-critical
+
+## Description Engineering
+
+Answer four questions:
+1. **What** does the tool do?
+2. **When** should it be used?
+3. **What inputs** does it accept?
+4. **What** does it return?
+
+### Good Example
+
+```json
+{
+  "name": "get_customer",
+  "description": "Retrieve customer profile by ID. Use for order processing, support. Returns 404 if not found.",
+  "parameters": {
+    "customer_id": {"type": "string", "pattern": "^CUST-[0-9]{6}$"},
+    "format": {"enum": ["concise", "detailed"]}
+  }
+}
+```
+
+### Poor Example
+
+```json
+{"name": "search", "description": "Search for things", "parameters": {"q": {}}}
+```
+
+## Error Messages
+
+```python
+def format_error(code, message, resolution):
+    return {
+        "error": {"code": code, "message": message,
+                  "resolution": resolution, "retryable": code in RETRYABLE}
+    }
+# "Use YYYY-MM-DD format, e.g., '2024-01-05'"
+```
+
+## Response Formats
+
+Offer concise vs detailed:
+
+```python
+def get_data(id, format="concise"):
+    if format == "concise":
+        return {"name": data.name}
+    return data.full()  # Detailed
+```
+
+## Guidelines
+
+1. Consolidate tools (target 10-20)
+2. Answer all four questions
+3. Use full parameter names
+4. Design errors for recovery
+5. Offer concise/detailed formats
+6. Test with agents before deploy
+7. Start minimal, add when proven
+
+## Related
+
+- [Context Fundamentals](./context-fundamentals.md)
+- [Multi-Agent Patterns](./multi-agent-patterns.md)
--- a/.opencode/skills/context-engineering/scripts/compression_evaluator.py
+++ b/.opencode/skills/context-engineering/scripts/compression_evaluator.py
@@ -0,0 +1,349 @@
+#!/usr/bin/env python3
+"""
+Compression Evaluator - Evaluate compression quality with probe-based testing.
+
+Usage:
+    python compression_evaluator.py evaluate <original_file> <compressed_file>
+    python compression_evaluator.py generate-probes <context_file>
+"""
+
+import argparse
+import json
+import os
+import re
+import sys
+from dataclasses import dataclass, field
+from enum import Enum
+from typing import Optional
+
+MAX_FILE_SIZE_MB = 100
+
+
+def load_file(path: str, as_json: bool = True):
+    """Load file with proper error handling and size validation."""
+    try:
+        size_mb = os.path.getsize(path) / (1024 * 1024)
+        if size_mb > MAX_FILE_SIZE_MB:
+            print(f"Error: File too large ({size_mb:.1f}MB). Max {MAX_FILE_SIZE_MB}MB", file=sys.stderr)
+            sys.exit(1)
+        with open(path, encoding='utf-8') as f:
+            return json.load(f) if as_json else f.read()
+    except FileNotFoundError:
+        print(f"Error: File not found: {path}", file=sys.stderr)
+        sys.exit(1)
+    except PermissionError:
+        print(f"Error: Permission denied: {path}", file=sys.stderr)
+        sys.exit(1)
+    except json.JSONDecodeError as e:
+        print(f"Error: Invalid JSON in {path}: {e}", file=sys.stderr)
+        sys.exit(1)
+
+
+class ProbeType(Enum):
+    RECALL = "recall"           # Factual retention
+    ARTIFACT = "artifact"       # File tracking
+    CONTINUATION = "continuation"  # Task planning
+    DECISION = "decision"       # Reasoning chains
+
+
+@dataclass
+class Probe:
+    type: ProbeType
+    question: str
+    ground_truth: str
+    context_reference: Optional[str] = None
+
+
+@dataclass
+class ProbeResult:
+    probe: Probe
+    response: str
+    scores: dict
+    overall_score: float
+
+
+@dataclass
+class EvaluationReport:
+    compression_ratio: float
+    quality_score: float
+    dimension_scores: dict
+    probe_results: list
+    recommendations: list = field(default_factory=list)
+
+
+# Six evaluation dimensions with weights
+DIMENSIONS = {
+    "accuracy": {"weight": 0.20, "description": "Technical correctness"},
+    "context_awareness": {"weight": 0.15, "description": "Conversation state"},
+    "artifact_trail": {"weight": 0.20, "description": "File tracking"},
+    "completeness": {"weight": 0.20, "description": "Coverage and depth"},
+    "continuity": {"weight": 0.15, "description": "Work continuation"},
+    "instruction_following": {"weight": 0.10, "description": "Constraint adherence"}
+}
+
+
+def estimate_tokens(text: str) -> int:
+    """Estimate token count."""
+    return len(text) // 4
+
+
+def extract_facts(messages: list) -> list:
+    """Extract factual statements that can be probed."""
+    facts = []
+    patterns = [
+        (r"error[:\s]+([^.]+)", "error"),
+        (r"next step[s]?[:\s]+([^.]+)", "next_step"),
+        (r"decided to\s+([^.]+)", "decision"),
+        (r"implemented\s+([^.]+)", "implementation"),
+        (r"found that\s+([^.]+)", "finding")
+    ]
+
+    for msg in messages:
+        content = str(msg.get("content", "") if isinstance(msg, dict) else msg)
+        for pattern, fact_type in patterns:
+            matches = re.findall(pattern, content, re.IGNORECASE)
+            for match in matches:
+                facts.append({"type": fact_type, "content": match.strip()})
+    return facts
+
+
+def extract_files(messages: list) -> list:
+    """Extract file references."""
+    files = []
+    patterns = [
+        r"(?:created|modified|updated|edited|read)\s+[`'\"]?([a-zA-Z0-9_/.-]+\.[a-zA-Z]+)[`'\"]?",
+        r"file[:\s]+[`'\"]?([a-zA-Z0-9_/.-]+\.[a-zA-Z]+)[`'\"]?"
+    ]
+
+    for msg in messages:
+        content = str(msg.get("content", "") if isinstance(msg, dict) else msg)
+        for pattern in patterns:
+            matches = re.findall(pattern, content)
+            files.extend(matches)
+    return list(set(files))
+
+
+def extract_decisions(messages: list) -> list:
+    """Extract decision points."""
+    decisions = []
+    patterns = [
+        r"chose\s+([^.]+)\s+(?:because|since|over)",
+        r"decided\s+(?:to\s+)?([^.]+)",
+        r"went with\s+([^.]+)"
+    ]
+
+    for msg in messages:
+        content = str(msg.get("content", "") if isinstance(msg, dict) else msg)
+        for pattern in patterns:
+            matches = re.findall(pattern, content, re.IGNORECASE)
+            decisions.extend(matches)
+    return decisions
+
+
+def generate_probes(messages: list) -> list:
+    """Generate probe set for evaluation."""
+    probes = []
+
+    # Recall probes from facts
+    facts = extract_facts(messages)
+    for fact in facts[:3]:  # Limit to 3 recall probes
+        probes.append(Probe(
+            type=ProbeType.RECALL,
+            question=f"What was the {fact['type'].replace('_', ' ')}?",
+            ground_truth=fact["content"]
+        ))
+
+    # Artifact probes from files
+    files = extract_files(messages)
+    if files:
+        probes.append(Probe(
+            type=ProbeType.ARTIFACT,
+            question="Which files have been modified or created?",
+            ground_truth=", ".join(files)
+        ))
+
+    # Continuation probe
+    probes.append(Probe(
+        type=ProbeType.CONTINUATION,
+        question="What should be done next?",
+        ground_truth="[Extracted from context]"  # Would need LLM to generate
+    ))
+
+    # Decision probes
+    decisions = extract_decisions(messages)
+    for decision in decisions[:2]:  # Limit to 2 decision probes
+        probes.append(Probe(
+            type=ProbeType.DECISION,
+            question=f"Why was the decision made to {decision[:50]}...?",
+            ground_truth=decision
+        ))
+
+    return probes
+
+
+def evaluate_response(probe: Probe, response: str) -> dict:
+    """
+    Evaluate response against probe.
+    Note: Production should use LLM-as-Judge.
+    """
+    scores = {}
+    response_lower = response.lower()
+    ground_truth_lower = probe.ground_truth.lower()
+
+    # Heuristic scoring (replace with LLM evaluation in production)
+    # Check for ground truth presence
+    if ground_truth_lower in response_lower:
+        base_score = 1.0
+    elif any(word in response_lower for word in ground_truth_lower.split()[:3]):
+        base_score = 0.6
+    else:
+        base_score = 0.3
+
+    # Adjust based on probe type
+    if probe.type == ProbeType.ARTIFACT:
+        # Check file mentions
+        files_mentioned = len(re.findall(r'\.[a-z]+', response_lower))
+        scores["artifact_trail"] = min(1.0, base_score + files_mentioned * 0.1)
+        scores["accuracy"] = base_score
+    elif probe.type == ProbeType.RECALL:
+        scores["accuracy"] = base_score
+        scores["completeness"] = base_score
+    elif probe.type == ProbeType.CONTINUATION:
+        scores["continuity"] = base_score
+        scores["context_awareness"] = base_score
+    elif probe.type == ProbeType.DECISION:
+        scores["accuracy"] = base_score
+        scores["context_awareness"] = base_score
+
+    return scores
+
+
+def calculate_compression_ratio(original: str, compressed: str) -> float:
+    """Calculate compression ratio."""
+    original_tokens = estimate_tokens(original)
+    compressed_tokens = estimate_tokens(compressed)
+    if original_tokens == 0:
+        return 0.0
+    return 1.0 - (compressed_tokens / original_tokens)
+
+
+def evaluate_compression(original_messages: list, compressed_text: str,
+                         probes: Optional[list] = None) -> EvaluationReport:
+    """
+    Evaluate compression quality.
+
+    Args:
+        original_messages: Original context messages
+        compressed_text: Compressed summary
+        probes: Optional pre-generated probes
+
+    Returns:
+        EvaluationReport with scores and recommendations
+    """
+    # Generate probes if not provided
+    if probes is None:
+        probes = generate_probes(original_messages)
+
+    # Calculate compression ratio
+    original_text = json.dumps(original_messages)
+    compression_ratio = calculate_compression_ratio(original_text, compressed_text)
+
+    # Evaluate each probe (simulated - production uses LLM)
+    probe_results = []
+    dimension_scores = {dim: [] for dim in DIMENSIONS}
+
+    for probe in probes:
+        # In production, send compressed_text + probe.question to LLM
+        # Here we simulate with heuristic check
+        scores = evaluate_response(probe, compressed_text)
+
+        overall = sum(scores.values()) / len(scores) if scores else 0
+        probe_results.append(ProbeResult(
+            probe=probe,
+            response="[Would be LLM response]",
+            scores=scores,
+            overall_score=overall
+        ))
+
+        # Aggregate by dimension
+        for dim, score in scores.items():
+            if dim in dimension_scores:
+                dimension_scores[dim].append(score)
+
+    # Calculate dimension averages
+    avg_dimensions = {}
+    for dim, scores in dimension_scores.items():
+        avg_dimensions[dim] = sum(scores) / len(scores) if scores else 0.5
+
+    # Calculate weighted quality score
+    quality_score = sum(
+        avg_dimensions.get(dim, 0.5) * info["weight"]
+        for dim, info in DIMENSIONS.items()
+    )
+
+    # Generate recommendations
+    recommendations = []
+    if compression_ratio > 0.99:
+        recommendations.append("Very high compression. Risk of information loss.")
+    if avg_dimensions.get("artifact_trail", 1) < 0.5:
+        recommendations.append("Artifact tracking weak. Add explicit file section to summary.")
+    if avg_dimensions.get("continuity", 1) < 0.5:
+        recommendations.append("Continuity low. Add 'Next Steps' section to summary.")
+    if quality_score < 0.6:
+        recommendations.append("Quality below threshold. Consider less aggressive compression.")
+
+    return EvaluationReport(
+        compression_ratio=compression_ratio,
+        quality_score=quality_score,
+        dimension_scores=avg_dimensions,
+        probe_results=probe_results,
+        recommendations=recommendations
+    )
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Compression quality evaluator")
+    subparsers = parser.add_subparsers(dest="command", required=True)
+
+    # Evaluate command
+    eval_parser = subparsers.add_parser("evaluate", help="Evaluate compression quality")
+    eval_parser.add_argument("original_file", help="JSON file with original messages")
+    eval_parser.add_argument("compressed_file", help="Text file with compressed summary")
+
+    # Generate probes command
+    probe_parser = subparsers.add_parser("generate-probes", help="Generate evaluation probes")
+    probe_parser.add_argument("context_file", help="JSON file with context messages")
+
+    args = parser.parse_args()
+
+    if args.command == "evaluate":
+        original = load_file(args.original_file, as_json=True)
+        messages = original if isinstance(original, list) else original.get("messages", [])
+        compressed = load_file(args.compressed_file, as_json=False)
+
+        report = evaluate_compression(messages, compressed)
+        print(json.dumps({
+            "compression_ratio": f"{report.compression_ratio:.1%}",
+            "quality_score": f"{report.quality_score:.2f}",
+            "dimension_scores": {k: f"{v:.2f}" for k, v in report.dimension_scores.items()},
+            "probe_count": len(report.probe_results),
+            "recommendations": report.recommendations
+        }, indent=2))
+
+    elif args.command == "generate-probes":
+        data = load_file(args.context_file, as_json=True)
+        messages = data if isinstance(data, list) else data.get("messages", [])
+
+        probes = generate_probes(messages)
+        output = []
+        for probe in probes:
+            output.append({
+                "type": probe.type.value,
+                "question": probe.question,
+                "ground_truth": probe.ground_truth
+            })
+        print(json.dumps(output, indent=2))
+
+
+if __name__ == "__main__":
+    main()
--- a/.opencode/skills/context-engineering/scripts/context_analyzer.py
+++ b/.opencode/skills/context-engineering/scripts/context_analyzer.py
@@ -0,0 +1,317 @@
+#!/usr/bin/env python3
+"""
+Context Analyzer - Health analysis and degradation detection for agent contexts.
+
+Usage:
+    python context_analyzer.py analyze <context_file>
+    python context_analyzer.py budget --system 2000 --tools 1500 --docs 3000 --history 5000
+"""
+
+import argparse
+import json
+import math
+import os
+import re
+import sys
+from dataclasses import dataclass, field
+from enum import Enum
+from typing import Optional
+
+MAX_FILE_SIZE_MB = 100
+
+
+def load_json_file(path: str):
+    """Load JSON file with proper error handling and size validation."""
+    try:
+        size_mb = os.path.getsize(path) / (1024 * 1024)
+        if size_mb > MAX_FILE_SIZE_MB:
+            print(f"Error: File too large ({size_mb:.1f}MB). Max {MAX_FILE_SIZE_MB}MB", file=sys.stderr)
+            sys.exit(1)
+        with open(path, encoding='utf-8') as f:
+            return json.load(f)
+    except FileNotFoundError:
+        print(f"Error: File not found: {path}", file=sys.stderr)
+        sys.exit(1)
+    except PermissionError:
+        print(f"Error: Permission denied: {path}", file=sys.stderr)
+        sys.exit(1)
+    except json.JSONDecodeError as e:
+        print(f"Error: Invalid JSON in {path}: {e}", file=sys.stderr)
+        sys.exit(1)
+
+
+class HealthStatus(Enum):
+    HEALTHY = "healthy"
+    WARNING = "warning"
+    DEGRADED = "degraded"
+    CRITICAL = "critical"
+
+
+@dataclass
+class ContextAnalysis:
+    total_tokens: int
+    token_limit: int
+    utilization: float
+    health_status: HealthStatus
+    health_score: float
+    degradation_risk: float
+    poisoning_risk: float
+    recommendations: list = field(default_factory=list)
+
+
+def estimate_tokens(text: str) -> int:
+    """Estimate token count (~4 chars per token for English)."""
+    return len(text) // 4
+
+
+def estimate_message_tokens(messages: list) -> int:
+    """Estimate tokens in message list."""
+    total = 0
+    for msg in messages:
+        if isinstance(msg, dict):
+            content = msg.get("content", "")
+            total += estimate_tokens(str(content))
+            # Add overhead for role, metadata
+            total += 10
+        else:
+            total += estimate_tokens(str(msg))
+    return total
+
+
+def measure_attention_distribution(context_length: int, sample_size: int = 100) -> list:
+    """
+    Simulate U-shaped attention distribution.
+    Real implementation would extract from model attention weights.
+    """
+    attention = []
+    for i in range(sample_size):
+        position = i / sample_size
+        # U-shaped curve: high at start/end, low in middle
+        if position < 0.1:
+            score = 0.9 - position * 2
+        elif position > 0.9:
+            score = 0.7 + (position - 0.9) * 2
+        else:
+            score = 0.3 + 0.1 * math.sin(position * math.pi)
+        attention.append(score)
+    return attention
+
+
+def detect_lost_in_middle(messages: list, critical_keywords: list) -> list:
+    """Identify critical items in attention-degraded regions."""
+    if not messages:
+        return []
+
+    total = len(messages)
+    warnings = []
+
+    for i, msg in enumerate(messages):
+        position = i / total
+        content = str(msg.get("content", "") if isinstance(msg, dict) else msg)
+
+        # Middle region (10%-90%)
+        if 0.1 < position < 0.9:
+            for keyword in critical_keywords:
+                if keyword.lower() in content.lower():
+                    warnings.append({
+                        "position": i,
+                        "position_pct": f"{position:.1%}",
+                        "keyword": keyword,
+                        "risk": "high" if 0.3 < position < 0.7 else "medium"
+                    })
+    return warnings
+
+
+def detect_poisoning_patterns(messages: list) -> dict:
+    """Detect potential context poisoning indicators."""
+    error_patterns = [
+        r"error", r"failed", r"exception", r"cannot", r"unable",
+        r"invalid", r"not found", r"undefined", r"null"
+    ]
+    # Simple contradiction check - look for both positive and negative statements
+    contradiction_keywords = [
+        ("is correct", "is not correct"),
+        ("should work", "should not work"),
+        ("will succeed", "will fail"),
+        ("is valid", "is invalid"),
+    ]
+
+    errors_found = []
+    contradictions = []
+
+    for i, msg in enumerate(messages):
+        content = str(msg.get("content", "") if isinstance(msg, dict) else msg).lower()
+
+        # Check error patterns
+        for pattern in error_patterns:
+            if re.search(pattern, content):
+                errors_found.append({"position": i, "pattern": pattern})
+
+        # Check for contradiction keywords (simplified)
+        for pos_phrase, neg_phrase in contradiction_keywords:
+            if pos_phrase in content and neg_phrase in content:
+                contradictions.append({"position": i, "type": "self-contradiction"})
+
+    total = max(len(messages), 1)
+    return {
+        "error_density": len(errors_found) / total,
+        "contradiction_count": len(contradictions),
+        "poisoning_risk": min(1.0, (len(errors_found) * 0.1 + len(contradictions) * 0.3))
+    }
+
+
+def calculate_health_score(utilization: float, degradation_risk: float, poisoning_risk: float) -> float:
+    """
+    Calculate composite health score.
+    1.0 = healthy, 0.0 = critical
+    """
+    score = 1.0
+    # Utilization penalty (kicks in after 70%)
+    if utilization > 0.7:
+        score -= (utilization - 0.7) * 1.5
+    # Degradation penalty
+    score -= degradation_risk * 0.3
+    # Poisoning penalty
+    score -= poisoning_risk * 0.2
+    return max(0.0, min(1.0, score))
+
+
+def get_health_status(score: float) -> HealthStatus:
+    """Map health score to status."""
+    if score > 0.8:
+        return HealthStatus.HEALTHY
+    elif score > 0.6:
+        return HealthStatus.WARNING
+    elif score > 0.4:
+        return HealthStatus.DEGRADED
+    return HealthStatus.CRITICAL
+
+
+def analyze_context(messages: list, token_limit: int = 128000,
+                    critical_keywords: Optional[list] = None) -> ContextAnalysis:
+    """
+    Comprehensive context health analysis.
+
+    Args:
+        messages: List of context messages
+        token_limit: Model's context window size
+        critical_keywords: Keywords that should be at attention-favored positions
+
+    Returns:
+        ContextAnalysis with health metrics and recommendations
+    """
+    critical_keywords = critical_keywords or ["goal", "task", "important", "critical", "must"]
+
+    # Calculate token utilization
+    total_tokens = estimate_message_tokens(messages)
+    utilization = total_tokens / token_limit
+
+    # Check for lost-in-middle issues
+    middle_warnings = detect_lost_in_middle(messages, critical_keywords)
+    degradation_risk = min(1.0, len(middle_warnings) * 0.2)
+
+    # Check for poisoning
+    poisoning = detect_poisoning_patterns(messages)
+    poisoning_risk = poisoning["poisoning_risk"]
+
+    # Calculate health
+    health_score = calculate_health_score(utilization, degradation_risk, poisoning_risk)
+    health_status = get_health_status(health_score)
+
+    # Generate recommendations
+    recommendations = []
+    if utilization > 0.8:
+        recommendations.append("URGENT: Context utilization >80%. Trigger compaction immediately.")
+    elif utilization > 0.7:
+        recommendations.append("WARNING: Context utilization >70%. Plan for compaction.")
+
+    if middle_warnings:
+        recommendations.append(f"Found {len(middle_warnings)} critical items in middle region. "
+                               "Consider moving to beginning/end.")
+
+    if poisoning_risk > 0.3:
+        recommendations.append("High poisoning risk detected. Review recent tool outputs for errors.")
+
+    if health_status == HealthStatus.CRITICAL:
+        recommendations.append("CRITICAL: Consider context reset with clean state.")
+
+    return ContextAnalysis(
+        total_tokens=total_tokens,
+        token_limit=token_limit,
+        utilization=utilization,
+        health_status=health_status,
+        health_score=health_score,
+        degradation_risk=degradation_risk,
+        poisoning_risk=poisoning_risk,
+        recommendations=recommendations
+    )
+
+
+def calculate_budget(system: int, tools: int, docs: int, history: int,
+                     buffer_pct: float = 0.15) -> dict:
+    """Calculate context budget allocation."""
+    subtotal = system + tools + docs + history
+    buffer = int(subtotal * buffer_pct)
+    total = subtotal + buffer
+
+    return {
+        "allocation": {
+            "system_prompt": system,
+            "tool_definitions": tools,
+            "retrieved_docs": docs,
+            "message_history": history,
+            "reserved_buffer": buffer
+        },
+        "total_budget": total,
+        "warning_threshold": int(total * 0.7),
+        "critical_threshold": int(total * 0.8),
+        "recommendations": [
+            f"Trigger compaction at {int(total * 0.7):,} tokens",
+            f"Aggressive optimization at {int(total * 0.8):,} tokens",
+            f"Reserved {buffer:,} tokens ({buffer_pct:.0%}) for responses"
+        ]
+    }
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Context health analyzer")
+    subparsers = parser.add_subparsers(dest="command", required=True)
+
+    # Analyze command
+    analyze_parser = subparsers.add_parser("analyze", help="Analyze context health")
+    analyze_parser.add_argument("context_file", help="JSON file with messages array")
+    analyze_parser.add_argument("--limit", type=int, default=128000, help="Token limit")
+    analyze_parser.add_argument("--keywords", nargs="+", help="Critical keywords to track")
+
+    # Budget command
+    budget_parser = subparsers.add_parser("budget", help="Calculate context budget")
+    budget_parser.add_argument("--system", type=int, default=2000, help="System prompt tokens")
+    budget_parser.add_argument("--tools", type=int, default=1500, help="Tool definitions tokens")
+    budget_parser.add_argument("--docs", type=int, default=3000, help="Retrieved docs tokens")
+    budget_parser.add_argument("--history", type=int, default=5000, help="Message history tokens")
+    budget_parser.add_argument("--buffer", type=float, default=0.15, help="Buffer percentage")
+
+    args = parser.parse_args()
+
+    if args.command == "analyze":
+        data = load_json_file(args.context_file)
+        messages = data if isinstance(data, list) else data.get("messages", [])
+        result = analyze_context(messages, args.limit, args.keywords)
+        print(json.dumps({
+            "total_tokens": result.total_tokens,
+            "token_limit": result.token_limit,
+            "utilization": f"{result.utilization:.1%}",
+            "health_status": result.health_status.value,
+            "health_score": f"{result.health_score:.2f}",
+            "degradation_risk": f"{result.degradation_risk:.2f}",
+            "poisoning_risk": f"{result.poisoning_risk:.2f}",
+            "recommendations": result.recommendations
+        }, indent=2))
+
+    elif args.command == "budget":
+        result = calculate_budget(args.system, args.tools, args.docs, args.history, args.buffer)
+        print(json.dumps(result, indent=2))
+
+
+if __name__ == "__main__":
+    main()
--- a/.opencode/skills/context-engineering/scripts/tests/test_edge_cases.py
+++ b/.opencode/skills/context-engineering/scripts/tests/test_edge_cases.py
@@ -0,0 +1,246 @@
+"""Tests for context-engineering edge case handling.
+
+Tests the error handling improvements in compression_evaluator.py and context_analyzer.py:
+- File not found
+- Permission denied
+- Invalid JSON
+- File too large
+- UTF-8 encoding
+"""
+
+import json
+import os
+import stat
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+
+import pytest
+
+SCRIPTS_DIR = Path(__file__).parent.parent
+PYTHON = sys.executable
+
+
+class TestCompressionEvaluatorEdgeCases:
+    """Test edge cases in compression_evaluator.py"""
+
+    @pytest.fixture
+    def valid_json_file(self, tmp_path):
+        """Create valid JSON file."""
+        f = tmp_path / "valid.json"
+        f.write_text('{"messages": [{"role": "user", "content": "hello"}]}', encoding='utf-8')
+        return str(f)
+
+    @pytest.fixture
+    def valid_text_file(self, tmp_path):
+        """Create valid text file."""
+        f = tmp_path / "compressed.txt"
+        f.write_text("Summary of conversation", encoding='utf-8')
+        return str(f)
+
+    def run_script(self, *args, timeout=30):
+        """Run compression_evaluator.py with args."""
+        cmd = [PYTHON, str(SCRIPTS_DIR / "compression_evaluator.py")] + list(args)
+        result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
+        return result
+
+    def test_missing_file_exits_1(self, tmp_path):
+        """Test exit code 1 when file not found."""
+        result = self.run_script("evaluate", "/nonexistent/file.json", str(tmp_path / "c.txt"))
+        assert result.returncode == 1
+        assert "File not found" in result.stderr
+
+    def test_missing_file_error_message(self, tmp_path):
+        """Test error message format for missing file."""
+        missing = "/this/path/does/not/exist/file.json"
+        result = self.run_script("evaluate", missing, str(tmp_path / "c.txt"))
+        assert result.returncode == 1
+        assert missing in result.stderr or "not found" in result.stderr.lower()
+
+    def test_invalid_json_exits_1(self, tmp_path, valid_text_file):
+        """Test exit code 1 when JSON is invalid."""
+        bad_json = tmp_path / "bad.json"
+        bad_json.write_text("{invalid json content", encoding='utf-8')
+
+        result = self.run_script("evaluate", str(bad_json), valid_text_file)
+        assert result.returncode == 1
+        assert "Invalid JSON" in result.stderr or "JSON" in result.stderr
+
+    def test_valid_files_succeed(self, valid_json_file, valid_text_file):
+        """Test success with valid inputs."""
+        result = self.run_script("evaluate", valid_json_file, valid_text_file)
+        assert result.returncode == 0
+        output = json.loads(result.stdout)
+        assert "compression_ratio" in output
+        assert "quality_score" in output
+
+    def test_generate_probes_missing_file(self):
+        """Test generate-probes with missing file."""
+        result = self.run_script("generate-probes", "/nonexistent/context.json")
+        assert result.returncode == 1
+        assert "File not found" in result.stderr
+
+    def test_generate_probes_invalid_json(self, tmp_path):
+        """Test generate-probes with invalid JSON."""
+        bad = tmp_path / "bad.json"
+        bad.write_text("not valid json {{{", encoding='utf-8')
+
+        result = self.run_script("generate-probes", str(bad))
+        assert result.returncode == 1
+        assert "Invalid JSON" in result.stderr or "JSON" in result.stderr
+
+    def test_generate_probes_success(self, valid_json_file):
+        """Test generate-probes with valid file."""
+        result = self.run_script("generate-probes", valid_json_file)
+        assert result.returncode == 0
+        output = json.loads(result.stdout)
+        assert isinstance(output, list)
+
+    def test_utf8_content(self, tmp_path):
+        """Test UTF-8 encoding with special characters."""
+        utf8_file = tmp_path / "utf8.json"
+        content = {"messages": [{"role": "user", "content": "日本語テスト émojis 🎉"}]}
+        utf8_file.write_text(json.dumps(content), encoding='utf-8')
+
+        compressed = tmp_path / "compressed.txt"
+        compressed.write_text("Summary with 日本語 and émojis 🎉", encoding='utf-8')
+
+        result = self.run_script("evaluate", str(utf8_file), str(compressed))
+        assert result.returncode == 0
+
+    @pytest.mark.skipif(os.name == 'nt', reason="Permission test not reliable on Windows")
+    def test_permission_denied(self, tmp_path):
+        """Test permission denied error."""
+        protected = tmp_path / "protected.json"
+        protected.write_text('{"messages": []}', encoding='utf-8')
+        os.chmod(protected, 0o000)
+
+        try:
+            result = self.run_script("generate-probes", str(protected))
+            assert result.returncode == 1
+            assert "Permission denied" in result.stderr or "permission" in result.stderr.lower()
+        finally:
+            os.chmod(protected, stat.S_IRUSR | stat.S_IWUSR)
+
+
+class TestContextAnalyzerEdgeCases:
+    """Test edge cases in context_analyzer.py"""
+
+    @pytest.fixture
+    def valid_context_file(self, tmp_path):
+        """Create valid context file."""
+        f = tmp_path / "context.json"
+        content = {
+            "messages": [
+                {"role": "user", "content": "implement feature X"},
+                {"role": "assistant", "content": "I'll help with that"}
+            ]
+        }
+        f.write_text(json.dumps(content), encoding='utf-8')
+        return str(f)
+
+    def run_script(self, *args, timeout=30):
+        """Run context_analyzer.py with args."""
+        cmd = [PYTHON, str(SCRIPTS_DIR / "context_analyzer.py")] + list(args)
+        result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
+        return result
+
+    def test_missing_file_exits_1(self):
+        """Test exit code 1 when file not found."""
+        result = self.run_script("analyze", "/nonexistent/context.json")
+        assert result.returncode == 1
+        assert "File not found" in result.stderr
+
+    def test_invalid_json_exits_1(self, tmp_path):
+        """Test exit code 1 when JSON is invalid."""
+        bad = tmp_path / "bad.json"
+        bad.write_text("not json", encoding='utf-8')
+
+        result = self.run_script("analyze", str(bad))
+        assert result.returncode == 1
+        assert "Invalid JSON" in result.stderr or "JSON" in result.stderr
+
+    def test_valid_file_succeeds(self, valid_context_file):
+        """Test success with valid input."""
+        result = self.run_script("analyze", valid_context_file)
+        assert result.returncode == 0
+        output = json.loads(result.stdout)
+        assert "health_status" in output or "health_score" in output
+
+    def test_utf8_content(self, tmp_path):
+        """Test UTF-8 encoding with international characters."""
+        utf8_file = tmp_path / "utf8.json"
+        content = {
+            "messages": [
+                {"role": "user", "content": "日本語で説明してください"},
+                {"role": "assistant", "content": "はい、説明します。émojis: 🎉🚀"}
+            ]
+        }
+        utf8_file.write_text(json.dumps(content, ensure_ascii=False), encoding='utf-8')
+
+        result = self.run_script("analyze", str(utf8_file))
+        assert result.returncode == 0
+
+    def test_empty_messages_array(self, tmp_path):
+        """Test handling of empty messages array."""
+        f = tmp_path / "empty.json"
+        f.write_text('{"messages": []}', encoding='utf-8')
+
+        result = self.run_script("analyze", str(f))
+        assert result.returncode == 0
+
+    def test_direct_messages_list(self, tmp_path):
+        """Test handling of direct messages list (no wrapper)."""
+        f = tmp_path / "direct.json"
+        content = [
+            {"role": "user", "content": "hello"},
+            {"role": "assistant", "content": "hi"}
+        ]
+        f.write_text(json.dumps(content), encoding='utf-8')
+
+        result = self.run_script("analyze", str(f))
+        assert result.returncode == 0
+
+    @pytest.mark.skipif(os.name == 'nt', reason="Permission test not reliable on Windows")
+    def test_permission_denied(self, tmp_path):
+        """Test permission denied error."""
+        protected = tmp_path / "protected.json"
+        protected.write_text('{"messages": []}', encoding='utf-8')
+        os.chmod(protected, 0o000)
+
+        try:
+            result = self.run_script("analyze", str(protected))
+            assert result.returncode == 1
+            assert "Permission denied" in result.stderr or "permission" in result.stderr.lower()
+        finally:
+            os.chmod(protected, stat.S_IRUSR | stat.S_IWUSR)
+
+    def test_with_keywords_filter(self, valid_context_file):
+        """Test analyze with keywords filter."""
+        result = self.run_script("analyze", valid_context_file, "--keywords", "feature,implement")
+        assert result.returncode == 0
+
+    def test_with_limit(self, valid_context_file):
+        """Test analyze with limit parameter."""
+        result = self.run_script("analyze", valid_context_file, "--limit", "10")
+        assert result.returncode == 0
+
+
+class TestFileSizeValidation:
+    """Test file size validation (100MB limit)."""
+
+    def test_large_file_warning_in_code(self):
+        """Verify MAX_FILE_SIZE_MB constant exists in scripts."""
+        evaluator = SCRIPTS_DIR / "compression_evaluator.py"
+        analyzer = SCRIPTS_DIR / "context_analyzer.py"
+
+        eval_content = evaluator.read_text()
+        analyzer_content = analyzer.read_text()
+
+        assert "MAX_FILE_SIZE_MB = 100" in eval_content
+        assert "MAX_FILE_SIZE_MB = 100" in analyzer_content
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])