4.0 KiB
4.0 KiB
Investigation Methodology
Five-step structured investigation for system-level issues, incidents, and multi-component failures.
When to Use
- Server returning 500 errors or unexpected responses
- System behavior changed without obvious code changes
- Multi-component failures spanning services/databases/infrastructure
- Need to understand "what happened" before fixing
Step 1: Initial Assessment
Gather scope and impact before diving in.
- Collect symptoms - Error messages, affected endpoints, user reports
- Identify affected components - Which services, databases, queues involved?
- Determine timeframe - When did issue start? Correlate with deployments/changes
- Assess severity - Users affected? Data at risk? Revenue impact?
- Check recent changes - Git log, deployment history, config changes, dependency updates
# Recent deployments
gh run list --limit 10
# Recent commits
git log --oneline -20 --since="2 days ago"
# Config changes
git diff HEAD~5 -- '*.env*' '*.config*' '*.yml' '*.yaml' '*.json'
Step 2: Data Collection
Gather evidence systematically before analysis.
- Server/application logs - Filter by timeframe and affected components
- CI/CD pipeline logs - Use
gh run view <run-id> --log-failedfor GitHub Actions - Database state - Query relevant tables, check recent migrations
- System metrics - CPU, memory, disk, network utilization
- External dependencies - Third-party API status, DNS, CDN
# GitHub Actions: list recent workflow runs
gh run list --workflow=<workflow> --limit 5
# View failed run logs
gh run view <run-id> --log-failed
# Download full logs
gh run view <run-id> --log > /tmp/ci-logs.txt
For codebase understanding:
- Read
docs/codebase-summary.mdif exists and up-to-date (<2 days old) - Otherwise use
ck:repomixto generate fresh codebase summary - Use
/ck:scoutor/ck:scout extto find relevant files - Use
ck:docs-seekerskill for package/plugin documentation
Step 3: Analysis Process
Correlate evidence across sources.
- Timeline reconstruction - Order events chronologically across all log sources
- Pattern identification - Recurring errors, timing patterns, affected user segments
- Execution path tracing - Follow request flow through system components
- Database analysis - Query performance, table relationships, data integrity
- Dependency mapping - Which components depend on the failing one?
Key questions:
- Does issue correlate with specific deployments or time windows?
- Is it intermittent or consistent?
- Does it affect all users or a subset?
- Are there related errors in upstream/downstream services?
Step 4: Root Cause Identification
Systematic elimination with evidence.
- List hypotheses ranked by evidence strength
- Test each - Design smallest experiment to confirm/eliminate
- Validate with evidence - Logs, metrics, reproduction steps
- Consider environmental factors - Race conditions, resource limits, config drift
- Document the chain - Full event sequence from trigger to symptom
Avoid: Fixing first hypothesis without testing alternatives. Multiple plausible causes require elimination.
Step 5: Solution Development
Design targeted, evidence-backed fixes.
- Immediate fix - Minimum change to restore service (hotfix, rollback, config change)
- Root cause fix - Address underlying issue permanently
- Preventive measures - Monitoring, alerting, validation to catch recurrence early
- Verification plan - How to confirm fix works in production
Prioritize: Impact × urgency. Restore service first, then fix root cause, then prevent recurrence.
Integration with Code-Level Debugging
When investigation narrows to specific code:
- Switch to
systematic-debugging.mdfor the code-level fix - Use
root-cause-tracing.mdif error is deep in call stack - Apply
defense-in-depth.mdafter fixing - Always finish with
verification.md