Files
english/.opencode/skills/ck-debug/references/investigation-methodology.md
2026-04-12 01:06:31 +07:00

102 lines
4.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Investigation Methodology
Five-step structured investigation for system-level issues, incidents, and multi-component failures.
## When to Use
- Server returning 500 errors or unexpected responses
- System behavior changed without obvious code changes
- Multi-component failures spanning services/databases/infrastructure
- Need to understand "what happened" before fixing
## Step 1: Initial Assessment
**Gather scope and impact before diving in.**
1. **Collect symptoms** - Error messages, affected endpoints, user reports
2. **Identify affected components** - Which services, databases, queues involved?
3. **Determine timeframe** - When did issue start? Correlate with deployments/changes
4. **Assess severity** - Users affected? Data at risk? Revenue impact?
5. **Check recent changes** - Git log, deployment history, config changes, dependency updates
```bash
# Recent deployments
gh run list --limit 10
# Recent commits
git log --oneline -20 --since="2 days ago"
# Config changes
git diff HEAD~5 -- '*.env*' '*.config*' '*.yml' '*.yaml' '*.json'
```
## Step 2: Data Collection
**Gather evidence systematically before analysis.**
1. **Server/application logs** - Filter by timeframe and affected components
2. **CI/CD pipeline logs** - Use `gh run view <run-id> --log-failed` for GitHub Actions
3. **Database state** - Query relevant tables, check recent migrations
4. **System metrics** - CPU, memory, disk, network utilization
5. **External dependencies** - Third-party API status, DNS, CDN
```bash
# GitHub Actions: list recent workflow runs
gh run list --workflow=<workflow> --limit 5
# View failed run logs
gh run view <run-id> --log-failed
# Download full logs
gh run view <run-id> --log > /tmp/ci-logs.txt
```
**For codebase understanding:**
- Read `docs/codebase-summary.md` if exists and up-to-date (<2 days old)
- Otherwise use `ck:repomix` to generate fresh codebase summary
- Use `/ck:scout` or `/ck:scout ext` to find relevant files
- Use `ck:docs-seeker` skill for package/plugin documentation
## Step 3: Analysis Process
**Correlate evidence across sources.**
1. **Timeline reconstruction** - Order events chronologically across all log sources
2. **Pattern identification** - Recurring errors, timing patterns, affected user segments
3. **Execution path tracing** - Follow request flow through system components
4. **Database analysis** - Query performance, table relationships, data integrity
5. **Dependency mapping** - Which components depend on the failing one?
**Key questions:**
- Does issue correlate with specific deployments or time windows?
- Is it intermittent or consistent?
- Does it affect all users or a subset?
- Are there related errors in upstream/downstream services?
## Step 4: Root Cause Identification
**Systematic elimination with evidence.**
1. **List hypotheses** ranked by evidence strength
2. **Test each** - Design smallest experiment to confirm/eliminate
3. **Validate with evidence** - Logs, metrics, reproduction steps
4. **Consider environmental factors** - Race conditions, resource limits, config drift
5. **Document the chain** - Full event sequence from trigger to symptom
**Avoid:** Fixing first hypothesis without testing alternatives. Multiple plausible causes require elimination.
## Step 5: Solution Development
**Design targeted, evidence-backed fixes.**
1. **Immediate fix** - Minimum change to restore service (hotfix, rollback, config change)
2. **Root cause fix** - Address underlying issue permanently
3. **Preventive measures** - Monitoring, alerting, validation to catch recurrence early
4. **Verification plan** - How to confirm fix works in production
**Prioritize:** Impact × urgency. Restore service first, then fix root cause, then prevent recurrence.
## Integration with Code-Level Debugging
When investigation narrows to specific code:
- Switch to `systematic-debugging.md` for the code-level fix
- Use `root-cause-tracing.md` if error is deep in call stack
- Apply `defense-in-depth.md` after fixing
- Always finish with `verification.md`