Investigation Methodology

Five-step structured investigation for system-level issues, incidents, and multi-component failures.

When to Use

Server returning 500 errors or unexpected responses
System behavior changed without obvious code changes
Multi-component failures spanning services/databases/infrastructure
Need to understand "what happened" before fixing

Step 1: Initial Assessment

Gather scope and impact before diving in.

Collect symptoms - Error messages, affected endpoints, user reports
Identify affected components - Which services, databases, queues involved?
Determine timeframe - When did issue start? Correlate with deployments/changes
Assess severity - Users affected? Data at risk? Revenue impact?
Check recent changes - Git log, deployment history, config changes, dependency updates

# Recent deployments
gh run list --limit 10
# Recent commits
git log --oneline -20 --since="2 days ago"
# Config changes
git diff HEAD~5 -- '*.env*' '*.config*' '*.yml' '*.yaml' '*.json'

Step 2: Data Collection

Gather evidence systematically before analysis.

Server/application logs - Filter by timeframe and affected components
CI/CD pipeline logs - Use gh run view <run-id> --log-failed for GitHub Actions
Database state - Query relevant tables, check recent migrations
System metrics - CPU, memory, disk, network utilization
External dependencies - Third-party API status, DNS, CDN

# GitHub Actions: list recent workflow runs
gh run list --workflow=<workflow> --limit 5
# View failed run logs
gh run view <run-id> --log-failed
# Download full logs
gh run view <run-id> --log > /tmp/ci-logs.txt

For codebase understanding:

Read docs/codebase-summary.md if exists and up-to-date (<2 days old)
Otherwise use ck:repomix to generate fresh codebase summary
Use /ck:scout or /ck:scout ext to find relevant files
Use ck:docs-seeker skill for package/plugin documentation

Step 3: Analysis Process

Correlate evidence across sources.

Timeline reconstruction - Order events chronologically across all log sources
Pattern identification - Recurring errors, timing patterns, affected user segments
Execution path tracing - Follow request flow through system components
Database analysis - Query performance, table relationships, data integrity
Dependency mapping - Which components depend on the failing one?

Key questions:

Does issue correlate with specific deployments or time windows?
Is it intermittent or consistent?
Does it affect all users or a subset?
Are there related errors in upstream/downstream services?

Step 4: Root Cause Identification

Systematic elimination with evidence.

List hypotheses ranked by evidence strength
Test each - Design smallest experiment to confirm/eliminate
Validate with evidence - Logs, metrics, reproduction steps
Consider environmental factors - Race conditions, resource limits, config drift
Document the chain - Full event sequence from trigger to symptom

Avoid: Fixing first hypothesis without testing alternatives. Multiple plausible causes require elimination.

Step 5: Solution Development

Design targeted, evidence-backed fixes.

Immediate fix - Minimum change to restore service (hotfix, rollback, config change)
Root cause fix - Address underlying issue permanently
Preventive measures - Monitoring, alerting, validation to catch recurrence early
Verification plan - How to confirm fix works in production

Prioritize: Impact × urgency. Restore service first, then fix root cause, then prevent recurrence.

Integration with Code-Level Debugging

When investigation narrows to specific code:

Switch to systematic-debugging.md for the code-level fix
Use root-cause-tracing.md if error is deep in call stack
Apply defense-in-depth.md after fixing
Always finish with verification.md

4.0 KiB Raw Permalink Blame History Unescape Escape