Systematic Debugging

Four-phase debugging framework that ensures root cause investigation before attempting fixes.

The Iron Law

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

If haven't completed Phase 1, cannot propose fixes.

Must complete each phase before proceeding to next.

BEFORE attempting ANY fix:

Read Error Messages Carefully - Don't skip past errors/warnings, read stack traces completely
Reproduce Consistently - Can trigger reliably? Exact steps? If not reproducible → gather more data
Check Recent Changes - What changed? Git diff, recent commits, new dependencies, config changes
Gather Evidence in Multi-Component Systems
- For EACH component boundary: log data entering/exiting, verify environment propagation
- Run once to gather evidence showing WHERE it breaks
- THEN analyze to identify failing component
Trace Data Flow - Where does bad value originate? Trace up call stack until finding source (see root-cause-tracing.md)

Find pattern before fixing:

Find Working Examples - Locate similar working code in same codebase
Compare Against References - Read reference implementation COMPLETELY, understand fully before applying
Identify Differences - List every difference however small, don't assume "that can't matter"
Understand Dependencies - What other components, settings, config, environment needed?

Scientific method:

Form Single Hypothesis - "I think X is root cause because Y", be specific not vague
Test Minimally - SMALLEST possible change to test hypothesis, one variable at a time
Verify Before Continuing - Worked? → Phase 4. Didn't work? → NEW hypothesis. DON'T add more fixes
When Don't Know - Say "I don't understand X", don't pretend, ask for help

Fix root cause, not symptom:

Create Failing Test Case - Simplest reproduction, automated if possible, MUST have before fixing
Implement Single Fix - Address root cause identified, ONE change, no "while I'm here" improvements
Verify Fix - Test passes? No other tests broken? Issue actually resolved?
If Fix Doesn't Work
- STOP. Count: How many fixes tried?
- If < 3: Return to Phase 1, re-analyze with new information
- If ≥ 3: STOP and question architecture
If 3+ Fixes Failed: Question Architecture
- Pattern: Each fix reveals new shared state/coupling problem elsewhere
- STOP and question fundamentals: Is pattern sound? Wrong architecture?
- Discuss with human partner before more fixes

If catch yourself thinking:

ALL mean: STOP. Return to Phase 1.

When see these: STOP. Return to Phase 1.

Excuse	Reality
"Issue is simple, don't need process"	Simple issues have root causes too
"Emergency, no time for process"	Systematic is FASTER than guess-and-check
"Just try this first, then investigate"	First fix sets pattern. Do right from start
"One more fix attempt" (after 2+ failures)	3+ failures = architectural problem

From debugging sessions: