18 KiB
Backend Debugging Strategies
Comprehensive debugging techniques, tools, and best practices for backend systems (2025).
Debugging Mindset
The Scientific Method for Debugging
- Observe - Gather symptoms and data
- Hypothesize - Form theories about the cause
- Test - Verify or disprove theories
- Iterate - Refine understanding
- Fix - Apply solution
- Verify - Confirm fix works
Golden Rules
- Reproduce first - Debugging without reproduction is guessing
- Simplify the problem - Isolate variables
- Read the logs - Error messages contain clues
- Check assumptions - "It should work" isn't debugging
- Use scientific method - Avoid random changes
- Document findings - Future you will thank you
Logging Best Practices
Structured Logging
Node.js (Pino - Fastest)
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
transport: {
target: 'pino-pretty',
options: { colorize: true }
}
});
// Structured logging with context
logger.info({ userId: '123', action: 'login' }, 'User logged in');
// Error logging with stack trace
try {
await riskyOperation();
} catch (error) {
logger.error({ err: error, userId: '123' }, 'Operation failed');
}
Python (Structlog)
import structlog
logger = structlog.get_logger()
# Structured context
logger.info("user_login", user_id="123", ip="192.168.1.1")
# Error with exception
try:
risky_operation()
except Exception as e:
logger.error("operation_failed", user_id="123", exc_info=True)
Go (Zap - High Performance)
import "go.uber.org/zap"
logger, _ := zap.NewProduction()
defer logger.Sync()
// Structured fields
logger.Info("user logged in",
zap.String("user_id", "123"),
zap.String("ip", "192.168.1.1"),
)
// Error logging
if err := riskyOperation(); err != nil {
logger.Error("operation failed",
zap.Error(err),
zap.String("user_id", "123"),
)
}
Log Levels
| Level | Purpose | Example |
|---|---|---|
| TRACE | Very detailed, dev only | Request/response bodies |
| DEBUG | Detailed info for debugging | SQL queries, cache hits |
| INFO | General informational | User login, API calls |
| WARN | Potential issues | Deprecated API usage |
| ERROR | Error conditions | Failed API calls, exceptions |
| FATAL | Critical failures | Database connection lost |
What to Log
✅ DO LOG:
- Request/response metadata (not bodies in prod)
- Error messages with context
- Performance metrics (duration, size)
- Security events (login, permission changes)
- Business events (orders, payments)
❌ DON'T LOG:
- Passwords or secrets
- Credit card numbers
- Personal identifiable information (PII)
- Session tokens
- Full request bodies in production
Debugging Tools by Language
Node.js / TypeScript
1. Chrome DevTools (Built-in)
# Run with inspect flag
node --inspect-brk app.js
# Open chrome://inspect in Chrome
# Set breakpoints, step through code
2. VS Code Debugger
// .vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"type": "node",
"request": "launch",
"name": "Debug Server",
"skipFiles": ["<node_internals>/**"],
"program": "${workspaceFolder}/src/index.ts",
"preLaunchTask": "npm: build",
"outFiles": ["${workspaceFolder}/dist/**/*.js"]
}
]
}
3. Debug Module
import debug from 'debug';
const log = debug('app:server');
const error = debug('app:error');
log('Starting server on port %d', 3000);
error('Failed to connect to database');
// Run with: DEBUG=app:* node app.js
Python
1. PDB (Built-in Debugger)
import pdb
def problematic_function(data):
# Set breakpoint
pdb.set_trace()
# Debugger commands:
# l - list code
# n - next line
# s - step into
# c - continue
# p variable - print variable
# q - quit
result = process(data)
return result
2. IPython Debugger (Better)
from IPython import embed
def problematic_function(data):
# Drop into IPython shell
embed()
result = process(data)
return result
3. VS Code Debugger
// .vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: FastAPI",
"type": "python",
"request": "launch",
"module": "uvicorn",
"args": ["main:app", "--reload"],
"jinja": true
}
]
}
Go
1. Delve (Standard Debugger)
# Install
go install github.com/go-delve/delve/cmd/dlv@latest
# Debug
dlv debug main.go
# Commands:
# b main.main - set breakpoint
# c - continue
# n - next line
# s - step into
# p variable - print variable
# q - quit
2. VS Code Debugger
// .vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"name": "Launch Package",
"type": "go",
"request": "launch",
"mode": "debug",
"program": "${workspaceFolder}"
}
]
}
Rust
1. LLDB/GDB (Native Debuggers)
# Build with debug info
cargo build
# Debug with LLDB
rust-lldb ./target/debug/myapp
# Debug with GDB
rust-gdb ./target/debug/myapp
2. VS Code Debugger (CodeLLDB)
// .vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"type": "lldb",
"request": "launch",
"name": "Debug",
"program": "${workspaceFolder}/target/debug/myapp",
"args": [],
"cwd": "${workspaceFolder}"
}
]
}
Database Debugging
SQL Query Debugging (PostgreSQL)
1. EXPLAIN ANALYZE
-- Show query execution plan and actual timings
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id, u.name
ORDER BY order_count DESC
LIMIT 10;
-- Look for:
-- - Seq Scan on large tables (missing indexes)
-- - High execution time
-- - Large row estimates
2. Enable Slow Query Logging
-- PostgreSQL configuration
ALTER DATABASE mydb SET log_min_duration_statement = 1000; -- Log queries >1s
-- Check slow queries
SELECT query, calls, total_exec_time, mean_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
3. Active Query Monitoring
-- See currently running queries
SELECT pid, now() - query_start as duration, query, state
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;
-- Kill a long-running query
SELECT pg_terminate_backend(pid);
MongoDB Debugging
1. Explain Query Performance
db.users.find({ email: 'test@example.com' }).explain('executionStats')
// Look for:
// - totalDocsExamined vs nReturned (should be close)
// - COLLSCAN (collection scan - needs index)
// - executionTimeMillis (should be low)
2. Profile Slow Queries
// Enable profiling for queries >100ms
db.setProfilingLevel(1, { slowms: 100 })
// View slow queries
db.system.profile.find().limit(5).sort({ ts: -1 }).pretty()
// Disable profiling
db.setProfilingLevel(0)
Redis Debugging
1. Monitor Commands
# See all commands in real-time
redis-cli MONITOR
# Check slow log
redis-cli SLOWLOG GET 10
# Set slow log threshold (microseconds)
redis-cli CONFIG SET slowlog-log-slower-than 10000
2. Memory Analysis
# Memory usage by key pattern
redis-cli --bigkeys
# Memory usage details
redis-cli INFO memory
# Analyze specific key
redis-cli MEMORY USAGE mykey
API Debugging
HTTP Request Debugging
1. cURL Testing
# Verbose output with headers
curl -v https://api.example.com/users
# Include response headers
curl -i https://api.example.com/users
# POST with JSON
curl -X POST https://api.example.com/users \
-H "Content-Type: application/json" \
-d '{"name":"John","email":"john@example.com"}' \
-v
# Save response to file
curl https://api.example.com/users -o response.json
2. HTTPie (User-Friendly)
# Install
pip install httpie
# Simple GET
http GET https://api.example.com/users
# POST with JSON
http POST https://api.example.com/users name=John email=john@example.com
# Custom headers
http GET https://api.example.com/users Authorization:"Bearer token123"
3. Request Logging Middleware
Express/Node.js:
import morgan from 'morgan';
// Development
app.use(morgan('dev'));
// Production (JSON format)
app.use(morgan('combined'));
// Custom format
app.use(morgan(':method :url :status :response-time ms - :res[content-length]'));
FastAPI/Python:
from fastapi import Request
import time
@app.middleware("http")
async def log_requests(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
logger.info(
"request_processed",
method=request.method,
path=request.url.path,
status_code=response.status_code,
duration_ms=duration * 1000
)
return response
Performance Debugging
CPU Profiling
Node.js (0x)
# Install
npm install -g 0x
# Profile application
0x node app.js
# Open flamegraph in browser
# Identify hot spots (red areas)
Node.js (Clinic.js)
# Install
npm install -g clinic
# CPU profiling
clinic doctor -- node app.js
# Heap profiling
clinic heapprofiler -- node app.js
# Event loop analysis
clinic bubbleprof -- node app.js
Python (cProfile)
import cProfile
import pstats
# Profile function
profiler = cProfile.Profile()
profiler.enable()
# Your code
result = expensive_operation()
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10) # Top 10 functions
Go (pprof)
import (
"net/http"
_ "net/http/pprof"
)
func main() {
// Enable profiling endpoint
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// Your application
startServer()
}
// Profile CPU
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// Profile heap
// go tool pprof http://localhost:6060/debug/pprof/heap
Memory Debugging
Node.js (Heap Snapshots)
// Take heap snapshot programmatically
import { writeHeapSnapshot } from 'v8';
app.get('/debug/heap', (req, res) => {
const filename = writeHeapSnapshot();
res.send(`Heap snapshot written to ${filename}`);
});
// Analyze in Chrome DevTools
// 1. Load heap snapshot
// 2. Compare snapshots to find memory leaks
// 3. Look for detached DOM nodes, large arrays
Python (Memory Profiler)
from memory_profiler import profile
@profile
def memory_intensive_function():
large_list = [i for i in range(1000000)]
return sum(large_list)
# Run with: python -m memory_profiler script.py
# Shows line-by-line memory usage
Production Debugging
Application Performance Monitoring (APM)
New Relic
// newrelic.js
export const config = {
app_name: ['My Backend API'],
license_key: process.env.NEW_RELIC_LICENSE_KEY,
logging: { level: 'info' },
distributed_tracing: { enabled: true },
};
// Import at app entry
import 'newrelic';
DataDog
import tracer from 'dd-trace';
tracer.init({
service: 'backend-api',
env: process.env.NODE_ENV,
version: '1.0.0',
logInjection: true
});
Sentry (Error Tracking)
import * as Sentry from '@sentry/node';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 1.0,
});
// Capture errors
try {
await riskyOperation();
} catch (error) {
Sentry.captureException(error, {
user: { id: userId },
tags: { operation: 'payment' },
});
}
Distributed Tracing
OpenTelemetry (Vendor-Agnostic)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
const sdk = new NodeSDK({
traceExporter: new JaegerExporter({
endpoint: 'http://localhost:14268/api/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// Traces HTTP, database, Redis automatically
Log Aggregation
ELK Stack (Elasticsearch, Logstash, Kibana)
# docker-compose.yml
version: '3'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- discovery.type=single-node
ports:
- 9200:9200
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
ports:
- 5601:5601
Loki + Grafana (Lightweight)
# promtail config for log shipping
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: backend-api
__path__: /var/log/app/*.log
Common Debugging Scenarios
1. High CPU Usage
Steps:
- Profile CPU (flamegraph)
- Identify hot functions
- Check for:
- Infinite loops
- Heavy regex operations
- Inefficient algorithms (O(n²))
- Blocking operations in event loop (Node.js)
Node.js Example:
// ❌ Bad: Blocking event loop
function fibonacci(n) {
if (n <= 1) return n;
return fibonacci(n - 1) + fibonacci(n - 2); // Exponential time
}
// ✅ Good: Memoized or iterative
const memo = new Map();
function fibonacciMemo(n) {
if (n <= 1) return n;
if (memo.has(n)) return memo.get(n);
const result = fibonacciMemo(n - 1) + fibonacciMemo(n - 2);
memo.set(n, result);
return result;
}
2. Memory Leaks
Symptoms:
- Memory usage grows over time
- Eventually crashes (OOM)
- Performance degradation
Common Causes:
// ❌ Memory leak: Event listeners not removed
class DataService {
constructor(eventBus) {
eventBus.on('data', (data) => this.processData(data));
// Listener never removed, holds reference to DataService
}
}
// ✅ Fix: Remove listeners
class DataService {
constructor(eventBus) {
this.eventBus = eventBus;
this.handler = (data) => this.processData(data);
eventBus.on('data', this.handler);
}
destroy() {
this.eventBus.off('data', this.handler);
}
}
// ❌ Memory leak: Global cache without limits
const cache = new Map();
function getCachedData(key) {
if (!cache.has(key)) {
cache.set(key, expensiveOperation(key)); // Grows forever
}
return cache.get(key);
}
// ✅ Fix: LRU cache with size limit
import LRU from 'lru-cache';
const cache = new LRU({ max: 1000, ttl: 1000 * 60 * 60 });
Detection:
# Node.js: Check heap size over time
node --expose-gc --max-old-space-size=4096 app.js
# Take periodic heap snapshots
# Compare snapshots in Chrome DevTools
3. Slow Database Queries
Steps:
- Enable slow query log
- Analyze with EXPLAIN
- Add indexes
- Optimize query
PostgreSQL Example:
-- Before: Slow full table scan
SELECT * FROM orders
WHERE user_id = 123
ORDER BY created_at DESC
LIMIT 10;
-- EXPLAIN shows: Seq Scan on orders
-- Fix: Add index
CREATE INDEX idx_orders_user_id_created_at
ON orders(user_id, created_at DESC);
-- After: Index Scan using idx_orders_user_id_created_at
-- 100x faster
4. Connection Pool Exhaustion
Symptoms:
- "Connection pool exhausted" errors
- Requests hang indefinitely
- Database connections at max
Causes & Fixes:
// ❌ Bad: Connection leak
async function getUser(id) {
const client = await pool.connect();
const result = await client.query('SELECT * FROM users WHERE id = $1', [id]);
return result.rows[0];
// Connection never released!
}
// ✅ Good: Always release
async function getUser(id) {
const client = await pool.connect();
try {
const result = await client.query('SELECT * FROM users WHERE id = $1', [id]);
return result.rows[0];
} finally {
client.release(); // Always release
}
}
// ✅ Better: Use pool directly
async function getUser(id) {
const result = await pool.query('SELECT * FROM users WHERE id = $1', [id]);
return result.rows[0];
// Automatically releases
}
5. Race Conditions
Example:
// ❌ Bad: Race condition
let counter = 0;
async function incrementCounter() {
const current = counter; // Thread 1 reads 0
await doSomethingAsync(); // Thread 2 reads 0
counter = current + 1; // Thread 1 writes 1, Thread 2 writes 1
// Expected: 2, Actual: 1
}
// ✅ Fix: Atomic operations (Redis)
async function incrementCounter() {
return await redis.incr('counter');
// Atomic, thread-safe
}
// ✅ Fix: Database transactions
async function incrementCounter(userId) {
await db.transaction(async (trx) => {
const user = await trx('users')
.where({ id: userId })
.forUpdate() // Row-level lock
.first();
await trx('users')
.where({ id: userId })
.update({ counter: user.counter + 1 });
});
}
Debugging Checklist
Before Diving Into Code:
- Read error message completely
- Check logs for context
- Reproduce the issue reliably
- Isolate the problem (binary search)
- Verify assumptions
Investigation:
- Enable debug logging
- Add strategic log points
- Use debugger breakpoints
- Profile performance if slow
- Check database queries
- Monitor system resources
Production Issues:
- Check APM dashboards
- Review distributed traces
- Analyze error rates
- Compare with previous baseline
- Check for recent deployments
- Review infrastructure changes
After Fix:
- Verify fix in development
- Add regression test
- Document the issue
- Deploy with monitoring
- Confirm fix in production
Debugging Resources
Tools:
- Node.js: https://nodejs.org/en/docs/guides/debugging-getting-started/
- Chrome DevTools: https://developer.chrome.com/docs/devtools/
- Clinic.js: https://clinicjs.org/
- Sentry: https://docs.sentry.io/
- DataDog: https://docs.datadoghq.com/
- New Relic: https://docs.newrelic.com/
Best Practices:
- 12 Factor App Logs: https://12factor.net/logs
- Google SRE Book: https://sre.google/sre-book/table-of-contents/
- OpenTelemetry: https://opentelemetry.io/docs/
Database:
- PostgreSQL EXPLAIN: https://www.postgresql.org/docs/current/using-explain.html
- MongoDB Performance: https://www.mongodb.com/docs/manual/administration/analyzing-mongodb-performance/