Log & CI/CD Analysis

Collect and analyze logs from servers, CI/CD pipelines, and application layers to diagnose failures.

GitHub Actions Analysis

List and Inspect Runs

# List recent runs (all workflows)
gh run list --limit 10

# List runs for specific workflow
gh run list --workflow=ci.yml --limit 5

# View specific run details
gh run view <run-id>

# View failed job logs only
gh run view <run-id> --log-failed

# Download complete logs
gh run view <run-id> --log > /tmp/ci-full.txt

# Re-run failed jobs
gh run rerun <run-id> --failed

Common CI/CD Failure Patterns

Pattern	Likely Cause	Investigation
Passes locally, fails CI	Environment diff	Check Node/Python version, OS, env vars
Intermittent failures	Race conditions, flaky tests	Run 3x, check timing, shared state
Timeout failures	Resource limits, infinite loops	Check resource usage, add timeouts
Permission errors	Token/secret misconfiguration	Verify `GITHUB_TOKEN`, secret names
Dependency install fails	Registry issues, version conflicts	Check lockfile, registry status
Build succeeds, tests fail	Test environment setup	Check test config, database setup, fixtures

Analyzing Failed Steps

Identify which step failed - gh run view <id> shows step-by-step status
Get the logs - gh run view <id> --log-failed for focused output
Search for error patterns - Look for Error:, FAIL, exit code, stack traces
Check annotations - gh api repos/{owner}/{repo}/check-runs/{id}/annotations

Server Log Analysis

Log Collection Strategy

Identify log locations - Application logs, system logs, web server logs
Filter by timeframe - Narrow to incident window
Correlate request IDs - Trace single request across services
Look for patterns - Repeated errors, error rate changes, unusual payloads

Structured Log Queries

# Search application logs for errors (use Grep tool when possible)
# Pattern: timestamp, level, message
# Filter by time range and severity

# PostgreSQL slow query log
psql -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

# Check database connections
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

Cross-Source Correlation

Align timestamps across all log sources (timezone awareness)
Build timeline - First error → propagation → user impact
Identify trigger - What changed immediately before first error?
Map blast radius - Which services/endpoints affected?

Application Log Analysis

Error Pattern Recognition

Sudden spike → Deployment, config change, external dependency failure
Gradual increase → Resource leak, data growth, degradation
Periodic failures → Cron jobs, scheduled tasks, resource contention
Single endpoint → Code bug, data issue, specific dependency
All endpoints → Infrastructure, database, network issue

Key Log Fields

Prioritize: timestamp, level, error message, stack trace, request ID, user ID, endpoint, response code, duration

Evidence Preservation

Always capture relevant log excerpts for the diagnostic report. Include:

Exact error messages and stack traces
Timestamps and request IDs
Before/after comparison (normal vs error state)
Counts and frequencies

3.4 KiB Raw Blame History