3.4 KiB
3.4 KiB
Log & CI/CD Analysis
Collect and analyze logs from servers, CI/CD pipelines, and application layers to diagnose failures.
GitHub Actions Analysis
List and Inspect Runs
# List recent runs (all workflows)
gh run list --limit 10
# List runs for specific workflow
gh run list --workflow=ci.yml --limit 5
# View specific run details
gh run view <run-id>
# View failed job logs only
gh run view <run-id> --log-failed
# Download complete logs
gh run view <run-id> --log > /tmp/ci-full.txt
# Re-run failed jobs
gh run rerun <run-id> --failed
Common CI/CD Failure Patterns
| Pattern | Likely Cause | Investigation |
|---|---|---|
| Passes locally, fails CI | Environment diff | Check Node/Python version, OS, env vars |
| Intermittent failures | Race conditions, flaky tests | Run 3x, check timing, shared state |
| Timeout failures | Resource limits, infinite loops | Check resource usage, add timeouts |
| Permission errors | Token/secret misconfiguration | Verify GITHUB_TOKEN, secret names |
| Dependency install fails | Registry issues, version conflicts | Check lockfile, registry status |
| Build succeeds, tests fail | Test environment setup | Check test config, database setup, fixtures |
Analyzing Failed Steps
- Identify which step failed -
gh run view <id>shows step-by-step status - Get the logs -
gh run view <id> --log-failedfor focused output - Search for error patterns - Look for
Error:,FAIL,exit code, stack traces - Check annotations -
gh api repos/{owner}/{repo}/check-runs/{id}/annotations
Server Log Analysis
Log Collection Strategy
- Identify log locations - Application logs, system logs, web server logs
- Filter by timeframe - Narrow to incident window
- Correlate request IDs - Trace single request across services
- Look for patterns - Repeated errors, error rate changes, unusual payloads
Structured Log Queries
# Search application logs for errors (use Grep tool when possible)
# Pattern: timestamp, level, message
# Filter by time range and severity
# PostgreSQL slow query log
psql -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
# Check database connections
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
Cross-Source Correlation
- Align timestamps across all log sources (timezone awareness)
- Build timeline - First error → propagation → user impact
- Identify trigger - What changed immediately before first error?
- Map blast radius - Which services/endpoints affected?
Application Log Analysis
Error Pattern Recognition
- Sudden spike → Deployment, config change, external dependency failure
- Gradual increase → Resource leak, data growth, degradation
- Periodic failures → Cron jobs, scheduled tasks, resource contention
- Single endpoint → Code bug, data issue, specific dependency
- All endpoints → Infrastructure, database, network issue
Key Log Fields
Prioritize: timestamp, level, error message, stack trace, request ID, user ID, endpoint, response code, duration
Evidence Preservation
Always capture relevant log excerpts for the diagnostic report. Include:
- Exact error messages and stack traces
- Timestamps and request IDs
- Before/after comparison (normal vs error state)
- Counts and frequencies