# Log & CI/CD Analysis Collect and analyze logs from servers, CI/CD pipelines, and application layers to diagnose failures. ## GitHub Actions Analysis ### List and Inspect Runs ```bash # List recent runs (all workflows) gh run list --limit 10 # List runs for specific workflow gh run list --workflow=ci.yml --limit 5 # View specific run details gh run view # View failed job logs only gh run view --log-failed # Download complete logs gh run view --log > /tmp/ci-full.txt # Re-run failed jobs gh run rerun --failed ``` ### Common CI/CD Failure Patterns | Pattern | Likely Cause | Investigation | |---------|-------------|---------------| | Passes locally, fails CI | Environment diff | Check Node/Python version, OS, env vars | | Intermittent failures | Race conditions, flaky tests | Run 3x, check timing, shared state | | Timeout failures | Resource limits, infinite loops | Check resource usage, add timeouts | | Permission errors | Token/secret misconfiguration | Verify `GITHUB_TOKEN`, secret names | | Dependency install fails | Registry issues, version conflicts | Check lockfile, registry status | | Build succeeds, tests fail | Test environment setup | Check test config, database setup, fixtures | ### Analyzing Failed Steps 1. **Identify which step failed** - `gh run view ` shows step-by-step status 2. **Get the logs** - `gh run view --log-failed` for focused output 3. **Search for error patterns** - Look for `Error:`, `FAIL`, `exit code`, stack traces 4. **Check annotations** - `gh api repos/{owner}/{repo}/check-runs/{id}/annotations` ## Server Log Analysis ### Log Collection Strategy 1. **Identify log locations** - Application logs, system logs, web server logs 2. **Filter by timeframe** - Narrow to incident window 3. **Correlate request IDs** - Trace single request across services 4. **Look for patterns** - Repeated errors, error rate changes, unusual payloads ### Structured Log Queries ```bash # Search application logs for errors (use Grep tool when possible) # Pattern: timestamp, level, message # Filter by time range and severity # PostgreSQL slow query log psql -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;" # Check database connections psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" ``` ### Cross-Source Correlation 1. **Align timestamps** across all log sources (timezone awareness) 2. **Build timeline** - First error → propagation → user impact 3. **Identify trigger** - What changed immediately before first error? 4. **Map blast radius** - Which services/endpoints affected? ## Application Log Analysis ### Error Pattern Recognition - **Sudden spike** → Deployment, config change, external dependency failure - **Gradual increase** → Resource leak, data growth, degradation - **Periodic failures** → Cron jobs, scheduled tasks, resource contention - **Single endpoint** → Code bug, data issue, specific dependency - **All endpoints** → Infrastructure, database, network issue ### Key Log Fields Prioritize: timestamp, level, error message, stack trace, request ID, user ID, endpoint, response code, duration ### Evidence Preservation Always capture relevant log excerpts for the diagnostic report. Include: - Exact error messages and stack traces - Timestamps and request IDs - Before/after comparison (normal vs error state) - Counts and frequencies