Files
english/.opencode/skills/ck-debug/references/log-and-ci-analysis.md
2026-04-12 01:06:31 +07:00

98 lines
3.4 KiB
Markdown

# Log & CI/CD Analysis
Collect and analyze logs from servers, CI/CD pipelines, and application layers to diagnose failures.
## GitHub Actions Analysis
### List and Inspect Runs
```bash
# List recent runs (all workflows)
gh run list --limit 10
# List runs for specific workflow
gh run list --workflow=ci.yml --limit 5
# View specific run details
gh run view <run-id>
# View failed job logs only
gh run view <run-id> --log-failed
# Download complete logs
gh run view <run-id> --log > /tmp/ci-full.txt
# Re-run failed jobs
gh run rerun <run-id> --failed
```
### Common CI/CD Failure Patterns
| Pattern | Likely Cause | Investigation |
|---------|-------------|---------------|
| Passes locally, fails CI | Environment diff | Check Node/Python version, OS, env vars |
| Intermittent failures | Race conditions, flaky tests | Run 3x, check timing, shared state |
| Timeout failures | Resource limits, infinite loops | Check resource usage, add timeouts |
| Permission errors | Token/secret misconfiguration | Verify `GITHUB_TOKEN`, secret names |
| Dependency install fails | Registry issues, version conflicts | Check lockfile, registry status |
| Build succeeds, tests fail | Test environment setup | Check test config, database setup, fixtures |
### Analyzing Failed Steps
1. **Identify which step failed** - `gh run view <id>` shows step-by-step status
2. **Get the logs** - `gh run view <id> --log-failed` for focused output
3. **Search for error patterns** - Look for `Error:`, `FAIL`, `exit code`, stack traces
4. **Check annotations** - `gh api repos/{owner}/{repo}/check-runs/{id}/annotations`
## Server Log Analysis
### Log Collection Strategy
1. **Identify log locations** - Application logs, system logs, web server logs
2. **Filter by timeframe** - Narrow to incident window
3. **Correlate request IDs** - Trace single request across services
4. **Look for patterns** - Repeated errors, error rate changes, unusual payloads
### Structured Log Queries
```bash
# Search application logs for errors (use Grep tool when possible)
# Pattern: timestamp, level, message
# Filter by time range and severity
# PostgreSQL slow query log
psql -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
# Check database connections
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
```
### Cross-Source Correlation
1. **Align timestamps** across all log sources (timezone awareness)
2. **Build timeline** - First error → propagation → user impact
3. **Identify trigger** - What changed immediately before first error?
4. **Map blast radius** - Which services/endpoints affected?
## Application Log Analysis
### Error Pattern Recognition
- **Sudden spike** → Deployment, config change, external dependency failure
- **Gradual increase** → Resource leak, data growth, degradation
- **Periodic failures** → Cron jobs, scheduled tasks, resource contention
- **Single endpoint** → Code bug, data issue, specific dependency
- **All endpoints** → Infrastructure, database, network issue
### Key Log Fields
Prioritize: timestamp, level, error message, stack trace, request ID, user ID, endpoint, response code, duration
### Evidence Preservation
Always capture relevant log excerpts for the diagnostic report. Include:
- Exact error messages and stack traces
- Timestamps and request IDs
- Before/after comparison (normal vs error state)
- Counts and frequencies