init
This commit is contained in:
97
.opencode/skills/ck-debug/references/log-and-ci-analysis.md
Normal file
97
.opencode/skills/ck-debug/references/log-and-ci-analysis.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Log & CI/CD Analysis
|
||||
|
||||
Collect and analyze logs from servers, CI/CD pipelines, and application layers to diagnose failures.
|
||||
|
||||
## GitHub Actions Analysis
|
||||
|
||||
### List and Inspect Runs
|
||||
|
||||
```bash
|
||||
# List recent runs (all workflows)
|
||||
gh run list --limit 10
|
||||
|
||||
# List runs for specific workflow
|
||||
gh run list --workflow=ci.yml --limit 5
|
||||
|
||||
# View specific run details
|
||||
gh run view <run-id>
|
||||
|
||||
# View failed job logs only
|
||||
gh run view <run-id> --log-failed
|
||||
|
||||
# Download complete logs
|
||||
gh run view <run-id> --log > /tmp/ci-full.txt
|
||||
|
||||
# Re-run failed jobs
|
||||
gh run rerun <run-id> --failed
|
||||
```
|
||||
|
||||
### Common CI/CD Failure Patterns
|
||||
|
||||
| Pattern | Likely Cause | Investigation |
|
||||
|---------|-------------|---------------|
|
||||
| Passes locally, fails CI | Environment diff | Check Node/Python version, OS, env vars |
|
||||
| Intermittent failures | Race conditions, flaky tests | Run 3x, check timing, shared state |
|
||||
| Timeout failures | Resource limits, infinite loops | Check resource usage, add timeouts |
|
||||
| Permission errors | Token/secret misconfiguration | Verify `GITHUB_TOKEN`, secret names |
|
||||
| Dependency install fails | Registry issues, version conflicts | Check lockfile, registry status |
|
||||
| Build succeeds, tests fail | Test environment setup | Check test config, database setup, fixtures |
|
||||
|
||||
### Analyzing Failed Steps
|
||||
|
||||
1. **Identify which step failed** - `gh run view <id>` shows step-by-step status
|
||||
2. **Get the logs** - `gh run view <id> --log-failed` for focused output
|
||||
3. **Search for error patterns** - Look for `Error:`, `FAIL`, `exit code`, stack traces
|
||||
4. **Check annotations** - `gh api repos/{owner}/{repo}/check-runs/{id}/annotations`
|
||||
|
||||
## Server Log Analysis
|
||||
|
||||
### Log Collection Strategy
|
||||
|
||||
1. **Identify log locations** - Application logs, system logs, web server logs
|
||||
2. **Filter by timeframe** - Narrow to incident window
|
||||
3. **Correlate request IDs** - Trace single request across services
|
||||
4. **Look for patterns** - Repeated errors, error rate changes, unusual payloads
|
||||
|
||||
### Structured Log Queries
|
||||
|
||||
```bash
|
||||
# Search application logs for errors (use Grep tool when possible)
|
||||
# Pattern: timestamp, level, message
|
||||
# Filter by time range and severity
|
||||
|
||||
# PostgreSQL slow query log
|
||||
psql -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
|
||||
|
||||
# Check database connections
|
||||
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
|
||||
```
|
||||
|
||||
### Cross-Source Correlation
|
||||
|
||||
1. **Align timestamps** across all log sources (timezone awareness)
|
||||
2. **Build timeline** - First error → propagation → user impact
|
||||
3. **Identify trigger** - What changed immediately before first error?
|
||||
4. **Map blast radius** - Which services/endpoints affected?
|
||||
|
||||
## Application Log Analysis
|
||||
|
||||
### Error Pattern Recognition
|
||||
|
||||
- **Sudden spike** → Deployment, config change, external dependency failure
|
||||
- **Gradual increase** → Resource leak, data growth, degradation
|
||||
- **Periodic failures** → Cron jobs, scheduled tasks, resource contention
|
||||
- **Single endpoint** → Code bug, data issue, specific dependency
|
||||
- **All endpoints** → Infrastructure, database, network issue
|
||||
|
||||
### Key Log Fields
|
||||
|
||||
Prioritize: timestamp, level, error message, stack trace, request ID, user ID, endpoint, response code, duration
|
||||
|
||||
### Evidence Preservation
|
||||
|
||||
Always capture relevant log excerpts for the diagnostic report. Include:
|
||||
- Exact error messages and stack traces
|
||||
- Timestamps and request IDs
|
||||
- Before/after comparison (normal vs error state)
|
||||
- Counts and frequencies
|
||||
Reference in New Issue
Block a user