Files
english/.opencode/skills/ck-debug/references/log-and-ci-analysis.md
2026-04-12 01:06:31 +07:00

3.4 KiB

Log & CI/CD Analysis

Collect and analyze logs from servers, CI/CD pipelines, and application layers to diagnose failures.

GitHub Actions Analysis

List and Inspect Runs

# List recent runs (all workflows)
gh run list --limit 10

# List runs for specific workflow
gh run list --workflow=ci.yml --limit 5

# View specific run details
gh run view <run-id>

# View failed job logs only
gh run view <run-id> --log-failed

# Download complete logs
gh run view <run-id> --log > /tmp/ci-full.txt

# Re-run failed jobs
gh run rerun <run-id> --failed

Common CI/CD Failure Patterns

Pattern Likely Cause Investigation
Passes locally, fails CI Environment diff Check Node/Python version, OS, env vars
Intermittent failures Race conditions, flaky tests Run 3x, check timing, shared state
Timeout failures Resource limits, infinite loops Check resource usage, add timeouts
Permission errors Token/secret misconfiguration Verify GITHUB_TOKEN, secret names
Dependency install fails Registry issues, version conflicts Check lockfile, registry status
Build succeeds, tests fail Test environment setup Check test config, database setup, fixtures

Analyzing Failed Steps

  1. Identify which step failed - gh run view <id> shows step-by-step status
  2. Get the logs - gh run view <id> --log-failed for focused output
  3. Search for error patterns - Look for Error:, FAIL, exit code, stack traces
  4. Check annotations - gh api repos/{owner}/{repo}/check-runs/{id}/annotations

Server Log Analysis

Log Collection Strategy

  1. Identify log locations - Application logs, system logs, web server logs
  2. Filter by timeframe - Narrow to incident window
  3. Correlate request IDs - Trace single request across services
  4. Look for patterns - Repeated errors, error rate changes, unusual payloads

Structured Log Queries

# Search application logs for errors (use Grep tool when possible)
# Pattern: timestamp, level, message
# Filter by time range and severity

# PostgreSQL slow query log
psql -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

# Check database connections
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

Cross-Source Correlation

  1. Align timestamps across all log sources (timezone awareness)
  2. Build timeline - First error → propagation → user impact
  3. Identify trigger - What changed immediately before first error?
  4. Map blast radius - Which services/endpoints affected?

Application Log Analysis

Error Pattern Recognition

  • Sudden spike → Deployment, config change, external dependency failure
  • Gradual increase → Resource leak, data growth, degradation
  • Periodic failures → Cron jobs, scheduled tasks, resource contention
  • Single endpoint → Code bug, data issue, specific dependency
  • All endpoints → Infrastructure, database, network issue

Key Log Fields

Prioritize: timestamp, level, error message, stack trace, request ID, user ID, endpoint, response code, duration

Evidence Preservation

Always capture relevant log excerpts for the diagnostic report. Include:

  • Exact error messages and stack traces
  • Timestamps and request IDs
  • Before/after comparison (normal vs error state)
  • Counts and frequencies