init ingestion

2026-05-24 22:59:24 +07:00
commit 4e8c11d545
80 changed files with 5639 additions and 0 deletions
--- a/data-layer/CLAUDE_analytics.md
+++ b/data-layer/CLAUDE_analytics.md
@@ -0,0 +1,415 @@
+# CLAUDE.md — CDP Analytics Service
+
+> You are a senior software engineer building the **Analytics & Data Layer** for a self-hosted CDP platform.
+> This service focuses on **query, explore, and activate** data already ingested into ClickHouse.
+>
+> **Scope boundary**: Read-side only. Never write raw events. Ingestion is handled by `cdp-ingestion`.
+
+---
+
+## What This Service Does
+
+Exposes ingested event data via Query API for exploration and analysis. Computes Traits and Audience
+Segments from event history via background workers. Activates segments to external tools via Reverse ETL
+and webhooks.
+
+---
+
+## Repository Layout
+
+```
+cdp-analytics/
+├── api/          # Go — Query API, Profile API                   (port 4000)
+├── workers/      # Go — Background jobs: Computed Traits, Segment refresh
+├── console/      # React + Vite + shadcn/ui + Tailwind — Analytics UI
+└── infra/
+    ├── migrations/   # PostgreSQL migrations (golang-migrate)
+    └── clickhouse/   # ClickHouse query templates (.sql files)
+```
+
+---
+
+## Tech Stack
+
+### Go Services (api, workers)
+
+| Concern | Library | Notes |
+|---------|---------|-------|
+| HTTP router | `chi` | Lightweight, stdlib-compatible middleware |
+| Logger | `zap` | Structured logging, fastest |
+| PostgreSQL | `pgx/v5` | Native driver, no database/sql wrapper |
+| ClickHouse | `clickhouse-go/v2` | Official driver, native protocol, good batch support |
+| Redis | `rueidis` | Modern client, faster than go-redis |
+| Job queue | `riverqueue/river` | Postgres-backed, pgx/v5 native, built-in scheduler + retry |
+| Config | `caarlos0/env` | Parse env vars into structs, zero deps |
+| Validation | `go-playground/validator/v10` | Struct tags validation |
+| Migration | `golang-migrate` + pgx driver | CLI only — never auto-migrate on startup |
+| Test assertion | `testify` | assert + require + mock |
+| Integration test | `testcontainers-go` | Real PG / Redis / ClickHouse in tests |
+
+### React Console (console/)
+
+| Concern | Library |
+|---------|---------|
+| Build | Vite |
+| UI components | shadcn/ui + Tailwind |
+| Routing | React Router v6 |
+| Server state | TanStack Query |
+| Client state | Zustand |
+| Forms | react-hook-form + zod |
+| Charts | Recharts |
+| Icons | lucide-react |
+
+> **No new technology** without discussion. All additions must justify why existing stack cannot handle it.
+
+---
+
+## Go Project Structure
+
+### api/
+
+```
+api/
+├── cmd/
+│   └── server/
+│       └── main.go        # wire everything, start server
+└── internal/
+    ├── handler/            # HTTP handlers — parse request, call service, write response
+    ├── service/            # business logic — no HTTP, no DB concerns
+    ├── repo/               # DB queries — PostgreSQL via pgx, ClickHouse via clickhouse-go
+    ├── middleware/         # auth, request ID, logging
+    └── config/             # env parsing via caarlos0/env
+```
+
+### workers/
+
+```
+workers/
+├── cmd/
+│   └── worker/
+│       └── main.go        # register jobs, start river worker
+└── internal/
+    ├── job/                # job definitions (ComputeTraitsJob, RefreshSegmentJob, ReverseETLJob)
+    ├── handler/            # job handlers — business logic per job type
+    ├── repo/               # DB queries shared across job handlers
+    └── config/
+```
+
+Rules:
+- `handler` depends on `service` (api) or `handler` on `repo` (workers). Never reverse.
+- `handler` never touches DB directly in api/.
+- `service` never imports `chi` or any HTTP package.
+- `repo` returns domain types, never raw `pgx.Rows` or `driver.Rows`.
+- ClickHouse queries live as `.sql` files in `infra/clickhouse/` — no inline SQL strings for complex queries.
+
+---
+
+## Error Handling
+
+Same `AppError` pattern as ingestion. Never return raw `pgx` or `clickhouse-go` errors to handlers.
+
+```go
+// internal/apperr/apperr.go
+
+type AppError struct {
+    Code    int    // HTTP status code to return
+    Message string // user-facing message (safe to expose)
+    Field   string // optional: which field caused the error
+    Err     error  // original error for logging (not exposed to user)
+}
+
+func (e *AppError) Error() string { return e.Message }
+func (e *AppError) Unwrap() error { return e.Err }
+
+// Constructors
+func BadRequest(msg, field string, err error) *AppError
+func NotFound(msg string) *AppError
+func Forbidden(msg string) *AppError
+func Internal(err error) *AppError
+```
+
+Handler pattern — one place handles all errors:
+
+```go
+func writeError(w http.ResponseWriter, err error) {
+    var appErr *apperr.AppError
+    if errors.As(err, &appErr) {
+        render.JSON(w, appErr.Code, ErrorResponse{Error: appErr.Message, Field: appErr.Field})
+        return
+    }
+    render.JSON(w, 500, ErrorResponse{Error: "internal server error"})
+}
+```
+
+---
+
+## ClickHouse Query Pattern
+
+Use raw SQL only. No query builder — ClickHouse SQL has its own syntax that builders handle poorly.
+
+```
+infra/clickhouse/
+├── event_explorer.sql
+├── funnel_analysis.sql
+├── retention_cohort.sql
+└── session_analysis.sql
+```
+
+Load templates at startup, inject parameters safely:
+
+```go
+// Never fmt.Sprintf into SQL — use named parameters
+query, err := templates.Load("funnel_analysis.sql")
+rows, err := chConn.Query(ctx, query, clickhouse.Named("workspace_id", id), ...)
+```
+
+Rules:
+- All ClickHouse queries must have a corresponding `.sql` file in `infra/clickhouse/`
+- No multi-line SQL strings inline in Go code
+- Every ClickHouse schema change must have a DDL file in `infra/clickhouse/`
+
+---
+
+## Job Queue (river)
+
+Background workers use `riverqueue/river` backed by PostgreSQL.
+
+```go
+// Define a job
+type ComputeTraitsArgs struct {
+    WorkspaceID string `json:"workspace_id"`
+    TraitID     string `json:"trait_id"`
+}
+func (ComputeTraitsArgs) Kind() string { return "compute_traits" }
+
+// Register handler
+river.AddWorker(workers, &ComputeTraitsWorker{repo: repo})
+
+// Enqueue
+client.Insert(ctx, ComputeTraitsArgs{WorkspaceID: "ws_123", TraitID: "t_456"}, nil)
+```
+
+Scheduled jobs (periodic):
+
+```go
+// Hourly trait recompute, hourly segment refresh
+&river.PeriodicJob{
+    ScheduleFunc: river.ScheduleFunc(func(t time.Time) time.Time {
+        return t.Add(time.Hour)
+    }),
+    ConstructorFunc: func() (river.JobArgs, *river.InsertOpts) {
+        return ComputeTraitsArgs{}, nil
+    },
+}
+```
+
+Rules:
+- Workers must be idempotent — river may retry on failure
+- Use `river`'s built-in retry with exponential backoff, do not implement custom retry
+- Log job start, job end, duration, and error with full context (job_id, args)
+
+---
+
+## Cache Strategy (Redis)
+
+Semantic key structure — allows per-workspace invalidation:
+
+```
+cache:query:events:{workspace_id}:{hash(params)}     TTL 60s
+cache:query:funnel:{workspace_id}:{hash(params)}     TTL 60s
+cache:query:retention:{workspace_id}:{hash(params)}  TTL 60s
+cache:dashboard:{workspace_id}                       TTL 60s
+cache:profile:{workspace_id}:{profile_id}            TTL 30s
+```
+
+Rules:
+- Default TTL: 60s for aggregate queries, 30s for profile lookups
+- TTL is configurable per query type via env vars
+- On cache miss: query ClickHouse, write result to Redis, return result
+- Never cache Custom SQL results — each query is arbitrary
+
+---
+
+## Custom SQL Sandbox
+
+`POST /query/sql` allows arbitrary SQL on ClickHouse. Two layers of protection:
+
+**Layer 1 — App-level parse (Go):**
+```go
+// Reject anything that is not a SELECT statement
+func validateReadOnly(sql string) error {
+    normalized := strings.TrimSpace(strings.ToUpper(sql))
+    if !strings.HasPrefix(normalized, "SELECT") {
+        return apperr.BadRequest("only SELECT statements are allowed", "sql", nil)
+    }
+    // Reject common DDL/DML keywords
+    forbidden := []string{"INSERT", "UPDATE", "DELETE", "DROP", "CREATE", "ALTER", "TRUNCATE"}
+    for _, kw := range forbidden {
+        if strings.Contains(normalized, kw) {
+            return apperr.BadRequest("statement contains forbidden keyword: "+kw, "sql", nil)
+        }
+    }
+    return nil
+}
+```
+
+**Layer 2 — ClickHouse read-only user:**
+- Custom SQL queries run as a separate ClickHouse user with `SELECT`-only grants
+- DDL/DML rejected at DB level even if app-level check is bypassed
+
+---
+
+## Testing Strategy
+
+### Unit tests — handler + service layer
+- Mock interfaces with `testify/mock`
+- No real DB, no real Redis, no real ClickHouse
+- File: `foo_test.go` alongside the file being tested
+
+### Integration tests — repo layer only
+- Use `testcontainers-go` to spin up real PostgreSQL, Redis, ClickHouse
+- File: `internal/repo/event_repo_test.go`
+- Tag: `//go:build integration`
+
+```bash
+make test              # unit only (fast, no containers)
+make test/integration  # repo layer with real DBs (slower, CI)
+```
+
+---
+
+## Migration Workflow
+
+```bash
+make migrate/new name=add_profile_traits   # create up+down files
+make migrate/up                            # apply all pending
+make migrate/down                          # rollback one step
+make migrate/status                        # show current version
+```
+
+- Migration files: `infra/migrations/{version}_{name}.up.sql` + `.down.sql`
+- **Never** auto-run migrations on server startup
+- Every PostgreSQL schema change **must** have a migration file
+
+---
+
+## PostgreSQL Schema (Analytics-owned tables)
+
+```sql
+-- Computed trait values per profile
+profile_traits (
+    profile_id   UUID,
+    trait_key    TEXT,
+    trait_value  JSONB,
+    computed_at  TIMESTAMPTZ
+)
+
+-- Segment membership history (used for delta Reverse ETL)
+segment_memberships (
+    segment_id   UUID,
+    profile_id   UUID,
+    entered_at   TIMESTAMPTZ,
+    exited_at    TIMESTAMPTZ   -- NULL = currently a member
+)
+```
+
+---
+
+## Data Sources (Read-only)
+
+This service **only reads** data written by `cdp-ingestion`. Never write to these tables.
+
+| Source | Data |
+|--------|------|
+| ClickHouse `events` | Flattened, schema-managed raw events |
+| PostgreSQL `profiles` | Identity graph, unified profiles |
+| PostgreSQL `sources` / `destinations` | Config metadata |
+| PostgreSQL `schemas` | Schema registry from ingestion |
+
+---
+
+## Key Design Decisions
+
+| Problem | Decision |
+|---------|---------|
+| Job queue | `river` on PostgreSQL — no Temporal, no Celery |
+| Computed Traits refresh | Hourly default, configurable per trait |
+| Segment re-evaluate | Full re-evaluate — simpler than incremental |
+| Query cache | Redis semantic keys, TTL 60s default |
+| Custom SQL | App-level SELECT-only check + ClickHouse read-only user |
+| Reverse ETL | Delta only (entered/exited) — never push full member list |
+| ClickHouse queries | Raw SQL in `.sql` template files — no query builder |
+| Scaling | Vertical — increase RAM/CPU, not instances |
+| Migration | CLI only — never auto-migrate on startup |
+
+---
+
+## API Endpoints
+
+| Method | Path | Description |
+|--------|------|-------------|
+| `POST` | `/query/events` | Filter + query raw events |
+| `POST` | `/query/sql` | Custom SQL on ClickHouse (SELECT only) |
+| `POST` | `/query/funnel` | Funnel analysis |
+| `POST` | `/query/retention` | Retention cohort |
+| `GET` | `/profiles/:id` | Unified profile lookup |
+| `GET` | `/profiles/:id/events` | User event timeline |
+| `GET` | `/segments` | List segments |
+| `POST` | `/segments` | Create segment |
+| `GET` | `/segments/:id/members` | Segment members |
+| `GET` | `/traits/definitions` | List computed trait definitions |
+| `GET` | `/health` | Health check |
+| `GET` | `/ready` | Readiness check |
+
+Every endpoint must have a request struct with `validate` tags. Validation runs before any business logic.
+
+---
+
+## Feature Priorities
+
+| Priority | Features |
+|----------|---------|
+| **P0** | Event Explorer, Custom SQL, Profile Lookup, Event Timeline, Saved Queries |
+| **P1** | Funnel Analysis, Retention Analysis, Session Analysis, Pre-built Dashboards |
+| **P2** | Computed Traits, Audience Segments, Background Worker |
+| **P3** | Reverse ETL, Webhook Push, Schema Registry, Data Catalog |
+
+Build in priority order. Do not start P1 before P0 is stable.
+
+---
+
+## Logging Policy (zap)
+
+```
+Query requests  → log workspace_id, query_type, duration_ms, rows_returned, cache_hit
+Worker jobs     → log job_id, job_kind, args, duration_ms, status (success/error)
+Errors          → log full error chain with context
+```
+
+---
+
+## Coding Rules
+
+- **Do not write code unless asked** — discuss architecture/features first
+- **Ask when scope is unclear**, especially when multiple valid approaches exist
+- **YAGNI + KISS** — do not build what is not needed yet
+- **Correctness before performance** — optimize only when profiling proves it necessary
+- **Every PostgreSQL schema change must have a migration file** in `infra/migrations/`
+- **Every ClickHouse query must have a `.sql` file** in `infra/clickhouse/`
+- **Every API endpoint must have a request struct with `validate` tags**
+- **Never write raw events** — this service is read-side only
+- Discuss in **Vietnamese**, write code and comments in **English**
+
+---
+
+## Common Pitfalls
+
+- Do not query ClickHouse directly for computed traits at request time — serve from PostgreSQL
+- Do not run full segment scans on every API request — that is the worker's job
+- Do not cache Custom SQL results — queries are arbitrary, cache would be useless
+- Do not inline complex SQL strings in Go — use `.sql` template files
+- Do not return raw `pgx` or `clickhouse-go` errors to HTTP handlers — wrap with `AppError`
+- Do not run migrations on server startup — use `make migrate/up` explicitly
+- Reverse ETL must push delta only (entered/exited), never the full member list per run
+- Workers must be idempotent — `river` retries on failure, job may run more than once
+- `service` layer must never import `net/http` or `chi`