# CLAUDE.md — CDP Ingestion Service > You are a senior software engineer building the **Data Ingestion Service** for a self-hosted CDP platform, > inspired by Jitsu. Focus: event streaming, JS functions, identity stitching. > > **Scope boundary**: Ingestion only. Analytics & Customer 360 live in a separate service (`cdp-analytics`). --- ## What This Service Does Collects events from any source → validates, deduplicates, transforms via JS Functions → stores in ClickHouse and exports to external warehouses. Segment-compatible API for easy migration. --- ## Repository Layout ``` cdp-ingestion/ ├── ingest/ # Go — HTTP API, auth, validate, dedup, push to Kafka (port 3049) ├── rotor/ # Node.js — JS functions runner, V8 isolate (port 3401) ├── bulker/ # Go — Kafka consumer, batch write to ClickHouse/warehouses (port 3042) ├── console/ # React + Vite + shadcn/ui + Tailwind — management UI (port 3000) └── infra/ ├── docker/ ├── clickhouse/ # ClickHouse DDL / migrations └── migrations/ # PostgreSQL migrations (golang-migrate) ``` --- ## Tech Stack ### Go Services (ingest, bulker) | Concern | Library | Notes | |---------|---------|-------| | HTTP router | `chi` | Lightweight, stdlib-compatible middleware | | Logger | `zap` | Structured logging, fastest | | PostgreSQL | `pgx/v5` | Native driver, no database/sql wrapper | | Kafka | `franz-go` | Pure Go, no CGO, best Redpanda support | | Redis | `rueidis` | Modern client, faster than go-redis | | Config | `caarlos0/env` | Parse env vars into structs, zero deps | | Validation | `go-playground/validator/v10` | Struct tags validation | | Migration | `golang-migrate` + pgx driver | CLI only — never auto-migrate on startup | | Test assertion | `testify` | assert + require + mock | | Integration test | `testcontainers-go` | Real PG / Redis / ClickHouse in tests | ### React Console (console/) | Concern | Library | |---------|---------| | Build | Vite | | UI components | shadcn/ui + Tailwind | | Routing | React Router v6 | | Server state | TanStack Query | | Client state | Zustand | | Forms | react-hook-form + zod | | Charts | Recharts | | Icons | lucide-react | > **No new technology** without discussion. All additions must justify why existing stack cannot handle it. --- ## Go Project Structure Every Go service follows this layout: ``` ingest/ ├── cmd/ │ └── server/ │ └── main.go # wire everything, start server └── internal/ ├── handler/ # HTTP handlers — parse request, call service, write response ├── service/ # business logic — no HTTP, no DB concerns ├── repo/ # DB queries — PostgreSQL via pgx, ClickHouse ├── kafka/ # producer (ingest) / consumer (bulker) ├── middleware/ # auth, rate limit, request ID, logging └── config/ # env parsing via caarlos0/env ``` Rules: - `handler` depends on `service`. `service` depends on `repo`. Never reverse. - `handler` never touches DB directly. - `service` never imports `chi` or any HTTP package. - `repo` returns domain types, never raw `pgx.Rows`. --- ## Error Handling Use `AppError` for all domain errors. Never return raw `pgx` or stdlib errors to handlers. ```go // internal/apperr/apperr.go type AppError struct { Code int // HTTP status code to return Message string // user-facing message (safe to expose) Field string // optional: which field caused the error (schema conflict, validation) Err error // original error for logging (not exposed to user) } func (e *AppError) Error() string { return e.Message } func (e *AppError) Unwrap() error { return e.Err } // Constructors func BadRequest(msg, field string, err error) *AppError func Conflict(msg string, err error) *AppError func TooManyRequests(retryAfter int) *AppError func UnprocessableEntity(msg string) *AppError func Internal(err error) *AppError ``` Handler pattern — one place to handle all errors: ```go func writeError(w http.ResponseWriter, err error) { var appErr *apperr.AppError if errors.As(err, &appErr) { // log appErr.Err internally, return appErr.Message to user render.JSON(w, appErr.Code, ErrorResponse{Error: appErr.Message, Field: appErr.Field}) return } // unexpected — log full error, return generic 500 render.JSON(w, 500, ErrorResponse{Error: "internal server error"}) } ``` --- ## Testing Strategy ### Unit tests — handler + service layer - Mock interfaces with `testify/mock` - No real DB, no real Redis, no real Kafka - File: `foo_test.go` alongside the file being tested ```go type EventServiceMock struct { mock.Mock } func (m *EventServiceMock) Track(ctx context.Context, e *Event) error { return m.Called(ctx, e).Error(0) } ``` ### Integration tests — repo layer only - Use `testcontainers-go` to spin up real PostgreSQL, Redis, ClickHouse - File: `internal/repo/event_repo_test.go` - Tag: `//go:build integration` - Run: `make test/integration` ```bash make test # unit only (fast, no containers) make test/integration # repo layer with real DBs (slower, CI) ``` --- ## Migration Workflow ```bash # Create new migration make migrate/new name=add_segment_memberships # Apply make migrate/up # Rollback one step make migrate/down # Check status make migrate/status ``` - Migration files live in `infra/migrations/` - Format: `{version}_{name}.up.sql` + `{version}_{name}.down.sql` - **Never** auto-run migrations on server startup - Every PostgreSQL schema change **must** have a migration file — no exceptions --- ## Ingest Pipeline (Step-by-Step) ``` HTTP Request 1. Auth — Write Key → PostgreSQL lookup, cached in Redis (TTL 30–60s + pub/sub invalidation) 2. Payload validate — size ≤ PAYLOAD_LIMIT_KB (default 100KB), struct + validator tags 3. Rate limit — Redis sliding window per workspace_id; 429 + Retry-After on breach 4. Timestamp — received_at = server time; client time preserved as sent_at 5. Late event check — (received_at − sent_at) > 24h → 422 drop 6. Deduplication — Redis SETNX message_id, TTL 24h 7. JSON flatten — {"a":{"b":1}} → {"a_b":1} 8. Schema validate — type conflict → 400 + field name → push to DLQ 9. Push Kafka — partition key = anonymous_id (ordering for identity stitching) 10. Return 200 OK — fire-and-forget, do not wait for Kafka ack ``` ## Kafka Topics | Topic | Purpose | |-------|---------| | `events.ingest` | Happy path — valid events | | `events.dlq` | Failed events — schema conflict, validation error, function crash | | `events.retry` | Replay from DLQ after fix | --- ## Key Design Decisions | Problem | Decision | |---------|---------| | Late events (> 24h) | Drop, return `422 Unprocessable` | | Schema conflict | Reject `400`, include field name in response, push to DLQ | | Timestamp authority | Server wins (`received_at`); client time kept as `sent_at` | | Payload limit | Configurable, default 100KB; batch has separate limit | | Partition key | `anonymous_id` — guarantees ordering for identity stitching | | Enrich mode | Async by default — store raw event first, worker enriches after | | Identity backfill | Async + lock per `anonymous_id` to avoid race condition | | Write Key cache | Redis TTL 30–60s + pub/sub invalidation on revoke | | Graceful shutdown | Drain in-flight requests on SIGTERM before exit | | Migration | CLI only — never auto-migrate on startup | ## Rate Limits | Tier | RPS | Events/day | Burst (5s) | |------|-----|-----------|-----------| | Default | 100 | 1M | 500 | | Pro | 500 | 10M | 2,500 | | Enterprise | custom | custom | custom | Rate limit key: `rate:{workspace_id}` — per workspace, not per IP. --- ## Data Conventions - **Field names**: `snake_case`; sanitize on ingest (remove spaces, special chars) - **Timestamps**: `received_at` (server), `sent_at` (client), `timestamp` (event time for analytics) - **Dedup key**: `message_id`, Redis SETNX, TTL 24h - **Nested objects**: auto-flatten before schema check - **Type coercion**: none — type conflict → reject immediately - **Write Key**: never log raw; always masked in logs --- ## Logging Policy (zap) ``` Happy path → metadata only, no payload LOG_PAYLOAD_ON_SUCCESS=false (default) Error/reject → full payload logged LOG_PAYLOAD_ON_ERROR=true (default) Write Key → always masked, never raw ``` Fields logged on every request: `workspace_id`, `source_id`, `message_id`, `event_type`, `duration_ms`, `status_code`. --- ## API Endpoints (Ingest) | Method | Path | Description | |--------|------|-------------| | `POST` | `/track` | Single event | | `POST` | `/batch` | Batch events (Segment-compatible) | | `POST` | `/identify` | Identify call | | `POST` | `/page` | Page call | | `POST` | `/group` | Group call | | `GET` | `/health` | Health check | | `GET` | `/ready` | Readiness check | Every endpoint must have a request struct with `validate` tags. Validation runs before any business logic. --- ## Coding Rules - **Do not write code unless asked** — discuss architecture/features first - **Ask when scope is unclear**, especially when multiple valid approaches exist - **YAGNI + KISS** — do not build what is not needed yet - **Correctness before performance** — optimize only when profiling proves it necessary - **Every ClickHouse schema change must have a migration file** in `infra/clickhouse/` - **Every PostgreSQL schema change must have a migration file** in `infra/migrations/` - **Every API endpoint must have a request struct with `validate` tags** - **Never write raw events from analytics layer** — ingestion is the sole writer - Discuss in **Vietnamese**, write code and comments in **English** --- ## Common Pitfalls - Do not skip dedup check even for bulk imports — use a different TTL bucket if needed - Do not change partition key from `anonymous_id` — breaks identity stitching ordering - Do not cache Write Keys without the pub/sub invalidation path — revoked keys must propagate within TTL - `rotor` is Node.js, not Go — cross-service calls go over HTTP, never in-process - DLQ events must be replayable — never mutate DLQ topic; write to `events.retry` for replay - Do not return raw `pgx` errors to HTTP handlers — always wrap with `AppError` - Do not run migrations on server startup — use `make migrate/up` explicitly - `service` layer must never import `net/http` or `chi` — keep HTTP concerns in `handler` only