14 KiB
CLAUDE.md — CDP Analytics Service
You are a senior software engineer building the Analytics & Data Layer for a self-hosted CDP platform. This service focuses on query, explore, and activate data already ingested into ClickHouse.
Scope boundary: Read-side only. Never write raw events. Ingestion is handled by
cdp-ingestion.
What This Service Does
Exposes ingested event data via Query API for exploration and analysis. Computes Traits and Audience Segments from event history via background workers. Activates segments to external tools via Reverse ETL and webhooks.
Repository Layout
cdp-analytics/
├── api/ # Go — Query API, Profile API (port 4000)
├── workers/ # Go — Background jobs: Computed Traits, Segment refresh
├── console/ # React + Vite + shadcn/ui + Tailwind — Analytics UI
└── infra/
├── migrations/ # PostgreSQL migrations (golang-migrate)
└── clickhouse/ # ClickHouse query templates (.sql files)
Tech Stack
Go Services (api, workers)
| Concern | Library | Notes |
|---|---|---|
| HTTP router | chi |
Lightweight, stdlib-compatible middleware |
| Logger | zap |
Structured logging, fastest |
| PostgreSQL | pgx/v5 |
Native driver, no database/sql wrapper |
| ClickHouse | clickhouse-go/v2 |
Official driver, native protocol, good batch support |
| Redis | rueidis |
Modern client, faster than go-redis |
| Job queue | riverqueue/river |
Postgres-backed, pgx/v5 native, built-in scheduler + retry |
| Config | caarlos0/env |
Parse env vars into structs, zero deps |
| Validation | go-playground/validator/v10 |
Struct tags validation |
| Migration | golang-migrate + pgx driver |
CLI only — never auto-migrate on startup |
| Test assertion | testify |
assert + require + mock |
| Integration test | testcontainers-go |
Real PG / Redis / ClickHouse in tests |
React Console (console/)
| Concern | Library |
|---|---|
| Build | Vite |
| UI components | shadcn/ui + Tailwind |
| Routing | React Router v6 |
| Server state | TanStack Query |
| Client state | Zustand |
| Forms | react-hook-form + zod |
| Charts | Recharts |
| Icons | lucide-react |
No new technology without discussion. All additions must justify why existing stack cannot handle it.
Go Project Structure
api/
api/
├── cmd/
│ └── server/
│ └── main.go # wire everything, start server
└── internal/
├── handler/ # HTTP handlers — parse request, call service, write response
├── service/ # business logic — no HTTP, no DB concerns
├── repo/ # DB queries — PostgreSQL via pgx, ClickHouse via clickhouse-go
├── middleware/ # auth, request ID, logging
└── config/ # env parsing via caarlos0/env
workers/
workers/
├── cmd/
│ └── worker/
│ └── main.go # register jobs, start river worker
└── internal/
├── job/ # job definitions (ComputeTraitsJob, RefreshSegmentJob, ReverseETLJob)
├── handler/ # job handlers — business logic per job type
├── repo/ # DB queries shared across job handlers
└── config/
Rules:
handlerdepends onservice(api) orhandleronrepo(workers). Never reverse.handlernever touches DB directly in api/.servicenever importschior any HTTP package.reporeturns domain types, never rawpgx.Rowsordriver.Rows.- ClickHouse queries live as
.sqlfiles ininfra/clickhouse/— no inline SQL strings for complex queries.
Error Handling
Same AppError pattern as ingestion. Never return raw pgx or clickhouse-go errors to handlers.
// internal/apperr/apperr.go
type AppError struct {
Code int // HTTP status code to return
Message string // user-facing message (safe to expose)
Field string // optional: which field caused the error
Err error // original error for logging (not exposed to user)
}
func (e *AppError) Error() string { return e.Message }
func (e *AppError) Unwrap() error { return e.Err }
// Constructors
func BadRequest(msg, field string, err error) *AppError
func NotFound(msg string) *AppError
func Forbidden(msg string) *AppError
func Internal(err error) *AppError
Handler pattern — one place handles all errors:
func writeError(w http.ResponseWriter, err error) {
var appErr *apperr.AppError
if errors.As(err, &appErr) {
render.JSON(w, appErr.Code, ErrorResponse{Error: appErr.Message, Field: appErr.Field})
return
}
render.JSON(w, 500, ErrorResponse{Error: "internal server error"})
}
ClickHouse Query Pattern
Use raw SQL only. No query builder — ClickHouse SQL has its own syntax that builders handle poorly.
infra/clickhouse/
├── event_explorer.sql
├── funnel_analysis.sql
├── retention_cohort.sql
└── session_analysis.sql
Load templates at startup, inject parameters safely:
// Never fmt.Sprintf into SQL — use named parameters
query, err := templates.Load("funnel_analysis.sql")
rows, err := chConn.Query(ctx, query, clickhouse.Named("workspace_id", id), ...)
Rules:
- All ClickHouse queries must have a corresponding
.sqlfile ininfra/clickhouse/ - No multi-line SQL strings inline in Go code
- Every ClickHouse schema change must have a DDL file in
infra/clickhouse/
Job Queue (river)
Background workers use riverqueue/river backed by PostgreSQL.
// Define a job
type ComputeTraitsArgs struct {
WorkspaceID string `json:"workspace_id"`
TraitID string `json:"trait_id"`
}
func (ComputeTraitsArgs) Kind() string { return "compute_traits" }
// Register handler
river.AddWorker(workers, &ComputeTraitsWorker{repo: repo})
// Enqueue
client.Insert(ctx, ComputeTraitsArgs{WorkspaceID: "ws_123", TraitID: "t_456"}, nil)
Scheduled jobs (periodic):
// Hourly trait recompute, hourly segment refresh
&river.PeriodicJob{
ScheduleFunc: river.ScheduleFunc(func(t time.Time) time.Time {
return t.Add(time.Hour)
}),
ConstructorFunc: func() (river.JobArgs, *river.InsertOpts) {
return ComputeTraitsArgs{}, nil
},
}
Rules:
- Workers must be idempotent — river may retry on failure
- Use
river's built-in retry with exponential backoff, do not implement custom retry - Log job start, job end, duration, and error with full context (job_id, args)
Cache Strategy (Redis)
Semantic key structure — allows per-workspace invalidation:
cache:query:events:{workspace_id}:{hash(params)} TTL 60s
cache:query:funnel:{workspace_id}:{hash(params)} TTL 60s
cache:query:retention:{workspace_id}:{hash(params)} TTL 60s
cache:dashboard:{workspace_id} TTL 60s
cache:profile:{workspace_id}:{profile_id} TTL 30s
Rules:
- Default TTL: 60s for aggregate queries, 30s for profile lookups
- TTL is configurable per query type via env vars
- On cache miss: query ClickHouse, write result to Redis, return result
- Never cache Custom SQL results — each query is arbitrary
Custom SQL Sandbox
POST /query/sql allows arbitrary SQL on ClickHouse. Two layers of protection:
Layer 1 — App-level parse (Go):
// Reject anything that is not a SELECT statement
func validateReadOnly(sql string) error {
normalized := strings.TrimSpace(strings.ToUpper(sql))
if !strings.HasPrefix(normalized, "SELECT") {
return apperr.BadRequest("only SELECT statements are allowed", "sql", nil)
}
// Reject common DDL/DML keywords
forbidden := []string{"INSERT", "UPDATE", "DELETE", "DROP", "CREATE", "ALTER", "TRUNCATE"}
for _, kw := range forbidden {
if strings.Contains(normalized, kw) {
return apperr.BadRequest("statement contains forbidden keyword: "+kw, "sql", nil)
}
}
return nil
}
Layer 2 — ClickHouse read-only user:
- Custom SQL queries run as a separate ClickHouse user with
SELECT-only grants - DDL/DML rejected at DB level even if app-level check is bypassed
Testing Strategy
Unit tests — handler + service layer
- Mock interfaces with
testify/mock - No real DB, no real Redis, no real ClickHouse
- File:
foo_test.goalongside the file being tested
Integration tests — repo layer only
- Use
testcontainers-goto spin up real PostgreSQL, Redis, ClickHouse - File:
internal/repo/event_repo_test.go - Tag:
//go:build integration
make test # unit only (fast, no containers)
make test/integration # repo layer with real DBs (slower, CI)
Migration Workflow
make migrate/new name=add_profile_traits # create up+down files
make migrate/up # apply all pending
make migrate/down # rollback one step
make migrate/status # show current version
- Migration files:
infra/migrations/{version}_{name}.up.sql+.down.sql - Never auto-run migrations on server startup
- Every PostgreSQL schema change must have a migration file
PostgreSQL Schema (Analytics-owned tables)
-- Computed trait values per profile
profile_traits (
profile_id UUID,
trait_key TEXT,
trait_value JSONB,
computed_at TIMESTAMPTZ
)
-- Segment membership history (used for delta Reverse ETL)
segment_memberships (
segment_id UUID,
profile_id UUID,
entered_at TIMESTAMPTZ,
exited_at TIMESTAMPTZ -- NULL = currently a member
)
Data Sources (Read-only)
This service only reads data written by cdp-ingestion. Never write to these tables.
| Source | Data |
|---|---|
ClickHouse events |
Flattened, schema-managed raw events |
PostgreSQL profiles |
Identity graph, unified profiles |
PostgreSQL sources / destinations |
Config metadata |
PostgreSQL schemas |
Schema registry from ingestion |
Key Design Decisions
| Problem | Decision |
|---|---|
| Job queue | river on PostgreSQL — no Temporal, no Celery |
| Computed Traits refresh | Hourly default, configurable per trait |
| Segment re-evaluate | Full re-evaluate — simpler than incremental |
| Query cache | Redis semantic keys, TTL 60s default |
| Custom SQL | App-level SELECT-only check + ClickHouse read-only user |
| Reverse ETL | Delta only (entered/exited) — never push full member list |
| ClickHouse queries | Raw SQL in .sql template files — no query builder |
| Scaling | Vertical — increase RAM/CPU, not instances |
| Migration | CLI only — never auto-migrate on startup |
API Endpoints
| Method | Path | Description |
|---|---|---|
POST |
/query/events |
Filter + query raw events |
POST |
/query/sql |
Custom SQL on ClickHouse (SELECT only) |
POST |
/query/funnel |
Funnel analysis |
POST |
/query/retention |
Retention cohort |
GET |
/profiles/:id |
Unified profile lookup |
GET |
/profiles/:id/events |
User event timeline |
GET |
/segments |
List segments |
POST |
/segments |
Create segment |
GET |
/segments/:id/members |
Segment members |
GET |
/traits/definitions |
List computed trait definitions |
GET |
/health |
Health check |
GET |
/ready |
Readiness check |
Every endpoint must have a request struct with validate tags. Validation runs before any business logic.
Feature Priorities
| Priority | Features |
|---|---|
| P0 | Event Explorer, Custom SQL, Profile Lookup, Event Timeline, Saved Queries |
| P1 | Funnel Analysis, Retention Analysis, Session Analysis, Pre-built Dashboards |
| P2 | Computed Traits, Audience Segments, Background Worker |
| P3 | Reverse ETL, Webhook Push, Schema Registry, Data Catalog |
Build in priority order. Do not start P1 before P0 is stable.
Logging Policy (zap)
Query requests → log workspace_id, query_type, duration_ms, rows_returned, cache_hit
Worker jobs → log job_id, job_kind, args, duration_ms, status (success/error)
Errors → log full error chain with context
Coding Rules
- Do not write code unless asked — discuss architecture/features first
- Ask when scope is unclear, especially when multiple valid approaches exist
- YAGNI + KISS — do not build what is not needed yet
- Correctness before performance — optimize only when profiling proves it necessary
- Every PostgreSQL schema change must have a migration file in
infra/migrations/ - Every ClickHouse query must have a
.sqlfile ininfra/clickhouse/ - Every API endpoint must have a request struct with
validatetags - Never write raw events — this service is read-side only
- Discuss in Vietnamese, write code and comments in English
Common Pitfalls
- Do not query ClickHouse directly for computed traits at request time — serve from PostgreSQL
- Do not run full segment scans on every API request — that is the worker's job
- Do not cache Custom SQL results — queries are arbitrary, cache would be useless
- Do not inline complex SQL strings in Go — use
.sqltemplate files - Do not return raw
pgxorclickhouse-goerrors to HTTP handlers — wrap withAppError - Do not run migrations on server startup — use
make migrate/upexplicitly - Reverse ETL must push delta only (entered/exited), never the full member list per run
- Workers must be idempotent —
riverretries on failure, job may run more than once servicelayer must never importnet/httporchi