Files
cdp/data-layer/CLAUDE_analytics.md
2026-05-24 22:59:24 +07:00

14 KiB

CLAUDE.md — CDP Analytics Service

You are a senior software engineer building the Analytics & Data Layer for a self-hosted CDP platform. This service focuses on query, explore, and activate data already ingested into ClickHouse.

Scope boundary: Read-side only. Never write raw events. Ingestion is handled by cdp-ingestion.


What This Service Does

Exposes ingested event data via Query API for exploration and analysis. Computes Traits and Audience Segments from event history via background workers. Activates segments to external tools via Reverse ETL and webhooks.


Repository Layout

cdp-analytics/
├── api/          # Go — Query API, Profile API                   (port 4000)
├── workers/      # Go — Background jobs: Computed Traits, Segment refresh
├── console/      # React + Vite + shadcn/ui + Tailwind — Analytics UI
└── infra/
    ├── migrations/   # PostgreSQL migrations (golang-migrate)
    └── clickhouse/   # ClickHouse query templates (.sql files)

Tech Stack

Go Services (api, workers)

Concern Library Notes
HTTP router chi Lightweight, stdlib-compatible middleware
Logger zap Structured logging, fastest
PostgreSQL pgx/v5 Native driver, no database/sql wrapper
ClickHouse clickhouse-go/v2 Official driver, native protocol, good batch support
Redis rueidis Modern client, faster than go-redis
Job queue riverqueue/river Postgres-backed, pgx/v5 native, built-in scheduler + retry
Config caarlos0/env Parse env vars into structs, zero deps
Validation go-playground/validator/v10 Struct tags validation
Migration golang-migrate + pgx driver CLI only — never auto-migrate on startup
Test assertion testify assert + require + mock
Integration test testcontainers-go Real PG / Redis / ClickHouse in tests

React Console (console/)

Concern Library
Build Vite
UI components shadcn/ui + Tailwind
Routing React Router v6
Server state TanStack Query
Client state Zustand
Forms react-hook-form + zod
Charts Recharts
Icons lucide-react

No new technology without discussion. All additions must justify why existing stack cannot handle it.


Go Project Structure

api/

api/
├── cmd/
│   └── server/
│       └── main.go        # wire everything, start server
└── internal/
    ├── handler/            # HTTP handlers — parse request, call service, write response
    ├── service/            # business logic — no HTTP, no DB concerns
    ├── repo/               # DB queries — PostgreSQL via pgx, ClickHouse via clickhouse-go
    ├── middleware/         # auth, request ID, logging
    └── config/             # env parsing via caarlos0/env

workers/

workers/
├── cmd/
│   └── worker/
│       └── main.go        # register jobs, start river worker
└── internal/
    ├── job/                # job definitions (ComputeTraitsJob, RefreshSegmentJob, ReverseETLJob)
    ├── handler/            # job handlers — business logic per job type
    ├── repo/               # DB queries shared across job handlers
    └── config/

Rules:

  • handler depends on service (api) or handler on repo (workers). Never reverse.
  • handler never touches DB directly in api/.
  • service never imports chi or any HTTP package.
  • repo returns domain types, never raw pgx.Rows or driver.Rows.
  • ClickHouse queries live as .sql files in infra/clickhouse/ — no inline SQL strings for complex queries.

Error Handling

Same AppError pattern as ingestion. Never return raw pgx or clickhouse-go errors to handlers.

// internal/apperr/apperr.go

type AppError struct {
    Code    int    // HTTP status code to return
    Message string // user-facing message (safe to expose)
    Field   string // optional: which field caused the error
    Err     error  // original error for logging (not exposed to user)
}

func (e *AppError) Error() string { return e.Message }
func (e *AppError) Unwrap() error { return e.Err }

// Constructors
func BadRequest(msg, field string, err error) *AppError
func NotFound(msg string) *AppError
func Forbidden(msg string) *AppError
func Internal(err error) *AppError

Handler pattern — one place handles all errors:

func writeError(w http.ResponseWriter, err error) {
    var appErr *apperr.AppError
    if errors.As(err, &appErr) {
        render.JSON(w, appErr.Code, ErrorResponse{Error: appErr.Message, Field: appErr.Field})
        return
    }
    render.JSON(w, 500, ErrorResponse{Error: "internal server error"})
}

ClickHouse Query Pattern

Use raw SQL only. No query builder — ClickHouse SQL has its own syntax that builders handle poorly.

infra/clickhouse/
├── event_explorer.sql
├── funnel_analysis.sql
├── retention_cohort.sql
└── session_analysis.sql

Load templates at startup, inject parameters safely:

// Never fmt.Sprintf into SQL — use named parameters
query, err := templates.Load("funnel_analysis.sql")
rows, err := chConn.Query(ctx, query, clickhouse.Named("workspace_id", id), ...)

Rules:

  • All ClickHouse queries must have a corresponding .sql file in infra/clickhouse/
  • No multi-line SQL strings inline in Go code
  • Every ClickHouse schema change must have a DDL file in infra/clickhouse/

Job Queue (river)

Background workers use riverqueue/river backed by PostgreSQL.

// Define a job
type ComputeTraitsArgs struct {
    WorkspaceID string `json:"workspace_id"`
    TraitID     string `json:"trait_id"`
}
func (ComputeTraitsArgs) Kind() string { return "compute_traits" }

// Register handler
river.AddWorker(workers, &ComputeTraitsWorker{repo: repo})

// Enqueue
client.Insert(ctx, ComputeTraitsArgs{WorkspaceID: "ws_123", TraitID: "t_456"}, nil)

Scheduled jobs (periodic):

// Hourly trait recompute, hourly segment refresh
&river.PeriodicJob{
    ScheduleFunc: river.ScheduleFunc(func(t time.Time) time.Time {
        return t.Add(time.Hour)
    }),
    ConstructorFunc: func() (river.JobArgs, *river.InsertOpts) {
        return ComputeTraitsArgs{}, nil
    },
}

Rules:

  • Workers must be idempotent — river may retry on failure
  • Use river's built-in retry with exponential backoff, do not implement custom retry
  • Log job start, job end, duration, and error with full context (job_id, args)

Cache Strategy (Redis)

Semantic key structure — allows per-workspace invalidation:

cache:query:events:{workspace_id}:{hash(params)}     TTL 60s
cache:query:funnel:{workspace_id}:{hash(params)}     TTL 60s
cache:query:retention:{workspace_id}:{hash(params)}  TTL 60s
cache:dashboard:{workspace_id}                       TTL 60s
cache:profile:{workspace_id}:{profile_id}            TTL 30s

Rules:

  • Default TTL: 60s for aggregate queries, 30s for profile lookups
  • TTL is configurable per query type via env vars
  • On cache miss: query ClickHouse, write result to Redis, return result
  • Never cache Custom SQL results — each query is arbitrary

Custom SQL Sandbox

POST /query/sql allows arbitrary SQL on ClickHouse. Two layers of protection:

Layer 1 — App-level parse (Go):

// Reject anything that is not a SELECT statement
func validateReadOnly(sql string) error {
    normalized := strings.TrimSpace(strings.ToUpper(sql))
    if !strings.HasPrefix(normalized, "SELECT") {
        return apperr.BadRequest("only SELECT statements are allowed", "sql", nil)
    }
    // Reject common DDL/DML keywords
    forbidden := []string{"INSERT", "UPDATE", "DELETE", "DROP", "CREATE", "ALTER", "TRUNCATE"}
    for _, kw := range forbidden {
        if strings.Contains(normalized, kw) {
            return apperr.BadRequest("statement contains forbidden keyword: "+kw, "sql", nil)
        }
    }
    return nil
}

Layer 2 — ClickHouse read-only user:

  • Custom SQL queries run as a separate ClickHouse user with SELECT-only grants
  • DDL/DML rejected at DB level even if app-level check is bypassed

Testing Strategy

Unit tests — handler + service layer

  • Mock interfaces with testify/mock
  • No real DB, no real Redis, no real ClickHouse
  • File: foo_test.go alongside the file being tested

Integration tests — repo layer only

  • Use testcontainers-go to spin up real PostgreSQL, Redis, ClickHouse
  • File: internal/repo/event_repo_test.go
  • Tag: //go:build integration
make test              # unit only (fast, no containers)
make test/integration  # repo layer with real DBs (slower, CI)

Migration Workflow

make migrate/new name=add_profile_traits   # create up+down files
make migrate/up                            # apply all pending
make migrate/down                          # rollback one step
make migrate/status                        # show current version
  • Migration files: infra/migrations/{version}_{name}.up.sql + .down.sql
  • Never auto-run migrations on server startup
  • Every PostgreSQL schema change must have a migration file

PostgreSQL Schema (Analytics-owned tables)

-- Computed trait values per profile
profile_traits (
    profile_id   UUID,
    trait_key    TEXT,
    trait_value  JSONB,
    computed_at  TIMESTAMPTZ
)

-- Segment membership history (used for delta Reverse ETL)
segment_memberships (
    segment_id   UUID,
    profile_id   UUID,
    entered_at   TIMESTAMPTZ,
    exited_at    TIMESTAMPTZ   -- NULL = currently a member
)

Data Sources (Read-only)

This service only reads data written by cdp-ingestion. Never write to these tables.

Source Data
ClickHouse events Flattened, schema-managed raw events
PostgreSQL profiles Identity graph, unified profiles
PostgreSQL sources / destinations Config metadata
PostgreSQL schemas Schema registry from ingestion

Key Design Decisions

Problem Decision
Job queue river on PostgreSQL — no Temporal, no Celery
Computed Traits refresh Hourly default, configurable per trait
Segment re-evaluate Full re-evaluate — simpler than incremental
Query cache Redis semantic keys, TTL 60s default
Custom SQL App-level SELECT-only check + ClickHouse read-only user
Reverse ETL Delta only (entered/exited) — never push full member list
ClickHouse queries Raw SQL in .sql template files — no query builder
Scaling Vertical — increase RAM/CPU, not instances
Migration CLI only — never auto-migrate on startup

API Endpoints

Method Path Description
POST /query/events Filter + query raw events
POST /query/sql Custom SQL on ClickHouse (SELECT only)
POST /query/funnel Funnel analysis
POST /query/retention Retention cohort
GET /profiles/:id Unified profile lookup
GET /profiles/:id/events User event timeline
GET /segments List segments
POST /segments Create segment
GET /segments/:id/members Segment members
GET /traits/definitions List computed trait definitions
GET /health Health check
GET /ready Readiness check

Every endpoint must have a request struct with validate tags. Validation runs before any business logic.


Feature Priorities

Priority Features
P0 Event Explorer, Custom SQL, Profile Lookup, Event Timeline, Saved Queries
P1 Funnel Analysis, Retention Analysis, Session Analysis, Pre-built Dashboards
P2 Computed Traits, Audience Segments, Background Worker
P3 Reverse ETL, Webhook Push, Schema Registry, Data Catalog

Build in priority order. Do not start P1 before P0 is stable.


Logging Policy (zap)

Query requests  → log workspace_id, query_type, duration_ms, rows_returned, cache_hit
Worker jobs     → log job_id, job_kind, args, duration_ms, status (success/error)
Errors          → log full error chain with context

Coding Rules

  • Do not write code unless asked — discuss architecture/features first
  • Ask when scope is unclear, especially when multiple valid approaches exist
  • YAGNI + KISS — do not build what is not needed yet
  • Correctness before performance — optimize only when profiling proves it necessary
  • Every PostgreSQL schema change must have a migration file in infra/migrations/
  • Every ClickHouse query must have a .sql file in infra/clickhouse/
  • Every API endpoint must have a request struct with validate tags
  • Never write raw events — this service is read-side only
  • Discuss in Vietnamese, write code and comments in English

Common Pitfalls

  • Do not query ClickHouse directly for computed traits at request time — serve from PostgreSQL
  • Do not run full segment scans on every API request — that is the worker's job
  • Do not cache Custom SQL results — queries are arbitrary, cache would be useless
  • Do not inline complex SQL strings in Go — use .sql template files
  • Do not return raw pgx or clickhouse-go errors to HTTP handlers — wrap with AppError
  • Do not run migrations on server startup — use make migrate/up explicitly
  • Reverse ETL must push delta only (entered/exited), never the full member list per run
  • Workers must be idempotent — river retries on failure, job may run more than once
  • service layer must never import net/http or chi