# CLAUDE.md — CDP Analytics Service

> You are a senior software engineer building the **Analytics & Data Layer** for a self-hosted CDP platform.
> This service focuses on **query, explore, and activate** data already ingested into ClickHouse.
>
> **Scope boundary**: Read-side only. Never write raw events. Ingestion is handled by `cdp-ingestion`.

---

## What This Service Does

Exposes ingested event data via Query API for exploration and analysis. Computes Traits and Audience
Segments from event history via background workers. Activates segments to external tools via Reverse ETL
and webhooks.

---

## Repository Layout

```
cdp-analytics/
├── api/          # Go — Query API, Profile API                   (port 4000)
├── workers/      # Go — Background jobs: Computed Traits, Segment refresh
├── console/      # React + Vite + shadcn/ui + Tailwind — Analytics UI
└── infra/
    ├── migrations/   # PostgreSQL migrations (golang-migrate)
    └── clickhouse/   # ClickHouse query templates (.sql files)
```

---

## Tech Stack

### Go Services (api, workers)

| Concern | Library | Notes |
|---------|---------|-------|
| HTTP router | `chi` | Lightweight, stdlib-compatible middleware |
| Logger | `zap` | Structured logging, fastest |
| PostgreSQL | `pgx/v5` | Native driver, no database/sql wrapper |
| ClickHouse | `clickhouse-go/v2` | Official driver, native protocol, good batch support |
| Redis | `rueidis` | Modern client, faster than go-redis |
| Job queue | `riverqueue/river` | Postgres-backed, pgx/v5 native, built-in scheduler + retry |
| Config | `caarlos0/env` | Parse env vars into structs, zero deps |
| Validation | `go-playground/validator/v10` | Struct tags validation |
| Migration | `golang-migrate` + pgx driver | CLI only — never auto-migrate on startup |
| Test assertion | `testify` | assert + require + mock |
| Integration test | `testcontainers-go` | Real PG / Redis / ClickHouse in tests |

### React Console (console/)

| Concern | Library |
|---------|---------|
| Build | Vite |
| UI components | shadcn/ui + Tailwind |
| Routing | React Router v6 |
| Server state | TanStack Query |
| Client state | Zustand |
| Forms | react-hook-form + zod |
| Charts | Recharts |
| Icons | lucide-react |

> **No new technology** without discussion. All additions must justify why existing stack cannot handle it.

---

## Go Project Structure

### api/

```
api/
├── cmd/
│   └── server/
│       └── main.go        # wire everything, start server
└── internal/
    ├── handler/            # HTTP handlers — parse request, call service, write response
    ├── service/            # business logic — no HTTP, no DB concerns
    ├── repo/               # DB queries — PostgreSQL via pgx, ClickHouse via clickhouse-go
    ├── middleware/         # auth, request ID, logging
    └── config/             # env parsing via caarlos0/env
```

### workers/

```
workers/
├── cmd/
│   └── worker/
│       └── main.go        # register jobs, start river worker
└── internal/
    ├── job/                # job definitions (ComputeTraitsJob, RefreshSegmentJob, ReverseETLJob)
    ├── handler/            # job handlers — business logic per job type
    ├── repo/               # DB queries shared across job handlers
    └── config/
```

Rules:
- `handler` depends on `service` (api) or `handler` on `repo` (workers). Never reverse.
- `handler` never touches DB directly in api/.
- `service` never imports `chi` or any HTTP package.
- `repo` returns domain types, never raw `pgx.Rows` or `driver.Rows`.
- ClickHouse queries live as `.sql` files in `infra/clickhouse/` — no inline SQL strings for complex queries.

---

## Error Handling

Same `AppError` pattern as ingestion. Never return raw `pgx` or `clickhouse-go` errors to handlers.

```go
// internal/apperr/apperr.go

type AppError struct {
    Code    int    // HTTP status code to return
    Message string // user-facing message (safe to expose)
    Field   string // optional: which field caused the error
    Err     error  // original error for logging (not exposed to user)
}

func (e *AppError) Error() string { return e.Message }
func (e *AppError) Unwrap() error { return e.Err }

// Constructors
func BadRequest(msg, field string, err error) *AppError
func NotFound(msg string) *AppError
func Forbidden(msg string) *AppError
func Internal(err error) *AppError
```

Handler pattern — one place handles all errors:

```go
func writeError(w http.ResponseWriter, err error) {
    var appErr *apperr.AppError
    if errors.As(err, &appErr) {
        render.JSON(w, appErr.Code, ErrorResponse{Error: appErr.Message, Field: appErr.Field})
        return
    }
    render.JSON(w, 500, ErrorResponse{Error: "internal server error"})
}
```

---

## ClickHouse Query Pattern

Use raw SQL only. No query builder — ClickHouse SQL has its own syntax that builders handle poorly.

```
infra/clickhouse/
├── event_explorer.sql
├── funnel_analysis.sql
├── retention_cohort.sql
└── session_analysis.sql
```

Load templates at startup, inject parameters safely:

```go
// Never fmt.Sprintf into SQL — use named parameters
query, err := templates.Load("funnel_analysis.sql")
rows, err := chConn.Query(ctx, query, clickhouse.Named("workspace_id", id), ...)
```

Rules:
- All ClickHouse queries must have a corresponding `.sql` file in `infra/clickhouse/`
- No multi-line SQL strings inline in Go code
- Every ClickHouse schema change must have a DDL file in `infra/clickhouse/`

---

## Job Queue (river)

Background workers use `riverqueue/river` backed by PostgreSQL.

```go
// Define a job
type ComputeTraitsArgs struct {
    WorkspaceID string `json:"workspace_id"`
    TraitID     string `json:"trait_id"`
}
func (ComputeTraitsArgs) Kind() string { return "compute_traits" }

// Register handler
river.AddWorker(workers, &ComputeTraitsWorker{repo: repo})

// Enqueue
client.Insert(ctx, ComputeTraitsArgs{WorkspaceID: "ws_123", TraitID: "t_456"}, nil)
```

Scheduled jobs (periodic):

```go
// Hourly trait recompute, hourly segment refresh
&river.PeriodicJob{
    ScheduleFunc: river.ScheduleFunc(func(t time.Time) time.Time {
        return t.Add(time.Hour)
    }),
    ConstructorFunc: func() (river.JobArgs, *river.InsertOpts) {
        return ComputeTraitsArgs{}, nil
    },
}
```

Rules:
- Workers must be idempotent — river may retry on failure
- Use `river`'s built-in retry with exponential backoff, do not implement custom retry
- Log job start, job end, duration, and error with full context (job_id, args)

---

## Cache Strategy (Redis)

Semantic key structure — allows per-workspace invalidation:

```
cache:query:events:{workspace_id}:{hash(params)}     TTL 60s
cache:query:funnel:{workspace_id}:{hash(params)}     TTL 60s
cache:query:retention:{workspace_id}:{hash(params)}  TTL 60s
cache:dashboard:{workspace_id}                       TTL 60s
cache:profile:{workspace_id}:{profile_id}            TTL 30s
```

Rules:
- Default TTL: 60s for aggregate queries, 30s for profile lookups
- TTL is configurable per query type via env vars
- On cache miss: query ClickHouse, write result to Redis, return result
- Never cache Custom SQL results — each query is arbitrary

---

## Custom SQL Sandbox

`POST /query/sql` allows arbitrary SQL on ClickHouse. Two layers of protection:

**Layer 1 — App-level parse (Go):**
```go
// Reject anything that is not a SELECT statement
func validateReadOnly(sql string) error {
    normalized := strings.TrimSpace(strings.ToUpper(sql))
    if !strings.HasPrefix(normalized, "SELECT") {
        return apperr.BadRequest("only SELECT statements are allowed", "sql", nil)
    }
    // Reject common DDL/DML keywords
    forbidden := []string{"INSERT", "UPDATE", "DELETE", "DROP", "CREATE", "ALTER", "TRUNCATE"}
    for _, kw := range forbidden {
        if strings.Contains(normalized, kw) {
            return apperr.BadRequest("statement contains forbidden keyword: "+kw, "sql", nil)
        }
    }
    return nil
}
```

**Layer 2 — ClickHouse read-only user:**
- Custom SQL queries run as a separate ClickHouse user with `SELECT`-only grants
- DDL/DML rejected at DB level even if app-level check is bypassed

---

## Testing Strategy

### Unit tests — handler + service layer
- Mock interfaces with `testify/mock`
- No real DB, no real Redis, no real ClickHouse
- File: `foo_test.go` alongside the file being tested

### Integration tests — repo layer only
- Use `testcontainers-go` to spin up real PostgreSQL, Redis, ClickHouse
- File: `internal/repo/event_repo_test.go`
- Tag: `//go:build integration`

```bash
make test              # unit only (fast, no containers)
make test/integration  # repo layer with real DBs (slower, CI)
```

---

## Migration Workflow

```bash
make migrate/new name=add_profile_traits   # create up+down files
make migrate/up                            # apply all pending
make migrate/down                          # rollback one step
make migrate/status                        # show current version
```

- Migration files: `infra/migrations/{version}_{name}.up.sql` + `.down.sql`
- **Never** auto-run migrations on server startup
- Every PostgreSQL schema change **must** have a migration file

---

## PostgreSQL Schema (Analytics-owned tables)

```sql
-- Computed trait values per profile
profile_traits (
    profile_id   UUID,
    trait_key    TEXT,
    trait_value  JSONB,
    computed_at  TIMESTAMPTZ
)

-- Segment membership history (used for delta Reverse ETL)
segment_memberships (
    segment_id   UUID,
    profile_id   UUID,
    entered_at   TIMESTAMPTZ,
    exited_at    TIMESTAMPTZ   -- NULL = currently a member
)
```

---

## Data Sources (Read-only)

This service **only reads** data written by `cdp-ingestion`. Never write to these tables.

| Source | Data |
|--------|------|
| ClickHouse `events` | Flattened, schema-managed raw events |
| PostgreSQL `profiles` | Identity graph, unified profiles |
| PostgreSQL `sources` / `destinations` | Config metadata |
| PostgreSQL `schemas` | Schema registry from ingestion |

---

## Key Design Decisions

| Problem | Decision |
|---------|---------|
| Job queue | `river` on PostgreSQL — no Temporal, no Celery |
| Computed Traits refresh | Hourly default, configurable per trait |
| Segment re-evaluate | Full re-evaluate — simpler than incremental |
| Query cache | Redis semantic keys, TTL 60s default |
| Custom SQL | App-level SELECT-only check + ClickHouse read-only user |
| Reverse ETL | Delta only (entered/exited) — never push full member list |
| ClickHouse queries | Raw SQL in `.sql` template files — no query builder |
| Scaling | Vertical — increase RAM/CPU, not instances |
| Migration | CLI only — never auto-migrate on startup |

---

## API Endpoints

| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/query/events` | Filter + query raw events |
| `POST` | `/query/sql` | Custom SQL on ClickHouse (SELECT only) |
| `POST` | `/query/funnel` | Funnel analysis |
| `POST` | `/query/retention` | Retention cohort |
| `GET` | `/profiles/:id` | Unified profile lookup |
| `GET` | `/profiles/:id/events` | User event timeline |
| `GET` | `/segments` | List segments |
| `POST` | `/segments` | Create segment |
| `GET` | `/segments/:id/members` | Segment members |
| `GET` | `/traits/definitions` | List computed trait definitions |
| `GET` | `/health` | Health check |
| `GET` | `/ready` | Readiness check |

Every endpoint must have a request struct with `validate` tags. Validation runs before any business logic.

---

## Feature Priorities

| Priority | Features |
|----------|---------|
| **P0** | Event Explorer, Custom SQL, Profile Lookup, Event Timeline, Saved Queries |
| **P1** | Funnel Analysis, Retention Analysis, Session Analysis, Pre-built Dashboards |
| **P2** | Computed Traits, Audience Segments, Background Worker |
| **P3** | Reverse ETL, Webhook Push, Schema Registry, Data Catalog |

Build in priority order. Do not start P1 before P0 is stable.

---

## Logging Policy (zap)

```
Query requests  → log workspace_id, query_type, duration_ms, rows_returned, cache_hit
Worker jobs     → log job_id, job_kind, args, duration_ms, status (success/error)
Errors          → log full error chain with context
```

---

## Coding Rules

- **Do not write code unless asked** — discuss architecture/features first
- **Ask when scope is unclear**, especially when multiple valid approaches exist
- **YAGNI + KISS** — do not build what is not needed yet
- **Correctness before performance** — optimize only when profiling proves it necessary
- **Every PostgreSQL schema change must have a migration file** in `infra/migrations/`
- **Every ClickHouse query must have a `.sql` file** in `infra/clickhouse/`
- **Every API endpoint must have a request struct with `validate` tags**
- **Never write raw events** — this service is read-side only
- Discuss in **Vietnamese**, write code and comments in **English**

---

## Common Pitfalls

- Do not query ClickHouse directly for computed traits at request time — serve from PostgreSQL
- Do not run full segment scans on every API request — that is the worker's job
- Do not cache Custom SQL results — queries are arbitrary, cache would be useless
- Do not inline complex SQL strings in Go — use `.sql` template files
- Do not return raw `pgx` or `clickhouse-go` errors to HTTP handlers — wrap with `AppError`
- Do not run migrations on server startup — use `make migrate/up` explicitly
- Reverse ETL must push delta only (entered/exited), never the full member list per run
- Workers must be idempotent — `river` retries on failure, job may run more than once
- `service` layer must never import `net/http` or `chi`