init ingestion
This commit is contained in:
415
data-layer/CLAUDE_analytics.md
Normal file
415
data-layer/CLAUDE_analytics.md
Normal file
@@ -0,0 +1,415 @@
|
||||
# CLAUDE.md — CDP Analytics Service
|
||||
|
||||
> You are a senior software engineer building the **Analytics & Data Layer** for a self-hosted CDP platform.
|
||||
> This service focuses on **query, explore, and activate** data already ingested into ClickHouse.
|
||||
>
|
||||
> **Scope boundary**: Read-side only. Never write raw events. Ingestion is handled by `cdp-ingestion`.
|
||||
|
||||
---
|
||||
|
||||
## What This Service Does
|
||||
|
||||
Exposes ingested event data via Query API for exploration and analysis. Computes Traits and Audience
|
||||
Segments from event history via background workers. Activates segments to external tools via Reverse ETL
|
||||
and webhooks.
|
||||
|
||||
---
|
||||
|
||||
## Repository Layout
|
||||
|
||||
```
|
||||
cdp-analytics/
|
||||
├── api/ # Go — Query API, Profile API (port 4000)
|
||||
├── workers/ # Go — Background jobs: Computed Traits, Segment refresh
|
||||
├── console/ # React + Vite + shadcn/ui + Tailwind — Analytics UI
|
||||
└── infra/
|
||||
├── migrations/ # PostgreSQL migrations (golang-migrate)
|
||||
└── clickhouse/ # ClickHouse query templates (.sql files)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack
|
||||
|
||||
### Go Services (api, workers)
|
||||
|
||||
| Concern | Library | Notes |
|
||||
|---------|---------|-------|
|
||||
| HTTP router | `chi` | Lightweight, stdlib-compatible middleware |
|
||||
| Logger | `zap` | Structured logging, fastest |
|
||||
| PostgreSQL | `pgx/v5` | Native driver, no database/sql wrapper |
|
||||
| ClickHouse | `clickhouse-go/v2` | Official driver, native protocol, good batch support |
|
||||
| Redis | `rueidis` | Modern client, faster than go-redis |
|
||||
| Job queue | `riverqueue/river` | Postgres-backed, pgx/v5 native, built-in scheduler + retry |
|
||||
| Config | `caarlos0/env` | Parse env vars into structs, zero deps |
|
||||
| Validation | `go-playground/validator/v10` | Struct tags validation |
|
||||
| Migration | `golang-migrate` + pgx driver | CLI only — never auto-migrate on startup |
|
||||
| Test assertion | `testify` | assert + require + mock |
|
||||
| Integration test | `testcontainers-go` | Real PG / Redis / ClickHouse in tests |
|
||||
|
||||
### React Console (console/)
|
||||
|
||||
| Concern | Library |
|
||||
|---------|---------|
|
||||
| Build | Vite |
|
||||
| UI components | shadcn/ui + Tailwind |
|
||||
| Routing | React Router v6 |
|
||||
| Server state | TanStack Query |
|
||||
| Client state | Zustand |
|
||||
| Forms | react-hook-form + zod |
|
||||
| Charts | Recharts |
|
||||
| Icons | lucide-react |
|
||||
|
||||
> **No new technology** without discussion. All additions must justify why existing stack cannot handle it.
|
||||
|
||||
---
|
||||
|
||||
## Go Project Structure
|
||||
|
||||
### api/
|
||||
|
||||
```
|
||||
api/
|
||||
├── cmd/
|
||||
│ └── server/
|
||||
│ └── main.go # wire everything, start server
|
||||
└── internal/
|
||||
├── handler/ # HTTP handlers — parse request, call service, write response
|
||||
├── service/ # business logic — no HTTP, no DB concerns
|
||||
├── repo/ # DB queries — PostgreSQL via pgx, ClickHouse via clickhouse-go
|
||||
├── middleware/ # auth, request ID, logging
|
||||
└── config/ # env parsing via caarlos0/env
|
||||
```
|
||||
|
||||
### workers/
|
||||
|
||||
```
|
||||
workers/
|
||||
├── cmd/
|
||||
│ └── worker/
|
||||
│ └── main.go # register jobs, start river worker
|
||||
└── internal/
|
||||
├── job/ # job definitions (ComputeTraitsJob, RefreshSegmentJob, ReverseETLJob)
|
||||
├── handler/ # job handlers — business logic per job type
|
||||
├── repo/ # DB queries shared across job handlers
|
||||
└── config/
|
||||
```
|
||||
|
||||
Rules:
|
||||
- `handler` depends on `service` (api) or `handler` on `repo` (workers). Never reverse.
|
||||
- `handler` never touches DB directly in api/.
|
||||
- `service` never imports `chi` or any HTTP package.
|
||||
- `repo` returns domain types, never raw `pgx.Rows` or `driver.Rows`.
|
||||
- ClickHouse queries live as `.sql` files in `infra/clickhouse/` — no inline SQL strings for complex queries.
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
Same `AppError` pattern as ingestion. Never return raw `pgx` or `clickhouse-go` errors to handlers.
|
||||
|
||||
```go
|
||||
// internal/apperr/apperr.go
|
||||
|
||||
type AppError struct {
|
||||
Code int // HTTP status code to return
|
||||
Message string // user-facing message (safe to expose)
|
||||
Field string // optional: which field caused the error
|
||||
Err error // original error for logging (not exposed to user)
|
||||
}
|
||||
|
||||
func (e *AppError) Error() string { return e.Message }
|
||||
func (e *AppError) Unwrap() error { return e.Err }
|
||||
|
||||
// Constructors
|
||||
func BadRequest(msg, field string, err error) *AppError
|
||||
func NotFound(msg string) *AppError
|
||||
func Forbidden(msg string) *AppError
|
||||
func Internal(err error) *AppError
|
||||
```
|
||||
|
||||
Handler pattern — one place handles all errors:
|
||||
|
||||
```go
|
||||
func writeError(w http.ResponseWriter, err error) {
|
||||
var appErr *apperr.AppError
|
||||
if errors.As(err, &appErr) {
|
||||
render.JSON(w, appErr.Code, ErrorResponse{Error: appErr.Message, Field: appErr.Field})
|
||||
return
|
||||
}
|
||||
render.JSON(w, 500, ErrorResponse{Error: "internal server error"})
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ClickHouse Query Pattern
|
||||
|
||||
Use raw SQL only. No query builder — ClickHouse SQL has its own syntax that builders handle poorly.
|
||||
|
||||
```
|
||||
infra/clickhouse/
|
||||
├── event_explorer.sql
|
||||
├── funnel_analysis.sql
|
||||
├── retention_cohort.sql
|
||||
└── session_analysis.sql
|
||||
```
|
||||
|
||||
Load templates at startup, inject parameters safely:
|
||||
|
||||
```go
|
||||
// Never fmt.Sprintf into SQL — use named parameters
|
||||
query, err := templates.Load("funnel_analysis.sql")
|
||||
rows, err := chConn.Query(ctx, query, clickhouse.Named("workspace_id", id), ...)
|
||||
```
|
||||
|
||||
Rules:
|
||||
- All ClickHouse queries must have a corresponding `.sql` file in `infra/clickhouse/`
|
||||
- No multi-line SQL strings inline in Go code
|
||||
- Every ClickHouse schema change must have a DDL file in `infra/clickhouse/`
|
||||
|
||||
---
|
||||
|
||||
## Job Queue (river)
|
||||
|
||||
Background workers use `riverqueue/river` backed by PostgreSQL.
|
||||
|
||||
```go
|
||||
// Define a job
|
||||
type ComputeTraitsArgs struct {
|
||||
WorkspaceID string `json:"workspace_id"`
|
||||
TraitID string `json:"trait_id"`
|
||||
}
|
||||
func (ComputeTraitsArgs) Kind() string { return "compute_traits" }
|
||||
|
||||
// Register handler
|
||||
river.AddWorker(workers, &ComputeTraitsWorker{repo: repo})
|
||||
|
||||
// Enqueue
|
||||
client.Insert(ctx, ComputeTraitsArgs{WorkspaceID: "ws_123", TraitID: "t_456"}, nil)
|
||||
```
|
||||
|
||||
Scheduled jobs (periodic):
|
||||
|
||||
```go
|
||||
// Hourly trait recompute, hourly segment refresh
|
||||
&river.PeriodicJob{
|
||||
ScheduleFunc: river.ScheduleFunc(func(t time.Time) time.Time {
|
||||
return t.Add(time.Hour)
|
||||
}),
|
||||
ConstructorFunc: func() (river.JobArgs, *river.InsertOpts) {
|
||||
return ComputeTraitsArgs{}, nil
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
Rules:
|
||||
- Workers must be idempotent — river may retry on failure
|
||||
- Use `river`'s built-in retry with exponential backoff, do not implement custom retry
|
||||
- Log job start, job end, duration, and error with full context (job_id, args)
|
||||
|
||||
---
|
||||
|
||||
## Cache Strategy (Redis)
|
||||
|
||||
Semantic key structure — allows per-workspace invalidation:
|
||||
|
||||
```
|
||||
cache:query:events:{workspace_id}:{hash(params)} TTL 60s
|
||||
cache:query:funnel:{workspace_id}:{hash(params)} TTL 60s
|
||||
cache:query:retention:{workspace_id}:{hash(params)} TTL 60s
|
||||
cache:dashboard:{workspace_id} TTL 60s
|
||||
cache:profile:{workspace_id}:{profile_id} TTL 30s
|
||||
```
|
||||
|
||||
Rules:
|
||||
- Default TTL: 60s for aggregate queries, 30s for profile lookups
|
||||
- TTL is configurable per query type via env vars
|
||||
- On cache miss: query ClickHouse, write result to Redis, return result
|
||||
- Never cache Custom SQL results — each query is arbitrary
|
||||
|
||||
---
|
||||
|
||||
## Custom SQL Sandbox
|
||||
|
||||
`POST /query/sql` allows arbitrary SQL on ClickHouse. Two layers of protection:
|
||||
|
||||
**Layer 1 — App-level parse (Go):**
|
||||
```go
|
||||
// Reject anything that is not a SELECT statement
|
||||
func validateReadOnly(sql string) error {
|
||||
normalized := strings.TrimSpace(strings.ToUpper(sql))
|
||||
if !strings.HasPrefix(normalized, "SELECT") {
|
||||
return apperr.BadRequest("only SELECT statements are allowed", "sql", nil)
|
||||
}
|
||||
// Reject common DDL/DML keywords
|
||||
forbidden := []string{"INSERT", "UPDATE", "DELETE", "DROP", "CREATE", "ALTER", "TRUNCATE"}
|
||||
for _, kw := range forbidden {
|
||||
if strings.Contains(normalized, kw) {
|
||||
return apperr.BadRequest("statement contains forbidden keyword: "+kw, "sql", nil)
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
**Layer 2 — ClickHouse read-only user:**
|
||||
- Custom SQL queries run as a separate ClickHouse user with `SELECT`-only grants
|
||||
- DDL/DML rejected at DB level even if app-level check is bypassed
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit tests — handler + service layer
|
||||
- Mock interfaces with `testify/mock`
|
||||
- No real DB, no real Redis, no real ClickHouse
|
||||
- File: `foo_test.go` alongside the file being tested
|
||||
|
||||
### Integration tests — repo layer only
|
||||
- Use `testcontainers-go` to spin up real PostgreSQL, Redis, ClickHouse
|
||||
- File: `internal/repo/event_repo_test.go`
|
||||
- Tag: `//go:build integration`
|
||||
|
||||
```bash
|
||||
make test # unit only (fast, no containers)
|
||||
make test/integration # repo layer with real DBs (slower, CI)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Migration Workflow
|
||||
|
||||
```bash
|
||||
make migrate/new name=add_profile_traits # create up+down files
|
||||
make migrate/up # apply all pending
|
||||
make migrate/down # rollback one step
|
||||
make migrate/status # show current version
|
||||
```
|
||||
|
||||
- Migration files: `infra/migrations/{version}_{name}.up.sql` + `.down.sql`
|
||||
- **Never** auto-run migrations on server startup
|
||||
- Every PostgreSQL schema change **must** have a migration file
|
||||
|
||||
---
|
||||
|
||||
## PostgreSQL Schema (Analytics-owned tables)
|
||||
|
||||
```sql
|
||||
-- Computed trait values per profile
|
||||
profile_traits (
|
||||
profile_id UUID,
|
||||
trait_key TEXT,
|
||||
trait_value JSONB,
|
||||
computed_at TIMESTAMPTZ
|
||||
)
|
||||
|
||||
-- Segment membership history (used for delta Reverse ETL)
|
||||
segment_memberships (
|
||||
segment_id UUID,
|
||||
profile_id UUID,
|
||||
entered_at TIMESTAMPTZ,
|
||||
exited_at TIMESTAMPTZ -- NULL = currently a member
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Sources (Read-only)
|
||||
|
||||
This service **only reads** data written by `cdp-ingestion`. Never write to these tables.
|
||||
|
||||
| Source | Data |
|
||||
|--------|------|
|
||||
| ClickHouse `events` | Flattened, schema-managed raw events |
|
||||
| PostgreSQL `profiles` | Identity graph, unified profiles |
|
||||
| PostgreSQL `sources` / `destinations` | Config metadata |
|
||||
| PostgreSQL `schemas` | Schema registry from ingestion |
|
||||
|
||||
---
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
| Problem | Decision |
|
||||
|---------|---------|
|
||||
| Job queue | `river` on PostgreSQL — no Temporal, no Celery |
|
||||
| Computed Traits refresh | Hourly default, configurable per trait |
|
||||
| Segment re-evaluate | Full re-evaluate — simpler than incremental |
|
||||
| Query cache | Redis semantic keys, TTL 60s default |
|
||||
| Custom SQL | App-level SELECT-only check + ClickHouse read-only user |
|
||||
| Reverse ETL | Delta only (entered/exited) — never push full member list |
|
||||
| ClickHouse queries | Raw SQL in `.sql` template files — no query builder |
|
||||
| Scaling | Vertical — increase RAM/CPU, not instances |
|
||||
| Migration | CLI only — never auto-migrate on startup |
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| `POST` | `/query/events` | Filter + query raw events |
|
||||
| `POST` | `/query/sql` | Custom SQL on ClickHouse (SELECT only) |
|
||||
| `POST` | `/query/funnel` | Funnel analysis |
|
||||
| `POST` | `/query/retention` | Retention cohort |
|
||||
| `GET` | `/profiles/:id` | Unified profile lookup |
|
||||
| `GET` | `/profiles/:id/events` | User event timeline |
|
||||
| `GET` | `/segments` | List segments |
|
||||
| `POST` | `/segments` | Create segment |
|
||||
| `GET` | `/segments/:id/members` | Segment members |
|
||||
| `GET` | `/traits/definitions` | List computed trait definitions |
|
||||
| `GET` | `/health` | Health check |
|
||||
| `GET` | `/ready` | Readiness check |
|
||||
|
||||
Every endpoint must have a request struct with `validate` tags. Validation runs before any business logic.
|
||||
|
||||
---
|
||||
|
||||
## Feature Priorities
|
||||
|
||||
| Priority | Features |
|
||||
|----------|---------|
|
||||
| **P0** | Event Explorer, Custom SQL, Profile Lookup, Event Timeline, Saved Queries |
|
||||
| **P1** | Funnel Analysis, Retention Analysis, Session Analysis, Pre-built Dashboards |
|
||||
| **P2** | Computed Traits, Audience Segments, Background Worker |
|
||||
| **P3** | Reverse ETL, Webhook Push, Schema Registry, Data Catalog |
|
||||
|
||||
Build in priority order. Do not start P1 before P0 is stable.
|
||||
|
||||
---
|
||||
|
||||
## Logging Policy (zap)
|
||||
|
||||
```
|
||||
Query requests → log workspace_id, query_type, duration_ms, rows_returned, cache_hit
|
||||
Worker jobs → log job_id, job_kind, args, duration_ms, status (success/error)
|
||||
Errors → log full error chain with context
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Coding Rules
|
||||
|
||||
- **Do not write code unless asked** — discuss architecture/features first
|
||||
- **Ask when scope is unclear**, especially when multiple valid approaches exist
|
||||
- **YAGNI + KISS** — do not build what is not needed yet
|
||||
- **Correctness before performance** — optimize only when profiling proves it necessary
|
||||
- **Every PostgreSQL schema change must have a migration file** in `infra/migrations/`
|
||||
- **Every ClickHouse query must have a `.sql` file** in `infra/clickhouse/`
|
||||
- **Every API endpoint must have a request struct with `validate` tags**
|
||||
- **Never write raw events** — this service is read-side only
|
||||
- Discuss in **Vietnamese**, write code and comments in **English**
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
- Do not query ClickHouse directly for computed traits at request time — serve from PostgreSQL
|
||||
- Do not run full segment scans on every API request — that is the worker's job
|
||||
- Do not cache Custom SQL results — queries are arbitrary, cache would be useless
|
||||
- Do not inline complex SQL strings in Go — use `.sql` template files
|
||||
- Do not return raw `pgx` or `clickhouse-go` errors to HTTP handlers — wrap with `AppError`
|
||||
- Do not run migrations on server startup — use `make migrate/up` explicitly
|
||||
- Reverse ETL must push delta only (entered/exited), never the full member list per run
|
||||
- Workers must be idempotent — `river` retries on failure, job may run more than once
|
||||
- `service` layer must never import `net/http` or `chi`
|
||||
Reference in New Issue
Block a user