Files
english/.opencode/skills/databases/analytics.md
2026-04-12 01:06:31 +07:00

199 lines
5.3 KiB
Markdown

# Analytics (OLAP) Rules
> **Note:** Core naming conventions, workflow, and checklist are in `SKILL.md` or `db-design.md` (always loaded).
Guidelines for designing schemas for statistics and reporting tables.
---
## General Principles
- **Separate** from transactional tables - don't mix analytics logic into business tables
- When heavy analytics queries/aggregations repeat → create separate tables
- Use **Star Schema**: Fact tables at center, Dimension tables around
---
## Design Process
### 1. Analyze Statistics Requirements
Ask user to clarify:
- **Analysis dimensions**: by date, by customer, by product, by channel, by region?
- **Granularity**: per order, per item, per day, per month?
- **Metrics**: order_count, revenue, margin, conversion_rate, avg_order_value?
### 2. Define Fact Granularity
**Important**: What does 1 row in fact table represent?
| Fact Table | Granularity | Use case |
|------------|-------------|----------|
| `fact_orders` | 1 row = 1 order | Statistics by order |
| `fact_order_items` | 1 row = 1 order item | Statistics by product |
| `fact_daily_sales` | 1 row = 1 day + store | Daily summary |
### 3. Identify Required Dimensions
Create separate dim table when:
- Reused in multiple places
- Has many descriptive attributes
- Subject to slow changes (Slowly Changing Dimension)
---
## Fact Tables
### Fact table structure
```sql
CREATE TABLE fact_orders (
fact_id BIGINT PRIMARY KEY AUTO_INCREMENT,
-- Dimension keys
date_key INT NOT NULL, -- FK to dim_date
customer_key BIGINT NOT NULL, -- FK to dim_customer
store_key INT,
channel_key INT,
-- Degenerate dimensions (no separate dim needed)
order_id BIGINT NOT NULL,
order_number VARCHAR(50),
-- Measures
item_count INT NOT NULL,
gross_amount DECIMAL(18,2) NOT NULL,
discount_amount DECIMAL(18,2) DEFAULT 0,
net_amount DECIMAL(18,2) NOT NULL,
INDEX idx_fact_orders_date (date_key),
INDEX idx_fact_orders_customer (customer_key),
INDEX idx_fact_orders_date_store (date_key, store_key)
);
```
---
## Dimension Tables
### dim_date (required for every analytics schema)
```sql
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- Format: YYYYMMDD (20241215)
full_date DATE NOT NULL,
year INT NOT NULL,
quarter INT NOT NULL, -- 1-4
month INT NOT NULL, -- 1-12
month_name VARCHAR(20), -- 'January', 'February'
week_of_year INT NOT NULL,
day_of_month INT NOT NULL,
day_of_week INT NOT NULL, -- 1=Monday, 7=Sunday
day_name VARCHAR(20),
is_weekend BOOLEAN NOT NULL,
is_holiday BOOLEAN DEFAULT FALSE,
UNIQUE (full_date)
);
-- Pre-populate for multiple years (2020-2030)
```
### dim_customer
```sql
CREATE TABLE dim_customer (
customer_key BIGINT PRIMARY KEY AUTO_INCREMENT, -- Surrogate key
customer_id BIGINT NOT NULL, -- Natural key from users
customer_name VARCHAR(255),
email VARCHAR(255),
segment VARCHAR(50), -- 'VIP', 'Regular', 'New'
city VARCHAR(100),
region VARCHAR(100),
first_order_date DATE,
-- SCD Type 2 columns (if history needed)
effective_from DATE NOT NULL,
effective_to DATE,
is_current BOOLEAN DEFAULT TRUE,
INDEX idx_dim_customer_id (customer_id),
INDEX idx_dim_customer_current (is_current, customer_id)
);
```
---
## Summary Tables (Pre-aggregated)
When pre-aggregation needed for dashboard performance:
```sql
CREATE TABLE summary_daily_sales (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
date_key INT NOT NULL,
store_key INT,
channel_key INT,
-- Pre-aggregated measures
order_count INT NOT NULL,
item_count INT NOT NULL,
gross_revenue DECIMAL(18,2) NOT NULL,
net_revenue DECIMAL(18,2) NOT NULL,
unique_customers INT NOT NULL,
avg_order_value DECIMAL(18,2),
UNIQUE (date_key, store_key, channel_key),
INDEX idx_summary_date (date_key)
);
```
---
## Slowly Changing Dimensions (SCD)
### Type 1 - Overwrite
Overwrite old value, no history kept:
```sql
UPDATE dim_customer SET segment = 'VIP' WHERE customer_id = 123;
```
### Type 2 - Add new row (Recommended when history needed)
```sql
-- 1. Close old row
UPDATE dim_customer
SET effective_to = CURRENT_DATE - 1, is_current = FALSE
WHERE customer_id = 123 AND is_current = TRUE;
-- 2. Add new row
INSERT INTO dim_customer (customer_id, segment, effective_from, is_current)
VALUES (123, 'VIP', CURRENT_DATE, TRUE);
```
---
## Indexing for Analytics
### Fact tables
- Index FKs to dimensions: `date_key`, `customer_key`, `product_key`
- Composite index based on query patterns: `INDEX (date_key, store_key)`
### Dimension tables
- PK: surrogate key
- Index natural key: `customer_id`, `product_id`
- Index for SCD: `(is_current, customer_id)`
---
## Naming Convention
- Fact tables: `fact_*` or `fct_*`
- Dimension tables: `dim_*`
- Summary tables: `summary_*` or `agg_*`
---
## Checklist
- [ ] Granularity defined for each fact table
- [ ] dim_date exists or created (pre-populate multiple years)
- [ ] Surrogate keys for dimensions
- [ ] Index FKs in fact tables
- [ ] SCD strategy for changing dimensions (Type 1 or Type 2)
- [ ] Naming: `fact_*`, `dim_*`, `summary_*`
- [ ] Refresh strategy: see [incremental-etl.md](incremental-etl.md)