5.3 KiB
5.3 KiB
Analytics (OLAP) Rules
Note: Core naming conventions, workflow, and checklist are in
SKILL.mdordb-design.md(always loaded).
Guidelines for designing schemas for statistics and reporting tables.
General Principles
- Separate from transactional tables - don't mix analytics logic into business tables
- When heavy analytics queries/aggregations repeat → create separate tables
- Use Star Schema: Fact tables at center, Dimension tables around
Design Process
1. Analyze Statistics Requirements
Ask user to clarify:
- Analysis dimensions: by date, by customer, by product, by channel, by region?
- Granularity: per order, per item, per day, per month?
- Metrics: order_count, revenue, margin, conversion_rate, avg_order_value?
2. Define Fact Granularity
Important: What does 1 row in fact table represent?
| Fact Table | Granularity | Use case |
|---|---|---|
fact_orders |
1 row = 1 order | Statistics by order |
fact_order_items |
1 row = 1 order item | Statistics by product |
fact_daily_sales |
1 row = 1 day + store | Daily summary |
3. Identify Required Dimensions
Create separate dim table when:
- Reused in multiple places
- Has many descriptive attributes
- Subject to slow changes (Slowly Changing Dimension)
Fact Tables
Fact table structure
CREATE TABLE fact_orders (
fact_id BIGINT PRIMARY KEY AUTO_INCREMENT,
-- Dimension keys
date_key INT NOT NULL, -- FK to dim_date
customer_key BIGINT NOT NULL, -- FK to dim_customer
store_key INT,
channel_key INT,
-- Degenerate dimensions (no separate dim needed)
order_id BIGINT NOT NULL,
order_number VARCHAR(50),
-- Measures
item_count INT NOT NULL,
gross_amount DECIMAL(18,2) NOT NULL,
discount_amount DECIMAL(18,2) DEFAULT 0,
net_amount DECIMAL(18,2) NOT NULL,
INDEX idx_fact_orders_date (date_key),
INDEX idx_fact_orders_customer (customer_key),
INDEX idx_fact_orders_date_store (date_key, store_key)
);
Dimension Tables
dim_date (required for every analytics schema)
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- Format: YYYYMMDD (20241215)
full_date DATE NOT NULL,
year INT NOT NULL,
quarter INT NOT NULL, -- 1-4
month INT NOT NULL, -- 1-12
month_name VARCHAR(20), -- 'January', 'February'
week_of_year INT NOT NULL,
day_of_month INT NOT NULL,
day_of_week INT NOT NULL, -- 1=Monday, 7=Sunday
day_name VARCHAR(20),
is_weekend BOOLEAN NOT NULL,
is_holiday BOOLEAN DEFAULT FALSE,
UNIQUE (full_date)
);
-- Pre-populate for multiple years (2020-2030)
dim_customer
CREATE TABLE dim_customer (
customer_key BIGINT PRIMARY KEY AUTO_INCREMENT, -- Surrogate key
customer_id BIGINT NOT NULL, -- Natural key from users
customer_name VARCHAR(255),
email VARCHAR(255),
segment VARCHAR(50), -- 'VIP', 'Regular', 'New'
city VARCHAR(100),
region VARCHAR(100),
first_order_date DATE,
-- SCD Type 2 columns (if history needed)
effective_from DATE NOT NULL,
effective_to DATE,
is_current BOOLEAN DEFAULT TRUE,
INDEX idx_dim_customer_id (customer_id),
INDEX idx_dim_customer_current (is_current, customer_id)
);
Summary Tables (Pre-aggregated)
When pre-aggregation needed for dashboard performance:
CREATE TABLE summary_daily_sales (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
date_key INT NOT NULL,
store_key INT,
channel_key INT,
-- Pre-aggregated measures
order_count INT NOT NULL,
item_count INT NOT NULL,
gross_revenue DECIMAL(18,2) NOT NULL,
net_revenue DECIMAL(18,2) NOT NULL,
unique_customers INT NOT NULL,
avg_order_value DECIMAL(18,2),
UNIQUE (date_key, store_key, channel_key),
INDEX idx_summary_date (date_key)
);
Slowly Changing Dimensions (SCD)
Type 1 - Overwrite
Overwrite old value, no history kept:
UPDATE dim_customer SET segment = 'VIP' WHERE customer_id = 123;
Type 2 - Add new row (Recommended when history needed)
-- 1. Close old row
UPDATE dim_customer
SET effective_to = CURRENT_DATE - 1, is_current = FALSE
WHERE customer_id = 123 AND is_current = TRUE;
-- 2. Add new row
INSERT INTO dim_customer (customer_id, segment, effective_from, is_current)
VALUES (123, 'VIP', CURRENT_DATE, TRUE);
Indexing for Analytics
Fact tables
- Index FKs to dimensions:
date_key,customer_key,product_key - Composite index based on query patterns:
INDEX (date_key, store_key)
Dimension tables
- PK: surrogate key
- Index natural key:
customer_id,product_id - Index for SCD:
(is_current, customer_id)
Naming Convention
- Fact tables:
fact_*orfct_* - Dimension tables:
dim_* - Summary tables:
summary_*oragg_*
Checklist
- Granularity defined for each fact table
- dim_date exists or created (pre-populate multiple years)
- Surrogate keys for dimensions
- Index FKs in fact tables
- SCD strategy for changing dimensions (Type 1 or Type 2)
- Naming:
fact_*,dim_*,summary_* - Refresh strategy: see incremental-etl.md