# BigQuery Rules

Guidelines for designing schemas specific to Google BigQuery.

---

## Data Types

| Use case | Type | Notes |
|----------|------|-------|
| Integer | `INT64` | |
| Decimal | `NUMERIC` or `BIGNUMERIC` | |
| Float | `FLOAT64` | |
| Boolean | `BOOL` | |
| String | `STRING` | |
| Timestamp | `TIMESTAMP` | |
| Date | `DATE` | |
| Time | `TIME` | |
| DateTime | `DATETIME` | No timezone |
| JSON | `JSON` | |
| Bytes | `BYTES` | |
| Nested | `STRUCT` | Record type |
| Repeated | `ARRAY` | |

---

## Description (NO metadata tables needed)

BigQuery supports description via `OPTIONS`:

### Table description
```sql
CREATE TABLE orders (
    ...
) OPTIONS(description='Table storing customer orders');
```

### Column descriptions
```sql
CREATE TABLE orders (
    id INT64 NOT NULL OPTIONS(description='Primary key'),
    order_number STRING NOT NULL OPTIONS(description='Unique order code'),
    status STRING NOT NULL OPTIONS(description='pending|confirmed|shipped|cancelled'),
    total_amount NUMERIC OPTIONS(description='Total value')
);
```

### Query descriptions
```sql
SELECT column_name, description
FROM `project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_name = 'orders';
```

---

## Partitioning & Clustering

BigQuery **does not have traditional indexes**. Instead, use partitioning and clustering:

### Partitioning (required for large tables)

```sql
-- Partition by date column
CREATE TABLE fact_orders (
    order_date DATE,
    customer_id INT64,
    order_id INT64,
    amount NUMERIC
)
PARTITION BY order_date;

-- Partition by timestamp (daily)
CREATE TABLE events (
    event_time TIMESTAMP,
    ...
)
PARTITION BY DATE(event_time);

-- Integer range partitioning
CREATE TABLE logs (
    log_id INT64,
    ...
)
PARTITION BY RANGE_BUCKET(log_id, GENERATE_ARRAY(0, 1000000, 10000));
```

### Clustering (optimize filter/group)

```sql
CREATE TABLE fact_orders (
    order_date DATE,
    customer_id INT64,
    product_id INT64,
    store_id INT64,
    amount NUMERIC
)
PARTITION BY order_date
CLUSTER BY customer_id, product_id;
-- Can cluster up to 4 columns
```

### When to use what?
- **Partition**: Filter by date range → reduces data scan
- **Cluster**: Filter/Group by specific columns → data is organized closer together

---

## Nested & Repeated Fields (Denormalization)

BigQuery encourages denormalization with STRUCT and ARRAY:

```sql
CREATE TABLE orders (
    order_id INT64,
    order_date DATE,
    customer STRUCT<
        id INT64,
        name STRING,
        email STRING
    >,
    items ARRAY<STRUCT<
        product_id INT64,
        product_name STRING,
        quantity INT64,
        unit_price NUMERIC
    >>
);

-- Query nested fields
SELECT
    order_id,
    customer.name,
    item.product_name,
    item.quantity
FROM orders, UNNEST(items) AS item;
```

---

## Best Practices

### Query optimization
- **Avoid `SELECT *`** - only select needed columns
- **Filter early** - use partition column in WHERE
- **Avoid cross-join** with large tables

### Table design
- Partition tables > 1GB
- Cluster by frequently filtered/grouped columns
- Denormalize with STRUCT/ARRAY when appropriate
- Avoid too many small tables

### Cost control
- Partition pruning: always filter by partition column
- Use `--dry_run` to estimate cost
- Set up query quotas

---

## Example DDL

```sql
CREATE TABLE `project.dataset.fact_orders` (
    -- Keys
    order_id INT64 NOT NULL OPTIONS(description='PK from source'),
    order_date DATE NOT NULL OPTIONS(description='Order date'),

    -- Dimensions (denormalized)
    customer_id INT64 NOT NULL,
    customer_name STRING,
    customer_segment STRING OPTIONS(description='VIP|Regular|New'),

    -- Measures
    item_count INT64 NOT NULL,
    gross_amount NUMERIC NOT NULL,
    discount_amount NUMERIC DEFAULT 0,
    net_amount NUMERIC NOT NULL,

    -- Nested items
    items ARRAY<STRUCT<
        product_id INT64,
        product_name STRING,
        quantity INT64,
        unit_price NUMERIC
    >> OPTIONS(description='Product details in order'),

    -- Metadata
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
PARTITION BY order_date
CLUSTER BY customer_id
OPTIONS(
    description='Fact table for orders, partitioned by date, clustered by customer'
);
```

---

## Scheduled Queries / Materialized Views

### Materialized View
```sql
CREATE MATERIALIZED VIEW `project.dataset.mv_daily_sales`
PARTITION BY report_date
CLUSTER BY store_id
AS
SELECT
    DATE(order_date) as report_date,
    store_id,
    COUNT(*) as order_count,
    SUM(net_amount) as revenue
FROM `project.dataset.fact_orders`
GROUP BY 1, 2;
```

### Scheduled Query (ETL)
Create via BigQuery Console or API with schedule expression.

---

## Checklist

- [ ] Partition large tables by date column
- [ ] Cluster by frequently filtered/grouped columns (max 4)
- [ ] `OPTIONS(description=...)` for tables and columns
- [ ] Denormalize with STRUCT/ARRAY when appropriate
- [ ] Avoid `SELECT *` and cross-join large tables
- [ ] Filter by partition column in queries
- [ ] Consider Materialized Views for frequently used aggregations