Files
2026-04-12 01:06:31 +07:00

232 lines
5.1 KiB
Markdown

# BigQuery Rules
Guidelines for designing schemas specific to Google BigQuery.
---
## Data Types
| Use case | Type | Notes |
|----------|------|-------|
| Integer | `INT64` | |
| Decimal | `NUMERIC` or `BIGNUMERIC` | |
| Float | `FLOAT64` | |
| Boolean | `BOOL` | |
| String | `STRING` | |
| Timestamp | `TIMESTAMP` | |
| Date | `DATE` | |
| Time | `TIME` | |
| DateTime | `DATETIME` | No timezone |
| JSON | `JSON` | |
| Bytes | `BYTES` | |
| Nested | `STRUCT` | Record type |
| Repeated | `ARRAY` | |
---
## Description (NO metadata tables needed)
BigQuery supports description via `OPTIONS`:
### Table description
```sql
CREATE TABLE orders (
...
) OPTIONS(description='Table storing customer orders');
```
### Column descriptions
```sql
CREATE TABLE orders (
id INT64 NOT NULL OPTIONS(description='Primary key'),
order_number STRING NOT NULL OPTIONS(description='Unique order code'),
status STRING NOT NULL OPTIONS(description='pending|confirmed|shipped|cancelled'),
total_amount NUMERIC OPTIONS(description='Total value')
);
```
### Query descriptions
```sql
SELECT column_name, description
FROM `project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_name = 'orders';
```
---
## Partitioning & Clustering
BigQuery **does not have traditional indexes**. Instead, use partitioning and clustering:
### Partitioning (required for large tables)
```sql
-- Partition by date column
CREATE TABLE fact_orders (
order_date DATE,
customer_id INT64,
order_id INT64,
amount NUMERIC
)
PARTITION BY order_date;
-- Partition by timestamp (daily)
CREATE TABLE events (
event_time TIMESTAMP,
...
)
PARTITION BY DATE(event_time);
-- Integer range partitioning
CREATE TABLE logs (
log_id INT64,
...
)
PARTITION BY RANGE_BUCKET(log_id, GENERATE_ARRAY(0, 1000000, 10000));
```
### Clustering (optimize filter/group)
```sql
CREATE TABLE fact_orders (
order_date DATE,
customer_id INT64,
product_id INT64,
store_id INT64,
amount NUMERIC
)
PARTITION BY order_date
CLUSTER BY customer_id, product_id;
-- Can cluster up to 4 columns
```
### When to use what?
- **Partition**: Filter by date range → reduces data scan
- **Cluster**: Filter/Group by specific columns → data is organized closer together
---
## Nested & Repeated Fields (Denormalization)
BigQuery encourages denormalization with STRUCT and ARRAY:
```sql
CREATE TABLE orders (
order_id INT64,
order_date DATE,
customer STRUCT<
id INT64,
name STRING,
email STRING
>,
items ARRAY<STRUCT<
product_id INT64,
product_name STRING,
quantity INT64,
unit_price NUMERIC
>>
);
-- Query nested fields
SELECT
order_id,
customer.name,
item.product_name,
item.quantity
FROM orders, UNNEST(items) AS item;
```
---
## Best Practices
### Query optimization
- **Avoid `SELECT *`** - only select needed columns
- **Filter early** - use partition column in WHERE
- **Avoid cross-join** with large tables
### Table design
- Partition tables > 1GB
- Cluster by frequently filtered/grouped columns
- Denormalize with STRUCT/ARRAY when appropriate
- Avoid too many small tables
### Cost control
- Partition pruning: always filter by partition column
- Use `--dry_run` to estimate cost
- Set up query quotas
---
## Example DDL
```sql
CREATE TABLE `project.dataset.fact_orders` (
-- Keys
order_id INT64 NOT NULL OPTIONS(description='PK from source'),
order_date DATE NOT NULL OPTIONS(description='Order date'),
-- Dimensions (denormalized)
customer_id INT64 NOT NULL,
customer_name STRING,
customer_segment STRING OPTIONS(description='VIP|Regular|New'),
-- Measures
item_count INT64 NOT NULL,
gross_amount NUMERIC NOT NULL,
discount_amount NUMERIC DEFAULT 0,
net_amount NUMERIC NOT NULL,
-- Nested items
items ARRAY<STRUCT<
product_id INT64,
product_name STRING,
quantity INT64,
unit_price NUMERIC
>> OPTIONS(description='Product details in order'),
-- Metadata
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
PARTITION BY order_date
CLUSTER BY customer_id
OPTIONS(
description='Fact table for orders, partitioned by date, clustered by customer'
);
```
---
## Scheduled Queries / Materialized Views
### Materialized View
```sql
CREATE MATERIALIZED VIEW `project.dataset.mv_daily_sales`
PARTITION BY report_date
CLUSTER BY store_id
AS
SELECT
DATE(order_date) as report_date,
store_id,
COUNT(*) as order_count,
SUM(net_amount) as revenue
FROM `project.dataset.fact_orders`
GROUP BY 1, 2;
```
### Scheduled Query (ETL)
Create via BigQuery Console or API with schedule expression.
---
## Checklist
- [ ] Partition large tables by date column
- [ ] Cluster by frequently filtered/grouped columns (max 4)
- [ ] `OPTIONS(description=...)` for tables and columns
- [ ] Denormalize with STRUCT/ARRAY when appropriate
- [ ] Avoid `SELECT *` and cross-join large tables
- [ ] Filter by partition column in queries
- [ ] Consider Materialized Views for frequently used aggregations