5.1 KiB
5.1 KiB
BigQuery Rules
Guidelines for designing schemas specific to Google BigQuery.
Data Types
| Use case | Type | Notes |
|---|---|---|
| Integer | INT64 |
|
| Decimal | NUMERIC or BIGNUMERIC |
|
| Float | FLOAT64 |
|
| Boolean | BOOL |
|
| String | STRING |
|
| Timestamp | TIMESTAMP |
|
| Date | DATE |
|
| Time | TIME |
|
| DateTime | DATETIME |
No timezone |
| JSON | JSON |
|
| Bytes | BYTES |
|
| Nested | STRUCT |
Record type |
| Repeated | ARRAY |
Description (NO metadata tables needed)
BigQuery supports description via OPTIONS:
Table description
CREATE TABLE orders (
...
) OPTIONS(description='Table storing customer orders');
Column descriptions
CREATE TABLE orders (
id INT64 NOT NULL OPTIONS(description='Primary key'),
order_number STRING NOT NULL OPTIONS(description='Unique order code'),
status STRING NOT NULL OPTIONS(description='pending|confirmed|shipped|cancelled'),
total_amount NUMERIC OPTIONS(description='Total value')
);
Query descriptions
SELECT column_name, description
FROM `project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_name = 'orders';
Partitioning & Clustering
BigQuery does not have traditional indexes. Instead, use partitioning and clustering:
Partitioning (required for large tables)
-- Partition by date column
CREATE TABLE fact_orders (
order_date DATE,
customer_id INT64,
order_id INT64,
amount NUMERIC
)
PARTITION BY order_date;
-- Partition by timestamp (daily)
CREATE TABLE events (
event_time TIMESTAMP,
...
)
PARTITION BY DATE(event_time);
-- Integer range partitioning
CREATE TABLE logs (
log_id INT64,
...
)
PARTITION BY RANGE_BUCKET(log_id, GENERATE_ARRAY(0, 1000000, 10000));
Clustering (optimize filter/group)
CREATE TABLE fact_orders (
order_date DATE,
customer_id INT64,
product_id INT64,
store_id INT64,
amount NUMERIC
)
PARTITION BY order_date
CLUSTER BY customer_id, product_id;
-- Can cluster up to 4 columns
When to use what?
- Partition: Filter by date range → reduces data scan
- Cluster: Filter/Group by specific columns → data is organized closer together
Nested & Repeated Fields (Denormalization)
BigQuery encourages denormalization with STRUCT and ARRAY:
CREATE TABLE orders (
order_id INT64,
order_date DATE,
customer STRUCT<
id INT64,
name STRING,
email STRING
>,
items ARRAY<STRUCT<
product_id INT64,
product_name STRING,
quantity INT64,
unit_price NUMERIC
>>
);
-- Query nested fields
SELECT
order_id,
customer.name,
item.product_name,
item.quantity
FROM orders, UNNEST(items) AS item;
Best Practices
Query optimization
- Avoid
SELECT *- only select needed columns - Filter early - use partition column in WHERE
- Avoid cross-join with large tables
Table design
- Partition tables > 1GB
- Cluster by frequently filtered/grouped columns
- Denormalize with STRUCT/ARRAY when appropriate
- Avoid too many small tables
Cost control
- Partition pruning: always filter by partition column
- Use
--dry_runto estimate cost - Set up query quotas
Example DDL
CREATE TABLE `project.dataset.fact_orders` (
-- Keys
order_id INT64 NOT NULL OPTIONS(description='PK from source'),
order_date DATE NOT NULL OPTIONS(description='Order date'),
-- Dimensions (denormalized)
customer_id INT64 NOT NULL,
customer_name STRING,
customer_segment STRING OPTIONS(description='VIP|Regular|New'),
-- Measures
item_count INT64 NOT NULL,
gross_amount NUMERIC NOT NULL,
discount_amount NUMERIC DEFAULT 0,
net_amount NUMERIC NOT NULL,
-- Nested items
items ARRAY<STRUCT<
product_id INT64,
product_name STRING,
quantity INT64,
unit_price NUMERIC
>> OPTIONS(description='Product details in order'),
-- Metadata
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
PARTITION BY order_date
CLUSTER BY customer_id
OPTIONS(
description='Fact table for orders, partitioned by date, clustered by customer'
);
Scheduled Queries / Materialized Views
Materialized View
CREATE MATERIALIZED VIEW `project.dataset.mv_daily_sales`
PARTITION BY report_date
CLUSTER BY store_id
AS
SELECT
DATE(order_date) as report_date,
store_id,
COUNT(*) as order_count,
SUM(net_amount) as revenue
FROM `project.dataset.fact_orders`
GROUP BY 1, 2;
Scheduled Query (ETL)
Create via BigQuery Console or API with schedule expression.
Checklist
- Partition large tables by date column
- Cluster by frequently filtered/grouped columns (max 4)
OPTIONS(description=...)for tables and columns- Denormalize with STRUCT/ARRAY when appropriate
- Avoid
SELECT *and cross-join large tables - Filter by partition column in queries
- Consider Materialized Views for frequently used aggregations