Files
2026-04-12 01:06:31 +07:00

5.1 KiB

BigQuery Rules

Guidelines for designing schemas specific to Google BigQuery.


Data Types

Use case Type Notes
Integer INT64
Decimal NUMERIC or BIGNUMERIC
Float FLOAT64
Boolean BOOL
String STRING
Timestamp TIMESTAMP
Date DATE
Time TIME
DateTime DATETIME No timezone
JSON JSON
Bytes BYTES
Nested STRUCT Record type
Repeated ARRAY

Description (NO metadata tables needed)

BigQuery supports description via OPTIONS:

Table description

CREATE TABLE orders (
    ...
) OPTIONS(description='Table storing customer orders');

Column descriptions

CREATE TABLE orders (
    id INT64 NOT NULL OPTIONS(description='Primary key'),
    order_number STRING NOT NULL OPTIONS(description='Unique order code'),
    status STRING NOT NULL OPTIONS(description='pending|confirmed|shipped|cancelled'),
    total_amount NUMERIC OPTIONS(description='Total value')
);

Query descriptions

SELECT column_name, description
FROM `project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
WHERE table_name = 'orders';

Partitioning & Clustering

BigQuery does not have traditional indexes. Instead, use partitioning and clustering:

Partitioning (required for large tables)

-- Partition by date column
CREATE TABLE fact_orders (
    order_date DATE,
    customer_id INT64,
    order_id INT64,
    amount NUMERIC
)
PARTITION BY order_date;

-- Partition by timestamp (daily)
CREATE TABLE events (
    event_time TIMESTAMP,
    ...
)
PARTITION BY DATE(event_time);

-- Integer range partitioning
CREATE TABLE logs (
    log_id INT64,
    ...
)
PARTITION BY RANGE_BUCKET(log_id, GENERATE_ARRAY(0, 1000000, 10000));

Clustering (optimize filter/group)

CREATE TABLE fact_orders (
    order_date DATE,
    customer_id INT64,
    product_id INT64,
    store_id INT64,
    amount NUMERIC
)
PARTITION BY order_date
CLUSTER BY customer_id, product_id;
-- Can cluster up to 4 columns

When to use what?

  • Partition: Filter by date range → reduces data scan
  • Cluster: Filter/Group by specific columns → data is organized closer together

Nested & Repeated Fields (Denormalization)

BigQuery encourages denormalization with STRUCT and ARRAY:

CREATE TABLE orders (
    order_id INT64,
    order_date DATE,
    customer STRUCT<
        id INT64,
        name STRING,
        email STRING
    >,
    items ARRAY<STRUCT<
        product_id INT64,
        product_name STRING,
        quantity INT64,
        unit_price NUMERIC
    >>
);

-- Query nested fields
SELECT
    order_id,
    customer.name,
    item.product_name,
    item.quantity
FROM orders, UNNEST(items) AS item;

Best Practices

Query optimization

  • Avoid SELECT * - only select needed columns
  • Filter early - use partition column in WHERE
  • Avoid cross-join with large tables

Table design

  • Partition tables > 1GB
  • Cluster by frequently filtered/grouped columns
  • Denormalize with STRUCT/ARRAY when appropriate
  • Avoid too many small tables

Cost control

  • Partition pruning: always filter by partition column
  • Use --dry_run to estimate cost
  • Set up query quotas

Example DDL

CREATE TABLE `project.dataset.fact_orders` (
    -- Keys
    order_id INT64 NOT NULL OPTIONS(description='PK from source'),
    order_date DATE NOT NULL OPTIONS(description='Order date'),

    -- Dimensions (denormalized)
    customer_id INT64 NOT NULL,
    customer_name STRING,
    customer_segment STRING OPTIONS(description='VIP|Regular|New'),

    -- Measures
    item_count INT64 NOT NULL,
    gross_amount NUMERIC NOT NULL,
    discount_amount NUMERIC DEFAULT 0,
    net_amount NUMERIC NOT NULL,

    -- Nested items
    items ARRAY<STRUCT<
        product_id INT64,
        product_name STRING,
        quantity INT64,
        unit_price NUMERIC
    >> OPTIONS(description='Product details in order'),

    -- Metadata
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
)
PARTITION BY order_date
CLUSTER BY customer_id
OPTIONS(
    description='Fact table for orders, partitioned by date, clustered by customer'
);

Scheduled Queries / Materialized Views

Materialized View

CREATE MATERIALIZED VIEW `project.dataset.mv_daily_sales`
PARTITION BY report_date
CLUSTER BY store_id
AS
SELECT
    DATE(order_date) as report_date,
    store_id,
    COUNT(*) as order_count,
    SUM(net_amount) as revenue
FROM `project.dataset.fact_orders`
GROUP BY 1, 2;

Scheduled Query (ETL)

Create via BigQuery Console or API with schedule expression.


Checklist

  • Partition large tables by date column
  • Cluster by frequently filtered/grouped columns (max 4)
  • OPTIONS(description=...) for tables and columns
  • Denormalize with STRUCT/ARRAY when appropriate
  • Avoid SELECT * and cross-join large tables
  • Filter by partition column in queries
  • Consider Materialized Views for frequently used aggregations