init
This commit is contained in:
231
.opencode/skills/databases/stacks/bigquery.md
Normal file
231
.opencode/skills/databases/stacks/bigquery.md
Normal file
@@ -0,0 +1,231 @@
|
||||
# BigQuery Rules
|
||||
|
||||
Guidelines for designing schemas specific to Google BigQuery.
|
||||
|
||||
---
|
||||
|
||||
## Data Types
|
||||
|
||||
| Use case | Type | Notes |
|
||||
|----------|------|-------|
|
||||
| Integer | `INT64` | |
|
||||
| Decimal | `NUMERIC` or `BIGNUMERIC` | |
|
||||
| Float | `FLOAT64` | |
|
||||
| Boolean | `BOOL` | |
|
||||
| String | `STRING` | |
|
||||
| Timestamp | `TIMESTAMP` | |
|
||||
| Date | `DATE` | |
|
||||
| Time | `TIME` | |
|
||||
| DateTime | `DATETIME` | No timezone |
|
||||
| JSON | `JSON` | |
|
||||
| Bytes | `BYTES` | |
|
||||
| Nested | `STRUCT` | Record type |
|
||||
| Repeated | `ARRAY` | |
|
||||
|
||||
---
|
||||
|
||||
## Description (NO metadata tables needed)
|
||||
|
||||
BigQuery supports description via `OPTIONS`:
|
||||
|
||||
### Table description
|
||||
```sql
|
||||
CREATE TABLE orders (
|
||||
...
|
||||
) OPTIONS(description='Table storing customer orders');
|
||||
```
|
||||
|
||||
### Column descriptions
|
||||
```sql
|
||||
CREATE TABLE orders (
|
||||
id INT64 NOT NULL OPTIONS(description='Primary key'),
|
||||
order_number STRING NOT NULL OPTIONS(description='Unique order code'),
|
||||
status STRING NOT NULL OPTIONS(description='pending|confirmed|shipped|cancelled'),
|
||||
total_amount NUMERIC OPTIONS(description='Total value')
|
||||
);
|
||||
```
|
||||
|
||||
### Query descriptions
|
||||
```sql
|
||||
SELECT column_name, description
|
||||
FROM `project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
|
||||
WHERE table_name = 'orders';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Partitioning & Clustering
|
||||
|
||||
BigQuery **does not have traditional indexes**. Instead, use partitioning and clustering:
|
||||
|
||||
### Partitioning (required for large tables)
|
||||
|
||||
```sql
|
||||
-- Partition by date column
|
||||
CREATE TABLE fact_orders (
|
||||
order_date DATE,
|
||||
customer_id INT64,
|
||||
order_id INT64,
|
||||
amount NUMERIC
|
||||
)
|
||||
PARTITION BY order_date;
|
||||
|
||||
-- Partition by timestamp (daily)
|
||||
CREATE TABLE events (
|
||||
event_time TIMESTAMP,
|
||||
...
|
||||
)
|
||||
PARTITION BY DATE(event_time);
|
||||
|
||||
-- Integer range partitioning
|
||||
CREATE TABLE logs (
|
||||
log_id INT64,
|
||||
...
|
||||
)
|
||||
PARTITION BY RANGE_BUCKET(log_id, GENERATE_ARRAY(0, 1000000, 10000));
|
||||
```
|
||||
|
||||
### Clustering (optimize filter/group)
|
||||
|
||||
```sql
|
||||
CREATE TABLE fact_orders (
|
||||
order_date DATE,
|
||||
customer_id INT64,
|
||||
product_id INT64,
|
||||
store_id INT64,
|
||||
amount NUMERIC
|
||||
)
|
||||
PARTITION BY order_date
|
||||
CLUSTER BY customer_id, product_id;
|
||||
-- Can cluster up to 4 columns
|
||||
```
|
||||
|
||||
### When to use what?
|
||||
- **Partition**: Filter by date range → reduces data scan
|
||||
- **Cluster**: Filter/Group by specific columns → data is organized closer together
|
||||
|
||||
---
|
||||
|
||||
## Nested & Repeated Fields (Denormalization)
|
||||
|
||||
BigQuery encourages denormalization with STRUCT and ARRAY:
|
||||
|
||||
```sql
|
||||
CREATE TABLE orders (
|
||||
order_id INT64,
|
||||
order_date DATE,
|
||||
customer STRUCT<
|
||||
id INT64,
|
||||
name STRING,
|
||||
email STRING
|
||||
>,
|
||||
items ARRAY<STRUCT<
|
||||
product_id INT64,
|
||||
product_name STRING,
|
||||
quantity INT64,
|
||||
unit_price NUMERIC
|
||||
>>
|
||||
);
|
||||
|
||||
-- Query nested fields
|
||||
SELECT
|
||||
order_id,
|
||||
customer.name,
|
||||
item.product_name,
|
||||
item.quantity
|
||||
FROM orders, UNNEST(items) AS item;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Query optimization
|
||||
- **Avoid `SELECT *`** - only select needed columns
|
||||
- **Filter early** - use partition column in WHERE
|
||||
- **Avoid cross-join** with large tables
|
||||
|
||||
### Table design
|
||||
- Partition tables > 1GB
|
||||
- Cluster by frequently filtered/grouped columns
|
||||
- Denormalize with STRUCT/ARRAY when appropriate
|
||||
- Avoid too many small tables
|
||||
|
||||
### Cost control
|
||||
- Partition pruning: always filter by partition column
|
||||
- Use `--dry_run` to estimate cost
|
||||
- Set up query quotas
|
||||
|
||||
---
|
||||
|
||||
## Example DDL
|
||||
|
||||
```sql
|
||||
CREATE TABLE `project.dataset.fact_orders` (
|
||||
-- Keys
|
||||
order_id INT64 NOT NULL OPTIONS(description='PK from source'),
|
||||
order_date DATE NOT NULL OPTIONS(description='Order date'),
|
||||
|
||||
-- Dimensions (denormalized)
|
||||
customer_id INT64 NOT NULL,
|
||||
customer_name STRING,
|
||||
customer_segment STRING OPTIONS(description='VIP|Regular|New'),
|
||||
|
||||
-- Measures
|
||||
item_count INT64 NOT NULL,
|
||||
gross_amount NUMERIC NOT NULL,
|
||||
discount_amount NUMERIC DEFAULT 0,
|
||||
net_amount NUMERIC NOT NULL,
|
||||
|
||||
-- Nested items
|
||||
items ARRAY<STRUCT<
|
||||
product_id INT64,
|
||||
product_name STRING,
|
||||
quantity INT64,
|
||||
unit_price NUMERIC
|
||||
>> OPTIONS(description='Product details in order'),
|
||||
|
||||
-- Metadata
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
|
||||
)
|
||||
PARTITION BY order_date
|
||||
CLUSTER BY customer_id
|
||||
OPTIONS(
|
||||
description='Fact table for orders, partitioned by date, clustered by customer'
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scheduled Queries / Materialized Views
|
||||
|
||||
### Materialized View
|
||||
```sql
|
||||
CREATE MATERIALIZED VIEW `project.dataset.mv_daily_sales`
|
||||
PARTITION BY report_date
|
||||
CLUSTER BY store_id
|
||||
AS
|
||||
SELECT
|
||||
DATE(order_date) as report_date,
|
||||
store_id,
|
||||
COUNT(*) as order_count,
|
||||
SUM(net_amount) as revenue
|
||||
FROM `project.dataset.fact_orders`
|
||||
GROUP BY 1, 2;
|
||||
```
|
||||
|
||||
### Scheduled Query (ETL)
|
||||
Create via BigQuery Console or API with schedule expression.
|
||||
|
||||
---
|
||||
|
||||
## Checklist
|
||||
|
||||
- [ ] Partition large tables by date column
|
||||
- [ ] Cluster by frequently filtered/grouped columns (max 4)
|
||||
- [ ] `OPTIONS(description=...)` for tables and columns
|
||||
- [ ] Denormalize with STRUCT/ARRAY when appropriate
|
||||
- [ ] Avoid `SELECT *` and cross-join large tables
|
||||
- [ ] Filter by partition column in queries
|
||||
- [ ] Consider Materialized Views for frequently used aggregations
|
||||
Reference in New Issue
Block a user