Dimensional Modeling for Labor Data — Lessons

The Four-Layer Architecture

Raw/Landing → Staging → Core Warehouse → Analyst Marts

Each layer has a specific job:

Raw/Landing — Store the downloaded file exactly as received. Never modify it. Record the URL, download time, checksum, and source release label. If something goes wrong downstream, you can always reprocess from raw.
Staging — Parse the raw file into relational tables. Standardize column names (everything becomes snake_case). Apply explicit typing. Do not interpret, aggregate, or transform business meaning. Example: stage__bls__oews_national.
Core Warehouse — Conform the staged data into shared dimensions and facts. This is where business meaning gets assigned. Dimensions (dim_occupation, dim_geography, dim_industry) provide context; facts (fact_occupation_employment_wages) record measurements; bridges (bridge_occupation_skill) link many-to-many relationships.
Analyst Marts — Pre-joined, denormalized views optimized for specific questions. No new business logic — only reshaping what the core warehouse already established. Example: occupation_summary joins occupation + wages + skills into one queryable view.

Why Dimensions Matter

A dimension table like dim_geography stores the descriptive attributes of a geographic area: its code (37), type (state), name (North Carolina), and whether it's the current version. The fact table stores only the surrogate key (geography_key = 14), not the full description.

This separation means:

Geographic names can change without rewriting fact history
Multiple facts can reference the same geography without storing redundant strings
Queries filter on dimension attributes and join to facts via integer keys (fast)

The Grain Rule

Every fact table has a "grain" — the combination of dimensions that uniquely identifies a row. For fact_occupation_employment_wages, the grain is:

(reference_period, geography_key, industry_key, ownership_code, occupation_key, source_dataset)

Rule: If you insert a row that duplicates this grain, something has gone wrong. The loader enforces this with delete-before-insert at the (source_dataset, source_release_id) level.

Bridge Tables for Many-to-Many

An occupation has many skills. A skill applies to many occupations. The bridge_occupation_skill table resolves this with three columns: occupation_key, element_id, and a score.

This is deliberately separate from a generic EAV table — each O*NET domain (skills, knowledge, abilities, tasks) gets its own bridge table for clarity. More tables is acceptable when it makes the schema self-documenting.