Dimensional Modeling for Labor Data
The Four-Layer Architecture
Raw/Landing → Staging → Core Warehouse → Analyst Marts
Each layer has a specific job:
- Raw/Landing — Store the downloaded file exactly as received. Never modify it. Record the URL, download time, checksum, and source release label. If something goes wrong downstream, you can always reprocess from raw.
-
Staging — Parse the raw file into relational tables. Standardize column names
(everything becomes
snake_case). Apply explicit typing. Do not interpret, aggregate, or transform business meaning. Example:stage__bls__oews_national. -
Core Warehouse — Conform the staged data into shared dimensions and facts.
This is where business meaning gets assigned. Dimensions (
dim_occupation,dim_geography,dim_industry) provide context; facts (fact_occupation_employment_wages) record measurements; bridges (bridge_occupation_skill) link many-to-many relationships. -
Analyst Marts — Pre-joined, denormalized views optimized for specific questions.
No new business logic — only reshaping what the core warehouse already established. Example:
occupation_summaryjoins occupation + wages + skills into one queryable view.
Why Dimensions Matter
A dimension table like dim_geography stores the descriptive attributes of a geographic
area: its code (37), type (state), name (North Carolina),
and whether it's the current version. The fact table stores only the surrogate key
(geography_key = 14), not the full description.
This separation means:
- Geographic names can change without rewriting fact history
- Multiple facts can reference the same geography without storing redundant strings
- Queries filter on dimension attributes and join to facts via integer keys (fast)
The Grain Rule
Every fact table has a "grain" — the combination of dimensions that uniquely identifies a row.
For fact_occupation_employment_wages, the grain is:
(reference_period, geography_key, industry_key, ownership_code, occupation_key, source_dataset)
Rule: If you insert a row that duplicates this grain, something has gone wrong.
The loader enforces this with delete-before-insert at the
(source_dataset, source_release_id) level.
Bridge Tables for Many-to-Many
An occupation has many skills. A skill applies to many occupations. The
bridge_occupation_skill table resolves this with three columns:
occupation_key, element_id, and a score.
This is deliberately separate from a generic EAV table — each O*NET domain (skills, knowledge, abilities, tasks) gets its own bridge table for clarity. More tables is acceptable when it makes the schema self-documenting.