The Federal Labor Data Landscape

Lesson 1 — Data sources & relationships

The Occupation Code Is Everything

Every data product in this system revolves around the Standard Occupational Classification (SOC) code. A SOC code like 11-1011 identifies "Chief Executives" — the 11 is the major group (Management), 1011 narrows to the specific occupation. There are roughly 870 detailed occupations in SOC 2018.

The SOC taxonomy is published by BLS and updated roughly every 10 years. The current version is SOC 2018. This matters because when the taxonomy changes, occupation codes can split, merge, or disappear — breaking any naive time-series comparison.

Four Data Products, One Key

SourcePublisherWhat It ContainsUpdate Frequency
SOC BLS Occupation hierarchy: major → minor → broad → detailed ~10 years
OEWS BLS Employment counts and wage statistics by occupation, geography, and industry Annual (May reference period)
O*NET DOL / O*NET Center Skills, knowledge, abilities, and tasks for each occupation Continuous (versioned releases)
Employment Projections BLS 10-year employment outlook with growth rates Every 2 years

How They Connect

SOC (taxonomy backbone)
 ├── OEWS links via occupation_code → how many people, how much they earn
 ├── O*NET links via SOC code → what skills and tasks the job requires
 └── Projections link via SOC code → where employment is heading

The SOC taxonomy must load first. Every other data product joins to dim_occupation through the SOC code. If a source uses a code that doesn't exist in the loaded SOC version, that row is excluded (this is expected for ~5 NEM 2024 codes that don't map to SOC 2018).

Why This Matters

Key insight: You cannot treat these as four independent datasets. They are four views of the same occupational reality, and the SOC code is the thread that connects them. Designing the warehouse around this fact — occupation as the stable external key — is the single most important architectural decision in this project.