The Multi-Vintage Challenge
What Is a "Vintage"?
A vintage is one release of a dataset. OEWS publishes annually with a May reference period, so
2021.05, 2022.05, and 2023.05 are three vintages of the
same dataset. Each vintage is a complete snapshot — not a diff.
The Naive Approach (and Why It Breaks)
The simplest approach: load each vintage independently. Each vintage creates its own dimension rows. OEWS 2021 creates geography key 1 for "National", OEWS 2022 creates geography key 2 for "National", OEWS 2023 creates geography key 3 for "National".
Now try to compute year-over-year change in employment for a national-level occupation. You need to join facts from 2021 and 2023 on geography — but they have different geography keys for the same physical area. The join produces zero rows.
The problem: Per-vintage dimension keys make cross-vintage analysis impossible because the same real-world entity gets different surrogate keys in each vintage.
The Fix: Deduplicate on Business Key
Geography doesn't change between vintages. North Carolina is still North Carolina. So the geography
dimension should check for existing rows using only the business key (geo_type + geo_code),
not the source release. If a matching row exists, reuse its surrogate key.
Before (broken)
WHERE geo_type = ? AND geo_code = ? AND source_release_id = ?
After (correct)
WHERE geo_type = ? AND geo_code = ?
This single change — removing source_release_id from the geography lookup —
is what makes cross-vintage time-series analysis possible.
Manifest-Based Vintage Management
The source manifest (config/source_manifest.yaml) uses suffix-based naming for vintages:
- dataset_name: oews_national_2021
dataset_url: https://www.bls.gov/oes/special-requests/oesm21nat.zip
- dataset_name: oews_national_2022
dataset_url: https://www.bls.gov/oes/special-requests/oesm22nat.zip
- dataset_name: oews_national_2023
dataset_url: https://www.bls.gov/oes/special-requests/oesm23nat.zip
The orchestrator pairs national + state entries by matching their suffix:
oews_national_2021 pairs with oews_state_2021. This convention is
simple, extensible, and easy to debug.