The Multi-Vintage Challenge

Lesson 3 — Multi-year data integration

What Is a "Vintage"?

A vintage is one release of a dataset. OEWS publishes annually with a May reference period, so 2021.05, 2022.05, and 2023.05 are three vintages of the same dataset. Each vintage is a complete snapshot — not a diff.

The Naive Approach (and Why It Breaks)

The simplest approach: load each vintage independently. Each vintage creates its own dimension rows. OEWS 2021 creates geography key 1 for "National", OEWS 2022 creates geography key 2 for "National", OEWS 2023 creates geography key 3 for "National".

Now try to compute year-over-year change in employment for a national-level occupation. You need to join facts from 2021 and 2023 on geography — but they have different geography keys for the same physical area. The join produces zero rows.

The problem: Per-vintage dimension keys make cross-vintage analysis impossible because the same real-world entity gets different surrogate keys in each vintage.

The Fix: Deduplicate on Business Key

Geography doesn't change between vintages. North Carolina is still North Carolina. So the geography dimension should check for existing rows using only the business key (geo_type + geo_code), not the source release. If a matching row exists, reuse its surrogate key.

Before (broken)

WHERE geo_type = ? AND geo_code = ? AND source_release_id = ?

After (correct)

WHERE geo_type = ? AND geo_code = ?

This single change — removing source_release_id from the geography lookup — is what makes cross-vintage time-series analysis possible.

Manifest-Based Vintage Management

The source manifest (config/source_manifest.yaml) uses suffix-based naming for vintages:

- dataset_name: oews_national_2021
  dataset_url: https://www.bls.gov/oes/special-requests/oesm21nat.zip
- dataset_name: oews_national_2022
  dataset_url: https://www.bls.gov/oes/special-requests/oesm22nat.zip
- dataset_name: oews_national_2023
  dataset_url: https://www.bls.gov/oes/special-requests/oesm23nat.zip

The orchestrator pairs national + state entries by matching their suffix: oews_national_2021 pairs with oews_state_2021. This convention is simple, extensible, and easy to debug.