Multi-Vintage Query Pitfalls — Lessons

The Core Problem

Once the warehouse holds multiple vintages of the same dataset (OEWS 2021, 2022, 2023), every query must explicitly decide: do I want the latest snapshot, or all history? Forgetting this decision produces silent data quality bugs.

Trap 1: Duplicate Rows in Snapshot Views

The wages comparison page (/occupation/{soc}/wages) showed 150 rows for 50 states — three rows per state, one per OEWS vintage. The page was designed to show the latest snapshot, but the query had no vintage filter:

# BROKEN: returns all vintages
SELECT g.geo_name, f.mean_annual_wage, ...
FROM fact_occupation_employment_wages f
JOIN dim_geography g ON f.geography_key = g.geography_key
WHERE o.soc_code = '11-2021' AND g.geo_type = 'state'

# Result: Alabama (2021), Alabama (2022), Alabama (2023), Alaska (2021), ...
# 50 states × 3 vintages = 150 rows

# FIXED: latest vintage only
SELECT MAX(source_release_id) FROM fact_occupation_employment_wages
-- → '2023.05'

SELECT g.geo_name, f.mean_annual_wage, ...
WHERE ... AND f.source_release_id = '2023.05'

# Result: Alabama, Alaska, ... (50 rows)

Rule: Any endpoint that shows a point-in-time snapshot must filter to a specific source_release_id. Only time-series and trend endpoints should return multiple vintages.

Trap 2: Derived Metrics Tied to Comparability Mode

The Trend Explorer page showed YoY Change and YoY % as N/A for every year. The calculations were correct — but they only existed in comparable mode, while the UI defaulted to as_published.

# fact_derived_series stores YoY only for comparable mode:
comparability_mode = 'comparable'  → yoy_change = 33420, yoy_pct = 15.69
comparability_mode = 'as_published' → (no rows)

# The API query correctly LEFT JOINs to derived series,
# but when comparability_mode = 'as_published', the join finds nothing → NULL

The fix was simple: change the default <select> option from as_published to comparable. The data existed all along — the UI was pointing at the wrong slice.

Rule: When derived metrics depend on a specific comparability mode, the UI must default to that mode. Otherwise users see empty columns and assume the feature is broken.

Trap 3: Percentages Without Absolute Context

The Ranked Movers page showed only YoY percent change. Without the absolute change, users couldn't distinguish meaningful moves from noise:

# Without absolute context:
Athletes and Sports Competitors:  +206%   ← How much is that?
Broadcast Announcers:             +178%   ← Meaningful or tiny base?

# With absolute context:
Athletes and Sports Competitors:  +$241,150  (+206%)   ← Massive jump
Broadcast Announcers:             +$99,620   (+178%)   ← Also large

The fix: join to yoy_absolute_change in the movers API query and display both columns in the UI. Wage metrics format as dollars; employment metrics format as counts.

Decision Guide: Latest vs. All Vintages

Page / Endpoint	Vintage Policy	Filter
Occupation profile, wages comparison	Latest only	`source_release_id = MAX(...)`
Trend Explorer, time-series API	All vintages	No filter (ordered by year)
Ranked Movers	Latest derived only	Implicit (latest period in derived series)
Geography comparison	Specific year (user-selected or latest)	`tp.year = ?`