Geography Comparison Pitfalls

Lesson 19 — State-level data challenges

State vs. National Gaps

One of the most useful derived metrics in the warehouse is the state-vs-national gap: how much does a state's wage for a given occupation differ from the national average? The computation is a single INSERT…SELECT that pairs each state observation with the corresponding national observation for the same occupation, metric, and period.

-- State minus national for the same metric, occupation, period
INSERT INTO fact_derived_series (...)
SELECT
    derived_key, base_key,
    state_obs.occupation_key,
    state_obs.geography_key,
    state_obs.period_key,
    state_obs.comparability_mode,
    state_obs.observed_value - nat_obs.observed_value,
    'state_vs_national_gap', run_id
FROM fact_time_series_observation state_obs
JOIN dim_geography g ON state_obs.geography_key = g.geography_key
JOIN dim_geography g_nat ON g_nat.geo_type = 'national'
JOIN fact_time_series_observation nat_obs
  ON nat_obs.occupation_key = state_obs.occupation_key
  AND nat_obs.geography_key = g_nat.geography_key
  AND nat_obs.period_key = state_obs.period_key
WHERE g.geo_type = 'state'
  AND state_obs.observed_value IS NOT NULL
  AND nat_obs.observed_value IS NOT NULL

The derived value is simply state minus national. A positive gap means the state pays more than the national average; a negative gap means less. The join on occupation_key and period_key ensures each comparison is apples-to-apples — same occupation, same time period, same metric. The IS NOT NULL filters exclude suppressed values, so the derived series only contains gaps where both sides have real data.

The Geography Compare Endpoint

The API endpoint that powers the geography comparison view follows a common year-resolution pattern. If the caller does not specify a year, the endpoint automatically finds the latest available year in the data.

# compare_geography(soc_code, metric, year)
# 1. If year not specified, find MAX(year) from data
# 2. Filter observations by:
#    - soc_code (specific occupation)
#    - metric (mean_annual_wage, etc.)
#    - geo_type = 'state'
#    - year (specific period)
# 3. Return list of {geo_name, geo_code, value, source_release_id}
# 4. Order by geo_name for consistent display

The year-resolution pattern (default to latest available) is a common UX strategy. It ensures users always see data even if they don't specify a year. This is especially important because different data vintages may have different latest years — OEWS 2023 data might be the newest available, while another source might only go through 2022. The endpoint resolves the year within the context of the specific occupation and metric being queried.

Cost-of-Living Confounders

A $10,000 state-vs-national wage gap does NOT mean the state is "better paid." California's software developers earn roughly $20K above the national average, but San Francisco housing costs $30K more per year than the national average. A positive gap in nominal wages can mask a negative gap in purchasing power.

This is not a solvable problem within the OEWS data alone. The Bureau of Economic Analysis publishes Regional Price Parities (RPPs) that could adjust for geographic cost differences, but the warehouse does not currently ingest RPP data. Until it does, geographic wage comparisons must carry an explicit caveat.

Important: The warehouse does not include cost-of-living data. All geographic wage comparisons are nominal. A positive gap signals higher pay, not higher purchasing power. Users must bring their own cost-of-living context to interpret these numbers.

Suppressed Cells in Small States

States with few workers in a given occupation have suppressed OEWS data. Wyoming may have data for "Registered Nurses" but not for "Industrial-Organizational Psychologists" (a niche occupation with perhaps 1–2 practitioners statewide). The BLS suppresses cells to protect confidentiality when the number of respondents is too small.

Occupation States with data States suppressed
Registered Nurses (29-1141) 51 0
Software Developers (15-1252) 51 0
Dancers (27-2031) 12 39
Astronomers (19-2011) 7 44

The API returns only non-NULL values — suppressed states simply don't appear in the response. This means small-occupation geographic comparisons show an incomplete picture. A map visualization for "Dancers" would have 39 blank states, which could be mistaken for zero employment rather than suppressed data. The distinction matters: suppressed means "data exists but can't be disclosed," not "no workers in this state."

Dimension-Driven Filtering

The geography dimension table is the key to flexible geographic queries. Every observation links to dim_geography through a surrogate key, and the dimension carries the geo_type classifier that enables level-based filtering.

dim_geography:
  geo_key | geo_type  | geo_code | geo_name
  1       | national  | US       | National
  2       | state     | CA       | California
  3       | state     | NY       | New York
  4       | metro     | 31080    | Los Angeles-Long Beach-Anaheim

All endpoints filter by geo_type — the same query pattern works across national, state, and metro comparisons. The dimension design means adding a new geography level (e.g., county) requires no schema changes — just new rows in dim_geography. The fact tables don't know or care what kind of geography they're pointing to; the dimension carries that context.

This is a textbook example of why dimensional modeling favors explicit type columns over separate tables. A dim_state table and a dim_metro table would force every query to know which table to join. A single dim_geography with a geo_type filter lets the same SQL work for any level.