Geography Comparison Pitfalls
State vs. National Gaps
One of the most useful derived metrics in the warehouse is the state-vs-national gap: how much does a state's wage for a given occupation differ from the national average? The computation is a single INSERT…SELECT that pairs each state observation with the corresponding national observation for the same occupation, metric, and period.
-- State minus national for the same metric, occupation, period
INSERT INTO fact_derived_series (...)
SELECT
derived_key, base_key,
state_obs.occupation_key,
state_obs.geography_key,
state_obs.period_key,
state_obs.comparability_mode,
state_obs.observed_value - nat_obs.observed_value,
'state_vs_national_gap', run_id
FROM fact_time_series_observation state_obs
JOIN dim_geography g ON state_obs.geography_key = g.geography_key
JOIN dim_geography g_nat ON g_nat.geo_type = 'national'
JOIN fact_time_series_observation nat_obs
ON nat_obs.occupation_key = state_obs.occupation_key
AND nat_obs.geography_key = g_nat.geography_key
AND nat_obs.period_key = state_obs.period_key
WHERE g.geo_type = 'state'
AND state_obs.observed_value IS NOT NULL
AND nat_obs.observed_value IS NOT NULL
The derived value is simply state minus national. A positive gap means the state pays
more than the national average; a negative gap means less. The join on
occupation_key and period_key ensures each comparison is
apples-to-apples — same occupation, same time period, same metric. The
IS NOT NULL filters exclude suppressed values, so the derived series
only contains gaps where both sides have real data.
The Geography Compare Endpoint
The API endpoint that powers the geography comparison view follows a common year-resolution pattern. If the caller does not specify a year, the endpoint automatically finds the latest available year in the data.
# compare_geography(soc_code, metric, year)
# 1. If year not specified, find MAX(year) from data
# 2. Filter observations by:
# - soc_code (specific occupation)
# - metric (mean_annual_wage, etc.)
# - geo_type = 'state'
# - year (specific period)
# 3. Return list of {geo_name, geo_code, value, source_release_id}
# 4. Order by geo_name for consistent display
The year-resolution pattern (default to latest available) is a common UX strategy. It ensures users always see data even if they don't specify a year. This is especially important because different data vintages may have different latest years — OEWS 2023 data might be the newest available, while another source might only go through 2022. The endpoint resolves the year within the context of the specific occupation and metric being queried.
Cost-of-Living Confounders
A $10,000 state-vs-national wage gap does NOT mean the state is "better paid." California's software developers earn roughly $20K above the national average, but San Francisco housing costs $30K more per year than the national average. A positive gap in nominal wages can mask a negative gap in purchasing power.
This is not a solvable problem within the OEWS data alone. The Bureau of Economic Analysis publishes Regional Price Parities (RPPs) that could adjust for geographic cost differences, but the warehouse does not currently ingest RPP data. Until it does, geographic wage comparisons must carry an explicit caveat.
Important: The warehouse does not include cost-of-living data. All geographic wage comparisons are nominal. A positive gap signals higher pay, not higher purchasing power. Users must bring their own cost-of-living context to interpret these numbers.
Suppressed Cells in Small States
States with few workers in a given occupation have suppressed OEWS data. Wyoming may have data for "Registered Nurses" but not for "Industrial-Organizational Psychologists" (a niche occupation with perhaps 1–2 practitioners statewide). The BLS suppresses cells to protect confidentiality when the number of respondents is too small.
| Occupation | States with data | States suppressed |
|---|---|---|
| Registered Nurses (29-1141) | 51 | 0 |
| Software Developers (15-1252) | 51 | 0 |
| Dancers (27-2031) | 12 | 39 |
| Astronomers (19-2011) | 7 | 44 |
The API returns only non-NULL values — suppressed states simply don't appear in the response. This means small-occupation geographic comparisons show an incomplete picture. A map visualization for "Dancers" would have 39 blank states, which could be mistaken for zero employment rather than suppressed data. The distinction matters: suppressed means "data exists but can't be disclosed," not "no workers in this state."
Dimension-Driven Filtering
The geography dimension table is the key to flexible geographic queries. Every
observation links to dim_geography through a surrogate key, and the
dimension carries the geo_type classifier that enables level-based
filtering.
dim_geography:
geo_key | geo_type | geo_code | geo_name
1 | national | US | National
2 | state | CA | California
3 | state | NY | New York
4 | metro | 31080 | Los Angeles-Long Beach-Anaheim
All endpoints filter by geo_type — the same query pattern works
across national, state, and metro comparisons. The dimension design means adding a
new geography level (e.g., county) requires no schema changes — just new rows
in dim_geography. The fact tables don't know or care what kind of
geography they're pointing to; the dimension carries that context.
This is a textbook example of why dimensional modeling favors explicit type columns
over separate tables. A dim_state table and a dim_metro
table would force every query to know which table to join. A single
dim_geography with a geo_type filter lets the same SQL
work for any level.