Derived Metrics from Base Observations
Why Derived Metrics Need Their Own Table
Base observations are source truth from BLS and O*NET. Derived values — year-over-year change, rolling averages, real wages — are computed artifacts. Mixing them in the same table creates ambiguity: is this value a measurement or a calculation? A consumer querying wage data shouldn't have to guess whether a number came directly from BLS or was produced by the pipeline.
The pipeline stores base observations in fact_time_series_observation
and all derived values in fact_derived_series. Each derived row carries
a derivation_method label (e.g., 'yoy_absolute_change',
'cpi_deflation') and a base_metric_key pointing back to
the metric it was computed from. This separation means downstream queries always
know exactly what kind of value they are working with based on which table they
read from.
The Self-Join Pattern for Year-over-Year
Computing year-over-year change requires connecting each observation to its
prior-year counterpart. The core pattern is a self-join on
fact_time_series_observation through the time dimension.
-- Year-over-year absolute change
INSERT INTO fact_derived_series (...)
SELECT
derived_key, base_key,
curr.occupation_key, curr.geography_key,
curr.period_key, curr.comparability_mode,
curr.observed_value - prev.observed_value,
'yoy_absolute_change', run_id
FROM fact_time_series_observation curr
JOIN dim_time_period tp_curr ON curr.period_key = tp_curr.period_key
JOIN dim_time_period tp_prev
ON tp_prev.year = tp_curr.year - 1
JOIN fact_time_series_observation prev
ON prev.metric_key = curr.metric_key
AND prev.occupation_key = curr.occupation_key
AND prev.geography_key = curr.geography_key
AND prev.period_key = tp_prev.period_key
WHERE curr.observed_value IS NOT NULL
AND prev.observed_value IS NOT NULL
The self-join connects each observation to its prior-year counterpart through
dim_time_period. The tp_prev.year = tp_curr.year - 1
condition finds the matching period, and the subsequent join on
metric_key, occupation_key, geography_key,
and period_key ensures we compare the exact same measure for the
exact same occupation in the exact same geography. If either year is missing or
NULL, no derived value is produced — the pipeline never imputes.
Rolling Averages via Triple-Join
A 3-year rolling average smooths out single-year volatility and reveals underlying trends. The SQL pattern extends the self-join approach by connecting three consecutive years.
-- 3-year rolling average requires three consecutive years
JOIN dim_time_period tp_1 ON tp_1.year = tp_curr.year - 1
JOIN dim_time_period tp_2 ON tp_2.year = tp_curr.year - 2
JOIN fact_time_series_observation p1
ON p1.period_key = tp_1.period_key
AND p1.metric_key = curr.metric_key
AND p1.occupation_key = curr.occupation_key
AND p1.geography_key = curr.geography_key
JOIN fact_time_series_observation p2
ON p2.period_key = tp_2.period_key
AND p2.metric_key = curr.metric_key
AND p2.occupation_key = curr.occupation_key
AND p2.geography_key = curr.geography_key
...
ROUND((p2.observed_value + p1.observed_value + curr.observed_value) / 3.0, 2)
The triple-join naturally handles the "3 consecutive years required" constraint — if any year is missing, the JOIN produces no rows. No explicit gap detection is needed. This is a general pattern: require N consecutive values by adding N-1 self-joins through the time dimension. The database engine handles the absence check for free through inner join semantics.
Idempotent Delete-Before-Insert
Every derived metric computation follows the same two-step pattern: delete all existing rows for that metric, then insert the freshly computed values.
# Every compute function follows this pattern:
derived_key = get_metric_key(conn, "yoy_absolute_change")
conn.execute(
"DELETE FROM fact_derived_series WHERE metric_key = ?",
[derived_key]
)
conn.execute(
"INSERT INTO fact_derived_series (...) SELECT ..."
)
Key principle: Delete-before-insert is safe for derived metrics because every value is fully reproducible from base observations. Unlike base data (which could be lost), derived data can always be recomputed. This pattern makes the pipeline idempotent: running it twice produces the same result.
The delete is scoped to a single metric_key, so recomputing
year-over-year change doesn't touch rolling averages or real wages. Each derived
metric is an independent computation that can be re-run in isolation without
affecting any other metric in the table.
The Derived Metrics Registry
The pipeline computes seven derived metrics, each with a specific derivation method
and minimum data requirement. All are stored in fact_derived_series
with a base_metric_key linking back to the source observation metric.
| Metric | Units | Method | Min Data Required |
|---|---|---|---|
| yoy_absolute_change | varies | Self-join on year-1 | 2 consecutive years |
| yoy_percent_change | percent | (curr - prev) / prev × 100 | 2 consecutive years |
| rolling_avg_3yr | varies | Average of 3 consecutive years | 3 consecutive years |
| state_vs_national_gap | varies | state_value - national_value | State + national data |
| rank_delta | rank_change | rank(year-1) - rank(year) | 2 consecutive years |
| real_mean_annual_wage | dollars | CPI deflation | CPI data for year |
| real_median_annual_wage | dollars | CPI deflation | CPI data for year |
Each metric is registered in dim_metric with its units and description.
The derivation_method column in fact_derived_series makes
it possible to trace exactly how any value was produced, and the
base_metric_key connects each derived row to the source observation
metric it was computed from.