Derived Metrics from Base Observations — Lessons

Why Derived Metrics Need Their Own Table

Base observations are source truth from BLS and O*NET. Derived values — year-over-year change, rolling averages, real wages — are computed artifacts. Mixing them in the same table creates ambiguity: is this value a measurement or a calculation? A consumer querying wage data shouldn't have to guess whether a number came directly from BLS or was produced by the pipeline.

The pipeline stores base observations in fact_time_series_observation and all derived values in fact_derived_series. Each derived row carries a derivation_method label (e.g., 'yoy_absolute_change', 'cpi_deflation') and a base_metric_key pointing back to the metric it was computed from. This separation means downstream queries always know exactly what kind of value they are working with based on which table they read from.

The Self-Join Pattern for Year-over-Year

Computing year-over-year change requires connecting each observation to its prior-year counterpart. The core pattern is a self-join on fact_time_series_observation through the time dimension.

-- Year-over-year absolute change
INSERT INTO fact_derived_series (...)
SELECT
    derived_key, base_key,
    curr.occupation_key, curr.geography_key,
    curr.period_key, curr.comparability_mode,
    curr.observed_value - prev.observed_value,
    'yoy_absolute_change', run_id
FROM fact_time_series_observation curr
JOIN dim_time_period tp_curr ON curr.period_key = tp_curr.period_key
JOIN dim_time_period tp_prev
  ON tp_prev.year = tp_curr.year - 1
JOIN fact_time_series_observation prev
  ON prev.metric_key = curr.metric_key
  AND prev.occupation_key = curr.occupation_key
  AND prev.geography_key = curr.geography_key
  AND prev.period_key = tp_prev.period_key
WHERE curr.observed_value IS NOT NULL
  AND prev.observed_value IS NOT NULL

The self-join connects each observation to its prior-year counterpart through dim_time_period. The tp_prev.year = tp_curr.year - 1 condition finds the matching period, and the subsequent join on metric_key, occupation_key, geography_key, and period_key ensures we compare the exact same measure for the exact same occupation in the exact same geography. If either year is missing or NULL, no derived value is produced — the pipeline never imputes.

Rolling Averages via Triple-Join

A 3-year rolling average smooths out single-year volatility and reveals underlying trends. The SQL pattern extends the self-join approach by connecting three consecutive years.

-- 3-year rolling average requires three consecutive years
JOIN dim_time_period tp_1 ON tp_1.year = tp_curr.year - 1
JOIN dim_time_period tp_2 ON tp_2.year = tp_curr.year - 2
JOIN fact_time_series_observation p1
  ON p1.period_key = tp_1.period_key
  AND p1.metric_key = curr.metric_key
  AND p1.occupation_key = curr.occupation_key
  AND p1.geography_key = curr.geography_key
JOIN fact_time_series_observation p2
  ON p2.period_key = tp_2.period_key
  AND p2.metric_key = curr.metric_key
  AND p2.occupation_key = curr.occupation_key
  AND p2.geography_key = curr.geography_key
...
ROUND((p2.observed_value + p1.observed_value + curr.observed_value) / 3.0, 2)

The triple-join naturally handles the "3 consecutive years required" constraint — if any year is missing, the JOIN produces no rows. No explicit gap detection is needed. This is a general pattern: require N consecutive values by adding N-1 self-joins through the time dimension. The database engine handles the absence check for free through inner join semantics.

Idempotent Delete-Before-Insert

Every derived metric computation follows the same two-step pattern: delete all existing rows for that metric, then insert the freshly computed values.

# Every compute function follows this pattern:
derived_key = get_metric_key(conn, "yoy_absolute_change")
conn.execute(
    "DELETE FROM fact_derived_series WHERE metric_key = ?",
    [derived_key]
)
conn.execute(
    "INSERT INTO fact_derived_series (...) SELECT ..."
)

Key principle: Delete-before-insert is safe for derived metrics because every value is fully reproducible from base observations. Unlike base data (which could be lost), derived data can always be recomputed. This pattern makes the pipeline idempotent: running it twice produces the same result.

The delete is scoped to a single metric_key, so recomputing year-over-year change doesn't touch rolling averages or real wages. Each derived metric is an independent computation that can be re-run in isolation without affecting any other metric in the table.

The Derived Metrics Registry

The pipeline computes seven derived metrics, each with a specific derivation method and minimum data requirement. All are stored in fact_derived_series with a base_metric_key linking back to the source observation metric.

Metric	Units	Method	Min Data Required
yoy_absolute_change	varies	Self-join on year-1	2 consecutive years
yoy_percent_change	percent	(curr - prev) / prev × 100	2 consecutive years
rolling_avg_3yr	varies	Average of 3 consecutive years	3 consecutive years
state_vs_national_gap	varies	state_value - national_value	State + national data
rank_delta	rank_change	rank(year-1) - rank(year)	2 consecutive years
real_mean_annual_wage	dollars	CPI deflation	CPI data for year
real_median_annual_wage	dollars	CPI deflation	CPI data for year

Each metric is registered in dim_metric with its units and description. The derivation_method column in fact_derived_series makes it possible to trace exactly how any value was produced, and the base_metric_key connects each derived row to the source observation metric it was computed from.