Idempotent Pipeline Design

Lesson 6 — Pipeline reliability patterns

Why Idempotency Matters

Data pipelines fail. Downloads timeout, parsers hit unexpected formats, database connections drop. When you restart the pipeline, it must not create duplicate rows for data that was already loaded.

Definition: Idempotency means running the same operation twice produces the same result as running it once.

Pattern: Delete-Before-Insert at Release Grain

The OEWS loader uses this pattern:

-- 1. Delete any existing rows for this release
DELETE FROM stage__bls__oews_national WHERE source_release_id = ?

-- 2. Insert the new rows
INSERT INTO stage__bls__oews_national (...) VALUES (...)

If the pipeline crashes after step 1 but before step 2, the next run will delete nothing (already gone) and insert fresh. If it crashes after both steps, the next run will delete the partial data and re-insert complete. Either way, no duplicates.

Pattern: Check-Before-Insert for Dimensions

Dimension tables use existence checks:

-- Only insert if this business key doesn't already exist
INSERT INTO dim_geography (geo_type, geo_code, geo_name, ...)
SELECT ?, ?, ?, ...
WHERE NOT EXISTS (
    SELECT 1 FROM dim_geography WHERE geo_type = ? AND geo_code = ?
)

This means the dimension grows over time (new geographies get added) but existing rows are never duplicated or modified.

Pattern: Fact Table Grain Uniqueness

The fact table's unique constraint on its grain columns is the final safety net. Even if the loader logic has a bug, the database will reject duplicate facts:

UNIQUE (reference_period, geography_key, industry_key, ownership_code,
        occupation_key, source_dataset)

Testing pattern: Every test suite includes idempotence tests — load data, record the row count, load again, verify the count hasn't changed.