Data Quality Traps in Government Sources

Lesson 4 — Data quality & defensive parsing

Trap 1: ZIP Files With Garbage Inside

BLS publishes OEWS data as ZIP files containing Excel spreadsheets. The 2022 state ZIP (oesm22st.zip) also contained ~$state_M2022_dl.xlsx — an Excel temp/lock file that someone at BLS left in the archive. Our ZIP extractor tried to read this as the data file and failed with "File is not a zip file."

Defense

Filter any filename starting with ~$ before selecting the XLSX to extract. BLS does not scrub their archives before publishing.

xlsx_names = [
    n for n in zf.namelist()
    if n.lower().endswith(".xlsx") and not n.split("/")[-1].startswith("~$")
]

Trap 2: Column Names Change Between Vintages

OEWS 2023 uses the column header O_GROUP. OEWS 2021 uses OCC_GROUP. Older files might use just GROUP. They all mean the same thing.

Similarly: AREA_NAME in one vintage, AREA_TITLE in another.

Defense

Maintain an alias map in the parser that normalizes all known column name variations to a single canonical name:

_OEWS_COLUMN_ALIASES = {
    "group": "o_group",
    "occ_group": "o_group",
    "area_name": "area_title",
    ...
}

When adding a new vintage: The first thing to check is whether its column headers match the alias map. New columns or renamed columns will cause silent data loss if not mapped.

Trap 3: Broad/Detailed Group Overlap

BLS categorizes occupations into hierarchical groups: major, minor, broad, and detailed. In the OEWS data, a "detailed" occupation like 11-1011 (Chief Executives) also appears under its "broad" group 11-1010. Both rows carry identical employment counts and wage values.

If you load both rows into the fact table and then try to normalize into time-series observations, the unique constraint fires because you're inserting the same (occupation_key, geography_key, period_key) twice with the same values.

Defense

Use SELECT DISTINCT when normalizing from fact to time-series observation tables. The duplicates have identical values, so deduplication is safe and correct.

Trap 4: Suppressed Values Are Not Zeros

BLS suppresses wage and employment data for confidentiality when the number of respondents is too small. These values arrive as null, asterisks (*), or special markers (**). They mean "data exists but cannot be disclosed" — fundamentally different from zero.

Rule: Preserve nulls through every layer. Never impute, never substitute zero. Display "Data suppressed" or "Not available" in the UI. This is a legal and ethical requirement, not just a data quality preference.

Trap 5: BLS Blocks Bare HTTP Requests

BLS.gov returns 403 Forbidden for HTTP requests that don't include browser-like headers. Specifically, you need Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site, and Sec-Fetch-User headers.

The downloader sends a full set of browser-like headers. This is not scraping — the data is public — but BLS's CDN enforces these checks.