Data Quality Traps in Government Sources
Trap 1: ZIP Files With Garbage Inside
BLS publishes OEWS data as ZIP files containing Excel spreadsheets. The 2022 state ZIP
(oesm22st.zip) also contained ~$state_M2022_dl.xlsx —
an Excel temp/lock file that someone at BLS left in the archive. Our ZIP extractor tried
to read this as the data file and failed with "File is not a zip file."
Defense
Filter any filename starting with ~$ before selecting the XLSX to extract.
BLS does not scrub their archives before publishing.
xlsx_names = [
n for n in zf.namelist()
if n.lower().endswith(".xlsx") and not n.split("/")[-1].startswith("~$")
]
Trap 2: Column Names Change Between Vintages
OEWS 2023 uses the column header O_GROUP. OEWS 2021 uses OCC_GROUP.
Older files might use just GROUP. They all mean the same thing.
Similarly: AREA_NAME in one vintage, AREA_TITLE in another.
Defense
Maintain an alias map in the parser that normalizes all known column name variations to a single canonical name:
_OEWS_COLUMN_ALIASES = {
"group": "o_group",
"occ_group": "o_group",
"area_name": "area_title",
...
}
When adding a new vintage: The first thing to check is whether its column headers match the alias map. New columns or renamed columns will cause silent data loss if not mapped.
Trap 3: Broad/Detailed Group Overlap
BLS categorizes occupations into hierarchical groups: major, minor, broad, and detailed.
In the OEWS data, a "detailed" occupation like 11-1011 (Chief Executives) also
appears under its "broad" group 11-1010. Both rows carry identical employment
counts and wage values.
If you load both rows into the fact table and then try to normalize into time-series
observations, the unique constraint fires because you're inserting the same
(occupation_key, geography_key, period_key) twice with the same values.
Defense
Use SELECT DISTINCT when normalizing from fact to time-series observation
tables. The duplicates have identical values, so deduplication is safe and correct.
Trap 4: Suppressed Values Are Not Zeros
BLS suppresses wage and employment data for confidentiality when the number of respondents
is too small. These values arrive as null, asterisks (*), or special markers
(**). They mean "data exists but cannot be disclosed" — fundamentally
different from zero.
Rule: Preserve nulls through every layer. Never impute, never substitute zero. Display "Data suppressed" or "Not available" in the UI. This is a legal and ethical requirement, not just a data quality preference.
Trap 5: BLS Blocks Bare HTTP Requests
BLS.gov returns 403 Forbidden for HTTP requests that don't include browser-like headers.
Specifically, you need Sec-Fetch-Dest, Sec-Fetch-Mode,
Sec-Fetch-Site, and Sec-Fetch-User headers.
The downloader sends a full set of browser-like headers. This is not scraping — the data is public — but BLS's CDN enforces these checks.