Crosswalk & Taxonomy Evolution — Lessons

The Illusion of Stable Codes

SOC code 15-1131 meant "Computer Programmers" in SOC 2010. In SOC 2018, it was renumbered to 15-1251 ("Software Developers"). The definition also changed. Naively comparing wages for "15-1131" across vintages mixes two different occupation definitions. Same code does not mean same occupation.

This is not an edge case. The SOC 2010-to-2018 revision restructured entire occupation groups. Some codes were split into multiple new codes. Others were merged. A few shifted to different major groups entirely. Any time-series analysis that treats SOC codes as stable identifiers across revisions will produce misleading results.

The solution is a crosswalk — an explicit mapping from every old code to every new code, with metadata about the type of change. The BLS publishes this mapping as a downloadable CSV. The pipeline parses it, classifies each pair, and stores the result in a bridge table that downstream queries can join against.

Crosswalk Structure

The parser produces one CrosswalkRow per (source, target) pair. Each row captures both the old and new code, plus metadata about the mapping relationship.

@dataclass
class CrosswalkRow:
    source_soc_code: str        # SOC 2010 code
    source_soc_title: str
    source_soc_version: str     # "2010"
    target_soc_code: str        # SOC 2018 code
    target_soc_title: str
    target_soc_version: str     # "2018"
    mapping_type: str           # "1:1", "split", "merge", "complex"
    source_release_id: str
    parser_version: str

Each row maps one 2010 code to one 2018 code. The mapping_type field is not present in the source data — it is computed by analyzing cardinality across all pairs. A 2010 code that maps to exactly one 2018 code (and vice versa) is 1:1. A 2010 code that maps to multiple 2018 codes is a split. The classification depends on the full set of pairs, not any single row in isolation.

Cardinality-Based Classification

The mapping type is derived from two counts: how many targets each source maps to (fan-out), and how many sources each target receives (fan-in). The algorithm builds these dictionaries in a first pass, then classifies each pair in a second pass.

# Count fan-out and fan-in
source_targets = {}  # 2010 code → set of 2018 codes
target_sources = {}  # 2018 code → set of 2010 codes

for pair in all_pairs:
    source_targets[pair.source].add(pair.target)
    target_sources[pair.target].add(pair.source)

# Classify each pair
for pair in all_pairs:
    src_fan = len(source_targets[pair.source])
    tgt_fan = len(target_sources[pair.target])
    if src_fan == 1 and tgt_fan == 1:
        pair.mapping_type = "1:1"      # Unchanged
    elif src_fan > 1 and tgt_fan == 1:
        pair.mapping_type = "split"    # One old → many new
    elif src_fan == 1 and tgt_fan > 1:
        pair.mapping_type = "merge"    # Many old → one new
    else:
        pair.mapping_type = "complex"  # Many-to-many

Type	Fan-out	Fan-in	Safe for wage comparison?
1:1	1	1	Yes — same occupation, different code
split	>1	1	No — wages can't be disaggregated
merge	1	>1	No — wages can't be averaged meaningfully
complex	>1	>1	No — requires manual analysis

The four types cover every possible relationship. In the SOC 2010→2018 crosswalk, the majority of codes are 1:1 mappings. Splits and merges are less common but affect some of the most analytically interesting occupations (technology roles, healthcare specialties). Complex mappings are rare and typically involve major group restructuring.

Why This Matters for Time Series

Key principle: Only 1:1 mappings are safe for direct wage comparison across SOC versions. Splits can aggregate employment counts (addition) but not wages (you can't split a mean). Merges lose granularity. The crosswalk bridge table stores all mapping types, but the comparable-history pipeline only uses 1:1 pairs.

Consider a split: SOC 2010 code 15-1100 (Computer Occupations) splits into three SOC 2018 codes. The 2010 vintage reports a single mean wage for the combined group. You cannot divide that mean wage into three parts because the component occupations have different wage distributions. However, you can sum the 2018 employment counts back to the 2010 group level, because employment is additive.

Merges have the opposite problem: two SOC 2010 codes collapse into one SOC 2018 code. You could average the 2010 wages, but an unweighted average misrepresents the combined group. A weighted average requires employment counts, which introduces circular dependencies when both wages and employment are the quantities you're trying to compare.

The pipeline sidesteps these problems by restricting comparable history to 1:1 mappings. This sacrifices coverage (some occupations can't be tracked across versions) in exchange for correctness (every comparison that is made is statistically valid).

The Parse Pipeline

The crosswalk parser uses a two-pass approach. The first pass collects all (source, target) pairs from the CSV. The second pass classifies each pair using the cardinality dictionaries built during the first pass.

# Pass 1: Collect all pairs and build cardinality maps
pairs = []
source_targets = defaultdict(set)
target_sources = defaultdict(set)

for row in csv_reader:
    src = normalize_code(row[source_col])
    tgt = normalize_code(row[target_col])
    pairs.append((src, tgt, row))
    source_targets[src].add(tgt)
    target_sources[tgt].add(src)

# Pass 2: Classify and emit CrosswalkRow objects
for src, tgt, row in pairs:
    mapping_type = classify(src, tgt, source_targets, target_sources)
    yield CrosswalkRow(
        source_soc_code=src,
        target_soc_code=tgt,
        mapping_type=mapping_type,
        ...
    )

Column aliases handle both possible header formats in the BLS CSV — "2010 SOC Code" vs. "Old SOC Code". The parser tries each alias in order and uses the first match. Regex validation ensures all codes match the XX-XXXX pattern before processing, catching malformed rows early.

The two-pass design is necessary because classification requires global knowledge. You cannot determine whether a pair is 1:1 or part of a split until you've seen every pair involving that source code. Streaming single-pass processing would require deferred classification with a flush step — more complex for no real benefit given the small size of the crosswalk file (under 2,000 rows).