Choosing the Right Similarity Algorithm
The Problem: Dancer = Lawyer?
The occupation similarity feature was designed to answer a simple question: which occupations require similar skills? The initial implementation used Jaccard similarity, a classic set-based metric that measures overlap between two sets. It should have worked.
But every occupation showed 100% similarity to every other occupation. Dancers were identical to Pile Driver Operators. Bakers matched Lawyers. The entire similarity matrix was useless.
Root cause: The algorithm was correct — the data model assumption was wrong. Jaccard similarity measures set overlap, but O*NET doesn't use sets.
Why Jaccard Fails on O*NET Skills
Jaccard similarity computes |A ∩ B| / |A ∪ B| — the size of the
intersection divided by the size of the union. It works when items either belong to a set
or don't: a recipe either uses flour or it doesn't.
O*NET doesn't work this way. It rates all 35 skills for every occupation on importance scales from 1 to 5. Every occupation has every skill — they differ in how much each skill matters, not in which skills they have.
# What the data actually looks like:
# Dancer (27-2031)
Active Listening IM = 3.75
Mathematics IM = 1.62
Programming IM = 1.12
# Lawyer (23-1011)
Active Listening IM = 4.25
Mathematics IM = 2.12
Programming IM = 1.25
# Both have ALL 35 skills rated.
# Jaccard sees: {35 skills} ∩ {35 skills} / {35 skills} ∪ {35 skills}
# = 35/35 = 1.0 (100% similar)
When every item has every feature, set-based metrics collapse to a constant. The signal is in the scores, not the membership.
Cosine Similarity: Using the Scores
Cosine similarity treats each occupation's skill importance scores as a vector in 35-dimensional space. It measures the angle between two vectors — occupations that emphasize the same skills at similar levels point in the same direction.
cosine(A, B) = (A · B) / (||A|| × ||B||)
# Dancer vs. Choreographer (high similarity)
Both emphasize coordination, social perceptiveness, time management
→ cosine ≈ 0.98
# Dancer vs. Lawyer (low similarity)
Dancer: high coordination, monitoring; low critical thinking, writing
Lawyer: high critical thinking, writing; low coordination, monitoring
→ cosine ≈ 0.82
The range is 0 to 1 (since all importance scores are positive). Values above 0.95 indicate genuinely similar skill profiles; values below 0.85 indicate meaningfully different ones.
The Hidden Duplicate Problem
Switching to cosine similarity revealed a second bug: some occupations had similarity scores greater than 1.0. Cosine similarity is bounded by 1.0, so something was inflating the dot product.
The cause: "All Other" catch-all occupations (e.g., 17-2199
Engineers, All Other) had up to 8 duplicate rows per skill in the
bridge_occupation_skill table. Each duplicate multiplied the dot product
without increasing the norm proportionally.
-- Before fix: duplicates inflate the dot product
SELECT occupation_key, element_id, COUNT(*) AS n
FROM bridge_occupation_skill b
JOIN dim_skill s ON b.skill_key = s.skill_key
WHERE b.scale_id = 'IM'
GROUP BY occupation_key, element_id
HAVING COUNT(*) > 1;
-- Returns rows with n = 2, 4, 6, 8 for "All Other" occupations
-- Fix: AVG + GROUP BY deduplicates before computing similarity
WITH occ_skill_scores AS (
SELECT b.occupation_key, s.element_id,
AVG(b.data_value) AS importance -- ← collapses duplicates
FROM bridge_occupation_skill b
JOIN dim_skill s ON b.skill_key = s.skill_key
WHERE b.scale_id = 'IM'
GROUP BY b.occupation_key, s.element_id -- ← one row per skill
)
Lesson: When your metric exceeds its theoretical bounds, look for data duplication before doubting the formula. The math was right — it was getting the wrong inputs.
The Final Implementation
The corrected view computes cosine similarity in three CTEs:
-- 1. occ_skill_scores: one importance score per occupation × skill
-- (AVG deduplicates "All Other" catch-all rows)
-- 2. norms: L2 norm per occupation (√Σ importance²)
-- 3. dot_products: pairwise dot products (self-join on element_id)
-- Final SELECT:
ROUND(dot_product / (norm_a × norm_b), 4) AS similarity_score
The view column is still named jaccard_similarity in the SQL to avoid a
cascading rename across the codebase. The API exposes it as similarity_score.
Choosing a Similarity Algorithm: Decision Guide
| Data Structure | Algorithm | Example |
|---|---|---|
| Binary presence/absence | Jaccard similarity | Recipe ingredients, tag sets |
| Continuous scores, same features | Cosine similarity | Skill ratings, TF-IDF vectors |
| Continuous scores, magnitude matters | Euclidean distance | Salary comparisons, physical measurements |
| Rankings or ordinal data | Spearman correlation | Preference rankings, survey responses |
Key takeaway: Before choosing a similarity algorithm, understand your data's structure. Binary presence/absence metrics are useless when every item has every feature rated. Use score-based similarity (cosine, correlation) when values carry the signal, not membership.