Choosing the Right Similarity Algorithm — Lessons

The Problem: Dancer = Lawyer?

The occupation similarity feature was designed to answer a simple question: which occupations require similar skills? The initial implementation used Jaccard similarity, a classic set-based metric that measures overlap between two sets. It should have worked.

But every occupation showed 100% similarity to every other occupation. Dancers were identical to Pile Driver Operators. Bakers matched Lawyers. The entire similarity matrix was useless.

Root cause: The algorithm was correct — the data model assumption was wrong. Jaccard similarity measures set overlap, but O*NET doesn't use sets.

Why Jaccard Fails on O*NET Skills

Jaccard similarity computes |A ∩ B| / |A ∪ B| — the size of the intersection divided by the size of the union. It works when items either belong to a set or don't: a recipe either uses flour or it doesn't.

O*NET doesn't work this way. It rates all 35 skills for every occupation on importance scales from 1 to 5. Every occupation has every skill — they differ in how much each skill matters, not in which skills they have.

# What the data actually looks like:
# Dancer (27-2031)
  Active Listening    IM = 3.75
  Mathematics         IM = 1.62
  Programming         IM = 1.12

# Lawyer (23-1011)
  Active Listening    IM = 4.25
  Mathematics         IM = 2.12
  Programming         IM = 1.25

# Both have ALL 35 skills rated.
# Jaccard sees: {35 skills} ∩ {35 skills} / {35 skills} ∪ {35 skills}
#             = 35/35 = 1.0 (100% similar)

When every item has every feature, set-based metrics collapse to a constant. The signal is in the scores, not the membership.

Cosine Similarity: Using the Scores

Cosine similarity treats each occupation's skill importance scores as a vector in 35-dimensional space. It measures the angle between two vectors — occupations that emphasize the same skills at similar levels point in the same direction.

cosine(A, B) = (A · B) / (||A|| × ||B||)

# Dancer vs. Choreographer (high similarity)
  Both emphasize coordination, social perceptiveness, time management
  → cosine ≈ 0.98

# Dancer vs. Lawyer (low similarity)
  Dancer: high coordination, monitoring; low critical thinking, writing
  Lawyer: high critical thinking, writing; low coordination, monitoring
  → cosine ≈ 0.82

The range is 0 to 1 (since all importance scores are positive). Values above 0.95 indicate genuinely similar skill profiles; values below 0.85 indicate meaningfully different ones.

The Hidden Duplicate Problem

Switching to cosine similarity revealed a second bug: some occupations had similarity scores greater than 1.0. Cosine similarity is bounded by 1.0, so something was inflating the dot product.

The cause: "All Other" catch-all occupations (e.g., 17-2199 Engineers, All Other) had up to 8 duplicate rows per skill in the bridge_occupation_skill table. Each duplicate multiplied the dot product without increasing the norm proportionally.

-- Before fix: duplicates inflate the dot product
SELECT occupation_key, element_id, COUNT(*) AS n
FROM bridge_occupation_skill b
JOIN dim_skill s ON b.skill_key = s.skill_key
WHERE b.scale_id = 'IM'
GROUP BY occupation_key, element_id
HAVING COUNT(*) > 1;
-- Returns rows with n = 2, 4, 6, 8 for "All Other" occupations

-- Fix: AVG + GROUP BY deduplicates before computing similarity
WITH occ_skill_scores AS (
    SELECT b.occupation_key, s.element_id,
           AVG(b.data_value) AS importance   -- ← collapses duplicates
    FROM bridge_occupation_skill b
    JOIN dim_skill s ON b.skill_key = s.skill_key
    WHERE b.scale_id = 'IM'
    GROUP BY b.occupation_key, s.element_id  -- ← one row per skill
)

Lesson: When your metric exceeds its theoretical bounds, look for data duplication before doubting the formula. The math was right — it was getting the wrong inputs.

The Final Implementation

The corrected view computes cosine similarity in three CTEs:

-- 1. occ_skill_scores: one importance score per occupation × skill
--    (AVG deduplicates "All Other" catch-all rows)
-- 2. norms: L2 norm per occupation (√Σ importance²)
-- 3. dot_products: pairwise dot products (self-join on element_id)

-- Final SELECT:
ROUND(dot_product / (norm_a × norm_b), 4) AS similarity_score

The view column is still named jaccard_similarity in the SQL to avoid a cascading rename across the codebase. The API exposes it as similarity_score.

Choosing a Similarity Algorithm: Decision Guide

Data Structure	Algorithm	Example
Binary presence/absence	Jaccard similarity	Recipe ingredients, tag sets
Continuous scores, same features	Cosine similarity	Skill ratings, TF-IDF vectors
Continuous scores, magnitude matters	Euclidean distance	Salary comparisons, physical measurements
Rankings or ordinal data	Spearman correlation	Preference rankings, survey responses

Key takeaway: Before choosing a similarity algorithm, understand your data's structure. Binary presence/absence metrics are useless when every item has every feature rated. Use score-based similarity (cosine, correlation) when values carry the signal, not membership.