Blog
/
AI & ML
James Swirhun
Credible

Entity Matching with Embeddings

TL;DR: Entity matching—linking records that refer to the same real-world thing—is one of the most expensive data problems in business. Every scaled company has some version of it, and they all follow the same painful arc: start with exact matching, pile on heuristics, end up with brittle scripts nobody wants to touch. Embedding models dramatically outperform string-matching approaches, but the real unlock isn't the model — it's having a pipeline you can evaluate, iterate on, and trust. This post shows how.

The Entity Matching Problem

If you've worked at a scaled data company, you've felt this pain. Somewhere in your organization, a team is maintaining a brittle collection of scripts, lookup tables, and manual review queues — all to answer a question that seems like it should be simple: "are these two records the same thing?"

  • E-commerce & retail: Product listings from retailers and POS systems ("Apple iPhone 15 Pro Max 256GB Natural Titanium," "IPHONE 15 PRO MAX 256 NAT TI") must match canonical product catalogs
  • Customer data: Records from different systems (CRM, billing, support) must be linked to the same customer
  • Media measurement: Publisher-submitted program names ("The Championship Game," "Championship Game Live," "The Big Game") must match canonical TV schedules

Every domain has its version of this challenge. And they all follow the same evolution:

  1. Start with exact matching—works for 60% of cases
  2. Add normalization rules—lowercase, trim whitespace, remove punctuation
  3. Add fuzzy matching—Jaro-Winkler, Levenshtein distance, n-gram overlap
  4. Add heuristics—regex patterns, lookup tables, special-case handling
  5. End up with 500 lines of SQL nobody wants to touch

The real problem isn't the matching algorithm—it's that the complexity becomes unmaintainable. These pipelines break silently, they resist iteration, and they scale with headcount rather than data. Every new edge case means another rule. Every new data source means another integration. The team that owns it becomes a bottleneck, and everyone else waits.

Why Embeddings Change Everything

Embedding models encode semantic similarity. They convert text into dense vectors where similar concepts cluster together—regardless of surface-level word overlap.

Record A Record B String Similarity Embedding Similarity
"Apple iPhone 15 Pro" "iPhone 15 Pro - Apple" ~0.68 ~0.97
"Tito's HM Vodka 1.5oz" "TITOS VODKA" ~0.47 ~0.82
"Nike Air Max 90 White" "AIR MAX 90 WHT - NIKE" ~0.47 ~0.86
"Samsung Galaxy S24 Ultra" "Galaxy S24U - Samsung" ~0.62 ~0.87

String-matching heuristics fail when the same entity has different names, abbreviations, or word order. Embeddings capture that "Galaxy S24U - Samsung" and "Samsung Galaxy S24 Ultra" refer to the same product — even when the surface-level text looks quite different.

But bolting embeddings onto an already-tangled pipeline just adds another layer of complexity: API calls, batching logic, result storage, monitoring. What you need is a foundation that makes embedding operations understandable, auditable, and measurable.

This post builds on two previous pieces—Rethinking Data Transformation with Malloy (where persistence as a language feature replaces external materialization config) and The Data Warehouse Became an ML Platform (a structured framework for warehouse-native ML pipelines). With those foundations, entity matching becomes tractable. The entire workflow, from exploratory analysis through embedding generation to production-grade evaluation, lives in a single tool and language.

The Pattern: Embedding-Based Entity Matching

Every embedding-based matching workflow follows the same five-step structure:

1. Feature Engineering    → Build match strings from source data

2. Embedding Generation   → Convert match strings to vectors (persist!)

3. Candidate Filtering    → Join on shared attributes to narrow the search space

4. Similarity Scoring     → Compute embedding distance + supplementary signals

5. Threshold & Evaluation → Score candidates, measure accuracy, iterate

Let's walk through a complete example.

Example: Product Catalog Matching for E-Commerce

You're an e-commerce platform aggregating inventory from hundreds of retailers. Each retailer submits product data in their own format:

  • Retailer A: "Apple iPhone 15 Pro Max 256GB Natural Titanium"
  • Retailer B: "iPhone 15 Pro Max - 256 GB (Natural Titanium) - Apple"
  • Retailer C: "APPLE iPHONE 15 PRO MAX 256GB NAT TITANIUM"

These all refer to the same SKU in your supplier catalog, but exact matching fails and regex normalization becomes a maintenance nightmare.

Step 1: Feature Engineering

Combine key product attributes into match strings, and normalize categories for candidate filtering:

source: retailer_listings is warehouse.table('retailer.products') extend {
  dimension:
    // Combine key attributes into a single string for semantic matching
    match_string is concat(
      coalesce(brand, ''), ' ',
      coalesce(product_name, ''), ' ',
      coalesce(variant, ''), ' ',
      coalesce(size, '')
    )
 
    // Normalized category for filtering candidates (join efficiency)
    category_normalized is lower(trim(category))
}

The supplier catalog source follows the same pattern, using product_description as its match string.

Step 2: Generate and Persist Embeddings

The expensive step—generating embeddings—gets the #@ persist tag so results are saved as a warehouse table and don't regenerate on every pipeline run. (For background on how materialization works as a language feature in Malloy, see Rethinking Data Transformation.)

#@ persist name=retailer_embeddings
source: retailer_with_embeddings is warehouse.sql("""
  SELECT * FROM ML.GENERATE_EMBEDDING(
    MODEL `project.product_embedding_model`,
    (SELECT match_string AS content, listing_id FROM %{ retailer_listings }%),
    STRUCT(TRUE AS flatten_json_output, 'SEMANTIC_SIMILARITY' AS task_type)
  )
""") extend {
  join_one: retailer_listings on listing_id = retailer_listings.listing_id
  dimension:
    embedding is ml_generate_embedding_result
}

Generate corresponding embeddings for the supplier catalog side. Now embeddings persist—subsequent queries read from the persisted table instead of recomputing them.

Step 3: Narrow the Search Space

Before computing similarity, filter candidates by joining on product category. You only want to compare electronics against electronics, not electronics against apparel:

source: matching_candidates is retailer_with_embeddings extend {
  join_many: supplier_with_embeddings
    on category_normalized = supplier_with_embeddings.category_normalized
}

This step is crucial. Without it, you'd compute similarity for every retailer listing against every supplier SKU — the comparison count explodes quadratically. Filtering on categories, brands, or other coarse attributes keeps the problem tractable.

Step 4: Similarity Scoring

Combine embedding similarity with supplementary signals:

source: scored_matches is matching_candidates extend {
  dimension:
    // Primary: cosine similarity between embeddings
    product_similarity is 1 - `ML.DISTANCE`!number(
      embedding, supplier_with_embeddings.embedding, 'COSINE'
    )
 
    // Supplementary signal: exact brand match
    brand_match is
      pick 1 when lower(brand) = lower(supplier_with_embeddings.brand)
      else 0
 
    // Blended score
    match_score is
      pick (product_similarity + brand_match) / 2 when brand_match > 0
      else product_similarity
 
    // Rank candidates per retailer listing
    match_rank is row_number() over (
      partition by listing_id order by match_score desc
    )
}

Notice that the logic—which signals to combine, how to weight them—is explicit in the code. No opaque Python function, no mystery scoring buried in a stored procedure.

Step 5: Apply Thresholds

#@ persist name=product_matches
source: final_matches is scored_matches extend {
  dimension:
    is_confident_match is match_score > 0.90 and match_rank = 1   // Automate
    needs_review is match_score > 0.75 and match_score <= 0.90 and match_rank = 1  // Human review
    no_match is match_score <= 0.75 or match_rank > 1             // No match
}

Notice materialization at two layers: embeddings are materialized so they don't regenerate, and final match results are materialized so downstream dashboards and exports read from a table, not a live computation. (Malloy uses the #@ persist annotation for materialization — see Rethinking Data Transformation for a deeper dive.) When you tune thresholds or add signals, only the layers that changed need to rebuild.

Evaluation: Measuring Match Quality

Here's why legacy pipelines stay stuck: even if you could swap in an embedding model, how would you know it's better? String-matching heuristics carry a false confidence—it either matches or it doesn't. Embedding-based matching is probabilistic. You need to measure accuracy, not just inspect individual results. (This is the same evaluation discipline described in Warehouse-Native ML Pipelines, applied specifically to entity matching.)

Evaluation is what enables iteration. Every failed match is a signal to improve the model — capture the gap, refine the feature engineering, adjust the thresholds, and measure the impact. Without this feedback loop, you're guessing. With it, your pipeline gets better every day.

Golden Test Sets and Top-N Accuracy

Create a representative sample with human-reviewed correct answers, then join your scored matches against it:

source: evaluation is scored_matches extend {
  join_one: golden_test_set on listing_id = golden_test_set.source_record_id
 
  dimension:
    is_correct is supplier_sku = golden_test_set.correct_target_id
    is_in_golden_set is golden_test_set.source_record_id is not null
 
  measure:
    golden_samples is count() { where: is_in_golden_set and match_rank = 1 }
    correct_at_rank_1 is count() { where: is_correct and match_rank = 1 }
}

From this, you can build dashboards showing Top-1 Accuracy (how often the top-ranked match is correct—your automation ceiling) and Top-3 or Top-5 Accuracy (how useful a shortlist is for human-in-the-loop review).

Error Analysis

When matches look wrong, you need to understand why. A simple error review query surfaces the source record, the correct answer, and the top candidates the algorithm returned with their similarity scores. Patterns emerge: maybe the embedding model struggles with abbreviations, or the blocking key is too restrictive, or certain categories need different feature engineering.

This error analysis drives a tight iteration loop:

  1. Run evaluation against the golden test set
  2. Identify failure patterns
  3. Hypothesize improvements (adjust features, tune thresholds, add signals)
  4. Implement changes in Malloy code
  5. Re-run evaluation to measure impact

Every experiment is version-controlled. Evaluations run at warehouse scale. And because embeddings are persisted, downstream experiments run in seconds—you're tuning thresholds and feature weights, not waiting for regeneration. The entire iteration loop — explore data, adjust features, re-score candidates, measure accuracy — happens in Malloy without switching between a Python notebook for experimentation and a SQL pipeline for production. When the numbers look right, the same code you used to explore and evaluate is what runs in production — no translation step, no handoff. One language, one workflow, from exploration to deployment.

From Heuristics to State-of-the-Art

Here's what changes when you build entity matching this way:

Legacy Approach Malloy + Embeddings
500-line SQL files nobody wants to touch Composable dimensions that build on each other
String-matching heuristics with endless edge cases Embedding models that handle semantic variation
No way to measure if changes improve accuracy Evaluation dashboards with golden test sets
Expensive to experiment (regenerate everything) Persistence makes iteration affordable
Ad-hoc Python scripts for ML operations Warehouse-native ML called from readable code
Separate tools for exploration, ML, and production One language from analysis to production deployment

Getting Started

Conclusion

Entity matching doesn't have to stay stuck in the past. The barrier isn't that better approaches don't exist—embedding models dramatically outperform string-matching heuristics. The barrier is that adoption requires pipelines that are governed, evaluable, and auditable. Malloy provides that foundation: readable transformation code, warehouse-native ML, materialization for fast experimentation, and built-in evaluation to prove that changes actually improve accuracy. You can finally adopt state-of-the-art approaches and iterate with confidence, instead of accumulating edge-case heuristics forever.