TL;DR: Entity matching—linking records that refer to the same real-world thing—is one of the most expensive data problems in business. Every scaled company has some version of it, and they all follow the same painful arc: start with exact matching, pile on heuristics, end up with brittle scripts nobody wants to touch. Embedding models dramatically outperform string-matching approaches, but the real unlock isn't the model — it's having a pipeline you can evaluate, iterate on, and trust. This post shows how.
The Entity Matching Problem
If you've worked at a scaled data company, you've felt this pain. Somewhere in your organization, a team is maintaining a brittle collection of scripts, lookup tables, and manual review queues — all to answer a question that seems like it should be simple: "are these two records the same thing?"
- E-commerce & retail: Product listings from retailers and POS systems ("Apple iPhone 15 Pro Max 256GB Natural Titanium," "IPHONE 15 PRO MAX 256 NAT TI") must match canonical product catalogs
- Customer data: Records from different systems (CRM, billing, support) must be linked to the same customer
- Media measurement: Publisher-submitted program names ("The Championship Game," "Championship Game Live," "The Big Game") must match canonical TV schedules
Every domain has its version of this challenge. And they all follow the same evolution:
- Start with exact matching—works for 60% of cases
- Add normalization rules—lowercase, trim whitespace, remove punctuation
- Add fuzzy matching—Jaro-Winkler, Levenshtein distance, n-gram overlap
- Add heuristics—regex patterns, lookup tables, special-case handling
- End up with 500 lines of SQL nobody wants to touch
The real problem isn't the matching algorithm—it's that the complexity becomes unmaintainable. These pipelines break silently, they resist iteration, and they scale with headcount rather than data. Every new edge case means another rule. Every new data source means another integration. The team that owns it becomes a bottleneck, and everyone else waits.
Why Embeddings Change Everything
Embedding models encode semantic similarity. They convert text into dense vectors where similar concepts cluster together—regardless of surface-level word overlap.
String-matching heuristics fail when the same entity has different names, abbreviations, or word order. Embeddings capture that "Galaxy S24U - Samsung" and "Samsung Galaxy S24 Ultra" refer to the same product — even when the surface-level text looks quite different.
But bolting embeddings onto an already-tangled pipeline just adds another layer of complexity: API calls, batching logic, result storage, monitoring. What you need is a foundation that makes embedding operations understandable, auditable, and measurable.
This post builds on two previous pieces—Rethinking Data Transformation with Malloy (where persistence as a language feature replaces external materialization config) and The Data Warehouse Became an ML Platform (a structured framework for warehouse-native ML pipelines). With those foundations, entity matching becomes tractable. The entire workflow, from exploratory analysis through embedding generation to production-grade evaluation, lives in a single tool and language.
The Pattern: Embedding-Based Entity Matching
Every embedding-based matching workflow follows the same five-step structure:
1. Feature Engineering → Build match strings from source data
2. Embedding Generation → Convert match strings to vectors (persist!)
3. Candidate Filtering → Join on shared attributes to narrow the search space
4. Similarity Scoring → Compute embedding distance + supplementary signals
5. Threshold & Evaluation → Score candidates, measure accuracy, iterate
Let's walk through a complete example.
Example: Product Catalog Matching for E-Commerce
You're an e-commerce platform aggregating inventory from hundreds of retailers. Each retailer submits product data in their own format:
- Retailer A: "Apple iPhone 15 Pro Max 256GB Natural Titanium"
- Retailer B: "iPhone 15 Pro Max - 256 GB (Natural Titanium) - Apple"
- Retailer C: "APPLE iPHONE 15 PRO MAX 256GB NAT TITANIUM"
These all refer to the same SKU in your supplier catalog, but exact matching fails and regex normalization becomes a maintenance nightmare.
Step 1: Feature Engineering
Combine key product attributes into match strings, and normalize categories for candidate filtering:
The supplier catalog source follows the same pattern, using product_description as its match string.
Step 2: Generate and Persist Embeddings
The expensive step—generating embeddings—gets the #@ persist tag so results are saved as a warehouse table and don't regenerate on every pipeline run. (For background on how materialization works as a language feature in Malloy, see Rethinking Data Transformation.)
Generate corresponding embeddings for the supplier catalog side. Now embeddings persist—subsequent queries read from the persisted table instead of recomputing them.
Step 3: Narrow the Search Space
Before computing similarity, filter candidates by joining on product category. You only want to compare electronics against electronics, not electronics against apparel:
This step is crucial. Without it, you'd compute similarity for every retailer listing against every supplier SKU — the comparison count explodes quadratically. Filtering on categories, brands, or other coarse attributes keeps the problem tractable.
Step 4: Similarity Scoring
Combine embedding similarity with supplementary signals:
Notice that the logic—which signals to combine, how to weight them—is explicit in the code. No opaque Python function, no mystery scoring buried in a stored procedure.
Step 5: Apply Thresholds
Notice materialization at two layers: embeddings are materialized so they don't regenerate, and final match results are materialized so downstream dashboards and exports read from a table, not a live computation. (Malloy uses the #@ persist annotation for materialization — see Rethinking Data Transformation for a deeper dive.) When you tune thresholds or add signals, only the layers that changed need to rebuild.
Evaluation: Measuring Match Quality
Here's why legacy pipelines stay stuck: even if you could swap in an embedding model, how would you know it's better? String-matching heuristics carry a false confidence—it either matches or it doesn't. Embedding-based matching is probabilistic. You need to measure accuracy, not just inspect individual results. (This is the same evaluation discipline described in Warehouse-Native ML Pipelines, applied specifically to entity matching.)
Evaluation is what enables iteration. Every failed match is a signal to improve the model — capture the gap, refine the feature engineering, adjust the thresholds, and measure the impact. Without this feedback loop, you're guessing. With it, your pipeline gets better every day.
Golden Test Sets and Top-N Accuracy
Create a representative sample with human-reviewed correct answers, then join your scored matches against it:
From this, you can build dashboards showing Top-1 Accuracy (how often the top-ranked match is correct—your automation ceiling) and Top-3 or Top-5 Accuracy (how useful a shortlist is for human-in-the-loop review).
Error Analysis
When matches look wrong, you need to understand why. A simple error review query surfaces the source record, the correct answer, and the top candidates the algorithm returned with their similarity scores. Patterns emerge: maybe the embedding model struggles with abbreviations, or the blocking key is too restrictive, or certain categories need different feature engineering.
This error analysis drives a tight iteration loop:
- Run evaluation against the golden test set
- Identify failure patterns
- Hypothesize improvements (adjust features, tune thresholds, add signals)
- Implement changes in Malloy code
- Re-run evaluation to measure impact
Every experiment is version-controlled. Evaluations run at warehouse scale. And because embeddings are persisted, downstream experiments run in seconds—you're tuning thresholds and feature weights, not waiting for regeneration. The entire iteration loop — explore data, adjust features, re-score candidates, measure accuracy — happens in Malloy without switching between a Python notebook for experimentation and a SQL pipeline for production. When the numbers look right, the same code you used to explore and evaluate is what runs in production — no translation step, no handoff. One language, one workflow, from exploration to deployment.
From Heuristics to State-of-the-Art
Here's what changes when you build entity matching this way:
Getting Started
- Try Malloy: malloydata.dev — open source, with extensions for VS Code and Cursor
- Talk to Credible: credibledata.com — managed entity matching infrastructure with evaluation, governed models, and AI-ready data at scale
- Keep reading: Rethinking Data Transformation · Warehouse-Native ML Pipelines
Conclusion
Entity matching doesn't have to stay stuck in the past. The barrier isn't that better approaches don't exist—embedding models dramatically outperform string-matching heuristics. The barrier is that adoption requires pipelines that are governed, evaluable, and auditable. Malloy provides that foundation: readable transformation code, warehouse-native ML, materialization for fast experimentation, and built-in evaluation to prove that changes actually improve accuracy. You can finally adopt state-of-the-art approaches and iterate with confidence, instead of accumulating edge-case heuristics forever.

_og-ml-pipelines.html.png)
_og-entity-matching.html.png)