Probabilistic matching
Audience: Platform admin, data or analytics engineer
Prerequisites: IDR overview →, Setup steps →, Prepare your data →, Golden Record →
Use probabilistic matching to link records that likely belong to the same person—even when values don’t match exactly.
What is probabilistic matching?
Probabilistic matching uses data normalization, fuzzy comparison, and AI to connect similar records. It compares multiple fields—like email, phone, name, and address—and assigns a confidence score to each match.
For example:
john.doe@hightouch.com
andjohndoe@gmail.com
might be linked if other traits (like phone or zip) overlap.
You define how strict or loose the logic is by setting match thresholds.
When to use it
Scenario | Example | Why It Helps |
---|---|---|
Typos and misspellings | John Doe vs. Jhon Doe | Fuzzy scoring tolerates inexact values |
Multiple accounts for the same person | johndoe@gmail.com vs. john.doe@company.com | Scores improve when combining multiple identifiers |
Format variations | (415) 555-1234 vs. 4155551234 | Probabilistic matching applies normalization logic |
Sparse or user-entered records | Lead forms, event RSVPs, loyalty sign-ups | Matches based on partial or inconsistent information |
Cross-channel identity stitching | CDP → CRM → POS | Links identities even when shared IDs aren’t available |
Multiple identifiers per person | Name + phone + ZIP code | Higher match confidence from overlapping traits |
How it works
Probabilistic matching links records that likely refer to the same person, even when the data is inconsistent or incomplete. Behind the scenes, matching happens in three steps:
- Normalize the data
We clean and standardize field values to make them easier to compare–for example, handling nicknames, email casing, and formatting differences. - Compares fields
We compare values across key fields (like name, email, or address) and use AI to generate similarity scores for each field pair. - Score the record pair
The individual field scores are evaluated by our proprietary AI model to form a single record-level similarity score that reflects how likely the records belong to the same person.
You decide what counts as a match
In the final step, you choose the confidence level that fits your use case. Records that meet or exceed your threshold are grouped into the same identity. Even if two records don’t match directly, they can be linked through shared matches.
Confidence tiers
Confidence tiers let you control how strict or flexible your matching is:
Tier | Description | Use Case |
---|---|---|
Exact | Near-identical records | Operations and transactional emails |
Strict | Strong match with minor variation | Lifecycle messaging, retargeting |
Loose | Possible match, broader reach | Ads, retargeting, analytics |
Lower tiers capture more matches but increase the chance of false positives. Higher tiers keep matching more conservative. You can adjust tiers to fit your data quality and business goals.
How to enable It
Probabilistic matching is optional, and can be added to any IDR model.
- When configuring your identity model, toggle on Probabilistic Matching
- Choose your match thresholds for Exact, Strict, and Loose
- Use as many probabilistic identifiers as possible (e.g. name, email, phone, address)
You can adjust thresholds anytime based on QA results.
Learn more → Prepare your data to build a customer identity graph
How to QA your results
After enabling probabilistic matching:
- Open the Summary tab to view match rates by confidence tier
- Use the Profiles tab to inspect what contributed to each match
- Adjust thresholds if you're over- or under-merging records
Look for improved match rates compared to deterministic-only baselines.
Learn more → Match summary & profile review
When to activate probabilistic matches
Use probabilistic matching when:
- ✅ Your data has inconsistencies, nicknames, or formatting issues
- ✅ You need broader reach for campaigns
- ✅ You want to unify across systems without shared IDs
Avoid using Loose matches without QA:
- ❌ Don’t sync all Loose records without validation
- ❌ Avoid Loose for operational and transactional use cases
What’s next?
Now that you understand deterministic and probabilistic matching, you’re ready to:
- Review and QA your matches using Summary and Profiles views
- Build audiences based on clean, deduplicated traits
- Sync downstream with confidence-level filters