Prepare your data to build a customer identity graph

Audience: Platform admin, data or analytics engineer
Prerequisites: IDR overview →, Setup steps →

Before you can resolve identities in Hightouch, you'll need to prepare your source data. This article walks you through how to structure and configure your data for use in an Identity Resolution (IDR) project, which powers your customer identity graph.

What is a customer identity graph?

A customer identity graph connects identifiers (like emails, device IDs, and phone numbers) across your datasets to form unified customer profiles. Each graph is built from an IDR project that defines:

Which source tables to include
Which columns represent identifiers
How records are matched across models
Whether to use deterministic, probabilistic, or hybrid matching strategies

The result is a set of deduplicated identities, each with a unique HT_ID, that you can use across Hightouch for targeting, analytics, and personalization.

What you'll prepare

Element	Description
Input model	A primary table where each row represents an individual (e.g. `users`)
Identifier mappings	Map model columns (e.g. `email`, `phone_number`) to identifier types used for matching
Input models	Supporting datasets (e.g. orders, devices, web events) joined via shared identifiers
Match strategy	Select deterministic, probabilistic, or both based on your data
Confidence thresholds	(Optional) Define match strength tiers (Exact / Strict / Loose) for probabilistic matching
Golden Record	(Optional) Rules for selecting the most trusted value per trait

Choose a match strategy

Your data quality and structure will determine which match strategy to use:

Use case	Recommended strategy
Stable IDs, clean login events	Deterministic
Messy, user-entered data (e.g. lead forms)	Probabilistic
Mixed-quality data across systems	Hybrid

You can use deterministic matching alone—or enable probabilistic matching to improve coverage.

Probabilistic matching uses similarity across multiple identifiers (e.g. name, email, phone) and assigns confidence scores to each match.

Step-by-step: prepare your data for Identity Resolution

Select a data source

Go to Identity Resolution and click Add identity graph
Choose a Lightning-supported data warehouse (Snowflake, Databricks, and BigQuery) that contains the data you want to use.

Info: Identity graphs are warehouse-specific. To build graphs across multiple sources, create one per warehouse.

Choose your models

Choose your input model (e.g. users, customers, contacts).
- Each model must include a timestamp column for incremental processing:
  - Use an event timestamp for event models
  - Use a last_updated_at or similar field for static records
  - If no timestamp exists, define one in your model SQL (e.g. CURRENT_TIMESTAMP)

Map identifier columns

Within each model, map relevant columns to standard identifier types. These mappings determine which identifiers Hightouch uses when evaluating record matches.

What Are Identifiers?

Identifiers are fields that help link records across systems. Common examples include:

Email address
Phone number
Full name
User ID or customer ID
Anonymous ID (e.g. session ID)
Mailing address or postal code

Be sure to review the consistency and formatting of identifiers across models.

Configure identifier rules

Once you've mapped identifier columns, configure identifier rules to control how each field contributes to matching.

What Are Identifier Rules?

Identifier rules determine how Hightouch uses your mapped identifiers in deterministic and probabilistic matching.

For deterministic matching, you'll define:

Priority order: Which identifiers should be used first when evaluating exact matches
Limit rules: Optional boundaries to prevent identifiers from over-linking across unrelated people (e.g. shared devices or generic emails)

For probabilistic matching, identifiers are automatically combined into a weighted model that calculates match confidence.

Supported Identifier Types

Identifier Type	Example Fields	Matching Supported
Email	`email`, `user_email`	Deterministic + Probabilistic
Phone	`phone_number`	Deterministic + Probabilistic
Name	`first_name`, `last_name`	Probabilistic only
Address	`street_address, state, city, postal_code`	Probabilistic only
User ID	`user_id`, `customer_id`	Deterministic only
Anonymous ID	`anonymous_id`	Deterministic only