Identity Resolution in SQL

First off: what is identity, and how does it relate to customer data? Identity can have many different meanings, but they all involve unifying different pieces of data around your customers. Identity can include projects like merging CRM data with product usage data, or bridging user data between multiple business units.

This blog post is going to focus on a simple, common scenario around first-party event data. This is data that you collect on your customer, rather than receiving from another source. These events are typically collected on landing pages and online applications. First-party event data usually falls into two categories:

• Anonymous data - the user has not logged in, and so events are only associated with a randomly generated anonymous ID

• Identified data - the user has logged in and events are associated with a known user ID

A common problem with collecting this event data is in unifying identified and anonymous data once a user logs in.

Let's imagine we're an e-commerce store. Users can come to our site, browse items anonymously, and then optionally login and purchase items. Let's also assume that we're collecting events for both anonymous (before login) and identified (after login) traffic on our site.

The challenge now is how to associate and unify anonymous data and identified data. Specifically, the challenge is in asking questions like "How many times did this user view products of this category?" when the events they performed while anonymous aren't associated with their user ID.

This blog posts assumes all of these events are collected into a modern warehouse like Snowflake or BigQuery, and then proposes a warehouse-native solution to this problem. If this feels daunting, or you're looking for a code-free way to unify identities within the data warehouse, we've built an Identity Resolution feature that will help you instead.

Warehouse specifics

Typically services like Segment, Google Analytics, RudderStack, or in-house solutions handle event collection and then push these events into a data warehouse. All events around your users are stored in a single place, alongside all of the other customer data relevant to your business.

Depending on the specific event collection service being used, the data will be loaded in your warehouse with slightly different schemas. In this article we'll use a schema resembling that of Segment. Most event collection schemas look quite similar, so even if yours doesn't look exactly like this much of the principles in this blog post remain the same.

We're going to have two tables in our warehouse relating to event collection. Normally these tables are provisioned by the service pushing events into your warehouse, but we'll show the schemas here to give a sense of how this data is usually structured.

events

The events table holds a sequence of events: the name of the event, relevant properties, the timestamp, and who performed it. If the the event is anonymous, it'll include only an anonymous ID. If the event is identified, it'll include a user ID, and typically an anonymous ID as well.

CREATE TABLE events(
    timestamp timestamp,
    user_id text,
    anonymous_id text,
    event text,
    properties jsonb
)

user_identities

The user identities table tracks relationships between anonymous and user IDs. This table is usually populated during user during signup or login.

CREATE TABLE user_identities(
    user_id text,
    anonymous_id text,
    PRIMARY KEY ( user_id, anonymous_id )
)

Here's the challenge, and where identity handling comes in. The data in these tables aren't in a format that's ideal for analysis. If you wanted to email all users who viewed a certain item, you'd likely want to include users who viewed that item anonymously (before they logged in).

Solution

Let's explore a solution where identity is resolved in the warehouse. Specifically, we're going to produce a table called identified_events that contains events after identity resolution.

Here's an example scenario that we can work with. Someone visits your site, views two items, logs in, and then views two more products.

example=> select * from events;

timestamp	user_id	anonymous_id	anonymous_id	anonymous_id
2020-09-10 01:52:40.844441		782d69de-86cd-41d2-8cb8-4d27d1f2ff39	Product Viewed	`{"product_id": 100}`
2020-09-10 01:53:40.844441		782d69de-86cd-41d2-8cb8-4d27d1f2ff39	Product Viewed	`{"product_id": 105}`
2020-09-10 01:54:40.844441	8fe77474-2c75-471e-8334-ee947b98a5a3	782d69de-86cd-41d2-8cb8-4d27d1f2ff39	Logged In	`{}`
2020-09-10 01:55:40.844441	8fe77474-2c75-471e-8334-ee947b98a5a3	782d69de-86cd-41d2-8cb8-4d27d1f2ff39	Product Viewed	`{"product_id": 111}`
2020-09-10 01:59:40.844441	8fe77474-2c75-471e-8334-ee947b98a5a3	782d69de-86cd-41d2-8cb8-4d27d1f2ff39	Product Viewed	`{"product_id": 115}`

(4 rows)

Notice how user_id is NULL for the first two events. These events were actually performed by user with ID 8fe77474-2c75-471e-8334-ee947b98a5a3, but the raw data doesn't reflect this. In identified_events we'd want this user ID to be backfilled into these events.

When using a service like Segment, you generally perform an identify call on login. This call takes both a user ID and anonymous ID and gets populated in the user_identities table.

example=> select * from user_identities;

user_id	anonymous_id
8fe77474-2c75-471e-8334-ee947b98a5a3	782d69de-86cd-41d2-8cb8-4d27d1f2ff39
`(1 row)`

This is the table that establishes the link between user ID 8fe77474-2c75-471e-8334-ee947b98a5a3 and 782d69de-86cd-41d2-8cb8-4d27d1f2ff39. As a result, this is the table we're going to join our events table with to backfill user IDs.

We're going to create identified_events as a SQL view. To summarize the logic in this view, we're filling in the user ID of the event by looking up the correct ID in the user_identities table, and leaving events that already have a user ID the same.

CREATE VIEW identified_events AS
(
    SELECT
        events.timestamp,
        user_identities.user_id,
        events.anonymous_id,
        events.event,
        events.properties
    FROM events
    JOIN user_identities ON user_identities.anonymous_id = events.anonymous_id
    WHERE events.user_id IS NULL
)
UNION ALL
(
    SELECT
        events.timestamp,
        events.user_id,
        events.anonymous_id,
        events.event,
        events.properties
    FROM events
    WHERE events.user_id IS NOT NULL
)

Now if we look at identified_events we'll get the following.

dbt=> select * from identified_events;

timestamp	user_id	anonymous_id	anonymous_id	anonymous_id
2020-09-10 01:52:40.844441	8fe77474-2c75-471e-8334-ee947b98a5a3	782d69de-86cd-41d2-8cb8-4d27d1f2ff39	Product Viewed	`{"product_id": 100}`
2020-09-10 01:52:40.844441	8fe77474-2c75-471e-8334-ee947b98a5a3	782d69de-86cd-41d2-8cb8-4d27d1f2ff39	Product Viewed	`{"product_id": 115}`
2020-09-10 01:52:40.844441	8fe77474-2c75-471e-8334-ee947b98a5a3	782d69de-86cd-41d2-8cb8-4d27d1f2ff39	Product Viewed	`{"product_id": 105}`
2020-09-10 01:52:40.844441	8fe77474-2c75-471e-8334-ee947b98a5a3	782d69de-86cd-41d2-8cb8-4d27d1f2ff39	Product Viewed	`{"product_id": 111}`

(4 rows)

Success! We can now use the identified_events view anywhere we would have used the raw events table.

For simple query example, we could find the number of product views for user with ID '8fe77474-2c75-471e-8334-ee947b98a5a3.

dbt=> select count(*) from identified_events where user_id = '8fe77474-2c75-471e-8334-ee947b98a5a3';

 count
-------
     4
(1 row)

Were we using the raw events table, the result would be two, which is not the full picture for this user.

Depending on the situation, it might make sense to make identified_events a materialized view. These views are pre-computed and saved to disk. The upside is that querying these views are faster because they're pre-computed, and the downside is that they need to be "refreshed" and can contain stale data. dbt (https://www.getdbt.com/) is a popular tool for organizing and scheduling materialized views.

Conclusion

What this blog post contained was a very simple example of identity, namely joining anonymous and identified event streams. Identity projects as a whole can be much more complicated. Here are some examples of other scenarios where getting a unified view of customer identity is much harder.

• You have multiple business units, each with completely different applications, SaaS tooling, and user databases

• You're a B2B company, and context on your customer is scattered across systems like Salesforce, Zendesk, and internal databases

• You have both an online and physical (brick-and-mortar) presence

Identity problems and solutions vary a lot between companies are highly specific to your business. At Hightouch, we believe the best place to manage customer identity is in data warehouses, the typical place where all data on your customers resides. This is why we've built a feature to help organizations resolve complex identity and entity resolution processes in their warehouse, with a code-free editor.

A special thanks to Huy Nguyen of Holistics and Graham Murphy for giving feedback on this article!‍