Step 5: Unifying historical and new data

After successfully setting up Hightouch Events and getting event data flowing, you'll almost certainly want to integrate your historical data with the newly collected events. While it's possible that an organization wants a fresh start to its event data or has completely rewritten its data model, if you're migrating from an existing solution like Segment or Rudderstack and have downstream applications, you'll want to unify your historical and new data to ensure continuity in your analytics and a clean transition for your data consumers.

Planning your unification strategy

Before you begin, consider the following:

Time frame: Based on your use cases, decide how far back you need to unify data (for example, 1 year, 2 years, or all historical data). Your decided time frame affects processing time and storage requirements and ensures you have enough historical context for your analyses.
Data volume: Assess the volume of historical data you need to unify. This could impact processing time and storage costs and may influence your choice of unification method.
Schema differences: Identify any differences between your old and new event schemas if you haven't in a prior step. If there are changes, this will inform any necessary data transformations during the unification process.
Storage considerations: Determine where the unified data will be stored (for example, an existing or new table in your data warehouse). Your decision can affect query performance, how you update downstream destinations, and may have cost implications.

There are three main options for where to store your unified data: in the previous provider's events tables, in the Hightouch tables, or in an intermediary materialized table. Any of these options would work, but we recommend joining the prior events tables into the Hightouch tables to avoid recurring unioning costs from an intermediate table and for ease of use (Hightouch will write events into the source of truth tables).

Unioning historical and new data

To ensure continuity in your analytics and a smooth transition for data consumers, we'll union the historical Segment data with the new Hightouch Events data.

Your process will depend on where you've decided to store the unioned data and any needed data transformations. We've sketched in SQL here for inserting 2 years of historical data from a prior provider (in this case Segment) into the Hightouch tables with minimal transformation.

In this particular case, we're only going to insert events from Segment that don't have a migrationId, since any event in Segment with a migrationId should already be in the Hightouch tables.

-- Insert historical data from Segment to Hightouch identifies table
INSERT INTO HIGHTOUCH.identifies (
  id, anonymous_id, user_id, timestamp, email, name, migrationId
)
SELECT
  s.id, s.anonymous_id, s.user_id, s.timestamp, s.email, s.name, s.migrationId
FROM SEGMENT.identifies s
WHERE s.migrationId IS NULL
  AND s.timestamp >= DATEADD(day, -730, CURRENT_DATE());  -- Adjust time frame as needed

Modify this for table names and to include any traits or other properties you need. Repeat similar INSERT statements for tracks, pages, and other relevant tables, mapping the columns appropriately.

If there are differences between Segment and Hightouch schemas, you may need to transform the data during insertion. If you are migrating from Segment, for example, the schemas are quite similar and should only require basic SQL transformations, if any.

Validating the unified dataset

We've previously validated that Hightouch Events is collecting an equivalent volume of events with matching values. For peace of mind, after unifying the de-duped data, you can perform some basic validation steps:

Check event counts: Ensure the total event count makes sense based on the historical data and new Hightouch events.

SELECT
    DATE_TRUNC('day', timestamp) as date,
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as user_count
FROM HIGHTOUCH.tracks
GROUP BY 1
ORDER BY 1;

Verify data continuity: Look for any unexpected drops or spikes in event volume, especially around the migration period.
Compare with Segment data: Run a comparison query to ensure all historical data was properly migrated.
Sample data: Manually review a sample of events from different periods to ensure data format and content consistency.

By following these steps, you'll create a unified dataset in your Hightouch tables that combines your historical Segment data with new events from Hightouch Events. This approach ensures data continuity and provides a single source of truth for all your event data.

Remember to adjust the SQL queries as needed to match your specific schema and table names. Also, consider running these operations in batches if you're dealing with a large volume of historical data.

In the next section, we'll set up alerts and monitoring for long-term success of your events infrastructure.

Step 5: Unifying historical and new data

Planning your unification strategy

Unioning historical and new data

Validating the unified dataset

Ready to get started?

Need help?

Feature requests?