Hightouch is a Data Activation platform that connects and orchestrates data from sources to business tools. The platform manages the varying integrations and logic to activate data models from sources.
A Source is wherever business data is stored, ranging from a data warehouse, database, CSV, SFTP, or even a BI Tool. It is most commonly a source of truth for business data.
A Destination is a tool or service receiving data from a Source. This is typically where end-users consume data (outside of analysis). Hightouch integrates with 100+ Destinations including CRM systems, ad platforms, marketing automation, and support tools.
Both Sources and Destinations are configured in Hightouch by either logging in with OAuth or providing an API key. Hightouch is ever-growing to support more Sources and Destinations; check our changelog here!
In order for Hightouch to know what data to sync, a Model is defined. A Model organizes elements of data to be queried from a data source. For most Sources, a Model is defined with SQL; Hightouch sends the SQL directly to the Source to query data. Alternatively, a Model can be defined with dbt Models or Looker Looks to leverage existing data models.
Hightouch’s Visual Audience Builder can be used to segment a Model to build audiences or cohorts of data (with no code) before syncing the data to a destination tool. This process creates an Audience that generates SQL that acts as a segmented Model.
Regardless of how a Model is built, it is configured with a unique Primary Key that is used by Hightouch to search and keep track of records. This is important to ensure Hightouch is only syncing new and updated data to a destination tool. How Hightouch manages difference checking will be covered in Diffing.
Once Hightouch knows what data model to query from a Source, a Sync is configured to map the data from the Source to the Destination. The Sync manages how a Destination will receive data from the Source as well as the frequency of the pipeline. A Sync can be scheduled to trigger periodically, manually, or automatically via Airflow Operator, dbt Cloud, or the REST API.
The configuration of a Sync varies from Destination to Destination, but for the most part, the experience is the same; declaratively map data from Source fields to Destination fields and determine a sync mode (Upsert, Insert, Update, etc). Some Destinations will have different sync types for varying data types, such as Users vs Accounts vs Events.
A single Model can be configured with multiple Syncs to different Destinations. For example, a Model containing customer data is commonly configured to sync between Sales (ie Salesforce), Marketing (ie Iterable), and Support (ie Zendesk) tools. Doing so enables all business tools to leverage the same source of truth.
Hightouch employs diffing to ensure the platform doesn’t send excessive requests for all rows in a Model every time a sync triggers; only deltas in your data model are synced to your destinations. A record of the data mapped and synced between a Model and a Destination (the diff file) is updated after each run. When a new Sync runs, the diff file is used to identify incremental changes to the Model. This is how Hightouch is able to only send requests for new and changed data in a Model.
The Primary Key specified in the Model is used as a waypoint to search and track records. When a new Sync triggers, Hightouch compares the Primary Keys in the new dataset with the previous dataset in the diff file. If the Primary Key for a record exists in both the new dataset and in the diff file, Hightouch scans the columns for any changes to the data. If the Primary Key is missing in the new dataset, Hightouch considers this a deleted record, whereas if the Primary Key is missing in the diff file, Hightouch considers this a new record.
By default, the diffing compute is done by Hightouch’s infrastructure (local diffing) and does not require WRITE permissions back to the Source. Alternatively, diffing can be done entirely in your warehouse with warehouse planning. This process has Change Data Captrue computing done within the Source warehouse to achieve faster syncs at higher volumes. This gives you the flexibility of providing write or read-only access to your warehouse with no loss in functionality.
Hightouch only tracks changes in columns based on the fields mapped in a Sync configuration. For example, if only 10 fields are being mapped in a Sync from a Model that queries 20 fields, Hightouch will only track these 10 fields.
Hightouch only compares the diff file from a current sync with the most recent diff file from that sync. Hightouch doesn’t maintain a historical record of all rows and all columns (fields) that have ever been sent.
If a row drops out of a Sync, it's considered a new row even though the row may have been sent in the past. Hightouch doesn't store all primary keys that have ever been sent. Consequently, Hightouch recommends the following 'best practice':
Your warehouse should be your single source of truth. It is not a good practice to update data only in your end tool.
When Hightouch executes a Sync, the platform runs a query against the specified Source and syncs the generated diff file to an cloud bucket where the diff check occurs. This is considered local diffing, and it can be hosted either on Hightouch's infrastructure or your own infrastructure.
If warehouse planning is enabled, instead of moving the diff file to a cloud bucket, Hightouch will store and compute the diff checking directly in the Source. There is a significant speed difference when using warehouse planning when dealing with millions or hundreds of millions of records.