Hightouch is a Data Activation platform that syncs data from sources to business applications and developer tools. This frees up valuable engineering time for your data team and delivers actionable data directly to business teams. It also ensures data consistency across your organization.
Keep reading to gain a high-level understanding of Hightouch's core concepts.
A source is wherever business data is stored. The most frequently used sources include data warehouses like Snowflake and Google BigQuery. Sources can also be databases, CSV files, SFTP, or BI tools.
To add a source to your Hightouch workspace, go to the Sources overview page and click the Add source button.
A destination is any tool or service you want to send source data to. They're where end-users typically consume data. Hightouch integrates with 125+ destinations, including CRM systems, ad platforms, marketing automations, and support tools.
To add a destination to your Hightouch workspace, go to the Destinations overview page and click the Add destination button.
For Hightouch to know what data to sync, you need to create a model. Models define the data you want to pull from a source.
You can define models by:
- writing a query in the SQL editor
- using the visual table selector
- or leveraging existing dbt models or Looker Looks
You can also use Hightouch's no-code Customer Studio feature to define cohorts before syncing data to a destination. Each cohort acts as a segmented model.
Regardless of how you build your models, you must configure them with a unique primary key. A primary key is a special value used to uniquely identify each row in a dataset. It's like a unique ID for each entry in a table or dataset.
For example, if you have a table containing customer information, you might have a column named
CustomerID that contains a unique number for each customer. This
CustomerID column would be the primary key because it's the value you can use to distinguish one customer from all the others.
Without a primary key, you might have multiple rows with the same name and address, for instance. The primary key gives each row its own unique identity.
Hightouch uses this unique identifier to keep track of records. Using a unique key lets Hightouch only sync new and updated data to your destinations. See the change data capture section to learn how Hightouch accomplishes this.
Once Hightouch knows what data model to query from a source, you can configure a sync to declare how you want that data to appear in your destination.
You can build multiple syncs to different destinations from the same model. For example, you can use a model containing customer data to configure syncs to sales, marketing, and support tools. Using the same data model ensures that all parts of your business are working off the same source of truth.
If you're syncing multiple object types to the same destination, for example, contacts and custom behavioral events to HubSpot, you must configure separate syncs for each sync type.
Sync configuration varies depending on the destination but generally includes the same steps. You select the appropriate sync mode for your use case—upsert, insert, update, etc.—and declaratively map model columns to destination fields.
Part of sync configuration is scheduling. Besides triggering syncs manually, you can schedule syncs to run on a recurring schedule. You can also trigger syncs automatically via dbt Cloud, Fivetran, Airflow, Dagster, Prefect, Mage, or the Hightouch REST API.
Refer to the sync overview page for more information on sync configuration and scheduling.
As part of configuring a model, you must select a unique primary key. Hightouch uses the primary key for change data capture (CDC) and to ensure that no duplicates are sent to downstream destinations.
Although you select a primary key during model creation, duplicate detection occurs during sync runs.
A sync run happens when Hightouch uses a model to query your data source and sends the query's results to a downstream destination. Because a model is a query and not the results of the query, Hightouch can't detect if the column you select for your primary key is unique based on the model definition alone. It needs a particular set of query results to detect duplicates, which is why duplication detection occurs during each sync run.
If Hightouch detects that a particular sync has two records with the same primary key, it sends neither record to the sync's destination. Hightouch has no way of knowing which is the preferred version of a duplicated record, so it's safer to send neither.
Duplicated records appear as errors in the live debugger. Because Hightouch doesn't send requests for duplicated records, you can't click into an errored row for more details. To become aware of these and other errors, it's best to set up alerting on your syncs.
Since primary keys are used during CDC, duplicated records aren't detected for All/Mirror syncs. This sync mode overwrites all existing records without performing CDC.
If Hightouch were to send all query results from a model at every sync, we'd likely be overwriting values that don't need updating. To prevent making excessive API requests and send only necessary updates to your destinations, Hightouch uses a process commonly referred to as diffing or change data capture (CDC).
Whenever a new sync is triggered, Hightouch compares the previous sync run to the current set of query results. To do this, Hightouch keeps a record of the data sent in the last sync. This record is the diff file.
Hightouch refers to the diff file to identify what has changed in the source data since the last run using a model's specified primary key. These are the steps for the comparison:
- Hightouch queries the source using the defined model.
- Hightouch compares all the primary keys in the query results with the primary keys in the diff file.
- For each primary key:
- If it's in the most recent query results but not in the diff file, Hightouch treats this as a new record.
- If it's in both, Hightouch scans columns for changes.
- If the primary key is in the diff file but missing from the most recent query results, Hightouch treats this as a deleted record.
- Hightouch creates a new diff file for the next comparison.
- Hightouch syncs changes, including any failed rows from the previous sync, to the destination.
Hightouch uses a CDC method called difference-based CDC because it involves a full before and after comparison. Difference-based CDC is only one CDC method. This CDC method is necessary when syncing data from data warehouses since they can't produce CDC logs on arbitrary SQL queries or dbt models.
Online transactional processing (OLTP) databases like Postgres, MySQL, or Microsoft SQL Server natively log incremental changes that occur on data tables. Most modern ETL tools use these transaction logs to track changes when sending data to data warehouses or lakes. Since Hightouch does the reverse—sending data from warehouses to other destinations—it can't rely on log-based CDC. How CDC happens is a crucial difference between ETL and reverse ETL.
Hightouch performs change data capture after receiving the query results from a source and before sending data to your destination. When a new sync is triggered, you may see the sync status is Querying. This status means the sync is in one of these three states:
- Hightouch is waiting for query results from the source
- Hightouch is saving the diff file
- Hightouch is performing the change data capture computation
By default, Hightouch computes CDC and stores the diff file on a Hightouch-managed infrastructure. If you're on a Business tier plan, you can configure Hightouch to use your own S3 or GCP bucket, so data is never stored in Hightouch's infrastructure.
For some sources, you can choose to do the CDC computation in your own warehouse. Using your warehouse has the advantage of faster syncs at higher volumes but requires granting write access to a separate Hightouch-managed schema in your warehouse. See the Lightning sync engine documentation to learn more.
For most SaaS destinations, Hightouch only tracks changes in columns that are part of the sync configuration. Suppose your model queries twenty columns from a source, but the only sync using that model maps ten of those. Then, Hightouch only tracks changes in the ten mapped fields.
As a general rule of thumb, changes in model configuration only matter for diffing purposes if they include columns used in syncs. However, there is a notable exception: custom destinations, such as HTTP Request, perform diffing on all columns, even those not used in the sync configuration.
If you change a column's data type in a model—for example, changing it from a
string to a
number—Hightouch detects these as row changes during the next sync.
If you add or delete columns from your model and then run the sync, Hightouch creates a new diff file for future comparisons.
Since a column addition or deletion affects all rows, it has the effect of resyncing your full query as if it's the first time your sync has run.
As explained in the primary key updates section, if you update a model's primary key by selecting a different column, you need to trigger a full resync for all syncs that use that model. Otherwise, change data capture can't process your model data correctly, which can make your syncs fail.
You don't need to manually trigger a full resync if you change the primary key column's data type. If you change the primary key's data type in the model configuration, your sync will process normally. If you make this change in your source or in the SQL editor, the entire model query result set is automatically resynced as if you triggered a full resync. As outlined in the full resync prerequisites section, this can create duplicates in your destination data.
Learn more about changes to your model configuration in the model column changes section.
If you change your sync configuration's mappings, Hightouch reprocesses the entire model query result set during the next sync run. More information can be found in the Field mapping updates section.
Hightouch doesn't maintain a historical record of previous diff files. We only maintain the most recent diff file and compare every new sync with it.