Hightouch is a Data Activation platform that syncs data from sources to business applications and developer tools. This frees up valuable engineering time for your data team and delivers actionable data directly to business teams. It also ensures data consistency across your organization.
Keep reading to gain a high-level understanding of Hightouch's core concepts.
A source is wherever business data is stored. The most frequently used sources include data warehouses like Snowflake and Google BigQuery. Sources can also be databases, CSV files, SFTP, or BI tools.
To add a source to your Hightouch workspace, go to the Sources overview page and click the Add source button.
A destination is any tool or service you want to send source data to. They're where end-users typically consume data. Hightouch integrates with 125+ destinations, including CRM systems, ad platforms, marketing automations, and support tools.
To add a destination to your Hightouch workspace, go to the Destinations overview page and click the Add destination button.
For Hightouch to know what data to sync, you need to create a model. Models define the data you want to pull from a source.
You can define models by:
- writing a query in the SQL editor
- using the visual table selector
- or leveraging existing dbt models or Looker Looks
You can also use Hightouch's no-code Audiences feature to define cohorts before syncing data to a destination. Each cohort acts as a segmented model.
Regardless of how you build your models, you must configure them with a unique primary key. Hightouch uses this unique identifier to keep track of records. Using a unique key lets Hightouch only sync new and updated data to your destinations. See the change data capture section to learn how Hightouch accomplishes this.
To ensure your destinations receive all desired data, it's imperative to select a truly unique primary key. For event syncs, for example, it's best to use a hash function combining all columns in the event data, including member ID, timestamp, etc.
Once Hightouch knows what data model to query from a source, you can configure a sync to declare how you want that data to appear in your destination.
You can build multiple syncs to different destinations from the same model. For example, you can use a model containing customer data to configure syncs to sales, marketing, and support tools. Using the same data model ensures that all parts of your business are working off the same source of truth.
Sync configuration varies depending on the destination but generally includes the same steps. You select the appropriate sync mode for your use case—upsert, insert, update, etc.—and declaratively map source columns to destination fields.
Part of sync configuration is scheduling. Besides triggering syncs manually, you can schedule syncs to run on a recurring schedule. You can also trigger syncs automatically via dbt Cloud, Fivetran, Airflow, Dagster, Prefect, or our REST API.
Refer to the sync overview page for more information on sync configuration and scheduling.
If Hightouch were to send all query results from a model at every sync, we'd likely be overwriting values that don't need updating. To prevent making excessive API requests and send only necessary updates to your destinations, Hightouch uses a process commonly referred to as diffing or change data capture (CDC).
Whenever a new sync is triggered, Hightouch compares the previous sync run to the current set of query results. To do this, Hightouch keeps a record of the data sent in the last sync. This record is called the diff file.
Hightouch refers to the diff file to identify what has changed in the source data since the last run using a model's specified primary key. These are the steps for the comparison:
- Hightouch queries the source using the defined model.
- Hightouch compares all the primary keys in the query results with the primary keys in the diff file.
- For each primary key:
- If it's in the most recent query results but not in the diff file, Hightouch treats this as a new record.
- If it's in both, Hightouch scans columns for changes.
- If the primary key is in the diff file but missing from the most recent query results, Hightouch treats this as a deleted record.
- Hightouch creates a new diff file for the next comparison.
- Hightouch syncs changes to the destination.
The CDC method Hightouch uses is called difference-based CDC because it involves a full before and after comparison. Difference-based CDC is only one CDC method. This CDC method is necessary when syncing data from data warehouses since they can't produce CDC logs on arbitrary SQL queries or dbt models.
Online transactional processing (OLTP) databases like Postgres, MySQL, or Microsoft SQL Server natively log incremental changes that occur on data tables. Most modern ETL tools use these transaction logs to track changes when sending data to data warehouses or lakes. Since Hightouch does the reverse—sending data from warehouses to other destinations—it can't rely on log-based CDC. How CDC happens is a crucial difference between ETL and reverse ETL.
Hightouch performs change data capture after receiving the query results from a source and before sending data to your destination. When a new sync is triggered, you may see the sync status is Querying. This status means the sync is in one of these three states:
- Hightouch is waiting for query results from the source
- Hightouch is saving the diff file
- Hightouch is performing the change data capture computation
By default, Hightouch computes CDC and stores the diff file on a Hightouch-managed infrastructure. If you're on a Business Tier plan, you can configure Hightouch to use your own S3 or GCP bucket, so data is never stored in Hightouch's infrastructure.
For some sources, you can choose to do the CDC computation in your own warehouse. Using your warehouse has the advantage of faster syncs at higher volumes but requires granting write access to a separate Hightouch-managed schema in your warehouse. See the Lightning sync engine documentation to learn more.
For most SaaS destinations, Hightouch only tracks changes in columns that are part of the sync configuration. Suppose your model queries twenty columns from a source, but the only sync using that model maps ten of those. Then, Hightouch only tracks changes in the ten mapped fields.
As a general rule of thumb, changes in model configuration only matter for diffing purposes if they include columns used in syncs. However, there is a notable exception: custom destinations, such as HTTP Request, perform diffing on all columns, even those not used in the sync configuration.
If you change a column's data type in a model—for example, changing it from a
string to a
number—Hightouch detects these as row changes during the next sync.
If you map new fields in your sync and then run it, Hightouch creates a new diff file for future comparisons.
Hightouch doesn't maintain a historical record of previous diff files. We only maintain one diff file—the most recent—and compare every new sync with the most recent one.