Data Integration: The Definitive Guide
Data integration is the process of moving data between databases internal, external, or both. Read this guide to learn about all the integration technologies and understand how data moves through each of them.
January 28, 2021
What is Data Integration?
The core definition of data integration is focused on combining data from disparate sources into a centralized view. Sounds simple right? Since you’re reading this, you already know that such a narrow definition of data integration is reckless.
Before diving into what data integration really means, rest assured that this guide is not meant to convince you that a particular solution or technology is better than another because there are several ways to tackle the approach to integration. Instead, the goal of this guide is to share a comprehensive, unbiased, 360-degree overview of different integration tools and review categories like iPaaS, CDP, ETL, ELT, and Reverse ETL.
Ok, so what the heck is data integration?
To put it simply, data integration is the process of moving data from both internal and external databases to a target system. Here, databases can include production DBs, data warehouses (DWs) as well as third-party tools and systems that generate and store data. At its core, data integration focuses on consolidating disparate data sources into a centralized location.
Modern data integration technologies were originally introduced to address the rapid adoption of the relational database. With the rapid adoption of relational databases, data teams needed a way to avoid manually integrating data and simplify overall integration efforts.
One important thing to note is the fact that all integration products and integration systems use the same underlying technology — APIs. If you are not familiar with APIs, here is an in-depth guide, a video, and a course.
The data integration landscape…
• iPaaS or Integration Platform as a Service: data moves between cloud apps directly with little to no transformation taking place in the iPaaS
• CDP or Customer Data Platform: data moves between cloud apps via a central hub which enables moderate transformation capabilities
• ETL or Extract, Transform and Load: data moves from cloud apps to a data warehouse via a robust transformation layer built into the ETL tool. With ETL data is transformed before it is loaded into the warehouse.
• ELT or Extract, Load, and Transform: data moves from cloud apps to a data warehouse where transformation and data modeling happens using SQL. With ELT data is transformed after being loaded into the warehouse.
• Reverse ETL: data moves from a data warehouse to cloud apps. Typically, the core transformation takes place in the warehouse before the reverse ETL process, but the reverse ETL tool may have a minimal transformation layer to fit data to an external system's schema
iPaaS or Integration Platform as a Service
iPaaS was allegedly coined in 2008 by Boomi, an enterprise iPaaS vendor. Since then, iPaaS has seen wide adoption and has resulted in a proliferation of companies offering iPaaS solutions in various shapes and packages.
Needless to say, the range of integrations and capabilities differ from vendor to vendor, but fundamentally all iPaaS solutions do the same thing: they perform actions based on a trigger. A trigger is essentially an event taking place in a system (ex: email open, new signup, etc.) that is then transmitted to an integration platform (via an API call or Webhook) which then performs one or more predefined actions.
The simplest example is every time you receive an email (an event or trigger), you also receive a Slack message with the contents of that email (action one), the email is then marked as read (action two). Here, your email client delivers this message to the integration platform as soon as a new email lands in your inbox, and then the iPaaS solution takes over.
An interesting thing to keep in mind about iPaaS solutions is that they can also check for changes in system “A” on a schedule (like every hour) and if a change is detected, the specified actions are performed in system “B”.
Taking the same example as above, the integration platform can knock on your inbox every hour and if there is a new email, it can perform the actions, and if not, then it does nothing. Here too, the trigger is a new email being found in your inbox but this time, the iPaaS had to fetch this information from the email client instead of having it delivered.
iPaaS solutions can also be used to move data between internal systems where events take place. However, it’s rather uncommon since integration platforms generally charge based on the number of operations or tasks they perform and it can get pretty expensive pretty quickly.
One of the biggest drivers of the adoption of iPaaS is the fact that they offer a visual interface to build integrations, enabling business teams to take control of their workflow automation needs. In essence, iPaaS brings the simplest of coding concepts into a UI that is easy to maneuver with little to no technical knowledge, which is truly amazing!
Additionally, one of the areas where iPaaS solutions shine is the depth of integrations they offer with third-party SaaS tools. As long as an API endpoint is exposed by a SaaS vendor, an iPaaS tool can pull data from it or push data to it, enabling some really complex integrations that would otherwise require writing and maintaining a lot of code.
That said, like any other technology, iPaaS tools have a set of limitations and are not the ideal solution for all types of integration needs. Event-based integrations particularly fall short when there is a need to move data that does not rely on events. Lastly, since iPaaS solutions are so flexible and free-form, they push a lot of complexity onto the user. Leveraging these iPaaS solutions to their fullest extent means building and maintaining complex workflows with numerous if/then branches and complex steps for each specific action.
CDP or Customer Data Platform
Customer data platforms collect and consolidate customer data from different cloud sources into a unified solution and then send that data to different destinations. Other than moving data, CDPs also enable data collection via proprietary SDKs and APIs. However, I intend to keep this guide focused on data integration and thus will leave data collection for another day and guide.
Like iPaaS solutions, CDPs have also gained steam in the last few years, resulting in horizontal CDP vendors like Segment, mParticle, Lytics, and Tealium as well as vertical CDPs like Amperity (retail and hospitality) and Zaius (ecommerce) focused on solving the needs of specific industries.
In terms of moving data, Customer Data Platforms rely on predefined data models and offer limited or shallow integrations with third-party vendors, making them less than ideal for many data integration use cases.
By definition, a CDP does a lot more than move data between tools. It enables marketing and growth teams to build segments based on user behavior and user traits, and sync these segments to third-party tools (ex: Google Ads or Facebook Ads) to deliver personalized experiences all without relying on engineering or data teams. If you're curious to learn more about CDPs, check out this guide from Data-led Academy.
On the other hand, if you’re already using a Data Warehouse such as Redshift, Snowflake, or BigQuery, you might not even need a CDP — read this in-depth piece that argues why your Data Warehouse should act as your CDP.
ETL or Extract, Transform and Load
ETL is a traditional data integration process with origins dating back to the 1970s. However, it was not until the 1990s that Informatica made ETL commonplace in the enterprise.
Under the ETL paradigm, data is first extracted from first-party databases and third-party/external sources (primarily SaaS tools for sales, marketing, and support), then transformed to meet the needs of business analysts and data scientists, and finally loaded into a data warehouse.
The transformation is particularly resource-intensive and time-consuming which significantly impacts the time it takes between the extraction and the loading of data.
However, due to advancements in warehousing technologies and ecosystem, ETL is being replaced by ELT which is faster and more flexible.
ELT or Extract, Load, and Transform
ELT is the modern approach to ETL which is being fueled by the enterprise data warehouse (i.e. Snowflake, Google BigQuery, Azure Synapse, and Amazon Redshift). These solutions have become extremely fast and reliable, enabling transformation to take place inside the warehouse itself. With data warehouses separating storage and compute, the flexibility and cost benefits that these platforms provide, make data warehousing a no-brainer.
Enterprise data integration solutions like Fivetran, Stitch, and Matillion are leading the new ELT paradigm and warehouse approach. Talend, which is a leading ETL provider, acquired Stitch in 2018 to embrace ELT. Hence, it is fair to say that ELT is the new ETL, but both terms are at play today.
Under the ELT paradigm, data is extracted from source systems and loaded into the warehouse without any transformation taking place. In fact, many modern ELT tools do not offer built-in transformation capabilities, choosing instead to integrate with transformation and data modeling tools like dbt that are designed to handle the transformation layer within the data warehouse. This removes the need to execute transformations elsewhere and lets you create data models and business rules within the warehouse.
ELT is fast, affordable and most importantly, requires no coding, all of which is fueling the shift from ETL to ELT.
The benefits that ELT tools bring to cloud data warehouses are plenty. However, the ability to transform data from numerous sources in the warehouse using tools like dbt makes the data warehouse the true source of truth for all types of data, particularly customer data which is typically scattered across internal and external systems. As a result, a new breed of tools are coming to light and for the lack of a better term, Reverse ETL is born.
Reverse ETL focuses on moving data out of the warehouse and pushing it back into the tools that run business processes (i.e. SaaS tools used for growth, marketing, sales, and support).
Why would you want to move your data out of the warehouse if you just spent so much effort getting it there in the first place?
Data warehouses exist to consolidate and de-silo data for business intelligence and analytics purposes, so analysts can write complex queries. The problem is, data warehouses have in fact, become a larger data silo. To be specific, the information within them is only accessible by technical users who know how to write SQL.
Even worse, the data usually only exists within a dashboard or a report and that means it is not actionable by anyone except business stakeholders who use that information to make high-level business decisions.
Hightouch is a Reverse ETL tool that enables companies to sync customer data from the data warehouse to Sales, Marketing, and analytics tools using just SQL and no scripts.
With Hightouch you can:
Extract data from a warehouse or a database on a regular cadence and load it into sales, marketing, and analytics tools
Trigger a webhook or make an arbitrary API call every time the data changes
Move extracted rows of data to a production database to deliver personalized experiences
So how does this new approach benefit data teams, you ask?
In many cases, data engineers are tasked to build custom integrations and pipelines to various SaaS tools so different business teams can get access to the information. These pipelines are time-consuming to build and brittle. Reverse ETL handles all of this for you. All you have to do is define your data and choose where you want to sync it.
Does this only benefit data teams?
Sales, marketing, and business analytics teams are positively impacted by this new data workflow as well because they are finally able to analyze and act upon the same, consistent, reliable data. This builds trust in the data and harmony amongst the teams.
Check out this customer story to see how CircleCI uses Hightouch to identify PQLs, improve data reliability, and reduce customer churn.
Hopefully, this article helps you get a handle on the data integration landscape and enables you to understand the pros and cons of the different technologies. It should be obvious that there are multiple ways to fulfill your business requirements. What is best for you truly depends on your use cases and business capabilities. I would love to hear from you if you have any feedback or if you would like to learn more about any of these integration technologies. If you're curious to see a Reverse ETL tool in action, you can schedule a Hightouch demo today. or signup for free.