Data Lake Sources

Identity resolution is only available on Business tier plans. You can use it with or without Customer Studio.


Audience	Data engineers and analytics teams managing large-scale data lakes
Prerequisites	Access to object storage and catalog configuration (Glue, Hive, or Unity Catalog)

Use data lake sources to query data from your object storage using open table formats without needing to maintain your own query engine.

Learning objectives

After reading this article, you’ll know how to:

Connect data lakes stored in S3, GCS, or Azure Blob
Configure supported open table formats and catalogs
Query and sync data from Iceberg, Delta Lake, or Hudi tables
Understand when to use data lake sources vs. standard object storage

Overview

Modern data lakes separate storage, metadata management, and compute into distinct layers.

Hightouch's data lake sources connects to these key components:

Layer	Description
Storage layer	Your object storage bucket (S3, GCS, or Azure Blob)
Table format	Open format organizing your data (Iceberg, Delta Lake, or Hudi)
Catalog	Metadata layer tracking table schemas and locations (AWS Glue, Hive Metastore, or Databricks Unity Catalog)

Once connected, Hightouch provides the compute engine to query your data lake tables and sync them to downstream destinations.

Supported configurations

Hightouch supports the following data lake source types:

S3 Data Lake
GCS Data Lake
Azure Blob Data Lake

Each source type supports specific combinations of table formats and catalogs:

Catalog	Iceberg	Delta Lake	Hudi
AWS Glue	✅	❌	✅
Hive Metastore	✅	❌	✅
Databricks Unity Catalog	❌	✅	❌

Setup guide

Step 1: Configure storage connection

Provide connection details for your object storage:

S3 Data Lake

AWS region
Bucket name
Authentication credentials (IAM role or access keys)

GCS Data Lake

Project ID
Bucket name
Service account credentials

Azure Blob Data Lake

Storage account name
Container name
Authentication credentials (connection string or SAS token)

Step 2: Select table format

Choose your table format:

Apache Iceberg - Supports AWS Glue and Hive Metastore catalogs
Delta Lake - Supports Databricks Unity Catalog only
Apache Hudi - Supports AWS Glue and Hive Metastore catalogs

Step 3: Configure catalog connection

Depending on your selected table format, configure the appropriate catalog:

AWS Glue

Select or add your AWS account in Cloud Providers
Enter the Glue database name
(Optional) Enter a catalog ID if using cross-account access
(Optional) Specify a table filter to limit visible tables

Required IAM permissions:

Glue: glue:GetDatabase, glue:GetDatabases, glue:GetTable, glue:GetTables, glue:GetPartition*, glue:SearchTables
S3: s3:ListBucket, s3:GetObject, s3:GetBucketLocation

Hive Metastore

Enter the Thrift URI (e.g., thrift://hms.company.com:9083)
Configure Kerberos authentication:
- Kerberos Principal
- Kerberos Keytab file
Enter database/schema name

Databricks Unity Catalog

Enter your Databricks workspace host
Authenticate using token or OAuth
Select catalog
Select the schema

Step 4: Test and save

After entering all configuration details, test the connection to verify Hightouch can access your data lake tables. You can save the source configuration even if tables aren't immediately available.

Use data lake sources

Once connected, you can:

Browse available tables in the source
Create models that query your data lake tables
Sync data to any Hightouch destination
Schedule syncs to keep downstream systems updated

Differences from standard object storage sources

Standard S3, GCS, and Azure Blob sources in Hightouch are designed for simple file-based data (CSV, JSON). Data lake sources are purpose-built for querying structured tables using open formats with full schema evolution and ACID transaction support.

If you're only working with CSV or JSON files, use the standard object storage sources. If you're using Iceberg, Delta Lake, or Hudi table formats with a catalog, use data lake sources.

When to use data lake sources

Data lake sources are ideal when you:

Store data in object storage using open table formats
Want to query data lakes without managing compute infrastructure
Need to sync data from Iceberg, Delta Lake, or Hudi tables
Don't have or don't want to use your own query engine like Trino or Spark

If you already have a Trino, Presto, or Spark cluster configured to query your data lake, you can connect that directly as a source instead.