| Audience | Data engineers and analytics teams managing large-scale data lakes |
| Prerequisites | Access to object storage and catalog configuration (Glue, Hive, or Unity Catalog) |
Use data lake sources to query data from your object storage using open table formats without needing to maintain your own query engine.
Learning objectives
After reading this article, you’ll know how to:
- Connect data lakes stored in S3, GCS, or Azure Blob
- Configure supported open table formats and catalogs
- Query and sync data from Iceberg, Delta Lake, or Hudi tables
- Understand when to use data lake sources vs. standard object storage
Overview
Modern data lakes separate storage, metadata management, and compute into distinct layers.
Hightouch's data lake sources connects to these key components:
| Layer | Description |
|---|---|
| Storage layer | Your object storage bucket (S3, GCS, or Azure Blob) |
| Table format | Open format organizing your data (Iceberg, Delta Lake, or Hudi) |
| Catalog | Metadata layer tracking table schemas and locations (AWS Glue, Hive Metastore, or Databricks Unity Catalog) |
Once connected, Hightouch provides the compute engine to query your data lake tables and sync them to downstream destinations.
Supported configurations
Hightouch supports the following data lake source types:
- S3 Data Lake
- GCS Data Lake
- Azure Blob Data Lake
Each source type supports specific combinations of table formats and catalogs:
| Catalog | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| AWS Glue | ✅ | ❌ | ✅ |
| Hive Metastore | ✅ | ❌ | ✅ |
| Databricks Unity Catalog | ❌ | ✅ | ❌ |
Setup guide
Step 1: Configure storage connection
Provide connection details for your object storage:
S3 Data Lake
- AWS region
- Bucket name
- Authentication credentials (IAM role or access keys)
GCS Data Lake
- Project ID
- Bucket name
- Service account credentials
Azure Blob Data Lake
- Storage account name
- Container name
- Authentication credentials (connection string or SAS token)
Step 2: Select table format
Choose your table format:
- Apache Iceberg - Supports AWS Glue and Hive Metastore catalogs
- Delta Lake - Supports Databricks Unity Catalog only
- Apache Hudi - Supports AWS Glue and Hive Metastore catalogs
Step 3: Configure catalog connection
Depending on your selected table format, configure the appropriate catalog:
AWS Glue
- Select or add your AWS account in Cloud Providers
- Enter the Glue database name
- (Optional) Enter a catalog ID if using cross-account access
- (Optional) Specify a table filter to limit visible tables
Required IAM permissions:
- Glue:
glue:GetDatabase,glue:GetDatabases,glue:GetTable,glue:GetTables,glue:GetPartition*,glue:SearchTables - S3:
s3:ListBucket,s3:GetObject,s3:GetBucketLocation
Hive Metastore
- Enter the Thrift URI (e.g.,
thrift://hms.company.com:9083) - Configure Kerberos authentication:
- Kerberos Principal
- Kerberos Keytab file
- Enter database/schema name
Databricks Unity Catalog
- Enter your Databricks workspace host
- Authenticate using token or OAuth
- Select catalog
- Select the schema
Step 4: Test and save
After entering all configuration details, test the connection to verify Hightouch can access your data lake tables. You can save the source configuration even if tables aren't immediately available.
Use data lake sources
Once connected, you can:
- Browse available tables in the source
- Create models that query your data lake tables
- Sync data to any Hightouch destination
- Schedule syncs to keep downstream systems updated
Differences from standard object storage sources
Standard S3, GCS, and Azure Blob sources in Hightouch are designed for simple file-based data (CSV, JSON). Data lake sources are purpose-built for querying structured tables using open formats with full schema evolution and ACID transaction support.
If you're only working with CSV or JSON files, use the standard object storage sources. If you're using Iceberg, Delta Lake, or Hudi table formats with a catalog, use data lake sources.
When to use data lake sources
Data lake sources are ideal when you:
- Store data in object storage using open table formats
- Want to query data lakes without managing compute infrastructure
- Need to sync data from Iceberg, Delta Lake, or Hudi tables
- Don't have or don't want to use your own query engine like Trino or Spark
If you already have a Trino, Presto, or Spark cluster configured to query your data lake, you can connect that directly as a source instead.