ChangelogBook a demoSign up

Data Lake Sources

AudienceData engineers and analytics teams managing large-scale data lakes
PrerequisitesAccess to object storage and catalog configuration (Glue, Hive, or Unity Catalog)

Use data lake sources to query data from your object storage using open table formats without needing to maintain your own query engine.


Learning objectives

After reading this article, you’ll know how to:

  • Connect data lakes stored in S3, GCS, or Azure Blob
  • Configure supported open table formats and catalogs
  • Query and sync data from Iceberg, Delta Lake, or Hudi tables
  • Understand when to use data lake sources vs. standard object storage

Overview

Modern data lakes separate storage, metadata management, and compute into distinct layers.

Hightouch's data lake sources connects to these key components:

LayerDescription
Storage layerYour object storage bucket (S3, GCS, or Azure Blob)
Table formatOpen format organizing your data (Iceberg, Delta Lake, or Hudi)
CatalogMetadata layer tracking table schemas and locations (AWS Glue, Hive Metastore, or Databricks Unity Catalog)

Once connected, Hightouch provides the compute engine to query your data lake tables and sync them to downstream destinations.


Supported configurations

Hightouch supports the following data lake source types:

  • S3 Data Lake
  • GCS Data Lake
  • Azure Blob Data Lake

Each source type supports specific combinations of table formats and catalogs:

CatalogIcebergDelta LakeHudi
AWS Glue
Hive Metastore
Databricks Unity Catalog

Setup guide

Step 1: Configure storage connection

Provide connection details for your object storage:

S3 Data Lake

  • AWS region
  • Bucket name
  • Authentication credentials (IAM role or access keys)

GCS Data Lake

  • Project ID
  • Bucket name
  • Service account credentials

Azure Blob Data Lake

  • Storage account name
  • Container name
  • Authentication credentials (connection string or SAS token)

Step 2: Select table format

Choose your table format:

  • Apache Iceberg - Supports AWS Glue and Hive Metastore catalogs
  • Delta Lake - Supports Databricks Unity Catalog only
  • Apache Hudi - Supports AWS Glue and Hive Metastore catalogs

Step 3: Configure catalog connection

Depending on your selected table format, configure the appropriate catalog:

AWS Glue

  1. Select or add your AWS account in Cloud Providers
  2. Enter the Glue database name
  3. (Optional) Enter a catalog ID if using cross-account access
  4. (Optional) Specify a table filter to limit visible tables

Required IAM permissions:

  • Glue: glue:GetDatabase, glue:GetDatabases, glue:GetTable, glue:GetTables, glue:GetPartition*, glue:SearchTables
  • S3: s3:ListBucket, s3:GetObject, s3:GetBucketLocation

Hive Metastore

  1. Enter the Thrift URI (e.g., thrift://hms.company.com:9083)
  2. Configure Kerberos authentication:
    • Kerberos Principal
    • Kerberos Keytab file
  3. Enter database/schema name

Databricks Unity Catalog

  1. Enter your Databricks workspace host
  2. Authenticate using token or OAuth
  3. Select catalog
  4. Select the schema

Step 4: Test and save

After entering all configuration details, test the connection to verify Hightouch can access your data lake tables. You can save the source configuration even if tables aren't immediately available.


Use data lake sources

Once connected, you can:

  • Browse available tables in the source
  • Create models that query your data lake tables
  • Sync data to any Hightouch destination
  • Schedule syncs to keep downstream systems updated

Differences from standard object storage sources

Standard S3, GCS, and Azure Blob sources in Hightouch are designed for simple file-based data (CSV, JSON). Data lake sources are purpose-built for querying structured tables using open formats with full schema evolution and ACID transaction support.

If you're only working with CSV or JSON files, use the standard object storage sources. If you're using Iceberg, Delta Lake, or Hudi table formats with a catalog, use data lake sources.


When to use data lake sources

Data lake sources are ideal when you:

  • Store data in object storage using open table formats
  • Want to query data lakes without managing compute infrastructure
  • Need to sync data from Iceberg, Delta Lake, or Hudi tables
  • Don't have or don't want to use your own query engine like Trino or Spark

If you already have a Trino, Presto, or Spark cluster configured to query your data lake, you can connect that directly as a source instead.

Ready to get started?

Jump right in or a book a demo. Your first destination is always free.

Book a demoSign upBook a demo

Need help?

Our team is relentlessly focused on your success. Don't hesitate to reach out!

Feature requests?

We'd love to hear your suggestions for integrations and other features.

Privacy PolicyTerms of Service

Last updated: Oct 14, 2025

On this page
  • Learning objectives
  • Overview
  • Supported configurations
  • Setup guide
  • Step 1: Configure storage connection
  • Step 2: Select table format
  • Step 3: Configure catalog connection
  • Step 4: Test and save
  • Use data lake sources
  • Differences from standard object storage sources
  • When to use data lake sources

Was this page helpful?