Data Integration Techniques: ETL, ELT, CDC & More

Overview

Data integration techniques are the methods used to move data from source systems to destinations where it can be analyzed and used. Choosing the right technique depends on your data volume, latency requirements, and technical constraints.

This guide covers the five main techniques: ETL, ELT, Change Data Capture, API integration, and streaming. Each has specific use cases where it excels.

For foundational concepts, see What Is Data Integration?

ETL (Extract, Transform, Load)

The traditional approach that dominated data integration for decades.

How ETL Works

Extract - Pull data from source systems (databases, files, APIs)
Transform - Clean, reshape, and enrich data in a staging area
Load - Write transformed data to the destination

Source → Extract → Staging → Transform → Load → Warehouse

When to Use ETL

Complex transformations - Business logic that’s easier to implement outside SQL
Data quality enforcement - Reject bad records before they reach the warehouse
Legacy systems - When destination systems can’t handle raw data transformations
Compliance requirements - When data must be masked or filtered before landing

ETL Limitations

Slower iteration - Changing transformation logic requires modifying the pipeline
Debugging difficulty - Raw data isn’t preserved, making issue investigation harder
Higher latency - Transformation adds processing time before data is available

ETL Tools

Enterprise: Informatica, Talend, IBM DataStage
Open source: Apache NiFi, Pentaho
Cloud: AWS Glue, Azure Data Factory

ELT (Extract, Load, Transform)

The modern standard for analytics workloads.

How ELT Works

Extract - Pull data from source systems
Load - Write raw data directly to the destination
Transform - Process data inside the destination using SQL

Source → Extract → Load → Warehouse → Transform → Analytics

When to Use ELT

Analytics use cases - Most reporting and dashboarding scenarios
Cloud warehouses - Snowflake, BigQuery, Databricks make in-warehouse transformation cheap
Iterative development - When transformation requirements evolve frequently
Data exploration - When analysts need access to raw data for discovery

ELT Advantages

Raw data preserved - Can always reprocess if transformation logic changes
Faster time-to-data - Data lands quickly, transformations can run on schedule
SQL-based - Transformations in familiar language, version-controlled with dbt
Scalable - Warehouse compute handles heavy transformations

ELT Tools

Ingestion: Fivetran, Airbyte, Stitch, Hevo
Transformation: dbt, Dataform, SQLMesh
Warehouses: Snowflake, BigQuery, Redshift, Databricks

Change Data Capture (CDC)

Captures changes as they happen rather than extracting full datasets.

How CDC Works

CDC monitors source database transaction logs and captures:

Inserts - New records added
Updates - Existing records modified
Deletes - Records removed

Changes stream to the destination incrementally, keeping it synchronized with the source.

Source DB → Transaction Log → CDC Tool → Destination

When to Use CDC

Near real-time requirements - Data needed within minutes, not hours
Large source tables - Full extraction would be too slow or resource-intensive
Operational analytics - Dashboards reflecting current state
Event-driven architectures - Triggering downstream processes on data changes

CDC Advantages

Low latency - Minutes instead of hours
Reduced source load - Reads logs instead of querying tables
Complete history - Captures every change, not just current state
Efficient - Only transfers changed data

CDC Challenges

Complexity - Requires understanding of database internals
Schema changes - DDL changes need careful handling
Initial load - First sync still requires full extraction
Log retention - Source must retain logs long enough for processing

CDC Tools

Open source: Debezium (Kafka-based)
Managed: Fivetran CDC, Airbyte CDC, Hevo CDC
Cloud: AWS DMS, Azure Data Factory CDC, GCP Datastream

API Integration

Pulls data from web APIs, essential for SaaS application data.

How API Integration Works

Authentication - OAuth, API keys, or tokens
Request - Call endpoints with parameters
Pagination - Handle multi-page responses
Rate limiting - Respect API throttling
Transformation - Flatten nested JSON structures

API Endpoint → Auth → Request → Parse Response → Handle Pagination → Load

When to Use API Integration

SaaS applications - Salesforce, HubSpot, Stripe, Zendesk
Third-party data - Enrichment services, market data
No database access - When direct DB connections aren’t available
Webhooks - Event-driven data from external systems

API Integration Challenges

Rate limits - APIs restrict request frequency
API changes - Vendors modify endpoints and schemas
Pagination - Large datasets require multiple requests
Authentication complexity - OAuth flows, token refresh
Nested data - JSON structures need flattening for analytics

API Integration Approaches

Approach	Pros	Cons
Managed tools (Fivetran, Airbyte)	Fast setup, maintained connectors	Less control, per-connector costs
Custom code	Full control, no per-connector fees	Maintenance burden, error handling
iPaaS (Workato, Tray)	Visual building, pre-built connectors	Can get expensive, less flexibility

API Integration Tools

Managed: Fivetran, Airbyte, Stitch, Hevo
Custom: Python + requests, Singer taps
iPaaS: Workato, Tray.io, Zapier (simple cases)

Streaming Integration

Processes data continuously as it arrives.

How Streaming Works

Data flows through a message broker, processed in real-time:

Source → Producer → Message Broker → Consumer → Destination
         (events)    (Kafka/Kinesis)  (processing)

When to Use Streaming

Sub-minute latency - Fraud detection, real-time pricing
Event-driven systems - Microservices communication
High-volume events - Clickstreams, IoT sensors, logs
Real-time dashboards - Operational monitoring

When NOT to Use Streaming

Most companies don’t need streaming for analytics:

Batch is simpler - Less infrastructure, easier debugging
Batch is cheaper - No always-on processing costs
Daily data is fine - Most business decisions don’t need real-time

Ask: “What decision requires data fresher than hourly?” If you can’t answer clearly, batch is probably sufficient.

Streaming Challenges

Complexity - Distributed systems, exactly-once semantics
Cost - Always-on infrastructure
Late arrivals - Handling out-of-order events
Schema evolution - Managing changes in event structures
Debugging - Harder to troubleshoot than batch

Streaming Tools

Message brokers: Apache Kafka, AWS Kinesis, Google Pub/Sub, Azure Event Hubs
Stream processing: Apache Flink, Spark Streaming, Kafka Streams, ksqlDB
Managed: Confluent Cloud, Amazon MSK, Redpanda

Comparing Integration Techniques

Technique	Latency	Complexity	Best For
ETL	Hours	Medium	Complex transformations, compliance
ELT	Hours	Low	Analytics, modern data stack
CDC	Minutes	High	Near real-time, large tables
API	Hours	Medium	SaaS data, third-party sources
Streaming	Seconds	High	Real-time requirements, events

Decision Framework

Start with ELT for most analytics use cases. It’s the simplest, most flexible approach.

Add CDC when:

Hourly latency isn’t fast enough
Source tables are too large for full extraction
You need change history, not just current state

Add Streaming only when:

Business genuinely requires sub-minute data
You have engineering capacity to maintain it
The ROI justifies the complexity

Use API Integration when:

Data lives in SaaS applications
No direct database access available
Prefer managed tools for standard connectors

Hybrid Approaches

Most mature organizations combine techniques:

Common Combinations

ELT + CDC:

ELT for SaaS sources (Fivetran → warehouse)
CDC for operational databases needing fresher data

ELT + Streaming:

ELT for analytical workloads
Streaming for operational use cases (alerts, real-time features)

API + CDC:

API integration for SaaS applications
CDC for internal databases

Architecture Example

SaaS Apps ──→ Fivetran (API) ──┐
                               ├──→ Snowflake ──→ dbt ──→ Analytics
Databases ──→ Debezium (CDC) ──┘

Events ──→ Kafka ──→ Flink ──→ Real-time Features

Getting Started

1. Assess Your Requirements

What latency do you actually need?
How large are your source datasets?
What technical capacity do you have?

2. Start Simple

Begin with managed ELT tools for most sources. Add complexity only when requirements demand it.

3. Build Incrementally

Don’t implement streaming because it seems modern. Implement it when batch latency genuinely blocks business value.

What Is Data Integration? - Core concepts and patterns
What Is Data Engineering? - The discipline that implements integration
What Is Data Architecture? - The blueprint integration supports
What Is Data Quality? - Ensuring integrated data is trustworthy
What Is Data Lineage? - Tracking data through integrations

Overview#

ETL (Extract, Transform, Load)#

How ETL Works#

When to Use ETL#

ETL Limitations#

ETL Tools#

ELT (Extract, Load, Transform)#

How ELT Works#

When to Use ELT#

ELT Advantages#

ELT Tools#

Change Data Capture (CDC)#

How CDC Works#

When to Use CDC#

CDC Advantages#

CDC Challenges#

CDC Tools#

API Integration#

How API Integration Works#

When to Use API Integration#

API Integration Challenges#

API Integration Approaches#

API Integration Tools#

Streaming Integration#

How Streaming Works#

When to Use Streaming#

When NOT to Use Streaming#

Streaming Challenges#

Streaming Tools#

Comparing Integration Techniques#

Decision Framework#

Hybrid Approaches#

Common Combinations#

Architecture Example#

Getting Started#

1. Assess Your Requirements#

2. Start Simple#

3. Build Incrementally#

Related Reading#