Overview

Data integration techniques are the methods used to move data from source systems to destinations where it can be analyzed and used. Choosing the right technique depends on your data volume, latency requirements, and technical constraints.

This guide covers the five main techniques: ETL, ELT, Change Data Capture, API integration, and streaming. Each has specific use cases where it excels.

For foundational concepts, see What Is Data Integration?


ETL (Extract, Transform, Load)

The traditional approach that dominated data integration for decades.

How ETL Works

  1. Extract - Pull data from source systems (databases, files, APIs)
  2. Transform - Clean, reshape, and enrich data in a staging area
  3. Load - Write transformed data to the destination
Source → Extract → Staging → Transform → Load → Warehouse

When to Use ETL

  • Complex transformations - Business logic that’s easier to implement outside SQL
  • Data quality enforcement - Reject bad records before they reach the warehouse
  • Legacy systems - When destination systems can’t handle raw data transformations
  • Compliance requirements - When data must be masked or filtered before landing

ETL Limitations

  • Slower iteration - Changing transformation logic requires modifying the pipeline
  • Debugging difficulty - Raw data isn’t preserved, making issue investigation harder
  • Higher latency - Transformation adds processing time before data is available

ETL Tools

  • Enterprise: Informatica, Talend, IBM DataStage
  • Open source: Apache NiFi, Pentaho
  • Cloud: AWS Glue, Azure Data Factory

ELT (Extract, Load, Transform)

The modern standard for analytics workloads.

How ELT Works

  1. Extract - Pull data from source systems
  2. Load - Write raw data directly to the destination
  3. Transform - Process data inside the destination using SQL
Source → Extract → Load → Warehouse → Transform → Analytics

When to Use ELT

  • Analytics use cases - Most reporting and dashboarding scenarios
  • Cloud warehouses - Snowflake, BigQuery, Databricks make in-warehouse transformation cheap
  • Iterative development - When transformation requirements evolve frequently
  • Data exploration - When analysts need access to raw data for discovery

ELT Advantages

  • Raw data preserved - Can always reprocess if transformation logic changes
  • Faster time-to-data - Data lands quickly, transformations can run on schedule
  • SQL-based - Transformations in familiar language, version-controlled with dbt
  • Scalable - Warehouse compute handles heavy transformations

ELT Tools

  • Ingestion: Fivetran, Airbyte, Stitch, Hevo
  • Transformation: dbt, Dataform, SQLMesh
  • Warehouses: Snowflake, BigQuery, Redshift, Databricks

Change Data Capture (CDC)

Captures changes as they happen rather than extracting full datasets.

How CDC Works

CDC monitors source database transaction logs and captures:

  • Inserts - New records added
  • Updates - Existing records modified
  • Deletes - Records removed

Changes stream to the destination incrementally, keeping it synchronized with the source.

Source DB → Transaction Log → CDC Tool → Destination

When to Use CDC

  • Near real-time requirements - Data needed within minutes, not hours
  • Large source tables - Full extraction would be too slow or resource-intensive
  • Operational analytics - Dashboards reflecting current state
  • Event-driven architectures - Triggering downstream processes on data changes

CDC Advantages

  • Low latency - Minutes instead of hours
  • Reduced source load - Reads logs instead of querying tables
  • Complete history - Captures every change, not just current state
  • Efficient - Only transfers changed data

CDC Challenges

  • Complexity - Requires understanding of database internals
  • Schema changes - DDL changes need careful handling
  • Initial load - First sync still requires full extraction
  • Log retention - Source must retain logs long enough for processing

CDC Tools

  • Open source: Debezium (Kafka-based)
  • Managed: Fivetran CDC, Airbyte CDC, Hevo CDC
  • Cloud: AWS DMS, Azure Data Factory CDC, GCP Datastream

API Integration

Pulls data from web APIs, essential for SaaS application data.

How API Integration Works

  1. Authentication - OAuth, API keys, or tokens
  2. Request - Call endpoints with parameters
  3. Pagination - Handle multi-page responses
  4. Rate limiting - Respect API throttling
  5. Transformation - Flatten nested JSON structures
API Endpoint → Auth → Request → Parse Response → Handle Pagination → Load

When to Use API Integration

  • SaaS applications - Salesforce, HubSpot, Stripe, Zendesk
  • Third-party data - Enrichment services, market data
  • No database access - When direct DB connections aren’t available
  • Webhooks - Event-driven data from external systems

API Integration Challenges

  • Rate limits - APIs restrict request frequency
  • API changes - Vendors modify endpoints and schemas
  • Pagination - Large datasets require multiple requests
  • Authentication complexity - OAuth flows, token refresh
  • Nested data - JSON structures need flattening for analytics

API Integration Approaches

ApproachProsCons
Managed tools (Fivetran, Airbyte)Fast setup, maintained connectorsLess control, per-connector costs
Custom codeFull control, no per-connector feesMaintenance burden, error handling
iPaaS (Workato, Tray)Visual building, pre-built connectorsCan get expensive, less flexibility

API Integration Tools

  • Managed: Fivetran, Airbyte, Stitch, Hevo
  • Custom: Python + requests, Singer taps
  • iPaaS: Workato, Tray.io, Zapier (simple cases)

Streaming Integration

Processes data continuously as it arrives.

How Streaming Works

Data flows through a message broker, processed in real-time:

Source → Producer → Message Broker → Consumer → Destination
         (events)    (Kafka/Kinesis)  (processing)

When to Use Streaming

  • Sub-minute latency - Fraud detection, real-time pricing
  • Event-driven systems - Microservices communication
  • High-volume events - Clickstreams, IoT sensors, logs
  • Real-time dashboards - Operational monitoring

When NOT to Use Streaming

Most companies don’t need streaming for analytics:

  • Batch is simpler - Less infrastructure, easier debugging
  • Batch is cheaper - No always-on processing costs
  • Daily data is fine - Most business decisions don’t need real-time

Ask: “What decision requires data fresher than hourly?” If you can’t answer clearly, batch is probably sufficient.

Streaming Challenges

  • Complexity - Distributed systems, exactly-once semantics
  • Cost - Always-on infrastructure
  • Late arrivals - Handling out-of-order events
  • Schema evolution - Managing changes in event structures
  • Debugging - Harder to troubleshoot than batch

Streaming Tools

  • Message brokers: Apache Kafka, AWS Kinesis, Google Pub/Sub, Azure Event Hubs
  • Stream processing: Apache Flink, Spark Streaming, Kafka Streams, ksqlDB
  • Managed: Confluent Cloud, Amazon MSK, Redpanda

Comparing Integration Techniques

TechniqueLatencyComplexityBest For
ETLHoursMediumComplex transformations, compliance
ELTHoursLowAnalytics, modern data stack
CDCMinutesHighNear real-time, large tables
APIHoursMediumSaaS data, third-party sources
StreamingSecondsHighReal-time requirements, events

Decision Framework

Start with ELT for most analytics use cases. It’s the simplest, most flexible approach.

Add CDC when:

  • Hourly latency isn’t fast enough
  • Source tables are too large for full extraction
  • You need change history, not just current state

Add Streaming only when:

  • Business genuinely requires sub-minute data
  • You have engineering capacity to maintain it
  • The ROI justifies the complexity

Use API Integration when:

  • Data lives in SaaS applications
  • No direct database access available
  • Prefer managed tools for standard connectors

Hybrid Approaches

Most mature organizations combine techniques:

Common Combinations

ELT + CDC:

  • ELT for SaaS sources (Fivetran → warehouse)
  • CDC for operational databases needing fresher data

ELT + Streaming:

  • ELT for analytical workloads
  • Streaming for operational use cases (alerts, real-time features)

API + CDC:

  • API integration for SaaS applications
  • CDC for internal databases

Architecture Example

SaaS Apps ──→ Fivetran (API) ──┐
                               ├──→ Snowflake ──→ dbt ──→ Analytics
Databases ──→ Debezium (CDC) ──┘

Events ──→ Kafka ──→ Flink ──→ Real-time Features

Getting Started

1. Assess Your Requirements

  • What latency do you actually need?
  • How large are your source datasets?
  • What technical capacity do you have?

2. Start Simple

Begin with managed ELT tools for most sources. Add complexity only when requirements demand it.

3. Build Incrementally

Don’t implement streaming because it seems modern. Implement it when batch latency genuinely blocks business value.