The Core Concept

Data integration is the process of combining data from different sources into a unified, consistent view. It’s how you turn disconnected systems into a coherent picture of your business.

If your CRM, billing system, and product database each have their own version of “customer,” data integration is what connects them. Without it, every team works with a partial view, reports contradict each other, and simple questions require manual data gathering.

Data integration isn’t optional anymore. It’s the foundation that makes analytics, reporting, and data-driven decisions possible.


Why Data Integration Matters

The Problem It Solves

Most companies have data scattered across:

  • Operational databases - PostgreSQL, MySQL, SQL Server
  • SaaS applications - Salesforce, HubSpot, Stripe, Zendesk
  • Files and spreadsheets - Excel exports, CSVs, Google Sheets
  • APIs and events - Webhooks, message queues, logs
  • Third-party data - Market data, enrichment services

Each system serves its purpose but knows nothing about the others. A customer in Salesforce has a different ID than the same customer in Stripe. Product names don’t match between inventory and e-commerce. Marketing attribution data lives in five different tools.

Data integration bridges these gaps.

What Changes With Good Integration

  • Single source of truth - One definition of customer, product, revenue
  • Self-service analytics - Analysts query unified data instead of gathering from systems
  • Faster decisions - Questions answered in hours, not weeks
  • Trusted numbers - Executives see consistent metrics across reports
  • Automation - Systems can share data without manual intervention

Data Integration Techniques

Data integration techniques are the specific approaches you use to move and combine data. Each technique has trade-offs around latency, complexity, and cost.

Batch Integration

The most common technique. Data moves at scheduled intervals - hourly, daily, or weekly.

How it works:

  • Extract full datasets or incremental changes on a schedule
  • Load into staging, transform, then move to final tables
  • Transformations run after data lands

Best for:

  • Analytics and reporting (most use cases)
  • Non-time-sensitive operational data
  • Cost-sensitive high-volume scenarios

Example: Extract yesterday’s sales from the POS system every morning at 6 AM, transform to match warehouse schema, load into the analytics warehouse.

Real-Time (Streaming) Integration

Data flows continuously as events occur.

How it works:

  • Source systems emit events to a message queue (Kafka, Kinesis)
  • Stream processing reads, transforms, and routes events
  • Consumers receive data within seconds or milliseconds

Best for:

  • Fraud detection and alerting
  • Real-time dashboards with sub-minute latency requirements
  • Event-driven microservices

Trade-off: Real-time is 5-10x more complex and expensive than batch. Most companies don’t need it - before choosing streaming, ask: “What decision changes if data is 1 hour old vs 1 minute old?”

Micro-Batch Integration

A middle ground between batch and streaming.

How it works:

  • Run batch processes on very short intervals (every 5-15 minutes)
  • Simpler than true streaming
  • “Near real-time” without streaming complexity

Best for:

  • Dashboards that need frequent updates but not sub-second
  • When stakeholders say “real-time” but mean “not stale”

Virtualization

Query data in place without moving it.

How it works:

  • A virtualization layer sits between consumers and sources
  • Queries are translated and routed to source systems
  • Data doesn’t move - only queries and results

Best for:

  • Exploration and discovery
  • When data can’t leave source systems (security/compliance)
  • Low-volume, ad-hoc queries

Trade-off: Performance depends on source systems. Not suitable for heavy analytical workloads or complex joins across sources.


Data Integration Methods

Methods describe how you extract and move data. Choose based on your sources, latency needs, and scale.

ETL (Extract, Transform, Load)

The traditional approach:

  1. Extract data from source systems
  2. Transform it to match the target structure
  3. Load into the destination (usually a data warehouse)

When to use: When you need to reshape data significantly before loading, or when transformation logic is complex.

Drawbacks: Transformation happens before loading, making it harder to debug and iterate.

ELT (Extract, Load, Transform)

The modern approach:

  1. Extract data from source systems
  2. Load raw data into the warehouse
  3. Transform inside the warehouse

When to use: Most modern analytics use cases. Cloud warehouses make in-warehouse transformation cheap and fast.

Benefits: Raw data is preserved, transformations are versioned and testable, iteration is faster.

Change Data Capture (CDC)

Tracks changes in source databases and replicates them:

  • Captures inserts, updates, deletes as they happen
  • Near real-time data synchronization
  • Lower load on source systems than full extraction

When to use: When you need near-real-time data, or when source databases are large and full extraction is slow.

Tools: Debezium, Fivetran CDC, AWS DMS, Airbyte CDC

API Integration

Pulls data from web APIs:

  • REST APIs, GraphQL, webhooks
  • Common for SaaS applications
  • Often rate-limited and paginated

When to use: When integrating with SaaS tools that don’t expose database access.

Challenges: Rate limits, API changes, authentication management, pagination handling.

Streaming Integration

Processes data continuously as it arrives:

  • Event-driven architectures
  • Real-time dashboards and alerting
  • Sub-second latency requirements

When to use: When business requirements genuinely need real-time data. Most companies don’t - batch is simpler and cheaper.

Tools: Kafka, Kinesis, Pub/Sub, Flink


Types of Data Integration

Understanding the different types of data integration helps you choose the right approach for each use case.

Application Integration

Connects applications so they can share data and functionality.

Examples:

  • CRM updates trigger marketing automation
  • E-commerce order creates shipping label
  • HR system provisions user accounts

Approach: APIs, webhooks, iPaaS tools (Zapier, Workato, Tray.io)

Data Consolidation

Combines data from multiple sources into a single repository.

Examples:

  • Sales data from multiple regions into one warehouse
  • Customer data from CRM, support, and billing into unified view
  • Financial data from subsidiaries into consolidated reporting

Approach: ETL/ELT pipelines, data warehousing

Data Propagation

Distributes data from one source to multiple destinations.

Examples:

  • Master product catalog pushed to all sales channels
  • Price updates propagated to all storefronts
  • Customer segments distributed to marketing tools

Approach: CDC, message queues, reverse ETL

Data Federation

Creates a virtual view across multiple sources without moving data.

Examples:

  • Query across on-premise and cloud databases
  • Unified view for compliance without data duplication
  • Ad-hoc exploration across systems

Approach: Data virtualization, federated query engines

Data Synchronization

Keeps data consistent across multiple systems.

Examples:

  • Inventory levels synced between warehouse and e-commerce
  • Contact updates reflected in both CRM and support tools
  • Bi-directional sync between operational and analytical systems

Approach: CDC, bi-directional replication, sync platforms

Data Model Integration

Reconciles different data models and schemas from multiple sources into a unified structure.

Examples:

  • Mapping customer fields from CRM, ERP, and support systems into one dimensional model
  • Standardizing product hierarchies across different regional systems
  • Aligning different chart of accounts from acquired companies

Approach: Schema mapping, data transformation, canonical data models, master data management

Data model integration is often the hardest part of data integration - not moving the data, but making different representations of the same concepts work together. When one system calls it “customer_id” and another calls it “account_number,” you need data model integration to unify them.


Integration Patterns - Point-to-Point, Hub-and-Spoke, Data Warehouse, Data Lake

Common data integration patterns


Data Integration Patterns

Data Warehouse Pattern

Centralize everything in one warehouse:

Sources → Ingestion → Warehouse → Transformation → Consumption
  • All data lands in Snowflake, BigQuery, or similar
  • Transformations happen in SQL (often dbt)
  • Single query engine for all analytics

Best for: Analytics-focused organizations, modern data stack adopters.

Data Lake Pattern

Store raw data in object storage, process on demand:

Sources → Object Storage → Processing Engine → Consumption
  • Data lands in S3, GCS, or Azure Blob
  • Spark, Presto, or similar engines process data
  • More flexibility for unstructured data

Best for: Data science workloads, large unstructured datasets, cost-sensitive high-volume scenarios.

Hub and Spoke Pattern

Central integration layer connecting sources to multiple destinations:

           ┌─→ Warehouse
Sources → Hub ─→ Operational DB
           └─→ ML Platform
  • One integration platform routes to multiple consumers
  • Decouples sources from destinations
  • Easier to add new destinations

Best for: Organizations with diverse consumption patterns.

Point-to-Point Pattern

Direct connections between systems:

Source A ─→ Destination A
Source B ─→ Destination B
  • Simple for single integrations
  • Becomes unmaintainable at scale
  • Often how integration starts, rarely how it should stay

Best for: Quick wins, proof of concepts. Migrate to centralized patterns as you scale.


Integration Challenges

Schema Evolution

Sources change without warning:

  • New columns appear
  • Column types change
  • Fields get renamed or removed

Solution: Schema drift detection, alerting, and handling strategies. Tools like Fivetran handle this automatically; custom integrations need explicit handling.

Data Quality

Integration surfaces data quality problems:

  • Null values where data should exist
  • Duplicates from multiple systems
  • Inconsistent formats (dates, currencies, identifiers)

Solution: Quality checks at ingestion, monitoring, and clear ownership of source data quality.

Identity Resolution

The same entity has different identifiers across systems:

  • Customer email in CRM, customer ID in billing, anonymous ID in analytics
  • Product SKU vs. product name vs. internal ID

Solution: Master data management, matching algorithms, or maintaining mapping tables.

Historical Data

Integration often starts with “current state” but analytics needs history:

  • What was the customer status last month?
  • When did this product price change?

Solution: Slowly changing dimensions (SCD), snapshot tables, or event sourcing patterns.

Performance and Cost

Integration at scale is expensive:

  • API rate limits slow extraction
  • Full table scans stress source databases
  • Storage costs grow with data volume

Solution: Incremental extraction, CDC instead of full loads, tiered storage, archival policies.


Build vs Buy

Buy (Managed Tools)

Tools: Fivetran, Airbyte, Stitch, Hevo

Pros:

  • Fast to implement (days, not months)
  • Handles schema changes, retries, monitoring
  • Connectors maintained by vendor
  • Predictable pricing

Cons:

  • Less control over extraction logic
  • Can get expensive at high volume
  • Some sources not supported

When to choose: Standard SaaS sources, need to move fast, don’t have dedicated integration engineers.

Build (Custom Pipelines)

Tools: Airflow + custom code, Dagster, Prefect, Meltano

Pros:

  • Full control over logic
  • Can optimize for specific needs
  • No per-row or per-connector fees

Cons:

  • Significant engineering investment
  • Maintenance burden for API changes
  • Need to build monitoring, alerting, retry logic

When to choose: Custom sources, complex transformation requirements, high volume where managed tool costs don’t make sense.

Hybrid Approach

Most mature organizations use both:

  • Managed tools for standard SaaS connectors
  • Custom pipelines for proprietary databases or complex logic

Integration and Data Architecture

Data integration is a component of your broader data architecture. It’s the “how data moves” part of the blueprint.

Good integration supports:

Poor integration undermines everything downstream. If the foundation is shaky, nothing built on top will be reliable.


Getting Started

1. Inventory Your Sources

List every system that has data you need:

  • What data does it contain?
  • How do you access it? (database, API, export)
  • How often does it change?
  • Who owns it?

2. Define Your Destination

Where will integrated data live?

  • Cloud warehouse (Snowflake, BigQuery) for most analytics use cases
  • Data lake for unstructured or high-volume scenarios
  • Operational database for real-time application needs

3. Start With High-Value Sources

Don’t integrate everything at once. Pick 3-5 sources that:

  • Drive the most business decisions
  • Are requested most often
  • Have the clearest ownership

4. Choose Your Tools

For most companies starting out:

  • Managed ingestion (Fivetran, Airbyte) for SaaS sources
  • dbt for transformation
  • Cloud warehouse (Snowflake, BigQuery) for storage

5. Build Quality In

From day one:

  • Monitor freshness (did data arrive on time?)
  • Check row counts (did we get what we expected?)
  • Validate schemas (did structure change?)

Frequently Asked Questions

What is data integration?
Data integration is the process of combining data from different sources into a unified, consistent view. It connects disparate systems like CRMs, databases, and SaaS tools so that teams can work with a complete picture rather than fragmented data.
What are the main data integration techniques?
The main data integration techniques are batch integration (scheduled data movement), real-time/streaming integration (continuous data flow), micro-batch (frequent small batches), and virtualization (querying data in place). Batch is the most common; streaming adds complexity and cost.
What is the difference between ETL and ELT?
ETL (Extract, Transform, Load) transforms data before loading it into the destination. ELT (Extract, Load, Transform) loads raw data first, then transforms it in the destination. ELT is the modern standard because cloud warehouses make in-warehouse transformation cheap and fast.
What are the types of data integration?
The main types are application integration (connecting apps via APIs), data consolidation (combining into one repository), data propagation (distributing to multiple destinations), data federation (virtual views without moving data), and data synchronization (keeping systems in sync).
What is change data capture (CDC)?
Change data capture tracks changes in source databases (inserts, updates, deletes) and replicates them to the destination. It provides near-real-time synchronization with lower source system load than full extraction.
What are common data integration patterns?
Common patterns include data warehouse (centralized analytics), data lake (raw storage with processing), hub-and-spoke (central routing to multiple destinations), and point-to-point (direct connections). Most organizations start with point-to-point and evolve to centralized patterns.
Should I build or buy data integration tools?
Most companies should buy managed tools like Fivetran or Airbyte for standard SaaS connectors - they’re faster to implement and handle maintenance. Build custom pipelines only for proprietary sources, complex logic, or when managed tool costs don’t scale.
What are common data integration challenges?
Common challenges include schema evolution (sources changing without warning), data quality issues surfaced during integration, identity resolution (matching entities across systems), and managing historical data. Good integration practices address these proactively.