The Essentials

Data quality measures how well data serves its intended purpose. High-quality data is accurate, complete, consistent, timely, and trustworthy. Poor-quality data leads to bad decisions, wasted effort, and eroded trust.

Quality isn’t binary - data can be perfect for one use case and useless for another. A dataset with 95% accuracy might be fine for marketing analysis but unacceptable for financial reporting.

The goal isn’t perfect data. It’s data that’s good enough for how you’re using it, with visibility into its limitations.


Why Data Quality Matters

Decision Quality

Bad data leads to bad decisions. If your customer churn numbers are wrong, your retention strategy will be wrong. If revenue figures don’t reconcile, forecasts are unreliable. Garbage in, garbage out - at business scale.

Trust

Once stakeholders lose trust in data, they stop using it. They go back to gut feel, spreadsheets, and asking around. Rebuilding trust takes longer than building it initially.

Efficiency

Poor quality creates rework. Analysts spend time cleaning data instead of analyzing it. Engineers debug data issues instead of building features. Support handles complaints about incorrect reports.

Compliance

Regulations require accurate data. GDPR mandates accurate personal data. SOX requires reliable financial reporting. Healthcare regulations demand accurate patient records. Quality failures become compliance failures.

Cost

Bad data has direct costs: incorrect invoices, missed opportunities, regulatory fines. It has indirect costs too: time spent fixing problems, delayed projects, lost credibility.


Data Quality Dimensions

Accuracy

Does data reflect reality?

  • Customer addresses that actually exist
  • Order amounts that match what was charged
  • Inventory counts that match physical stock

Accuracy is the most fundamental dimension. Inaccurate data is wrong data.

Completeness

Is all required data present?

  • No missing values where data should exist
  • All expected records captured
  • Required fields populated

Completeness means having the data you need, not necessarily having all possible data.

Consistency

Does data agree across systems and over time?

  • Same customer, same email address everywhere
  • Totals that match their components
  • Historical data that doesn’t change unexpectedly

Inconsistency creates confusion and undermines trust.

Timeliness

Is data available when needed?

  • Real-time dashboards updated in real-time
  • Daily reports available at start of business
  • Alerts triggered before issues escalate

Stale data is often useless data.

Validity

Does data conform to defined formats and rules?

  • Dates in expected formats
  • Values within expected ranges
  • References that point to existing records

Invalid data breaks downstream processing and analysis.

Uniqueness

Is each entity represented once?

  • No duplicate customer records
  • No repeated transactions
  • Clear identity resolution

Duplicates inflate metrics and complicate analysis.


Measuring Data Quality

Quality Metrics

Quantify quality across dimensions:

Accuracy Rate = Correct records / Total records

Completeness Rate = Non-null values / Expected values

Consistency Rate = Matching records across systems / Total records

Timeliness = Time from event to data availability

Validity Rate = Valid records / Total records

Uniqueness Rate = Unique records / Total records

Quality Score

Combine metrics into an overall quality score:

Quality Score = (w1 × Accuracy + w2 × Completeness + w3 × Consistency + ...) / Total Weight

Weight dimensions based on importance for your use case. Financial data might weight accuracy heavily. Marketing data might prioritize completeness.

Quality SLAs

Define acceptable quality levels:

  • “Customer email addresses must be 99% valid”
  • “Daily revenue figures must be available by 6 AM”
  • “Order data must match source systems within 0.1%”

SLAs make quality expectations explicit and measurable.


Common Quality Problems

Missing Data

Symptoms: Null values, empty fields, incomplete records

Causes:

  • Optional fields not enforced at source
  • Integration failures losing records
  • Schema mismatches dropping columns

Solutions:

  • Add not-null constraints where appropriate
  • Monitor for completeness drops
  • Validate record counts across systems

Duplicate Data

Symptoms: Inflated counts, multiple records for same entity

Causes:

  • Multiple data sources without deduplication
  • Retry logic creating duplicate inserts
  • Lack of unique constraints

Solutions:

  • Implement deduplication logic
  • Use idempotent processing
  • Add unique constraints and indexes

Inconsistent Data

Symptoms: Same entity with different values across systems

Causes:

  • No master data management
  • Unsynchronized updates
  • Different business rules in different systems

Solutions:

  • Establish single source of truth
  • Implement data synchronization
  • Document and reconcile business rules

Stale Data

Symptoms: Reports showing old information, delayed alerts

Causes:

  • Pipeline delays or failures
  • Batch processing too infrequent
  • Bottlenecks in data flow

Solutions:

  • Monitor data freshness
  • Alert on pipeline delays
  • Optimize or parallelize processing

Invalid Data

Symptoms: Processing failures, incorrect calculations

Causes:

  • No input validation
  • Schema evolution without handling
  • Edge cases not anticipated

Solutions:

  • Validate data at ingestion
  • Handle schema changes gracefully
  • Test with edge cases

Data Quality Strategy

Prevention

Stop quality problems before they happen:

  • Source validation - Reject bad data at the entry point
  • Schema enforcement - Define and enforce data contracts
  • Input constraints - Limit what values are acceptable

Prevention is cheaper than remediation.

Detection

Find problems quickly when they occur:

  • Automated testing - Run quality checks on every pipeline run
  • Anomaly detection - Alert on unexpected patterns
  • Monitoring dashboards - Visibility into quality metrics

Early detection limits damage.

Correction

Fix problems when found:

  • Automated remediation - Fix known issues automatically
  • Manual review workflows - Route exceptions to humans
  • Backfill processes - Correct historical data when needed

Have a process before you need it.

Communication

Keep stakeholders informed:

  • Quality reports - Regular visibility into data health
  • Incident communication - Alert consumers to known issues
  • Documentation - Clear understanding of data limitations

Transparency builds trust even when problems occur.


Quality Testing

Schema Tests

Verify data structure:

-- Column exists and has correct type
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'orders'

Null Tests

Check for unexpected nulls:

-- No null values in required field
SELECT COUNT(*)
FROM orders
WHERE customer_id IS NULL

Uniqueness Tests

Verify no duplicates:

-- Primary key is unique
SELECT order_id, COUNT(*)
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1

Referential Integrity Tests

Check foreign key relationships:

-- All orders have valid customers
SELECT COUNT(*)
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL

Range Tests

Verify values within bounds:

-- Order amounts are positive
SELECT COUNT(*)
FROM orders
WHERE total_amount <= 0

Freshness Tests

Check data is current:

-- Data updated within expected window
SELECT MAX(updated_at)
FROM orders
WHERE updated_at < CURRENT_TIMESTAMP - INTERVAL '1 hour'

Tools for Data Quality

dbt Tests

Built-in testing for dbt projects:

# schema.yml
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('customers')
              field: customer_id
      - name: total_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0

Great Expectations

Python-based data validation:

import great_expectations as gx

context = gx.get_context()
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="orders_suite"
)

validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("total_amount", min_value=0)
validator.expect_column_values_to_be_unique("order_id")

Soda

SQL-based quality checks:

# checks.yml
checks for orders:
  - row_count > 0
  - missing_count(customer_id) = 0
  - duplicate_count(order_id) = 0
  - min(total_amount) >= 0
  - freshness(created_at) < 1h

Monte Carlo, Anomalo, Bigeye

Commercial data observability platforms that provide:

  • Automated anomaly detection
  • Quality monitoring dashboards
  • Incident management
  • Root cause analysis

Quality and Governance

Data quality is a core pillar of data governance:

  • Ownership - Someone must be accountable for each dataset’s quality
  • Standards - Define what “good quality” means for your organization
  • Processes - How quality issues are identified and resolved
  • Metrics - How quality is measured and reported

Quality without governance is reactive firefighting. Governance without quality metrics is paperwork.


Getting Started

1. Define Critical Data Elements

Not all data needs the same quality attention. Identify the data that matters most - what drives key decisions, what has compliance requirements, what powers customer-facing products.

2. Establish Baselines

Measure current quality before trying to improve it. You can’t demonstrate progress without knowing where you started.

3. Implement Basic Tests

Start with simple tests on critical data:

  • Not null on required fields
  • Unique constraints on identifiers
  • Freshness on time-sensitive data

4. Create Visibility

Build a quality dashboard. Make quality metrics visible to data producers and consumers.

5. Assign Ownership

Every dataset needs an owner accountable for quality. Quality without ownership is everyone’s problem and no one’s priority.


Frequently Asked Questions

What is data quality?
Data quality measures how well data serves its intended purpose. It’s assessed across dimensions including accuracy (correctness), completeness (no missing values), consistency (agreement across systems), timeliness (freshness), validity (conforming to rules), and uniqueness (no duplicates).
Why is data quality important?
Poor data quality leads to bad decisions, eroded trust, wasted effort, compliance failures, and direct costs. Once stakeholders lose trust in data, they stop using it - and rebuilding trust takes longer than building it initially.
How do you measure data quality?
Measure quality across dimensions: accuracy rate (correct/total records), completeness rate (non-null/expected values), consistency rate (matching records across systems), timeliness (event to availability time), validity rate, and uniqueness rate. Combine into a weighted quality score based on your priorities.
What tools help with data quality?
dbt provides built-in testing. Great Expectations offers Python-based validation. Soda provides SQL-based checks. Commercial platforms like Monte Carlo, Anomalo, and Bigeye offer automated anomaly detection and quality monitoring dashboards.
How do I improve data quality?
Focus on prevention (validate at source), detection (automated testing and monitoring), correction (remediation processes), and communication (transparency with stakeholders). Start with critical data elements, establish baselines, implement basic tests, and assign clear ownership.