Databricks DBU Trap For ETL

Your Databricks bill is mostly DBUs. A lot of those DBUs run ETL that never needed Spark in the first place.

Databricks is genuinely good at what it’s built for: large-scale distributed processing, ML workloads, data that doesn’t fit on one machine. The cost model assumes that’s the work you’re doing.

Most SME ETL is smaller and simpler: a few million rows a night, a handful of tables joined, landed in a warehouse. Spark spins up a cluster to do work a single SQL query could handle.

Where the DBUs quietly add up:

Clusters sized for peak, running jobs that never hit it.
All-purpose clusters left on for “interactive” work that’s mostly idle.
Job clusters with slow startup, so people keep clusters warm to skip the wait.
Simple transforms written in PySpark because the platform’s there, billed at Spark rates.

The comparison that matters for a simple nightly pipeline:

Spark / DBUs: handles anything, and you pay for a distributed engine whether or not the job needs one.
Warehouse-native SQL (dbt on Snowflake, BigQuery, Postgres): cheaper for set-based transforms, simpler to run, hits a real ceiling once data outgrows the warehouse.
A small managed ELT tool for ingest plus SQL for the rest: least engineering, fine until volume or complexity demands more.

The honest rule I use: if the job fits comfortably in your warehouse and runs on a schedule, do it there.

Keep Databricks for what it’s good at. Then look at what your DBUs are actually running, and move the simple, high-frequency jobs to the cheapest thing that does them well.

Pull your top five DBU-consuming jobs. How many are just moving and joining tables?

Recognise the problem? Let's talk about it.