Your Databricks bill is mostly DBUs. A lot of those DBUs run ETL that never needed Spark in the first place.
Databricks is genuinely good at what it’s built for: large-scale distributed processing, ML workloads, data that doesn’t fit on one machine. The cost model assumes that’s the work you’re doing.
Most SME ETL is smaller and simpler: a few million rows a night, a handful of tables joined, landed in a warehouse. Spark spins up a cluster to do work a single SQL query could handle.
Where the DBUs quietly add up:
- Clusters sized for peak, running jobs that never hit it.
- All-purpose clusters left on for “interactive” work that’s mostly idle.
- Job clusters with slow startup, so people keep clusters warm to skip the wait.
- Simple transforms written in PySpark because the platform’s there, billed at Spark rates.
The comparison that matters for a simple nightly pipeline:
- Spark / DBUs: handles anything, and you pay for a distributed engine whether or not the job needs one.
- Warehouse-native SQL (dbt on Snowflake, BigQuery, Postgres): cheaper for set-based transforms, simpler to run, hits a real ceiling once data outgrows the warehouse.
- A small managed ELT tool for ingest plus SQL for the rest: least engineering, fine until volume or complexity demands more.
The honest rule I use: if the job fits comfortably in your warehouse and runs on a schedule, do it there.
Keep Databricks for what it’s good at. Then look at what your DBUs are actually running, and move the simple, high-frequency jobs to the cheapest thing that does them well.
Pull your top five DBU-consuming jobs. How many are just moving and joining tables?
