Your pipeline failed at 3am. The on-call engineer is now paid to clean up duplicates.
The pattern I see at most scaleups when an orchestration job fails partway through:
- Half the data is in the warehouse, the other half isn’t
- Retrying the whole job double-inserts the rows that already landed
- The on-call engineer manually runs SELECT DISTINCT queries at 3:30am to figure out what’s safe to re-process
- Recovery takes 4-6 hours
The jobs themselves aren’t idempotent. Run them twice, get different results.
Idempotent design means: running the same job twice produces the same end state as running it once. No duplicates. No half-states. Same outcome.
How you actually get there:
- MERGE / UPSERT instead of INSERT: keys deduplicate naturally. SQL warehouses give you this for free.
- Time-window partitions: each run targets a specific date range. Re-running yesterday’s job replaces yesterday’s partition. Doesn’t touch today’s.
- Run IDs in target tables: every row tagged with the run ID that produced it. Re-running a failed run deletes that run’s rows first, then re-inserts.
- Checkpoint state in a dedicated table: jobs know which steps completed. Restart from the failed step, not the beginning.
The migration to idempotent pipelines is usually 2-4 weeks for a scaleup with a dozen production jobs. The first time the team retries a failed job in a click instead of paging the on-call engineer, the investment pays back.
A signal you need this: anyone on your team has “manual cleanup after pipeline failure” in their actual job description.
Is your most critical pipeline safe to re-run without thinking about it?
