Most pipelines are designed for the happy path. The first real outage makes recovery the only thing that matters.

The design choices you skip while everything works become the reason recovery takes days instead of minutes.

Replayability is a design decision you make up front. A rough decision tree I use:

Is the pipeline idempotent? Can you rerun the same input and get the same result, with no double-counting? If no, fix that before anything else. Nothing downstream matters until reruns are safe.

Is the source data still there to replay from? If the source is a queue that’s already drained, land raw events first, before transforming. If it’s a database you can re-read, you have more freedom.

Can you replay a single partition, or only the whole thing? Table-level reprocessing turns a small fix into a giant job. Partition by time or key so you can replay just the broken slice.

Do you know what failed and where? Without run-level logging and clear checkpoints, replay means guessing which rows are bad. Cheap to add up front, painful to retrofit.

Score those four and you know your real recovery posture. The pattern across 2026 is that data engineering keeps shifting toward operations. How the system behaves when it breaks is becoming the job.

The pipelines that feed decisions or money need to recover fast. Know which ones those are before the outage tells you. The rest can wait for morning.

Pick your most important pipeline. If it failed mid-run tonight, could you replay just the broken part, or rerun everything and hope?