Back to archive

Engineering

Backfills Are Production Deployments

Why data correction jobs need rollout discipline, observability, and rollback thinking.

Backfills Are Production Deployments

A backfill can look less serious than a deploy because it often lives outside the request path. It may be a script, a batch job, a replay, a repair command, or a one-off workflow. No user clicks the backfill button. No product launch depends on the job name.

But the effect is real. A backfill changes production truth.

The thesis

Backfills are production deployments because they modify user-visible state under real constraints. They deserve scope, rollout discipline, observability, and rollback thinking proportional to the truth they change.

The code may already exist. The schema may already be migrated. The feature may already be live. None of that makes the correction safe by default.

The production pattern

A system discovers that historical data needs repair. A previous bug wrote incomplete records. A new field must be populated from older facts. A derived table fell behind. An eligibility rule changed. A deduplication pass needs to collapse old duplicates. A search index must be rebuilt after semantics changed.

The team writes a job. It passes on a sample. The job is pointed at production. It runs for longer than expected, touches more records than expected, or produces a dispute nobody expected. Sometimes the backfill is technically correct but operationally disruptive: it overloads a dependency, invalidates caches, floods downstream consumers, or changes a user-facing total in a way support cannot explain.

The mistake was not using a batch job. The mistake was treating production mutation as cleanup instead of deployment.

The trap

The trap is thinking the risk ended when the code shipped.

Backfills happen after the product has accumulated real state, real consumers, and real expectations. That makes them different from schema migrations and different from ordinary feature deploys. A feature deploy changes future behavior. A backfill changes the past as represented by the system.

That distinction matters. Old records may have been manually corrected. Downstream systems may have learned the old shape. Reports may have been exported. Users may have acted on prior values. Recomputing history can be correct and still surprising.

The second trap is relying on "we can rerun it" as the rollback plan. Rerun is not rollback. If the job writes bad values, triggers downstream effects, or deletes evidence needed for repair, rerunning may only repeat the mistake faster.

The model

I review backfills with six fields.

Scope: which records are eligible, and how is that set proven? The safest backfills start by producing a candidate list, not by scanning and mutating in one motion. Scope should include exclusions: records that are too new, manually locked, already corrected, legally sensitive, or ambiguous.

Dry run: what would change if the job ran? A useful dry run reports counts by class, representative samples, expected deltas, skipped records, and reasons for uncertainty. A dry run that only says "no exceptions" is not enough.

Checkpointing: can the job stop, resume, and avoid double effects? Long-running correction needs progress markers, idempotency, batch sizing, rate limits, and a plan for partial completion. A backfill that must succeed in one uninterrupted run is a fragile deployment.

Audit sample: how will humans inspect correctness before, during, and after rollout? I want sampled before-and-after records, edge cases, high-value categories, and examples where the job intentionally does nothing. The "does nothing" cases often catch semantic bugs.

Rollback story: what happens if the job is wrong? The answer may be restore from captured prior values, compensating backfill, dual-written correction table, pause and quarantine, or manual review. The rollback story should be written before the first production batch.

User impact: who sees changed truth? A backfill may alter counts, statuses, permissions, balances, recommendations, exports, notifications, or dashboards. If the change is visible, support and product need the explanation before users find the inconsistency.

This model does not make backfills slow by default. It makes risk visible enough to choose the right amount of ceremony.

Where this model breaks

Over-processing tiny cleanup jobs is wasteful. If a job changes a few internal records, has no downstream effects, and can be manually verified, a full rollout plan is unnecessary.

The key is proportionality. I do not need a heavyweight process for every correction. I do need teams to notice when a job crosses one of three boundaries: large scope, decision-bearing data, or irreversible effects.

There is also a counterpoint around speed. During an active incident, repair may need to happen quickly. Even then, the model can be compressed rather than skipped. Write scope, sample, checkpoint, observe, and record what changed. The discipline matters more when the clock is loud.

What I do now

I ask teams to describe backfills as rollout plans, not scripts.

The plan says what will run first, what evidence allows the next batch, what dashboard or query watches progress, what rate limit protects dependencies, and who can stop the job. For broad backfills, I prefer staged execution: small sample, one low-risk segment, one representative segment, then wider rollout.

I also want prior values captured when the mutation matters. That can be a correction log, shadow table, versioned record, object snapshot, or event trail. The exact mechanism depends on the system, but the principle is simple: if you cannot explain what changed, you cannot confidently repair it.

For derived data, I separate recompute from reinterpretation. Recompute applies the same meaning more accurately. Reinterpretation changes the meaning of historical facts. The second needs product approval because it can change reports, workflows, and user expectations.

The principal-engineer lens is sequencing. A backfill usually touches code, data, operations, support, and product truth at once. Treating it as "just a job" hides that coordination. Treating it as a deployment gives the organization a way to reason about blast radius.

Closing takeaway

Before running a backfill, ask the same question as any production deploy: what will change, how will we know, and how do we recover if our understanding is wrong?