Reconciliation Is Where Correctness Becomes Real

Many systems have a beautiful write path and a neglected recovery path. The API validates the command. The database transaction is tidy. The job is queued. The dashboard moves to green. Then a worker dies, a downstream service partially accepts the request, or a later step is delayed long enough that the original assumption is no longer true.

The write path made a promise. Reconciliation is where the system proves whether it kept it.

The thesis

Correctness is not the moment a command is accepted. Correctness is the ongoing ability to compare desired state with observed state, explain the difference, and move the system toward a valid state.

This is why reconciliation is not cleanup. It is part of the core design. If it is bolted on after incidents, it inherits missing identifiers, vague states, and no ownership. Then it becomes a cron job that knows too much and guarantees too little.

The production pattern

A user asks the system to create or change something. The local state records the request. Work fans out. One step succeeds, one step times out, and one step is delayed behind other work. The user sees a page that says the operation is complete because the initial command handler finished.

Hours later, support finds that the external resource exists but the local system does not reference it, or the local system shows access granted but the downstream service never applied the permission. Sometimes the reverse is worse: the local system believes a resource was removed, but the external resource is still active.

Every team eventually writes a repair script. The question is whether that script is a panic tool or a designed subsystem.

The model

I use a reconciliation loop with five responsibilities:

Desired state: the durable intent the system owns.
Observed state: the facts collected from databases, queues, external APIs, and logs that are allowed to act as evidence.
Difference: a named classification of how reality diverges from intent.
Repair: a bounded action that moves one difference toward a valid state.
Record: an audit trail of what was found, what was attempted, and what remains uncertain.

The difference classification matters. "Broken" is not enough. The system needs to distinguish missing external resource, extra external resource, local record missing external identifier, permission drift, stale owner, duplicate side effect, and unknown after attempted repair. Each class has a different safe action.

The repair also needs a contract. Some repairs can be automatic. Some should only mark the object for human review. Some should stop because the evidence is contradictory. The loop should not pretend all drift is equal.

Where this goes wrong

The first failure is treating reconciliation as a batch query plus updates. That works until the query sees stale data, the update races with live traffic, or the repair emits side effects that were never meant to happen twice.

The second failure is giving reconciliation no product state. Users and operators need honest states like pending, verifying, repair needed, and blocked on manual review. If every drift condition is hidden behind "processing," the system trains people to ignore status.

The third failure is letting repair code bypass normal ownership rules. A script written during an incident often has broad credentials and no fences. It can fix one class of drift while creating another.

The fourth failure is measuring only how often reconciliation runs. Frequency is not the same as effectiveness. Better questions are: what drift classes exist, how old are they, which ones repeat, which repairs fail, and which objects require human decision?

There is a counterpoint. Some systems are intentionally disposable. If the desired state can be recreated cheaply and no external obligation exists, full reconciliation may be more machinery than the product needs. But if the system grants access, moves money, sends messages, provisions resources, or deletes user-visible data, reconciliation is not optional in practice.

What I do now

I design the reconciliation path while designing the write path. For every command, I ask what evidence will exist if the worker crashes after each side effect. If the answer is "we can look in logs," the design is not ready. Logs help humans investigate; they should not be the only source of repair truth.

I keep repairs narrow. A reconciler should usually repair one object or one drift class at a time, with limits on how much it can change in one run. Large blind repair jobs are hard to stop and harder to trust.

I separate detection from action. Detection should be able to run without changing state and produce a useful report. Action should be gated by drift class, ownership, and risk. This makes the system easier to operate during an incident because you can ask "what would it do?" before letting it do anything.

I also make reconciliation visible. If a product accepts delayed or partial recovery, users should not have to guess. A precise pending state is better than a false success state, even when it creates uncomfortable conversations about service promises.

For agents, reconciliation is the difference between a transcript that sounds plausible and a system that checks the world. The planner may believe it filed the ticket, changed the config, or notified the user. The runtime still needs to verify the durable state outside the model.

Closing takeaway

The write path expresses intent. Reconciliation earns correctness by checking whether the world actually moved.