Back to archive

Agentic Systems

Idempotency Is Durable Intent

Idempotency works when it preserves a decision across retries, crashes, and uncertain outcomes.

Idempotency Is Durable Intent

Idempotency is often described as making retries safe. That description is too small. Safe retries are the symptom. The deeper idea is that the system can recognize the same intent after the caller, worker, network, or planner has lost certainty.

When idempotency is treated as a decorator on an endpoint, it usually fails in the place that matters: after the first attempt has crossed a side-effect boundary.

The thesis

Idempotency is durable intent, not duplicate suppression.

The system needs to remember what it agreed to do, under what identity, with what parameters, and what outcome is known so far. If that record exists, a retry can become a question: "what happened to this intent?" If that record does not exist, a retry becomes a second command wearing the same costume.

This distinction matters more as systems become more automated. Agents retry tool calls. Workers restart. Clients refresh pages. Schedulers requeue delayed work. Without durable intent, each recovery path can accidentally create a new side effect.

The production pattern

The usual failure starts with a request that looks harmless. Create an account. Send an invoice. Start an import. Publish a document. Grant access. The caller sets a timeout because every caller eventually sets a timeout. The downstream operation takes longer than expected. The caller retries.

If the downstream system only sees two similar payloads, it has no stable reason to treat them as one operation. The result may be duplicate emails, duplicate workspaces, duplicate charges, duplicate jobs, or two conflicting writes racing toward the same resource.

Teams often patch this with request-body hashing. That helps only when the same bytes always mean the same business intent. In real systems, timestamps, generated names, defaults, auth context, and harmless metadata can change between attempts. Worse, two different business intents can sometimes have the same visible payload from the downstream service's point of view.

The model

I look for four pieces:

  • Operation key: a caller-owned identity for the business intent.
  • Intent record: a durable row or event written before the side effect starts.
  • Parameter binding: the exact command shape accepted for that key.
  • Outcome record: the current known result, including uncertain and rejected states.

The operation key is not a trace id. A trace id identifies an execution path. An operation key identifies the user's or system's intent. One operation may have many attempts and many traces.

The parameter binding is what stops accidental key reuse. If the same key arrives with different meaningful parameters, the system should reject it or return a conflict. Quietly accepting the new payload turns the key into a footnote.

The outcome record is what lets the retry return the right answer. If the first attempt succeeded, the second call can return the original result. If the first attempt is still running, it can return pending. If the first attempt is unknown after crossing a boundary, it can return pending verification rather than firing again.

Where this goes wrong

The first common mistake is using a short cache as the source of truth. A cache can reduce load, but it is a poor home for business intent. If the key expires while a downstream system still has the side effect, a later retry can duplicate it.

The second mistake is storing only successful outcomes. Failures, rejections, and unknown states are part of the idempotency contract. If a rejected command can later become accepted under the same key because the earlier rejection was not remembered, the caller cannot reason about the system.

The third mistake is putting idempotency at the wrong layer. An API gateway can notice duplicate HTTP requests, but it usually cannot know whether two requests represent the same durable business intent. The service that owns the side effect needs to participate.

The fourth mistake is making the endpoint idempotent while the internal effects are not. A create call may reuse the same database row but still send a welcome email twice, enqueue two imports, or publish two audit events. Idempotency has to cover the effect graph, not just the first write.

There is a real limit here. Not every operation deserves a full operation ledger. A low-value preference toggle may be better served by last-write-wins semantics. But once an operation crosses into money, access, provisioning, publication, or irreversible communication, durable intent is usually cheaper than cleanup.

What I do now

I ask which component owns the operation key. If the answer is "the client library generates one for every HTTP request," I push back. The key should represent a business decision, not a transport attempt.

I ask where the intent is written. It should be before the external side effect. If the first durable record appears after the call returns, the crash window is still open.

I ask what happens when the same key arrives with different parameters. The answer should be explicit. Return the original operation, reject the mismatch, or require a new key. Do not quietly reinterpret old intent.

I ask whether downstream effects use the same identity. If the workflow creates a job, sends a message, and calls an external API, those children need stable identities too. Otherwise the top-level key gives a false sense of safety.

For agentic systems, I also separate the model's proposed action from the system's accepted intent. The model can suggest "send this email" twice. The runtime should decide whether those suggestions map to an existing operation or a new one.

Closing takeaway

Idempotency is not a retry trick. It is the system remembering what it meant to do after the execution path becomes unreliable.