Ambiguous Outcomes Are the Core Distributed Systems Problem
A clean failure is almost relaxing. The request was rejected, the job never started, the write did not commit, and the caller can make a new decision. The expensive cases are the ones that end in silence. A timeout fires, a connection drops, a worker loses its lease, or a tool call returns nothing after it may already have changed the outside world.
That is the shape I care about most when I review distributed systems and agentic systems: not "did it fail?" but "what do we know about what happened?"
The thesis
Ambiguous outcome is the central reliability problem. Most production damage comes from treating "unknown" as either "failed" or "succeeded" because that lets the system move forward with a false story.
Retries, queues, workflows, and agents all become dangerous when they erase that distinction. A retry can duplicate a side effect. A success path can skip repair for work that never finished. A human operator can clear a ticket because the dashboard says "timeout" when the payment, email, deployment, or deletion actually happened.
The first design move is to make ambiguity a first-class state.
The production pattern
The common version is simple. A service receives an instruction to perform an external effect. It writes some local state, calls another system, and waits. The downstream system is slow. The caller times out. The network drops before the response arrives. The worker crashes after sending the request but before recording the result.
Now the system has two facts that do not fit together. Locally, it may have no completion record. Externally, the effect may already exist. If the caller repeats the operation without a durable identity, the downstream system sees a second request, not a question about the first one.
This is not limited to payments or obvious high-risk actions. It shows up in provisioning, sending notifications, changing access, importing files, scheduling jobs, publishing documents, and invoking agent tools. The side effect may be small, but the uncertainty compounds when later steps depend on it.
The model
I use a small model for every effectful operation:
- Intent: what the system decided to do, recorded before the attempt.
- Attempt: the specific execution of that intent, including the target and parameters.
- Outcome: the evidence collected after the attempt.
- Reconciliation: the later process that checks reality and repairs the local story.
The important outcome states are not just success and failure. They include accepted, rejected, committed, not found, cancelled before start, and unknown after attempt. Unknown after attempt is the state that deserves the most respect, because it is the one where a blind retry can create new damage.
Once you name these states, the interface changes. A timeout is not a failed command. It is a missing answer. A lost acknowledgement is not permission to assume no work occurred. A worker crash is not evidence that the external side effect did not happen.
This model also changes product behavior. If a user asks for an action and the system loses certainty, the honest state may be "pending verification" rather than "failed." That is less tidy, but it is closer to the truth.
Where this goes wrong
The first failure is collapsing unknown into failure. This usually happens through a helper library that catches timeout exceptions and returns a generic error. The caller logs the error and retries. The code looks reasonable in a unit test because the fake downstream service either returns success or throws before doing anything. Production systems do not fail that politely.
The second failure is recording intent too late. If the system calls the external service before writing a durable record, a crash can leave no local evidence that an attempt ever happened. Later, reconciliation has nothing stable to search for.
The third failure is storing only happy-path results. Teams often keep the successful downstream identifier but discard rejected, timed out, or uncertain attempts. That makes the most important cases the least debuggable.
The fourth failure is giving ownership of ambiguity to no one. Operators see a pile of timed out jobs. Product sees users stuck in a vague state. Engineers see logs but no state machine. Everyone agrees the system is "eventually consistent," but no process is responsible for making it consistent.
There is a counterpoint. Some operations are cheap, reversible, and isolated. For those, it may be acceptable to fail closed or ask the user to retry manually. But that is a product and operations decision, not a hidden default inside transport error handling.
What I do now
I ask for the ambiguity table before I ask for the retry policy. For each side effect, what can be unknown? How is intent recorded before the attempt? What identity lets a later request ask about the same intent? Where is the evidence stored? Who owns the reconciliation path?
I prefer APIs that accept a caller-generated operation key and return the current state of that operation. I prefer workers that write an attempt record before calling out. I prefer dashboards that show "unknown after attempt" separately from "not attempted." I prefer runbooks that tell an operator how to verify external reality instead of telling them to rerun the job.
For agentic systems, the same rule applies. A model may decide to call a tool, but the durable system around it must remember that decision, the tool input, the attempt identity, and the observed outcome. The transcript is not enough. It is a narrative artifact, not an operations ledger.
Closing takeaway
Do not design around failure first. Design around uncertainty first. If the system can say "we tried, and we do not yet know what happened" without losing its place, the rest of the reliability work has somewhere honest to stand.