Sagas Without Fantasy Rollbacks
Distributed workflows tempt engineers into comforting diagrams. Step one creates something. Step two updates something. Step three notifies someone. If step three fails, arrows point backward and boxes are labeled rollback.
The diagram is tidy. Production is not. Many real effects cannot be undone. They can only be compensated, explained, hidden, expired, credited, revoked, or repaired by a later action.
The thesis
Sagas are not distributed transactions with nicer branding. They are forward recovery plans for workflows that cross ownership boundaries.
The word rollback is dangerous when it suggests the system can restore the world to the state before the workflow started. In many product systems, that world no longer exists. A user saw a message. An external system created an object. A permission was active for a period of time. A downstream workflow consumed an event.
The serious question is not "how do we undo this?" It is "what valid state can we move to now, and who needs to know?"
The production pattern
A workflow spans several systems. It might reserve capacity, charge or credit an account, create a resource, update permissions, send a message, and mark a local record complete. The early steps succeed. A later step fails or becomes ambiguous.
The team tries to roll back. The reservation can be released, but the message was already sent. The external resource can be disabled, but its identifier still exists. The permission can be revoked, but it was active for a while. The local database can be changed, but another consumer already read the event.
Now the system needs recovery, not denial. Pretending the workflow did not happen creates worse inconsistencies.
The model
I design sagas around these ideas:
- Intent: the durable reason the workflow exists.
- Step record: each action attempted, with its operation identity and outcome.
- Commit boundary: the point after which the product cannot pretend nothing happened.
- Compensation: a forward action that creates a valid follow-up state.
- Visibility: the user, operator, or downstream system state that explains what remains.
The commit boundary is the key review point. Some steps are preparatory. Some steps publish reality. Once reality is published, compensation must be treated as a new action, not an eraser.
Compensation should be specific. "Rollback account" is vague. "Issue credit," "revoke access," "archive resource," "send correction," "mark provisioning failed after external create," or "open manual review" are concrete product operations.
Where this goes wrong
The first mistake is designing compensations as mirror images of steps. Create does not always map cleanly to delete. Send does not map to unsend. Grant does not fully map to revoke if time passed and actions were taken during the grant.
The second mistake is not testing compensation paths. Happy-path workflow tests are common. Tests where step four fails after step three commits are less common, which is exactly why incidents expose them.
The third mistake is hiding partial states from users and operators. A saga may need states like reserved, externally created, notification failed, compensation pending, compensated, or manual review required. Collapsing all of that into failed makes repair harder.
The fourth mistake is nesting sagas until no one owns the outcome. If each service runs its own workflow and emits generic failure events, the end-to-end owner can lose the thread. Someone needs to own the user-visible intent.
There is a counterpoint. If all changes live inside one database and the product contract is local to that database, use a transaction. Do not build a saga because the term sounds architectural. Sagas are for crossing boundaries where one atomic commit is not available or not appropriate.
What I do now
I ask where the workflow crosses from private preparation into published reality. Before that point, aborting may be possible. After that point, the design needs named compensations and product states.
I write compensation as normal product behavior, not emergency code. If a resource may need to be disabled after partial provisioning, that disable path needs ownership, audit, permissions, and user-facing language. It is not a cleanup script hiding behind the curtain.
I make every step idempotent and every compensation idempotent. Recovery often runs after crashes, timeouts, and operator retries. A compensation that creates duplicate corrections is only another side effect problem.
I also separate automatic compensation from manual decision. Some partial states are safe to repair mechanically. Others require human judgment because money, access, customer communication, or compliance context is involved. The workflow should stop honestly instead of inventing certainty.
For agents, this is especially important. If an agent performs a sequence of tool actions, the runtime cannot assume a failed later step means earlier steps disappeared. Agent recovery needs the same step records, commit boundaries, and compensation rules as any other distributed workflow.
Closing takeaway
Do not ask a saga to rewind time. Ask it to move a partially changed world into the next valid state.