Replay Changes Everything
The hardest incident reviews are the ones where the team can describe symptoms but cannot recreate the path. Logs show fragments. Metrics show timing. A dashboard shows a final state that should not exist. Everyone has a theory, and every theory requires imagining what the system must have done.
Replay changes the conversation. It gives the team a way to ask the system what it would do again under controlled conditions.
The thesis
If you cannot replay the important path, you cannot really debug it. You can only inspect its residue.
Replay does not need to mean perfect determinism. Most production systems cannot replay the exact universe. The useful goal is narrower: preserve enough inputs, decisions, configuration, ownership, and side-effect records to reconstruct the path that mattered and test a fix against it.
That is a different design goal from "we have logs." Logs are for observation. Replay is for controlled re-execution.
The production pattern
A workflow produces a bad result. Maybe it sent a duplicate notification, skipped a permission update, chose the wrong fallback, retried an unsafe tool call, or marked a job complete after a stale owner resumed.
The team pulls logs. Some context is missing because log sampling changed. The payload was redacted in a way that removed a key field. The worker version is gone. The configuration flag has since changed. The external API now returns different data. The model prompt or tool schema has been edited. A developer tries to reproduce the issue in a local environment and gets a different result.
Now the incident review is partly a memory exercise. That is not where serious reliability work should live.
The model
I think replay needs four boundaries:
- Input capture: the command, event, message, or task as accepted.
- Decision capture: the versioned code, configuration, policy, prompt, and ownership context that shaped the path.
- Effect capture: the side effects attempted and the durable identities attached to them.
- Replay mode: a controlled execution path that can read captured facts without performing real external mutations.
The effect boundary matters most. A replay that sends real emails, changes real access, or calls production tools is not replay. It is a second incident with better intentions.
Good replay also distinguishes decisions from observations. If the original run observed a downstream state, the replay may need either the captured observation or an explicit choice to query current reality. Both can be useful, but they answer different questions.
Where this goes wrong
The first mistake is relying on free-form logs as the replay source. Logs are often incomplete, reformatted, sampled, or written for humans. Replay needs structured records with stable meaning.
The second mistake is ignoring versioned context. A workflow run depends on code version, feature flags, policy rules, schemas, prompts, tool definitions, and sometimes model choices. If those are not captured or recoverable, replay may prove only what today's system would do.
The third mistake is designing replay after the first serious incident. By then, the missing fields are already missing. You can improve future replay, but the incident that taught the lesson remains partly opaque.
The fourth mistake is confusing requeue with replay. Requeueing asks the live system to process work again. Replay asks what happened, or what would happen under controlled changes. Those are different operations with different risk.
There is a counterpoint. Replay has cost. Capturing payloads can create privacy, retention, and storage concerns. Some data should not be stored, and some effects cannot be simulated perfectly. The answer is not to capture everything forever. It is to identify the decisions that create risk and preserve the minimum evidence needed to inspect them responsibly.
What I do now
I design effectful workflows with replay hooks from the beginning. The system should be able to load an operation by id, inspect accepted input, list attempts, show decisions, and run the decision path in a mode that refuses real side effects.
I keep side effects behind adapters that can record, simulate, or assert. In normal execution, the adapter calls the real system. In replay, it can return captured observations or stop with a clear "external observation required" result. That is less glamorous than clever tracing, but it is much easier to trust.
I capture enough policy context to explain why a branch was taken. That includes ownership version, deadline, selected fallback, refusal reason, and relevant configuration. For agent systems, it includes tool definitions, selected context, planner output, and the runtime's approval decision.
I also turn bad runs into future tests. Once an incident path can be replayed, it can become a regression case that checks behavior rather than text. The goal is not to freeze every output. The goal is to prevent the same unsafe decision pattern from returning.
Closing takeaway
Replay turns production memory into engineering material. Without it, debugging depends too much on what the system happened to say while it was failing.