Back to archive

Agentic Systems

Replay Agents

Replay agents reconstruct failed runs from durable traces so engineers can debug causes instead of arguing from summaries.

Replay Agents

A replay agent is not a retry button with a nicer explanation.

Retrying asks whether the system can get a better result this time. Replay asks what happened last time. Those are different jobs. In agentic systems, that distinction matters because the planner is unreliable, the tools are real, and the most important failures often happen in the gap between what the model believed and what the system state allowed.

Replay agents should reconstruct failure, not repair it.

The thesis

A replay agent turns an ambiguous run into an ordered account of cause and effect.

It should answer: what did the worker know, which instructions had authority, which capabilities were granted, what state was observed, what tool calls were made, what changed, and where did the run become unrecoverable or unsafe?

It should not decide the fix. It should not silently rerun the task against today's state. It should not patch production because it found a likely cause.

The production pattern

The pattern usually starts with a bad run that is hard to explain.

A worker touched the wrong file. A harness stopped after reporting success. A tool returned success but no artifact. A reviewer approved a plan that later expanded. A queue item was processed twice. The final message is confident, but the resulting state is wrong.

Without replay, the team argues from fragments: the user prompt, the final answer, a few logs, and whatever the current filesystem happens to show. That is guessing with a better vocabulary.

A replay agent consumes the durable record and reconstructs the run. For example, a worker deletes a generated file it thought was stale. The replay agent rebuilds the sequence: the directory listing came from a cached tool response, the lease on the task expired before the delete, the write tool accepted the request anyway, and the final summary omitted the stale read. The useful finding is not "the model made a mistake." The useful finding is "the harness allowed a destructive write after stale observation and expired ownership."

That points to an engineering fix.

The model

Replay needs a strict boundary.

Inputs: run identifiers, immutable traces, tool inputs and outputs, state snapshots or checksums, approval records, policy snapshots, model configuration, and final artifacts. For privacy-sensitive systems, the replay agent may receive redacted values, but the redaction must preserve enough structure to explain the run.

Permissions: read-only access to historical records and a sandbox for reconstructing state. It may generate a replay report, a minimal reproduction, or a candidate regression fixture. It may not call production write tools or continue the original task.

Outputs: an ordered timeline, causal hypotheses with evidence, missing evidence if the trace is incomplete, and recommended follow-up categories. A good replay report separates observed facts from inferred causes.

Failure modes: replaying against current state instead of historical state, filling trace gaps with confident guesses, hiding nondeterminism, leaking sensitive context into bug reports, or turning replay into an automatic repair path.

Review path: an engineer or owner reviews the replay report, decides the fix category, and chooses whether to create a regression case, change a policy, alter a tool contract, or repair data. The replay agent can prepare those options. It does not choose among them.

Where this goes wrong

The first failure is calling a rerun a replay. If the same task is executed again with today's files, today's approvals, and today's model, the result may be useful, but it does not explain the original failure. It can even erase the evidence by producing a clean run that makes the bad run look accidental.

The second failure is trace poverty. A replay agent cannot reconstruct a missing tool input, an unrecorded approval, or a state snapshot that was never captured. It may still be able to say "the trace is insufficient," which is valuable, but it should not pretend to know.

The third failure is allowing replay to mutate state. A replay environment should be a controlled world. If a replay agent can write to the system it is investigating, the investigation becomes part of the incident.

The fourth failure is overexplaining the model. The important question is usually not what hidden thought produced the answer. It is which visible decision had insufficient evidence or authority.

What I do now

I design replay before I need it.

That starts with durable identifiers. Every run, approval, tool call, state snapshot, and output artifact gets an identifier that can be linked later. Human-readable logs help, but replay depends on references that survive summarization.

I record tool inputs and outputs at the boundary. If a tool lists files, the replay trace should include the requested path, the observed entries, and enough metadata to know whether the observation was fresh. If a write succeeds, the trace should include what changed. If a tool returns ambiguous success, that ambiguity should be preserved.

I also record stop conditions and ownership state. Many serious harness failures are not wrong answers. They are actions taken after the right to act expired.

When a bad run happens, the replay agent produces a timeline before anyone debates remedies. The timeline does not need to be long. It needs to be anchored:

  • approved intent
  • relevant state observed
  • capability granted
  • action taken
  • state changed
  • mismatch detected

From there, the follow-up is easier. Some failures become adversarial fixtures. Some become auditor checks. Some become tool contract changes. Some reveal that the harness never captured the evidence required to debug itself.

Closing takeaway

Replay is the difference between "the agent did something weird" and "the harness permitted this action after these facts." A replay agent should make that sentence possible.