Replay Is the Difference Between Debugging and Guessing

When an agent fails once and nobody can reproduce the run, the team usually debates the wrong thing.

Someone blames the prompt. Someone blames the model. Someone says the tool output was weird. Someone says the user asked for too much. All of those may be true, but without replay they are guesses with different costumes.

Replay is the point where agent debugging starts to look like engineering.

The thesis

An agent harness without replay is a demo recorder, not a debugging system. To understand agent failures, the harness must preserve the world, inputs, tool responses, permissions, time, and state transitions well enough to run the same case again.

The model may not make the same choice every time. That is fine. Replay does not mean freezing every token. It means freezing the world around the agent so that changes in behavior can be compared against the same facts.

If the facts keep moving, every investigation becomes folklore.

The production pattern

Imagine an agent that triages fake incident tickets. It can read alerts, query a service catalog, inspect recent deploys, and create an incident note. A bad run closes ticket_442 as duplicate even though it is a new database saturation issue.

The transcript shows something suspicious. The agent called listDeploys(service="checkout") and got an empty list. Later, a human reads the service catalog and sees a deploy that happened ten minutes before the alert. Was the tool stale? Did the agent query the wrong service? Did the deploy appear after the run? Did the harness seed the catalog incorrectly?

Without replay, the team argues.

With replay, the harness has the run bundle:

Initial ticket state.
Service catalog snapshot.
Fake clock value.
Tool request and response log.
Permission grants and denials.
Final state diff.
Agent-visible observations.
Hidden ground truth used by the grader.

Now the failure can be rerun against the same catalog. The team can change one thing at a time: a prompt instruction, a planner setting, a tool schema, or the stale response behavior. The old failure either remains, disappears, or changes shape.

That is debugging.

The model

I want replay to preserve five boundaries.

The first boundary is input. The user request, system instructions, available tools, tool descriptions, and initial context must be captured as they were. If the harness quietly updates tool descriptions between runs, the replay is no longer the same case.

The second boundary is world state. Files, fake database rows, queues, tickets, permissions, feature flags, and clocks should be captured or rebuilt from a versioned seed. It is not enough to save the prompt. The agent acted in a world.

The third boundary is observation. The agent only knows what tools returned. If a fake API returned stale data, that stale response is part of the run. A replay that replaces it with fresh state has erased the bug.

The fourth boundary is side effects. Every write attempt should be recorded, including denied writes and partial writes. If updateTicketStatus returned permission_denied, that denial matters. If writeFile saved the first chunk and failed, the resulting file matters.

The fifth boundary is grading. The expected state diff and policy checks should travel with the replay case. Otherwise, a future team may rerun the case and decide it "looks fine" because they no longer remember what correctness meant.

A replay bundle can be simple:

case: incident-triage-stale-deploy
clock: 2026-04-28T10:15:00Z
seed: incident-world-v4
tool_overrides:
  listDeploys.checkout:
    response_age: 12m
    items: []
permissions:
  ticket.close: denied
  incident_note.create: allowed
expected:
  ticket_442.status: open
  incident_note.created: true
  duplicate_closure_attempted: false

The value is not the format. The value is that the failed world can be summoned again.

Where this goes wrong

Replay goes wrong when teams only save transcripts. A transcript is useful, but it is an observation log, not a world. It cannot tell you whether the database row existed, whether the tool response was stale by design, or whether a write partially succeeded after the last visible message.

It also goes wrong when replay is too coupled to one model. If the case only works by forcing the exact same model output, it becomes brittle. The better replay asks whether the current system handles the same world correctly. The agent may take a different route. The final state contract should still hold.

Another failure is hiding nondeterminism under the rug. Some tools are intentionally flaky in a harness. A search API may return results in a different order. A fake queue may delay visibility. That is acceptable only if the replay records the policy for that nondeterminism. Otherwise, failure becomes impossible to classify.

There is a counterpoint: not every exploratory run deserves full replay machinery. During early prototyping, a lightweight transcript may be enough to learn what task shape is even possible. But the moment a workflow can write to meaningful state, replay stops being optional. You cannot operate a side-effecting system by memory.

What I do now

I give every serious harness case a replay ID. When a run fails, the ID should be enough to reconstruct the world, rerun the agent, and inspect the state diff. If a tool response came from a fake service, the fake service records the response. If it came from a seeded fixture, the fixture version is recorded.

I separate replay from grading. Replay rebuilds the world. Grading decides whether the final state is acceptable. This separation lets me replay an old bad run against a new behavior contract when the product changes, without pretending the old grade was timeless.

I also keep the bad cases close to the harness. A failure discovered during manual review should become a replay fixture quickly. Otherwise, it becomes a story people remember differently.

The most useful replay cases are not exotic. Stale tool output. A denied permission. A write accepted by one tool and invisible to another until later. A generated file touched outside scope. A task stopped after creating a draft but before linking it back to the ticket. These are the ordinary shapes that expose whether the agent system has real control.

Closing takeaway

If you cannot replay the failed world, you are not debugging the agent. You are negotiating with a memory of it.