Back to archive

Agentic Systems

Agent State Should Live Outside the Model

A practical argument for storing task state, approvals, actions, and verification outside the conversation window.

Agent State Should Live Outside the Model

The conversation window is a poor database. It is lossy, hard to query, easy to summarize incorrectly, and not built for concurrency. Yet many agent systems treat the model's current context as if it were durable state.

The thesis

The model can reason over state, but it should not own state. Production agent state belongs in explicit records outside the model.

This is not a philosophical preference. It is an operational requirement. If an agent can write files, change tickets, send messages, or trigger workflows, the system needs to know what task is active, which resources are in scope, what approvals were granted, what actions were proposed, what tools ran, what changed, and what remains unresolved. A transcript can help explain those things. It should not be the only place they exist.

The production pattern

An agent is working on a pull request. It reads review comments, edits two files, runs tests, gets a failure, revises one file, asks the user for approval to push, and then loses context because the session is restarted. The next run receives a summary: "Addressed feedback, tests mostly pass, user approved push."

That summary is dangerous. Which comments were addressed? Which files changed? Which test failed? What exactly did the user approve? Did the approval apply before or after the second edit? Was the branch still at the same head? Did the agent already push once before the interruption?

If the state lives only in the model's conversation, the next run must infer answers from incomplete language. If the state lives outside the model, the next run can load the task record, action log, approval grant, file versions, test results, and unresolved verification failures.

The difference is not cosmetic. It is the difference between resuming work and guessing.

The model

I separate agent state into five records: task, world observations, proposals, execution ledger, and verification ledger.

The task record contains the objective, requester, scope, constraints, deadline, and current status. It answers, "What are we trying to do, and where are the boundaries?" For example: address review feedback on PR 123, modify only files under docs/, do not push without approval, stop if tests fail outside the edited area.

World observations capture what the agent has read. They include source, resource version, capture time, and summary. A file read, issue comment, deployment status, policy decision, and test result are different observations. They should not be flattened into one memory sentence.

Proposals capture intended writes before they happen. A proposal says what the agent wants to do, why, and which observations support it. If the proposal changes, it gets a new identity. That makes approval drift visible.

The execution ledger records tool calls and effects. It stores inputs, outputs, ids, timestamps, errors, and ambiguous outcomes. If a ticket creation timed out, the ledger should not say "failed" unless reconciliation proved no ticket exists. It should say the outcome is unknown.

The verification ledger records post-action checks. It is where the system stores "read file back and hash matches," "CI job 456 passed," "message id 789 exists in the thread," or "policy check denied production deploy." This gives later runs a factual base.

Where this goes wrong

The first failure is memory optimism. The agent says, "I remember the user approved this." Maybe it does. Maybe the approval applied to an earlier proposal. Maybe the conversation was summarized and lost the condition. Durable approvals need ids, scopes, and expiry.

The second failure is concurrent work. Two agents pick up the same ticket, both read the same initial state, and both write different fixes. If ownership exists only in their prompts, neither can reliably fence the other out. External state can hold a lease, branch reservation, or task assignment.

The third failure is retry confusion. A tool call fails with a network error. The model decides to try again. Without an execution ledger and idempotency key, the retry may duplicate a side effect. The model cannot reason its way out of an ambiguous outcome it did not record.

The fourth failure is audit theater. A transcript is stored, but it is too long and unstructured to answer simple questions. Which policy allowed this write? Which approval was used? Which verification failed? If answering requires rereading thousands of tokens, the state model is doing the wrong job.

The counterpoint is that very small agents can keep light state. A local assistant drafting a throwaway note does not need a task database. But as soon as an agent crosses process boundaries, resumes after interruption, or affects shared systems, external state pays for itself quickly.

What I do now

I make the model read state at the beginning of each turn instead of assuming continuity. The prompt can include a concise view, but the source of truth is the task record and ledgers.

I design tools to update state as part of execution. A write tool should not only perform the write; it should append to the action ledger. A verification tool should not only return a result to the model; it should record the observation. If the model crashes, the system still knows what happened.

I keep approvals and stop conditions outside the model. The model can request approval and explain why. It cannot decide that an old approval still applies. The model can recommend stopping. The system can enforce stop conditions when preconditions fail, budgets expire, or policy denies the action.

I also make state compact enough to inspect. Good state is boring: records with ids, timestamps, resources, versions, and outcomes. It should be possible to answer the important questions without reconstructing the whole conversation.

Closing takeaway

Use the model for reasoning, not memory. Durable state is the part of an agent system that survives retries, interruptions, audits, and changed plans.