Agent Observability Should Follow Decisions, Not Tokens

Token logs are useful, but they are not agent observability. They show what text moved through the model. They do not necessarily show why a write happened, which context justified it, which policy allowed it, or why verification was considered good enough.

The thesis

Agent observability should follow decisions. Tokens are evidence, but the production questions are about intent, authority, action, and state.

When an agent changes a file, sends a message, or updates a ticket, the first debugging question is rarely "What was token 8,421?" The question is: why did the system believe this action was appropriate, allowed, and complete? A raw prompt transcript may contain clues, but clues are not an operational model.

If we want to run agents in production, we need traces that make decisions inspectable without replaying the whole conversation in a human's head.

The production pattern

An agent comments on a customer-facing ticket with the wrong deployment time. The transcript shows that an old incident note mentioned a 2 PM maintenance window, a calendar read later showed 4 PM, and the final message used 2 PM. The token log is complete, but the decision is still hard to inspect.

Which context item did the agent rely on? Did it consider the calendar authoritative? Was the calendar read after the incident note? Did the send tool require a freshness check? Did a human approve the final message? Did the approval happen before or after the time changed?

Another example: an agent modifies a configuration file and says tests passed. The trace has model text saying "running tests now" and "the relevant tests pass." But the tool log shows one command timed out, a narrower command passed, and the agent treated the narrower command as enough. The important event is not a token. It is a decision to downgrade verification.

The model

I want observability spans around decisions, not just model calls.

A context selection span records which observations were used and which were ignored. It points to source, capture time, owner, and resource version. This makes stale or low-authority context visible.

A planning span records the proposed action and alternatives considered. This does not require exposing private chain-of-thought. It can record structured reasons: selected action, rejected alternatives, constraints, and confidence category. "Chose to edit worker.yaml because failing job references checkout-worker; rejected deploy restart because no deploy permission" is enough.

A policy span records subject, action, resource, context, decision, and rule result. If policy denies a write, that denial should be visible even if the model later argues for it.

An approval span records proposal id, reviewer, decision, scope, expiry, and version. It should answer whether approval applied to the action that actually ran.

An execution span records tool input, output, duration, side-effect handle, and outcome. Ambiguous outcomes should be labeled as ambiguous, not converted to failure or success because the agent wants a clean story.

A verification span records the post-action check, expected state, observed state, and result. If verification changed during the run, the trace should show that decision.

Together, these spans make an agent run debuggable. They also support product metrics that matter: denied writes, stale-context stops, approval changes, ambiguous tool outcomes, verification failures, and repeated scope expansion.

Where this goes wrong

The first mistake is logging only prompts and completions. That is often too much text and too little structure. It can also create privacy and retention problems because prompts may contain sensitive data copied from tools.

The second mistake is measuring agent quality with surface metrics. Token count, latency, number of tool calls, and final success text do not tell you whether the agent made a safe decision. A run with fewer tool calls can be worse if it skipped verification. A run with more stops can be better if it avoided stale writes.

The third mistake is hiding decisions inside broad tools. If complete_ticket reads context, writes code, posts comments, and closes work, the trace may show one tool span. The decision boundaries disappear. Observability cannot recover structure the tool did not expose.

The fourth mistake is treating the transcript as replay. A transcript may not include external state, current policy, resource versions, or tool side effects. To replay a decision, you need the observations and tool results as they were, not just the words around them.

The counterpoint is that detailed traces cost money and attention. You do not need maximum detail for every low-risk draft. Sampling and tiering are reasonable. But for write-capable agents, the decision record around side effects should be non-negotiable.

What I do now

I define the questions observability must answer before choosing what to log: What did the agent know? What did it propose? What allowed it? What changed? How was it verified? Why did it stop or continue?

I use correlation ids across model calls, proposals, approvals, tool calls, and verification checks. If a pull request was opened, I want to trace back from the PR id to the proposal and from the proposal to the observations that justified it.

I log structured decisions separately from raw text. The text can be retained with tighter controls or shorter retention. The decision log should remain queryable: show all actions approved by a person, all writes based on stale context, all verification downgrades, all policy denials followed by re-proposals.

I review traces during design, not only incidents. If a normal successful run cannot explain itself, the system is not ready for harder cases.

Closing takeaway

Observe the decision path. Tokens tell you what the model said; decision traces tell you why the system let reality change.