Auditor Agents

An auditor agent should not be impressed by a fluent explanation.

The worker may have written a clean summary. It may have said it stayed in scope. It may have used the right safety language. None of that is the audit target. The audit target is the trace: what the harness knew, what permission it had, what it did, what changed, and who is responsible for deciding whether that was acceptable.

Auditor agents are useful when they review evidence instead of conversation.

The thesis

An auditor agent is an independent trace reviewer, not a second opinion about the answer.

Its job is to check a run against policy, approvals, provenance, and final state. It should produce findings that a human owner can accept or reject. It should not repair the run. It should not approve exceptions. It should not infer private intent from a polished final message.

The model can help inspect the record, but policy must live outside the model.

The production pattern

The production pattern starts after a harness has done real work.

A worker read a ticket, inspected files, created a patch, ran a command, requested approval, or called a service tool. The final answer says the task was completed within scope. The question for the auditor is narrower: does the trace support that claim?

For example, imagine a documentation agent approved to update three Markdown files. The final diff includes a generated config change because a tool suggested the docs would otherwise be stale. The worker explains that the config edit was harmless. The auditor does not debate harmlessness. It checks the approval record, sees that writes were scoped to docs, sees a write outside that scope, and emits a finding.

The finding is not "bad agent." It is: approval mismatch, file path outside granted write scope, config file touched, human review required before merge.

That level of specificity is the difference between audit and commentary.

The model

I structure auditor agents around decision points.

Inputs: immutable traces, action records, tool inputs and outputs, approval artifacts, policy snapshots, context provenance, and final state diffs. If the trace is incomplete, the auditor should say which evidence is missing.

Permissions: read-only access to traces and artifacts. The auditor may label a run, create a review item, or recommend a block through a controlled output channel. It may not mutate the work product or grant approval.

Outputs: findings with severity, evidence, violated rule or expected behavior, affected artifact, and recommended review path. A no-finding result should also be explicit: the auditor checked these decision points and found no mismatch in the available trace.

Failure modes: auditing summaries instead of source events, sharing the worker's poisoned context, treating policy as a prompt preference, producing vague concerns, missing redacted evidence, and creating so many low-value findings that operators stop reading.

Review path: audit findings go to the owner of the boundary being protected. A security-sensitive capability violation routes differently from a style issue in a generated document. The auditor should not flatten all findings into one generic queue.

Where this goes wrong

The first failure is independence theater. If the auditor receives the worker's final answer and a prompt that says "check this carefully," it is not auditing. It is summarizing a summary. The auditor needs the trace.

The second failure is letting the auditor become the policy engine. A model can identify that a write crossed a boundary. It should not invent the boundary while reviewing the run. The allowed paths, required approvals, tool capabilities, and stop conditions need durable definitions.

The third failure is missing provenance. If an instruction came from a user, it may carry authority. If it came from a README inside an untrusted repository, it may be data. If the auditor cannot tell those apart, it cannot reliably judge instruction-following behavior.

The fourth failure is treating audit as blame. Audit should make ownership clear, not theatrical. A good finding gives the next reviewer enough evidence to decide quickly.

What I do now

I make the audit surface part of the harness contract.

Each important action records intent, capability, precondition, tool call, observed output, state change, and stop-condition check. That is not for the model's benefit alone. It is for the next engineer who has to understand why the harness did what it did.

Then I keep the first auditor small. It checks a handful of boundaries that matter:

writes outside approved paths
approvals used after expiry
missing provenance for instructions
tool calls after a stop condition
final summaries contradicted by state
policy exceptions without a reviewer

I prefer concrete severity levels tied to review paths. "Block merge until reviewed by owner" is different from "record for harness tuning." Without that distinction, every audit finding becomes either noise or panic.

I also make the auditor cite event identifiers rather than prose snippets. A human should be able to open the trace, find the event, and decide whether the finding is correct. The auditor can be wrong. The system should make that cheap to discover.

Finally, I feed audit misses back into the harness. If the auditor missed a bad run, the trace becomes a regression case. If it produced a weak finding, the output contract gets tightened. The audit agent is an operational component, not an oracle.

Closing takeaway

An auditor agent should review decisions, not demeanor. If it cannot point to the event, permission, policy, and state change behind a finding, it has not audited the run.