A Harness Is a Controlled World
The first useful agent harness I trust does not begin with a scoring rubric. It begins with a fake world.
The model gets a workspace, a clock, a filesystem, a few APIs, and a set of permissions. Some of those APIs lie. Some are stale. Some deny writes. Some accept a request and delay the visible state change. The point is not to trick the model. The point is to make the surrounding system deterministic enough that a failure means something.
The thesis
A harness is not a questionnaire for an agent. It is a controlled world where tool effects, state transitions, permissions, and observations can be inspected after the run.
This matters because agents do not only produce answers. They act through tools. They read partial state, form plans, execute steps, revise plans, and sometimes continue after they should stop. If the harness only grades the final message, it misses the part of the system that can harm a real user.
The useful question is not "did the agent say the right thing?" The useful question is "given this world, did the agent leave the system in an acceptable state?"
The production pattern
The recurring pattern is a product workflow wrapped in a model loop.
An agent is asked to clean up a billing workspace. It can list invoices, open customer records, draft refunds, and create support notes. In production, these are real capabilities. In a harness, they should be fake services with real contracts.
The fake invoice API should return invoices with stable identifiers. The fake customer API should expose stale profile data for one customer. The refund API should require an idempotency key and reject refunds above a threshold. The support note tool should allow drafts but not publish without approval.
The seeded world might look like this in plain terms:
customer_17has two duplicate invoices.customer_18has a late invoice that is valid.listInvoicesreturns data generated at09:00.getCustomerforcustomer_17returns an address updated at08:45.createRefunddenies writes withoutrefund:write.createDraftNoteaccepts writes and records the author.
The task is not just to refund the right invoice. The task is to read enough state, notice which data is stale, refuse the denied write path, leave a draft note, and explain what still needs human approval.
That is a controlled world. The harness owns the facts. The agent owns the choices.
The model
I think of a harness world in five layers.
The first layer is seeded state. The harness starts with known users, files, tickets, branches, database rows, queue entries, and permissions. These are not fixtures scattered across tests. They are the world definition.
The second layer is tool contracts. Each tool has accepted inputs, denied inputs, side effects, latency behavior, and error shapes. A fake API that always succeeds is not a tool contract. It is a shortcut that teaches the harness nothing.
The third layer is observation rules. The agent should only see what a real agent would see. If a support note write is eventually visible, the next read should sometimes return the old state until reconciliation catches up. If a file search result is stale, the harness should preserve the stale output instead of silently correcting it.
The fourth layer is policy. The harness should know which actions require approval, which resources are out of scope, and which writes are forbidden. This cannot live inside the prompt alone. If the model decides that a forbidden write is reasonable, the tool boundary should still deny it.
The fifth layer is final state inspection. After the run, the harness compares expected and actual state. It checks the files changed, tickets touched, API calls made, draft records created, approvals requested, and stop reason. It does not need the final answer to be eloquent. It needs the world to be right.
An expected state diff is often clearer than a grade:
expected:
draft_notes:
- customer_id: customer_17
mentions_duplicate_invoice: true
refunds:
- invoice_id: inv_884
status: blocked_for_approval
untouched:
- invoice_id: inv_901
actual:
draft_notes:
- customer_id: customer_17
mentions_duplicate_invoice: true
refunds:
- invoice_id: inv_901
status: attempted
That failure is not a wording issue. The agent acted on the wrong invoice.
Where this goes wrong
The common failure is building a harness that is only a transcript recorder. The run is saved, the final answer is scored, and someone reads the trace when the score looks strange. That is useful for forensics, but it is not enough to create a reliable boundary.
Another failure is making the fake world too polite. Every API returns fresh data. Every permission exists. Every write succeeds. Every tool error is clean and immediate. The agent looks competent because the world has been stripped of the ambiguity that makes the production workflow hard.
The opposite failure is a maze. The harness injects so many traps that the test becomes a puzzle rather than a representative workflow. That creates agents that hesitate everywhere and accomplish little. A controlled world should be sharp, not theatrical.
The hard balance is realism without mystery. The harness should make failure reproducible. If the agent writes to the wrong file, the test should show the exact stale read or permission assumption that led there. If the agent refuses a valid action, the harness should show which state made the action safe.
What I do now
I start by writing the world definition before the prompt. What resources exist? What can be read? What can be written? Which writes are real, draft, denied, or approval-gated? Which observations are stale? Which operations are delayed?
Then I define acceptable end states. I try to avoid asserting exact sequences unless the sequence is part of the contract. An agent may inspect invoices before customers or customers before invoices. That is fine. It may not refund a valid late invoice while ignoring the duplicate.
I also include at least one denied capability in serious harnesses. A denied permission is the fastest way to learn whether the agent treats tools as authority boundaries or as obstacles to route around. If deleteFile returns permission_denied, the acceptable behavior is to stop, request approval, or choose a permitted draft path. It is not to rewrite a shell command that deletes the same file another way.
Finally, I keep the harness stateful after failure. Bad runs become assets. The world, tool responses, and final diff should be reusable as regression cases. If a failure cannot be replayed, it will be explained away.
Closing takeaway
Do not ask whether the agent sounded right. Put it in a controlled world and inspect what it changed.