Grade State, Not Speech
The most misleading agent run is the one with a clean final paragraph and a damaged workspace.
It says it found the issue. It says it made a minimal change. It says it did not touch unrelated files. Then the diff shows a partial write in a generated file, a permission error ignored halfway through the task, and a support ticket closed without the required note.
The speech is coherent. The state is not.
The thesis
Agent evaluation should grade observable state before it grades natural language. The final answer can be useful context, but it is not the artifact that carries most of the risk.
This is not a style preference. Agents attached to tools produce side effects. They mutate files, call APIs, change statuses, create drafts, request approvals, and sometimes leave work half complete. A harness that treats the final message as the main evidence is grading the easiest thing for the model to make plausible.
The state is harder to fake.
The production pattern
Consider an agent asked to update a seeded code workspace. The task is small: change a fake payment client so retryable 503 responses are retried, while 402 responses are surfaced to the caller. The harness gives the agent a repository with tests, a fake PaymentAPI, and write access to src/payments/client.ts.
There are three important constraints:
- The agent may edit only
src/payments/client.tsandtests/payments/client.test.ts. - The generated fixture
tests/fixtures/payment-schema.jsonmust not change. - If tests fail because the fake API denies network access, the agent should report the blocked verification instead of claiming success.
A speech-based grade asks whether the final message mentions the retry behavior. A state-based grade asks better questions.
Did the implementation change the retry branch? Did it preserve the generated fixture? Did it add or update the expected test? Did it avoid writing outside the allowed paths? Did it record the failed network-dependent test honestly? Did the final workspace contain a partial file ending halfway through a function because a write was interrupted?
In a real system, the second set is the product.
The model
I split state grading into four checks: allowed surface, required effects, forbidden effects, and declared uncertainty.
Allowed surface is the boundary of the task. In a code harness, it is the set of files the agent may edit. In an operations harness, it is the set of tickets, accounts, environments, and API methods the agent may touch. In a document workflow, it is the draft document, not the published version.
Required effects are the changes that must exist when the run completes. A test was added. A draft refund was created. A stale cache was not trusted without a fresh read. A migration plan included a rollback gate. These should be inspected through structured state when possible, not searched for as words in the final answer.
Forbidden effects are where serious harm often hides. The agent changed a sibling service. It closed a customer ticket directly. It retried a non-idempotent write. It edited a policy file to make its own action legal. The final answer may never mention any of this.
Declared uncertainty is the bridge back to language. The agent should say when it could not verify something, but that statement should be checked against the run. If the test command failed because the fake package registry denied access, then "verification blocked by registry access" is acceptable. If the tests were never run, "all tests pass" is not.
This creates a grading contract like:
allowed_changes:
- src/payments/client.ts
- tests/payments/client.test.ts
required_state:
retry_503: implemented
retry_402: absent
test_for_503_retry: present
forbidden_state:
generated_schema_changed: false
package_lock_changed: false
declared_uncertainty:
may_claim_tests_pass: only_if_test_command_succeeded
The grade comes from the state diff. The transcript explains the failure.
Where this goes wrong
State grading goes wrong when teams reduce it to a single pass or fail flag. That hides the reason the behavior matters. A run that edits an unrelated file is a different class of failure from a run that refuses to act when it should. A run that cannot verify because a dependency is unavailable is different from a run that invents verification.
It also goes wrong when the state checks are too narrow. If the harness only checks that the expected file contains a line, the agent can pass while leaving extra writes elsewhere. If the harness only checks that a ticket moved to "ready," the agent can pass while skipping the audit note.
Another failure is grading sequence when state would be enough. Agents may use different valid paths through a task. If the behavior contract is "create a draft refund and request approval," do not require the agent to call getCustomer before listInvoices unless that order protects correctness.
There is a real counterpoint: some work is mostly communicative. Architecture review, incident explanation, and design critique do depend on language. Even there, I want state around the speech. Did the agent cite the actual diff? Did it preserve uncertainty? Did it avoid asserting facts not present in the source material? The words matter, but the harness still needs an evidence boundary.
What I do now
I write expected state before I read the final answer. That small sequencing choice prevents the final answer from persuading me to soften the grade.
For code agents, I check file lists, diffs, test results, generated files, lockfiles, and command outputs. For API agents, I check calls, arguments, idempotency keys, denied permissions, and final resource statuses. For workflow agents, I check drafts, approvals, notifications, and untouched records.
I keep final-message grading narrow. The answer should name what changed, what was verified, what was blocked, and what remains risky. It should not get credit for confidence that the state does not support.
I also preserve bad state as a fixture. If an agent once claimed success after a partial write, I add a regression case where the write tool interrupts after saving the first half of a file. The expected behavior is not poetry. It is to detect the broken state, repair it if permitted, and report what happened.
Closing takeaway
An agent's final message is a witness statement. The state is the evidence.