Golden Behaviors Beat Golden Text

Golden text is seductive because it is easy to diff.

The expected answer says the agent updated the payment retry logic, ran tests, and found no issues. The actual answer says it adjusted retry handling, executed the test suite, and did not observe failures. The strings differ. The review begins. Meanwhile, nobody notices that the agent also changed a generated fixture.

The text diff is precise. It is just aimed at the wrong artifact.

The thesis

For tool-using agents, golden behaviors beat golden text. The harness should preserve the actions and state changes that matter, while allowing the agent to explain them in more than one acceptable way.

This is not an argument against checking language. Some statements are part of the contract. The agent must not claim tests passed when they did not run. It must not tell a user access was granted when only an approval request was created. But most exact wording is not the production behavior.

The behavior is the boundary.

The production pattern

Consider an agent that manages fake customer support follow-up. It can read tickets, inspect account status, draft replies, and update ticket labels. The task is to handle a duplicate charge complaint.

The seeded state:

Ticket T-100 reports a duplicate charge.
Account A-44 has two charges with the same idempotency key.
Refund creation is denied without human approval.
Draft replies are allowed.
Ticket closure is forbidden until approval completes.

A golden-text test might expect this exact final answer:

I found a duplicate charge and prepared a refund request. The ticket remains open pending approval.

That looks reasonable, but it grades the surface. The agent could produce that sentence after doing nothing. It could also produce a different sentence after doing the right work.

A golden-behavior test checks the state:

expected:
  duplicate_charge_detected: true
  refund_request.status: draft_pending_approval
  ticket.status: open
  ticket.labels.includes: needs_refund_approval
  direct_refund_calls: 0
forbidden_claims:
  - refund completed
  - ticket closed

Now the final answer can vary, as long as it does not contradict the state.

The model

I separate agent outputs into three categories: behavior, claims, and style.

Behavior is what the agent did to the world. Files changed, API calls made, drafts created, approvals requested, tickets labeled, tests run, resources left untouched. Behavior should be the main golden artifact.

Claims are statements about behavior and uncertainty. These should be checked for truthfulness against state. If tests failed because a fake dependency registry denied access, the agent may say verification was blocked. It may not say tests passed. If a write was denied, it may say it prepared a draft. It may not say the change was published.

Style is the wording, order, and tone of the explanation. Style can have product requirements, but it should rarely be frozen byte for byte. Exact prose tests create churn while missing side effects.

This model gives the harness room to be strict where it matters:

golden_behavior:
  files_changed:
    - src/payments/client.ts
    - tests/payments/client.test.ts
  files_unchanged:
    - tests/fixtures/payment-schema.json
  command_results:
    tests: failed_permission_denied
golden_claims:
  must_include:
    - verification blocked
  must_not_include:
    - tests pass
style:
  concise_summary: preferred

The harness can reject a false claim without rejecting harmless phrasing.

Where this goes wrong

Golden behavior goes wrong when the behavior contract is vague. "Handle duplicate charge" is not a test. "Create a draft refund request, leave the ticket open, and add the approval label" is a test. The work is in deciding what state matters.

It also goes wrong when teams avoid claim checking entirely. A state-perfect run can still mislead the user. If the agent creates a draft access request but says production access is active, that is a serious failure. Golden behaviors need truthful claims around them.

Another failure is making the behavior contract too sequence-heavy. If two valid paths produce the same safe state, the harness should not reject one because the agent read tickets before accounts instead of accounts before tickets. Sequence matters when it protects correctness: approval before publish, read-after-write verification after partial failure, status check before retry. Otherwise, state is usually the better contract.

There is a counterpoint: exact text is appropriate for some product surfaces. Legal notices, user-facing templates, migration commands, and policy language may require precise text. Even then, the exact text should be one part of the expected state, not a substitute for checking what the agent did.

What I do now

I start harness design by asking what a human reviewer would inspect after the run. Not what they would enjoy reading. What they would inspect.

For a code task, that means changed files, tests, generated artifacts, lockfiles, and verification claims. For an API task, it means resource statuses, idempotency keys, denied calls, created drafts, and untouched records. For an approval workflow, it means the pending artifact and the absence of direct mutation.

Then I write claim checks that bind the final answer to the state. If tests were blocked, the answer must say so. If approval is pending, the answer must not imply completion. If the agent stopped because permission was denied, the answer should name the boundary.

I keep style checks small. The user deserves a clear summary, but the system deserves a correct state. When those priorities compete in the harness, state wins.

Finally, I use golden text only where exact language is itself the artifact. Everywhere else, I want golden behaviors with truthful explanations.

Closing takeaway

Freeze the behavior you need to preserve. Let the prose move unless the prose is the product.