Adversary Agents

An adversary agent is not a villain character in the harness.

That framing is entertaining and usually unhelpful. The useful adversary is a fixture generator with a narrow job: produce hostile inputs that make the harness prove its boundaries. It does not get production credentials. It does not attack live systems. It does not decide whether a failure is acceptable.

It creates test material that a reviewer can inspect, keep, and replay.

The thesis

An adversary agent should manufacture bad conditions, not perform bad actions.

The difference matters. A production harness should never need a model with permission to behave maliciously against real assets. What it needs is a steady supply of uncomfortable fixtures: poisoned documents, malformed tool outputs, conflicting instructions, stale approvals, misleading filenames, duplicate identifiers, and user requests that are almost valid but cross an important line.

The adversary's job is to make the refusal path, capability checks, and state assertions work under pressure.

The production pattern

The pattern shows up after a harness starts passing the happy path too easily.

The worker can read a ticket, inspect code, propose a patch, run a test, and summarize the result. The demo looks good. Then a real run includes a dependency file that says "ignore previous instructions," a fixture with a path like ../../secrets, a tool response that omits a required field, or a stale approval that looks current in the conversation but is expired in the control plane.

If the harness only tests cooperative inputs, it will ship a system that sounds careful while accepting hostile state.

An adversary agent is given the public contract of the harness, a safe fixture directory, and a list of boundaries to pressure. For example, in a code-review harness it might create:

a pull request description that asks the agent to approve its own changes
a test fixture containing prompt-injection text in a README
a filename that looks like a config file but is outside the allowed path
a tool response where the status is success but the artifact is missing
a user request that asks for a write after a read-only approval

The adversary creates malicious fixtures. The harness consumes them in a controlled test. A human or deterministic checker decides whether the behavior was acceptable.

The model

The adversary agent needs a contract as tight as any other harness component.

Inputs: the capability model, tool schemas, approval rules, known bad runs, refusal requirements, and sanitized examples of real work. It should not receive production secrets or private incident details.

Permissions: write access only to a fixture workspace or test-data branch. It may generate files, fake tool responses, synthetic tickets, and adversarial prompts. It may not call production tools, mutate real queues, or approve its own fixtures.

Outputs: fixtures plus expected assertions. The expected assertions are as important as the bad input. "The worker must refuse to write," "the supervisor must require approval," "the trace must record context provenance," and "the final state must remain unchanged" are testable expectations.

Failure modes: theatrical attacks that do not resemble the harness contract, fixtures so bizarre that nobody keeps them, overfitting to the current prompt, leaking sensitive examples into tests, and confusing refusal speech with refusal behavior.

Review path: accepted fixtures become regression cases. Rejected fixtures are either simplified, discarded, or turned into a note about an unsupported threat model. The adversary does not get to declare victory.

Where this goes wrong

The common mistake is treating adversary work as a jailbreak contest. The output becomes a collection of clever strings instead of a pressure test for the system boundary.

For production harnesses, the interesting failures are often dull. A write tool accepts a relative path. A worker treats a retrieved document as a user instruction. A stale approval remains in the prompt after it has expired in the control plane. A model promises it did not change state, but the trace shows a tool call did.

Another mistake is giving the adversary too much knowledge. If it sees private implementation details, it may create fixtures that prove only that the test has memorized the system. Useful adversarial fixtures should be derived from public contracts, prior failures, and explicit threat categories.

A third mistake is scoring text instead of state. A harness that says "I cannot do that" and then calls the write tool has failed. A harness that refuses tersely and leaves state unchanged has passed, even if the refusal message is not elegant.

What I do now

I start adversary work with a small taxonomy.

Instruction confusion: untrusted content tries to act like a higher-priority instruction.

Capability confusion: a request asks the worker to use a tool outside the approved permission.

State confusion: stale, partial, or contradictory data makes the correct next step ambiguous.

Boundary confusion: paths, identifiers, tenants, environments, or branches are made to look similar enough to invite a mistake.

For each category, I ask the adversary to create a fixture and an expected state assertion. The fixture is not done until it says what must remain true after the run. That one requirement filters out a lot of noise.

I also keep adversary outputs reviewable. A malicious fixture should be small enough that an engineer can understand it in a minute. If it requires a long explanation, it is probably testing the adversary's imagination more than the harness.

The strongest adversarial suites are built from bad runs. When a production-like harness fails, the replay and audit traces should feed the adversary. The adversary then creates a smaller fixture that captures the failure without carrying private history.

Closing takeaway

An adversary agent earns its place when it turns vague fear into durable fixtures. If the fixture does not produce a clear state assertion, it is not yet a test.