Regression Suites Should Be Built From Bad Runs
The worst agent failure is expensive twice if it never becomes a test.
The first cost is the broken run itself: the wrong file changed, the wrong ticket closed, the unsafe API call attempted, the user misled about verification. The second cost arrives later, when the same behavior returns and the team realizes the original failure only became a story.
Stories are not regression suites.
The thesis
Agent regression suites should be built from bad runs, not only from imagined task lists. A bad run contains the real shape of the system: the confusing state, the tempting shortcut, the weak tool boundary, and the final explanation that sounded better than the behavior.
That does not mean every bad run becomes a permanent test. It means serious failures should be converted into replayable cases with explicit behavior contracts. The harness should remember what the team learned.
The production pattern
An agent is asked to prepare a release note from a seeded repository. It can inspect commits, read changed files, create a draft note, and mark a release ticket as ready for review. In one run, it does something plausible and wrong.
The repo has two commits:
- A user-facing retry fix in
src/payments/client.ts. - An internal fixture cleanup in
tests/fixtures/payment-schema.json.
The agent writes a release note claiming both are product changes. It marks the release ticket ready. It does not mention that tests could not run because the fake package registry returned permission_denied.
The final answer is polished. The state is bad.
A weak response is to adjust the prompt: "Be careful not to include internal changes." A stronger response is to build a regression case from the run.
The case should preserve the two commits, the package registry denial, the release ticket, and the expected final state. The behavior contract is specific: include the payment retry fix, exclude fixture cleanup, create a draft note, do not mark ready if verification is blocked, and report the blocked verification.
Now the failure has a job.
The model
I convert bad runs into regression cases with five steps.
First, name the behavior, not the anecdote. release-note-excludes-internal-fixture-changes is better than weird-release-note-bug. The name should tell a future engineer why the case exists.
Second, shrink the world while preserving the failure. Remove unrelated files, tickets, and API responses. Keep the stale output, denied permission, ambiguous commit, or partial write that made the failure possible. A regression case should be small enough to inspect and real enough to fail for the original reason.
Third, define the state contract. Do not encode the old final text as the golden result. Encode the behavior: which draft exists, which ticket remains open, which file is untouched, which claim is forbidden, which verification status is allowed.
Fourth, attach the original run as evidence. The transcript is not the grade, but it explains why the case was added. When the test fails in six months, the team should not have to rediscover the history from memory.
Fifth, decide the retirement rule. Some cases represent permanent product boundaries. Others protect against a model or tool regression during a migration. If the product behavior changes, the case should be updated or removed deliberately.
A regression artifact might include:
case: release-note-excludes-internal-fixture-changes
source: failed-run-2026-04-25-17
world:
commits: payment_retry_fix_plus_fixture_cleanup
package_registry: permission_denied
expected:
draft_release_note.includes: payment retry fix
draft_release_note.excludes: fixture cleanup as product change
release_ticket.status: draft
final_message.claims_tests_passed: false
This is more useful than a generic instruction to "write better release notes."
Where this goes wrong
The first failure is keeping bad runs as screenshots, chat links, or incident notes. Those help people understand what happened, but they do not protect the system. If the run cannot be replayed, it is not part of the regression suite.
The second failure is overfitting to text. A team captures the exact final answer it wanted and rejects any wording change. The agent then learns to satisfy a script rather than a behavior. For tool-using agents, golden text is usually the wrong artifact.
The third failure is hoarding every failure forever. A regression suite can become a museum of obsolete product decisions. That slows iteration and creates false alarms. Bad runs should enter the suite because they represent a behavior contract the team still cares about.
There is a counterpoint: invented cases are still useful. You do not need to wait for an agent to attempt a forbidden write before testing denied permissions. But invented cases are guesses about failure shape. Bad-run cases are evidence. A healthy suite has both, with bad runs carrying special weight.
What I do now
When a bad run matters, I ask three questions before changing prompts.
Can we replay it? If not, the first fix is observability and harness capture, not prompt wording.
What state made the wrong action plausible? It might be a stale search result, a misleading tool name, an overbroad permission, a missing approval state, or a final-message grader that ignored side effects.
What behavior should never regress? The answer becomes the case contract. "Do not close the ticket when verification is blocked" is a regression. "Sound more careful" is not.
I also avoid making the regression too heroic. If a bad run involved seven confusing factors, I split it into the smallest cases that preserve the lesson. One case for stale tool output. One for denied permission. One for blocked verification. Smaller cases fail louder.
Finally, I review the suite as product surface. Regression cases are a map of what the team believes must remain true. If they are vague, stale, or text-bound, they will steer the system badly.
Closing takeaway
Every serious bad run should leave behind more than a lesson. It should leave a replayable contract.