The New Skill Is Reviewing Machine Work

Machine-generated code often arrives in a shape that looks reviewable before it has earned trust. It is formatted. It names things plausibly. It may include tests. That polish can lower the reviewer's guard.

The new senior skill is not accepting or rejecting AI output by vibe. It is reviewing machine work as a compressed bundle of assumptions.

The thesis

AI code review should focus less on style and more on intent, contracts, failure modes, and evidence.

Style still matters, but it is rarely where the expensive bugs hide. The expensive bugs hide in places where the generated code misunderstood the shape of the system.

The production pattern

A human patch usually carries social context. You can ask why the author chose a path, what they considered, and which constraints shaped the implementation.

Machine work arrives without real ownership. The diff may be coherent, but its explanation is not evidence. A model can give a confident reason after the fact. The reviewer has to reconstruct whether the code actually satisfies the intended change.

This changes the review posture. I do not ask first, "Is this code clean?" I ask, "What claim is this code making about the system?"

The model

My review rubric has five passes.

The first pass is intent. I write down the behavior change in plain language, then check whether the diff implements that behavior or merely resembles a solution.

The second pass is contract. I look for public APIs, persisted data, event formats, permissions, retries, timeouts, and operator workflows. Generated code often changes contracts accidentally because the prompt did not name them.

The third pass is failure modes. I check empty input, malformed input, partial state, duplicate events, concurrent calls, dependency failure, and rollback. Agents are good at happy paths and common tests. Production rarely fails in the common path.

The fourth pass is test quality. I do not give much credit for tests that mirror the implementation. I want tests that pin the requirement, protect the bug class, and fail for the previous behavior.

The fifth pass is maintainability. I ask whether a future engineer can debug this at 2 a.m. without knowing the prompt that produced it.

The compact checklist:

Does the diff solve the stated problem, not a nearby problem?
Did any external contract change by accident?
Are edge cases tested as requirements, not implementation details?
Are new abstractions smaller than the complexity they remove?
Is the operational behavior clear under failure and rollback?
Would I be willing to own this code without mentioning the agent?

Where this goes wrong

One failure mode is excessive suspicion. If every generated patch gets treated as untrusted garbage, the team loses the benefit. Human code also contains wrong assumptions, cargo-cult patterns, and overfitted tests.

The better distinction is not human versus machine. It is whether the change has accountable intent and sufficient evidence.

Another failure is reviewing only the final diff. For larger changes, the useful review artifact may include the task framing, files intentionally left untouched, rejected alternatives, and verification output. Without that context, the reviewer is forced to infer too much.

What I do now

For small changes, I review machine output like any other patch, but I spend extra attention on contracts and tests.

For larger changes, I ask for a review note with four sections:

Intended behavior
Files and contracts touched
Verification performed
Known risks or assumptions

If that note is vague, I assume the implementation is vague too.

I also prefer agents to produce narrower patches. Reviewing one precise behavior change is productive. Reviewing a diff that mixes cleanup, feature work, test rewrites, and dependency churn is a tax on everyone.

The review debt test

The question I keep coming back to is whether the generated work has reduced review debt or moved it somewhere harder to see.

A good machine-assisted patch should leave a reviewer with fewer unknowns than a hand-written patch of the same size. It should name the boundary it touched, show the checks it ran, and make the risky assumptions easy to attack. If the patch is large but the explanation is small, the debt has not disappeared. It has been transferred to the reviewer.

I use three signals to decide whether to send the work back.

First, does the patch contain unexplained breadth? If a focused bug fix also rewrites helpers, renames variables, changes fixtures, and adjusts formatting, I assume the agent optimized for local coherence instead of reviewability.

Second, does the patch make a system claim without a system test? A generated change can look correct inside one function while violating behavior at the API, job, or workflow boundary.

Third, can the author defend the result without appealing to the tool? "The agent produced it" is not a reason. The responsible engineer needs to explain why the change is the right one.

That is the real shift in skill. The reviewer is not just checking code quality. The reviewer is checking whether ownership survived automation.

Closing takeaway

Review AI-generated code by finding the assumptions inside the diff, then demand evidence for the assumptions that could hurt production.