The Minimum Viable Process

Process has a bad reputation with engineers because too much of it is added as a tax on everyone after a failure caused by a few missing controls.

The better question is not whether a team needs more process. The better question is which observed failure class deserves a lightweight control.

The thesis

The minimum viable process is the smallest repeatable mechanism that prevents a known failure mode without slowing unrelated work. Good process is specific risk control, not generalized caution.

That makes process an engineering design problem. It should have a trigger, an owner, an artifact, an escalation path, and a retirement condition. If those are missing, the process is likely to become habit instead of leverage.

The production pattern

A team gets burned. A migration surprises another group. A dependency change breaks a contract. A risky deploy has no rollback owner. A product decision reaches implementation before the operational cost is understood. Everyone agrees the failure should not repeat.

The common response is to add a broad gate: more approvals, more required documents, more mandatory meetings, more fields in a template. The next few risky changes improve because people are alert. Then the process remains after the memory fades. Small changes pay the same coordination cost as large ones. Engineers route around the gate. Reviewers skim. The organization keeps the ceremony and loses the safety.

Minimum viable process starts from the opposite direction. It asks what exact class of mistake happened and what minimal mechanism would catch that class next time.

The model

I use five fields before adding process.

Trigger: When does this process apply? The trigger must be observable. "Important change" is too vague. Better triggers include schema changes, cross-service contract changes, user-visible migration, irreversible data update, new external dependency, new operational ownership, or changes above a defined blast radius band.

Owner: Who is accountable for the process working? A process with no owner becomes folklore. The owner maintains the template, handles exceptions, collects feedback, and removes stale steps.

Artifact: What durable thing does the process produce? It might be a decision record, migration checklist, rollback plan, dependency review, launch note, or operational readiness review. Meetings can help, but the artifact is what survives.

Escalation: What happens when the process reveals disagreement or missing evidence? If escalation is unclear, teams either block each other informally or pretend concerns are resolved.

Retirement: When should the process be removed, narrowed, or automated? Every process should have a way to die.

These fields force process to declare its purpose. They also make it easier to say no to process that is emotionally satisfying but operationally vague.

Where this goes wrong

The counterpoint is that some environments need heavier controls. Regulated domains, privacy-sensitive systems, financial correctness, and high-blast-radius infrastructure cannot rely only on informal judgment. Minimum viable process does not mean minimal responsibility. It means the control should match the risk.

Another failure mode is adding process only after pain. Some risks are predictable before the first incident. If a change can corrupt durable data, expose private information, or make recovery impossible, waiting for an observed failure is irresponsible. In those cases, the "known failure mode" can come from industry patterns and first principles, not only local history.

The model also fails when teams use it to avoid coordination. A process that is too small to protect the boundary is just politeness. If multiple groups are affected, the trigger and owner must reflect that reality.

There is a subtler trap: process can become a substitute for architecture. If every change needs five humans to remember the right thing, the system may be missing safer defaults. Good process should often point toward tooling, stronger interfaces, or simpler ownership.

What I do now

When someone proposes a new gate, I ask what failure it prevents. If the answer is a story, I ask for the failure class. Was it missing ownership, missing rollback, unknown dependency, unclear decision authority, insufficient testing, weak observability, or surprise cost? The failure class determines the control.

I prefer process that sits close to the work. A migration checklist near the migration tool is better than a distant policy page. A required decision record for cross-boundary contracts is better than a recurring meeting that people attend without a decision to make.

I also make exceptions explicit. If a team can bypass the process, the bypass should name the reason and owner. That keeps urgency from becoming invisible precedent.

For principal engineers, the hard part is pricing the organization-wide tax. A process that costs ten minutes across hundreds of changes may be more expensive than a rare failure. Conversely, a process that slows a few high-risk changes may be cheap insurance. The cost is not only meeting time. It is delay, context switching, review fatigue, and the quiet training of engineers to wait for permission.

I review processes like code. Is the trigger still right? Are reviewers adding value? Has the artifact been used during a later incident or design review? Can tooling remove a step? Should the process narrow to a smaller class of changes? If nobody can answer, the process is probably running on memory rather than evidence.

Closing takeaway

Add process only when you can name the failure class, the trigger, the owner, the artifact, the escalation path, and the condition that should retire it.