Back to archive

Engineering

Good Distributed Systems Are Designed Around Boredom

Why predictable recovery paths matter more than clever steady-state architecture.

Good Distributed Systems Are Designed Around Boredom

Many distributed systems look impressive in the happy path. Requests flow through clean diagrams, workers scale, queues absorb bursts, and dashboards show reassuring curves. The real design is revealed when something ordinary goes wrong.

The best systems make recovery boring.

The thesis

Good distributed systems are designed around boredom: predictable failure modes, rehearsed recovery, and obvious ownership matter more than clever steady-state architecture.

Cleverness is not the enemy. Unrehearsed cleverness is.

The production pattern

A system handles normal traffic well. Then a dependency slows down, a deploy partially rolls out, a worker pool falls behind, a schema change meets old code, or a retry loop amplifies pressure. None of this is exotic. These are routine facts of production.

Yet the response often depends on improvisation. Someone remembers a command. Someone knows which queue can be drained. Someone knows which flag is safe. Someone knows which metric is lying. The system works because a few people carry operational folklore.

That is not reliability. That is luck with tenure.

The model

I review distributed systems through a boredom lens.

First, failure shape. What happens when each dependency is slow, unavailable, inconsistent, or returns unexpected data? The answer should be specific. "It retries" is not enough.

Second, pressure behavior. Under stress, does the system shed load, queue work, back off, degrade features, or amplify the problem? Many systems fail because every component independently tries to be helpful.

Third, recovery path. Can an operator understand the current state and choose a safe action? If recovery requires reading code during an incident, the design is unfinished.

Fourth, reconciliation. After partial failure, how does the system return to truth? Which source wins? What is safe to replay? What must be manually reviewed?

Fifth, ownership. Who can act without asking for permission from three other groups? Boring recovery needs authority as much as tooling.

My checklist:

  • Known failure modes have named behaviors
  • Retries have limits, jitter, and idempotent effects
  • Queues expose age, size, and poison patterns
  • Operators can pause, drain, replay, and disable safely
  • Runbooks explain decision points, not just commands
  • Recovery has been tested outside a real incident

I separate boring design into detection, containment, correction, and learning.

Detection answers the first question quickly: what is happening, and how bad is it? Containment prevents helpful components from making the situation worse. Correction gives operators a safe path back to a known state. Learning turns the new fact into a changed limit, test, alert, or operating rule.

Most brittle systems skip one of these. They detect a problem but cannot contain it. They contain it but have no repair path. They repair it manually but never encode what was learned. The next incident then starts from almost the same place, just with more tired people.

The boring version is less theatrical. It says, "When this dependency slows, we stop accepting new work for this path, keep existing work durable, expose backlog age, and resume only after reconciliation catches up." That is not glamorous architecture, but it is architecture people can operate.

I also want the boring path to have authority attached to it. A pause button is useless if no one knows who may press it. A replay tool is dangerous if the operator cannot tell which effects are repeatable. Recovery design is partly permission design: the people carrying the pager need enough authority to protect the system without inventing a governance process during stress.

Where this goes wrong

Designing for boredom can become over-engineering. Some systems do not need elaborate recovery controls because their blast radius is small, their data is reproducible, or downtime is acceptable. Reliability work must be proportional to consequence.

There is also a product counterpoint. Sometimes a temporary manual recovery path is the right tradeoff while product fit is still uncertain. Automating every edge case early can waste effort and slow learning.

The key is honesty about risk. "We accept manual repair for this workflow because volume is low and the business can tolerate delay" is a responsible statement. "We will figure it out if it happens" is not a design.

What I do now

In design reviews, I ask teams to describe the most boring incident they can imagine. Not the dramatic outage. The mundane one: a delayed queue, duplicate event, partial deploy, expired credential, or stuck worker.

Then I ask what the newest on-call engineer would do. If the answer depends on hidden history, we write that history down or simplify the mechanism.

Principal engineers should care deeply about the ordinary failure path because that is where operational cost accumulates. A system that needs brilliance during routine failure will eventually exhaust the people who operate it.

Closing takeaway

Judge a distributed system by how dull its recovery is. If routine failure requires heroics, the architecture is not done.