Back to archive

Engineering

Consensus Is a Tax: Pay It Only for the Right Decisions

A decision frame for when global agreement is worth the coordination cost.

Consensus Is a Tax: Pay It Only for the Right Decisions

Consensus is often discussed as if it were a badge of seriousness. A system needs correctness, so someone suggests global agreement. A control plane needs safety, so someone reaches for a protocol. A distributed workflow has edge cases, so the design starts drifting toward a single agreed order.

Sometimes that is right. Sometimes it is a very expensive way to avoid naming the actual decision.

The thesis

Consensus is a tax on latency, availability, operations, and cognitive load. Pay it only for decisions that truly need global agreement before progress.

The mistake is not using consensus. The mistake is using it before identifying the specific fact that must be agreed upon. Global agreement is powerful because it narrows reality. That power should be spent on the smallest decision that needs it.

The production pattern

A distributed system has multiple nodes, workers, regions, partitions, or replicas. Most operations can proceed locally most of the time. Then a safety question appears.

Who is the leader? Which members are in the cluster? Which command happened first? Is this identifier unique? Is this lease still valid? Has this state crossed a point of no return?

These questions feel similar because they all involve coordination, but they are not the same decision. Treating them as one generic "consistency problem" leads to two bad designs.

One design overpays. It sends too much work through the consensus path, turning ordinary operations into quorum-dependent operations. The system becomes correct but slow, fragile under partial failure, and hard to operate.

The other design underpays. It avoids consensus with timestamps, caches, retries, local checks, or human procedure, even though the invariant requires a single answer. The system looks available until two actors make incompatible decisions and no local repair can fully undo them.

The trap

The trap is arguing about mechanisms before naming the decision.

Teams ask whether they need a consensus protocol, distributed lock, primary node, coordinator, quorum write, or strongly consistent store. Those are implementation shapes. The design review should start one level above: what decision must not have two winners?

Without that sentence, consensus spreads. A leader election mechanism becomes a place to put ordering. A membership store becomes a configuration database. A uniqueness check becomes part of the hot request path. A lock becomes a workflow engine.

The opposite trap is hand-waving coordination away. "Eventually consistent" is not a spell. If two systems can both believe they own the same resource, both process an irreversible action, or both advance exclusive state, the coordination still exists. It has merely moved into cleanup scripts, support queues, or unexplained product behavior.

The model

I use six decision categories when deciding whether consensus is worth the tax.

Leader election decides which actor is allowed to coordinate a class of work. The agreed fact is not the work itself. It is the identity and term of the coordinator. Keep this narrow, or every leader-dependent action inherits election risk.

Membership decides who is part of the group making decisions. This matters because quorums mean little if the system disagrees about voters. Membership changes deserve careful sequencing because they modify the shape of agreement itself.

Ordering decides the sequence of commands or events when different actors might observe different orders. Ordering is expensive and often overused. Many operations need causal or per-entity order, not one global sequence for everything.

Uniqueness decides that only one actor can claim a name, slot, resource, version, or identity. Some uniqueness can be scoped by shard, tenant, region, or time window. Global uniqueness should be treated as a product requirement, not a default database taste.

Fencing decides whether an old owner is still allowed to act. This is the part many lock designs miss. A lock without a fencing token can reduce duplicate work while still allowing a stale actor to write after losing authority.

Irreversible state decides whether a transition has crossed a boundary that cannot be safely replayed, duplicated, or contradicted. Money movement, destructive deletion, external commitment, and one-way workflow transitions often live here.

The useful question is: which one of these decisions needs agreement, and can the scope be smaller?

Where this model breaks

Avoiding consensus can create worse hidden coordination. If the invariant is real, pretending it is local does not make the system simpler. It moves complexity into reconciliation, support queues, manual operations, and confusing edge states.

There are also mature systems where using a strongly consistent service for a narrow control-plane decision is simpler than building a bespoke alternative. Paying the tax through a well-understood component can be cheaper than inventing a half-consistent one.

The model also breaks when teams obsess over minimizing consensus and ignore user expectations. A product that promises exclusive ownership, immediate revocation, exact ordering, or irreversible commitment may need strong coordination. A weaker system may be technically elegant and product-wrong.

The goal is not to avoid consensus. The goal is to spend it only where disagreement would be more expensive.

What I do now

When a design mentions consensus, I ask for the decision sentence.

  • Decision: what fact must the system agree on?
  • Scope: is agreement global, regional, per tenant, per resource, or per workflow?
  • Timing: must agreement happen before user-visible success, or can it happen before finalization?
  • Failure behavior: what happens when quorum is unavailable?
  • Fencing: how are stale actors prevented from acting after authority changes?
  • Escape hatch: how is a bad decision repaired, audited, or overridden?

I try to keep consensus out of high-volume data paths unless the invariant demands it. I prefer scoping agreement by entity, tenant, partition, or control-plane operation. I also ask whether a weaker model would be honest: reservation instead of ownership, pending instead of final, approximate instead of exact, local order instead of global order.

When consensus is required, I want operational clarity. Who owns the quorum system? How is membership changed? What does partial failure look like to users? What metrics prove the agreement path is healthy? How do operators distinguish slow consensus from a broken dependency?

The principal-engineer lens is to make coordination visible. Consensus is not only an algorithmic choice. It is an organizational commitment to operate a narrow source of truth under failure.

Closing takeaway

Use consensus only after naming the exact decision that cannot have two winners, then make the agreement scope as small as the invariant allows.