Back to archive

Engineering

Your System Is Already Partitioned

A practical taxonomy of everyday micro-partitions and the design habits that make them survivable.

Your System Is Already Partitioned

Partition tolerance sounds like a property reserved for dramatic network failures. In production, partitions are usually smaller and more ordinary. One process is slow, one cache is stale, one owner is unavailable, and one deploy wave is ahead of another.

The thesis

Your system is already partitioned. The question is whether the design admits it.

Many architectures talk about partitions as rare events: a region isolated, a network cable cut, a database leader unavailable. Those events matter, but they are not the only partitions that shape user experience. Slow DNS, garbage collection pauses, noisy neighbors, deploy skew, stale caches, rate limits, retry queues, and ownership gaps all split the system into parts that temporarily disagree about time, state, or responsibility.

The practical problem is not whether a partition can happen. It is whether the system can keep making honest decisions while some parts see a different world.

The production pattern

The common pattern starts with a service that is healthy by its own dashboard. Error rates look fine. CPU is acceptable. The database is reachable. Yet users see inconsistent behavior. One request reads old configuration. Another reaches a new code path. A retry lands on a worker with a warmer cache. A scheduled job uses credentials that an interactive path has already rotated. A message consumer pauses long enough that ordering assumptions become fiction.

Nobody calls this a partition because packets still move. That naming gap is expensive. Engineers keep looking for a single outage while the actual failure is disagreement between components that are all partially alive.

Distributed systems do not need total failure to become distributed. They only need independent clocks, caches, queues, owners, deploys, and dependencies.

The model

I classify everyday partitions into four types: transport, time, state, and ownership.

Transport partitions are about reachability and delivery. Requests time out, DNS resolves slowly, connections pool badly, retries saturate a dependency, or one path can call a service while another cannot. The design question is what callers are allowed to assume when the answer is missing.

Time partitions happen when components observe the world at different speeds. A job is delayed, a cache refresh lags, a deployment rolls gradually, or an event arrives after the user has moved on. The design question is whether old information can still trigger new effects.

State partitions appear when replicas, caches, indexes, ledgers, and search views disagree. The design question is which state is authoritative for which decision, and how conflicts are repaired.

Ownership partitions happen when no single group owns the whole outcome. A platform owns the API, another group owns data quality, another owns support operations, and nobody owns reconciliation. The design question is who acts when the system is technically up but the user promise is broken.

This taxonomy matters because each partition type needs a different response. A circuit breaker helps with transport. Versioned reads help with state. Deadlines help with time. Clear runbooks and ownership contracts help with responsibility.

Where this goes wrong

The first mistake is measuring only hard availability. A dependency can be reachable and still too slow for the product promise. A cache can respond quickly and still be too stale for the decision being made. A queue can be draining and still violate the user's expectation of freshness.

The second mistake is allowing retries to erase evidence. Retries can bridge a transport gap, but they can also cross time and state partitions in surprising ways. A retry after a timeout might run against newer configuration, a different schema version, or a changed entitlement.

The third mistake is assuming deploy skew is harmless because the rollout is short. Short is not the same as safe. If old and new versions interpret state differently, even a brief overlap can create durable confusion.

The counterpoint is that not every micro-partition deserves heavy machinery. Some stale reads are acceptable. Some asynchronous updates are part of the product. Some internal dashboards can lag without consequence. The discipline is to decide explicitly which promises require a single view of truth and which can tolerate temporary disagreement.

What I do now

I ask reviewers to name the partition they are designing for. "What if the dependency is down?" is too narrow. I ask what happens if it is slow, if the cache is old, if an event is late, if one version is deployed, if a retry crosses a state change, and if the owning group is unavailable.

I prefer designs that make staleness visible. Read models should know when they were computed. Cached decisions should carry freshness. Asynchronous workflows should expose whether work is queued, delayed, blocked, or reconciled.

I treat version compatibility as partition tolerance. During a rollout, old and new code are different participants in the same distributed system. They need to share durable state safely, tolerate unknown fields, and avoid assuming that every caller changed at once.

I also look for ownership partitions in design docs. If a user-facing promise crosses services, somebody must own the promise, not just a component. A principal engineer has to notice when the architecture is technically decomposed but operationally orphaned.

Finally, I prefer graceful refusal over dishonest progress. If the system cannot make a correct decision because the relevant view is stale or unreachable, it should say so, defer the decision, or restrict the operation.

Closing takeaway

Design for the partitions you already have: missing transport, mismatched time, divergent state, and split ownership.