Back to archive

Engineering

The Queue Is Not the Architecture

How to reason about async workflows by ownership, ordering, failure semantics, and reconciliation.

The Queue Is Not the Architecture

Adding a queue can make a system feel more scalable, more resilient, and more modern. It can also hide the hardest questions behind a comforting rectangle labeled "async."

The queue is a mechanism. It is not the architecture.

The thesis

Async architecture is defined by ownership, ordering, failure semantics, and reconciliation. The queue only transports work between those decisions.

If those decisions are unclear, the queue will preserve the ambiguity and deliver it later.

The production pattern

A synchronous workflow becomes too slow or too coupled. A team introduces a queue so the caller can return quickly and workers can process later. This may be the right move.

Then reality arrives. Messages are duplicated. Some are delayed. Some arrive after related state has changed. A downstream dependency rejects a subset of work. Operators need to replay messages, but replay creates side effects. Product managers ask why the user sees "complete" when downstream work is still pending.

The queue did not create all of these problems. It made the existing semantics impossible to ignore.

The model

I use four questions for async workflow design.

First, ownership: who owns the work after enqueue? The caller, the worker, or a workflow owner? If ownership transfers, what does the caller promise the user?

Second, ordering: which operations must happen in sequence, and which merely tend to? Ordering is expensive. Do not buy it globally when only one entity needs it locally.

Third, failure semantics: what happens after retry exhaustion, poison messages, dependency failure, or invalid input? A dead-letter queue is a parking lot, not a resolution strategy.

Fourth, reconciliation: how does the system detect and repair divergence between intended state and completed work? Async systems need a way to ask, "What should be true now?"

My async checklist:

  • Message describes intent, not just implementation command
  • Consumer is safe against duplicate delivery
  • User-visible state accounts for pending and failed work
  • Retry policy matches downstream failure modes
  • Poison messages have an owner and review path
  • Reconciliation job can rebuild truth from durable state
  • Dashboards show age and stuck work, not only count

The sharper design artifact is usually a workflow contract. It states who may enqueue work, what durable record proves the work should exist, which state transitions are legal, what makes work obsolete, and which side effects are allowed after delay.

That contract should include time. A message that is correct for five seconds may be wrong after five hours. Inventory may change, permissions may be revoked, pricing may expire, or a user may cancel the intent. Async design needs freshness checks, cancellation semantics, and a way to turn stale work into a visible terminal state.

I am also cautious with retry optimism. Retrying is useful when the failure is transient and the effect is idempotent. Retrying is dangerous when the downstream system is rejecting invalid state, when pressure is already high, or when each attempt emits a new side effect. A retry policy is a production decision, not a library default.

One useful review move is to read the message schema as if it were the only surviving artifact. Could a new owner understand the intent, scope, freshness, and repair path? If not, the queue is carrying private context from the producer's code. That context will disappear exactly when replay, backfill, or incident response needs it most.

Where this goes wrong

The counterpoint is that queues are often exactly the right primitive. They absorb bursts, decouple availability, smooth load, and let work happen outside request latency. Avoiding queues because they introduce complexity can leave a system brittle in other ways.

The failure is not using a queue. The failure is treating "put it on a queue" as the design. A queue without semantics is just delayed coupling.

There is also a scale trap. Teams sometimes import heavy workflow machinery when a simple table of pending work would be clearer and easier to operate. The best async design may be boring storage plus a careful worker.

What I do now

When reviewing an async proposal, I ask for the state machine before the queue technology. What states can the work occupy? Which transitions are user-visible? Which transitions can be retried? Which require human review? Which can be reconstructed?

I also ask what happens if the message is delivered tomorrow. This simple question reveals hidden assumptions about freshness, authorization, inventory, pricing, or user intent. If delayed execution would be wrong, the message needs enough context or validation to notice.

The principal-engineer lens is semantics. Async systems are not reliable because they use queues. They are reliable when every participant understands what the queued work means and how truth is restored when delivery is messy.

Closing takeaway

Before choosing a queue, design the ownership, ordering, failure, and reconciliation rules. The transport should serve those rules, not substitute for them.