Queues Are Not Reliability

Queues are useful enough that they attract wishful thinking. A team has a timing problem, a burst problem, or a dependency problem, and the answer becomes "put it on a queue." That may be the right move. It is not the end of the reliability design.

A queue preserves a message better than a synchronous call in some failure modes. It does not preserve intent, enforce idempotency, repair partial effects, or decide whether stale work should still run.

The thesis

Queues buy decoupling. They do not buy correctness.

The dangerous belief is that moving work into a queue turns an unreliable operation into a reliable one. In reality, it changes where the hard states appear. Instead of one request timing out, you now have enqueue uncertainty, duplicate delivery, delayed processing, poison messages, partial consumer effects, and ambiguous acknowledgement.

Those are solvable problems, but the queue does not solve them for you.

The production pattern

A service accepts a command and enqueues work. The API returns quickly, so the system feels healthier. During normal traffic, this is a good improvement. Producers are less coupled to worker latency. Workers can be scaled independently. Bursts are easier to absorb.

Then a downstream service slows down. Consumers start timing out after doing some work. Messages are retried. A few messages always fail and move to a dead-letter queue. New messages continue arriving behind old messages that may no longer be useful. Some users retry from the UI because the product says only "processing."

The queue did not create these problems, but it made them durable. That is useful only if the surrounding system knows what to do with durable uncertainty.

The model

I break queued work into a lifecycle:

Accept: record the intent that should produce work.
Enqueue: place a durable reference to that intent on the queue.
Reserve: let a consumer claim an attempt.
Process: perform reads, writes, and external effects.
Commit: record the known outcome.
Acknowledge: remove or advance the message only after the commit rules are satisfied.

Each edge has a failure mode. The producer can crash after recording intent but before enqueueing. The queue can deliver more than once. The consumer can crash after an external effect but before commit. The acknowledgement can fail after commit. The message can be delayed so long that the original intent should be rechecked.

The queue is one component in that lifecycle. Treating it as the lifecycle is the mistake.

Where this goes wrong

The first mistake is putting complete business commands in messages without a durable source of intent. If the message is the only record, a bad publish, accidental deletion, or malformed payload can erase the system's memory of what was supposed to happen.

The second mistake is assuming consumers see each message once. Many queues are designed around at-least-once delivery. Even when a system offers stronger delivery under normal conditions, consumer code should still handle duplicate attempts because retries and operator actions will happen.

The third mistake is letting poison messages become a junk drawer. A dead-letter queue with no owner, alert, replay policy, or product state is not recovery. It is delayed abandonment.

The fourth mistake is ignoring ordering and freshness. Work that arrives first is not always work that should complete first. A stale recalculation, old notification, or superseded agent action may need to be skipped, collapsed, or revalidated before execution.

The fifth mistake is using queue depth as the only health signal. Depth tells you pressure. It does not tell you which business intents are blocked, which users are affected, which drift classes are growing, or whether consumers are making partial progress.

There is a counterpoint. Queues are excellent tools for smoothing bursts and isolating dependencies. The critique is not "avoid queues." It is "do not let the queue carry design responsibility that belongs to the workflow."

What I do now

I start with the intent record. The message should usually carry an identifier, not be the only durable description of the work. A consumer can then reload current state, check whether the work is still desired, and decide what attempt to make.

I make consumers idempotent at the effect boundary. If a message is delivered twice, the second attempt should find the existing operation, current owner, or completed result. If the consumer cannot do that, the queue's retry policy is unsafe.

I give dead-letter queues explicit ownership. Someone should know what classes of messages go there, what review cadence exists, what can be replayed automatically, and what requires a product or operations decision.

I also design replay before the incident. Replaying a message should not be the same as pretending time has not passed. The consumer should recheck current desired state and stop if the work has been cancelled, superseded, or already completed.

For agent runtimes, queues can be useful for tool work, evaluation, and long actions. But the queue should hold work for a governed action lifecycle, not raw model intention. The runtime still needs ownership, idempotency, deadlines, and reconciliation.

Closing takeaway

A queue can keep work from disappearing. It cannot tell you whether that work is still correct, safe, or owned.