Back to archive

Engineering

Caches Are Product Decisions With Expiration Dates

A cache review frame for stale reads, invalidation ownership, and stampede risk.

Caches Are Product Decisions With Expiration Dates

A cache usually enters a design review as an optimization. The read path is slow. The source of truth is expensive. The team wants lower latency or lower load. A small box appears between the caller and the database, and everyone relaxes because the word "cache" sounds like implementation detail.

That is often the first mistake.

The thesis

A cache is a product decision because it answers a product question: how wrong is the system allowed to be, and for how long?

Once a cached value can outlive the fact it represents, the system has made a promise about stale reads. That promise may be harmless, useful, or dangerous. The point is that it exists whether the team names it or not.

At principal-engineer level, cache review is not just latency review. It is correctness review, ownership review, cost review, and failure review under a shorter name.

The production pattern

The pattern starts innocently. A service repeatedly reads data that changes less often than it is requested. The first cache is local. Then it becomes shared. Then another team reads through it. Then a background job depends on it. Then an incident review discovers that the cached value was technically old but operationally decisive.

The cache did exactly what it was configured to do. The design failed because nobody owned the meaning of "old."

I keep seeing the same shape. A dashboard shows a high hit rate, so the cache is declared healthy. Latency improves, so the feature is declared better. Load drops, so the database owner is grateful. Meanwhile, the important question sits unasked: what happens when the cache returns an answer that was true five minutes ago but false now?

For some domains, that answer is fine. A product catalog, a public profile snippet, or a static policy description can tolerate bounded staleness. For other domains, stale state changes behavior. Permissions, balances, eligibility, workflow state, inventory, fraud decisions, and safety controls are not all cacheable in the same way.

The cache is not just between caller and storage. It is between a user and a promise.

The trap

The trap is reviewing caches as if they have one dimension: hit rate. Hit rate matters, but it can hide the damage. A cache that is right most of the time can still be wrong at the moments that matter most.

Another trap is treating time-to-live as an engineering constant. A thirty-second TTL sounds small in a config file. It is large if the user revoked access one second ago. It is tiny if the source system changes weekly. Time has product meaning only in context.

Invalidation also gets hand-waved. Teams say "we will invalidate on write" without naming the owner of every write. Then a second writer appears. Then a backfill appears. Then a repair script appears. Then a replica lag path appears. Each one changes the truth without necessarily touching the cache.

The hardest cache bugs are not always stale values. They are stampedes. The cache expires, many callers miss at once, and the system moves from protecting the origin to attacking it. A cache that improves normal traffic can make recovery traffic worse.

The model

I use a five-part cache review model.

Cacheable truth asks what fact is being cached and who is allowed to define it. Is it derived from one source or many? Does it include permissions, policy, inventory, lifecycle state, or money-like value? Is the cached object a full truth or a convenience projection?

Invalidation owner asks who knows when the cached fact stops being true. A TTL is not ownership. Ownership means a write path, event path, repair path, migration path, or manual operation has a named mechanism to make old answers disappear or become harmless.

Stale-read impact asks what happens when the answer is old. Does the user see a cosmetic delay, receive wrong advice, get blocked incorrectly, get access they should not have, or trigger irreversible work? This turns cache review from "is staleness possible?" into "is staleness acceptable?"

Warmup path asks what happens when the cache is empty. Cold starts, deploys, failovers, region shifts, and key churn all create empty-cache behavior. If the warmup path cannot survive production traffic, the cache is not an optimization. It is a hidden dependency.

Stampede control asks how the system behaves when many keys expire or many callers miss together. Request coalescing, jittered expiration, stale-while-revalidate, per-key limits, and origin budgets are not decoration. They are the difference between a cache protecting the system and a cache synchronizing load.

The checklist is simple: truth, owner, impact, warmup, stampede. If a design cannot answer those five, it is not ready for broad use.

Where this model breaks

Some caches really are local implementation details. Request-scoped memoization, pure function caching inside one process, compiled templates, static lookup tables packaged with a release, and tiny caches over immutable values do not need a governance ceremony.

The model also breaks when the cost of formal review exceeds the blast radius. A local cache in a command-line tool should not carry the same process as a cross-service cache on an authorization path.

The counterpoint is important because over-treating every cache as a distributed systems problem makes teams avoid useful simplicity. The distinction I care about is whether the cached answer can be observed as product behavior by another person, service, or workflow. If yes, the stale-read promise deserves to be named.

Another limit: not all stale reads are bad. Sometimes bounded staleness is the right product decision. A system can choose availability and speed over immediate consistency. The failure is not choosing staleness. The failure is pretending no choice was made.

What I do now

When reviewing a cache, I ask for the stale-read contract in plain language. "This value may be up to five minutes old, and that is acceptable because it only affects display ordering" is a design statement. "TTL equals 300" is not.

I ask for invalidation ownership before implementation detail. If the cache depends on write events, I want to know which writes publish them, how missed events are repaired, and whether backfills can bypass the path. If the cache depends only on expiration, I want the stale-read impact to justify that choice.

I ask teams to test cold behavior. Disable the cache, clear a key range, or run a deploy path that starts empty. A cache that only works when already warm is borrowing reliability from yesterday.

I also ask for failure-mode observability: hit rate, miss rate, origin load, refresh errors, evictions, age of served values where possible, and protection during refresh. Hit rate alone is a vanity signal unless paired with correctness and origin health.

The principal-engineer lens is that caches redistribute responsibility. They move load away from one system while moving correctness questions into another. That is a good trade when explicit. It is a dangerous trade when hidden in a helper library or a config change.

Closing takeaway

Treat every shared cache as a promise: this stale answer is acceptable for this long, under this owner, with this recovery path.