Back to archive

Engineering

The Reliability Work Nobody Wants to Fund

A framing for turning invisible reliability labor into risk reduction leaders can evaluate.

The Reliability Work Nobody Wants to Fund

Some of the most valuable reliability work is hard to fund because nothing new appears on the product surface. The system becomes easier to operate, safer to change, and less likely to fail in boring ways. That value is real, but it is often invisible until after the absence of work becomes expensive.

Reliability needs a better business frame than "we should clean this up."

The thesis

Reliability work gets funded when it is expressed as risk reduction with a clear owner, consequence, and decision point. It gets ignored when it is presented as engineering hygiene.

The principal engineer's job is not to complain that leaders do not value reliability. It is to make the risk legible enough to compare against other work.

The production pattern

A team knows a system is fragile. Deployments require careful sequencing. Alerts are noisy. A critical runbook is tribal knowledge. Tests miss integration behavior. A queue can build up silently. A fallback exists but has not been tested in a long time.

Everyone agrees these are not ideal. But the roadmap is full. Product work has named outcomes. Reliability work has unease.

Unease loses.

The work gets deferred until an incident turns it into an emergency. Then the organization funds a rushed version under worse conditions.

The model

I use a reliability funding frame with five parts.

First, name the failure mode. Avoid vague labels like "tech debt" or "fragility." Say what can happen: delayed recovery, data inconsistency, user-visible downtime, unsafe deploy, unbounded retry pressure, or inability to diagnose.

Second, describe consequence in ranges, not invented precision. Use scale bands: minutes of recovery time, classes of workflows affected, operational load, or support burden.

Third, identify the trigger. What makes the risk more likely? Growth, product change, dependency behavior, staffing changes, more frequent deploys, or accumulated complexity.

Fourth, propose the smallest risk-reducing slice. Reliability programs fail when they are sold as a vague bucket. A slice might be alert ownership, rollback testing, reconciliation, load shedding, or removing a dangerous manual step.

Fifth, define evidence of reduction. What will be true after the work that is not true now?

My template:

  • Risk: specific failure mode
  • Consequence: who or what is affected
  • Trigger: why now
  • Slice: smallest useful mitigation
  • Evidence: how we know risk dropped
  • Owner: who maintains the improvement

I also rank reliability work by the kind of uncertainty it removes.

Diagnostic uncertainty means we do not know quickly what is wrong. Recovery uncertainty means we know what is wrong but do not know how to return safely. Capacity uncertainty means normal growth can push the system past limits without warning. Change uncertainty means deploys and migrations may create failures we cannot isolate. Ownership uncertainty means the right response requires negotiation during stress.

These categories help avoid generic reliability backlogs. A dashboard-heavy proposal may reduce diagnostic uncertainty while leaving recovery unchanged. A rollback drill may reduce recovery uncertainty without addressing noisy alerts. A staffing or ownership change may reduce risk more than another technical control.

The strongest funding requests connect a specific slice to a specific uncertainty. "Add tracing" is weak. "Reduce diagnostic uncertainty for delayed settlement by linking account state, queue age, and downstream response classes in one on-call view" is much easier to judge.

I also try to put a retirement date on reliability bets. If a mitigation does not reduce incidents, shorten recovery, or change operator behavior after a reasonable window, it should be revised or removed. Reliability work deserves rigor after funding too. Otherwise it becomes another permanent surface area that someone has to maintain.

Where this goes wrong

The counterpoint is that engineers can overstate reliability risk to get preferred work funded. That damages trust. Not every annoyance is a material risk, and not every cleanup deserves roadmap space.

Reliability work also has opportunity cost. A young product may reasonably accept some operational discomfort to learn faster. The responsible move is to make that tradeoff explicit, not to shame it.

There is another failure mode: funding one-time reliability pushes without assigning ongoing ownership. A cleanup sprint can improve the system briefly, but reliability decays if the operating model stays the same.

What I do now

I avoid asking for "time to fix reliability." I ask for specific risk decisions.

For example: "We can spend a short cycle reducing recovery uncertainty for this workflow, or we can accept that the next failure may require manual diagnosis by a small set of people." That is a decision leaders can engage with. It connects engineering work to consequence without drama.

Principal engineers have to translate invisible labor into visible risk movement. The work is not more virtuous because it is hidden. It becomes fundable when its value can be inspected.

Closing takeaway

If reliability work cannot name the failure mode it reduces and the evidence that risk went down, it is not ready to compete for roadmap attention.