Incident Reviews Should Change Roadmaps, Not Just Runbooks

Many incident reviews produce better documents than decisions. The timeline is accurate, the contributing factors are thoughtful, the tone is blameless, and the action items look reasonable. Then the roadmap continues as if nothing important was learned.

That is a missed opportunity.

The thesis

An incident review should change the roadmap when it reveals that the organization was wrong about risk, ownership, or operating cost.

Runbook improvements matter, but they are often the smallest outcome. The larger question is whether the incident exposed a prioritization error.

The production pattern

After an incident, teams tend to generate local fixes: add an alert, improve a dashboard, update a runbook, patch a bug, add a test, clarify an escalation path. These can be useful.

But some incidents reveal a broader mismatch. A supposedly non-critical workflow is more central than expected. A dependency has become a single point of failure. A manual process is no longer acceptable at current scale. A team owns a system in name but lacks authority to operate it. A product promise depends on behavior nobody has budgeted to harden.

If those insights do not affect planning, the review has documented learning without spending it.

The model

I separate incident outcomes into four classes.

First, local correction. A defect, alert, or runbook gap can be fixed inside the owning area without changing priorities elsewhere.

Second, operating model correction. The system behavior is acceptable, but ownership, escalation, permissions, or on-call readiness is wrong.

Third, architecture correction. The incident exposed coupling, missing isolation, unsafe state transitions, inadequate rollback, or an unbounded failure mode.

Fourth, strategy correction. The roadmap assumes speed, cost, or reliability that the current system cannot support without investment.

The review should explicitly classify its findings. The action list should match the class.

My incident review checklist:

What did we learn that we did not believe before?
Which assumption in the roadmap did this contradict?
Did ownership match authority during response?
Did the system fail safely or require improvisation?
Which action item reduces recurrence, impact, or recovery time?
Which action item deserves roadmap tradeoff discussion?

I also separate actions by funding path.

Immediate containment belongs inside the incident follow-up window: disable an unsafe path, lower a dangerous limit, patch the defect, or add the alert that would have shortened response. Operating-model changes need an accountable owner: permission cleanup, escalation rules, runbook review, or on-call training. Roadmap changes need prioritization language: what planned work becomes riskier if the system stays as it is?

This separation prevents two common failures. The first is pretending a roadmap problem can be solved with a checklist item. The second is escalating every small cleanup into a planning debate. A review earns influence when it is precise about which kind of decision it is asking for.

I like action items that name the bet they retire. "Add test" is weak. "Prevent a deploy from accepting a state transition the previous version cannot read" is better. It connects the work to the production lesson.

The review should also record the decision if leaders choose not to change the roadmap. That is not a failure by itself. Sometimes the right call is to accept the risk. But writing the acceptance down prevents the next incident review from rediscovering the same tradeoff as if it were new information.

Where this goes wrong

The counterpoint is that not every incident should redirect planning. Some are routine defects. Some are rare combinations with low consequence. Some are best handled with a targeted fix and no ceremony.

Overreacting to every incident creates roadmap thrash and rewards recency bias. A principal engineer has to distinguish signal from noise. The question is not "Did something bad happen?" The question is "Did this event reveal that our risk model was wrong?"

There is also a cultural risk. If every incident becomes a demand for roadmap change, teams may become defensive. The review must stay focused on system learning, not political leverage.

What I do now

At the end of an incident review, I like to add a short section called "Planning implications." Sometimes it says, "None." That is acceptable. But forcing the question prevents important discoveries from being trapped inside the incident process.

If there is a planning implication, I want it written as a tradeoff: "Continue feature work and accept this recovery risk," or "Spend capacity to reduce this class of failure before expanding the workflow." Leaders can choose either, but they should not unknowingly choose the first.

The principal-engineer lens is incentives. Teams do what roadmaps reward. If incident reviews only create local chores, the organization may never fund the structural work its incidents keep requesting.

Closing takeaway

Do not close an incident review until you have asked whether the roadmap still reflects what production just taught you.