Back to archive

Projects & Notes

The Art of the Technical Pre-Mortem

A practical pre-mortem format for finding failure modes before design confidence becomes expensive.

The Art of the Technical Pre-Mortem

The most useful technical risks are often obvious after the failure. A pre-mortem is a way to make some of that hindsight available while the design is still cheap to change.

I do not use pre-mortems to make teams pessimistic. I use them to make confidence more honest.

The thesis

A technical pre-mortem works when it treats failure as a design input, not as a mood. The goal is to produce owned failure modes before momentum hardens around the happy path.

The non-obvious part is that the pre-mortem is not mainly about imagination. It is about permission. Many engineers can see weak spots before launch, but they soften the language because the project already has sponsors, dates, and social gravity. A good pre-mortem gives the team a formal moment to say the uncomfortable thing early.

The production pattern

A project begins with a reasonable design. The diagram is clean. The rollout has stages. The interfaces look stable enough. People ask good questions, but the conversation stays anchored on how the system should work.

Then reality adds texture. A dependency is slower than expected. A migration needs repair tooling. A fallback path produces different semantics. A queue hides retries that violate an assumption. A team that was expected to provide support has other priorities. None of this is surprising in hindsight. Most of it was probably knowable.

The problem is that design conversations often reward coherence. Pre-mortems reward stress. They ask the team to temporarily assume the project failed and then work backward from the evidence.

That inversion changes the quality of the conversation. Instead of "Is this safe?" the team asks "What would we be embarrassed to learn after launch?" Instead of "Do we have observability?" the team asks "Which failure would page us with no clear owner?"

The model

I use a six-part pre-mortem format.

Cast: Include the people who will build, operate, integrate, support, and approve the work. A pre-mortem made only of designers will miss operational risk. A pre-mortem made only of operators may miss product constraints.

Prompt: Start with a concrete sentence: "It is several weeks after launch, and this effort is considered a failure." Then ask each person to list why. The prompt should name the scope, not a private incident or real account.

Failure inventory: Collect failures without debating them first. Group them by class: correctness, migration, dependency, performance, cost, security, privacy, usability, operability, ownership, and adoption.

Mitigations: For each serious failure, decide whether to prevent it, detect it, limit its blast radius, rehearse recovery, or explicitly accept it. Not every risk deserves prevention. Some risks need a dashboard and a rollback.

Owners: Every mitigation and accepted risk needs an owner. Ownership means authority to change the plan, not just a name in a document.

Revisit triggers: Define what will reopen the decision. A trigger might be a missed load test, an unresolved dependency, a failed migration rehearsal, a support burden, or a cost signal. Without triggers, the pre-mortem becomes a snapshot instead of a control loop.

Where this goes wrong

The counterpoint is that pre-mortems can become theater. A team can generate a dramatic list of scary outcomes and still change nothing. That is worse than skipping the exercise because it creates the appearance of diligence without buying down risk.

A pre-mortem also fails when it is used as a veto ceremony. If senior engineers use it to ambush a team, the team will learn to bring safer, vaguer documents. The exercise should make risk speakable, not make people defensive.

Another failure mode is treating every risk as equal. Serious engineering judgment includes choosing which risks to accept. If the team tries to eliminate every possible failure, the project slows down and the risk moves elsewhere: missed opportunity, complexity, or an unmaintainable mitigation layer.

There are cases where a pre-mortem is too much. A small reversible change with limited blast radius may only need a checklist. The ceremony should scale with irreversibility, data risk, cross-boundary coordination, and operational cost.

What I do now

I run pre-mortems early enough that the output can still change sequencing. The most valuable mitigations often affect order: build repair tooling before migration, validate a dependency before committing to an interface, rehearse rollback before widening rollout, or assign operational ownership before launch.

I keep the tone factual. "This might fail because nobody owns data repair" is useful. "This design is reckless" is usually not.

I also separate failure discovery from decision making. First collect the inventory. Then rank. Then choose responses. Mixing those steps causes people to defend ideas before the risk surface is visible.

For principal engineers, the highest-leverage part is often not naming the clever failure. It is finding the risk that has no owner because it sits between teams, phases, or incentives. Boundary risks rarely announce themselves as technical complexity. They show up as "we assumed someone would handle that."

After the pre-mortem, I want the design document or delivery plan to change. If no artifact changes, the exercise probably did not matter. The output should become a smaller scope, a changed sequence, a new rehearsal, a dashboard, a fallback, an explicit accepted risk, or a decision record.

Closing takeaway

Run a technical pre-mortem when the cost of being wrong is high enough that hindsight should not be your first honest review.