Back to archive

Engineering

Startup Engineering Is the Art of Choosing Which Fires Stay Lit

A startup operating model for separating existential fires from noisy but survivable ones.

Startup Engineering Is the Art of Choosing Which Fires Stay Lit

In a startup, "put out every fire" is not a strategy. It is a way to exhaust the people who know where the system is weakest.

The hard part is not noticing problems. There are always too many: brittle workflows, missing tests, manual deploy steps, unclear ownership, noisy alerts, data cleanup, performance cliffs, awkward onboarding, half-finished abstractions. A serious engineer can make an infinite list by lunch.

The principal move is deciding which fires must be extinguished, which must be contained, and which can stay lit for now.

The thesis

Startup engineering quality comes from disciplined fire selection, not from pretending urgency removes the need for engineering judgment.

This cuts against two common instincts. One says speed excuses everything. The other says quality requires cleaning everything before moving. Both are lazy in different directions. Startups need an operating model for selective negligence.

The production pattern

Early systems often grow around learning, not elegance. Product direction shifts. Integrations appear before boundaries are mature. Manual operations fill gaps. One engineer knows too much. A script becomes a workflow. A prototype becomes a dependency.

Some of this is rational. Premature architecture can be more damaging than temporary mess. But mess compounds. Eventually every new feature pays a tax to the least understood part of the system.

The challenge is that not all fires create equal risk. A broken internal dashboard may be annoying but survivable. A data correctness issue in a user-visible workflow may be existential. A slow build may hurt morale but not block discovery yet. A missing rollback on a critical path may turn one bad release into an organization-level distraction.

The model

I classify fires across four dimensions.

Existential fires threaten the organization or product promise. These include security exposure, data loss risk, billing correctness, trust-damaging reliability, or anything that can stop learning from the market. These get direct attention.

Compounding fires make every future change slower or riskier. Examples include unclear ownership of a core path, no migration strategy for a changing data model, or a release process that only one person understands. These may not explode today, but they tax every week.

Contained fires are ugly but bounded. They have known owners, known blast radius, and a path to manual recovery. These can stay lit if the organization is honest about them.

Decorative fires are problems engineers dislike but users and the business do not currently feel. Some cleanup lives here. Not forever, but for now.

For each fire, I ask:

  • What is the blast radius?
  • Is the risk growing with usage, headcount, or product surface?
  • Who owns detection and recovery?
  • What decision would this block later?
  • What is the cheapest containment move?

The answer often changes the work. Instead of "rewrite the subsystem", the right move may be "add an owner, a runbook, a reconciliation check, and a kill switch."

Where this goes wrong

Fire selection can become an excuse for permanent neglect. If every problem is labeled survivable, the organization slowly trains itself to accept bad operating conditions. Engineers burn out when leadership celebrates speed while quietly depending on heroic recovery.

It can also go wrong in the opposite direction. A principal engineer with large-organization habits may overclassify risks as existential because they know how bad systems can get. Startup judgment requires respecting the stage of the business. Some risks are real and still not the next right investment.

The counterpoint is that values matter. Even under pressure, some standards should not be traded away: user trust, data handling, security basics, and clear ownership for critical paths. A startup can move fast without being careless about irreversible damage.

What I do now

I keep an explicit fire ledger for ambiguous environments.

Each entry includes the fire, class, owner, current containment, trigger for escalation, and next review date. This avoids two bad outcomes: forgotten risk and performative panic.

I push for containment before perfection. If a workflow is risky, can we reduce blast radius? If a component is owned by memory, can we document recovery? If a migration is too large, can we create a reversible first slice? If an alert is noisy, can we define the one signal that matters?

I also make the cost of leaving a fire lit visible. The organization can choose to accept the risk, but it should not accidentally normalize it.

The review cadence matters. A contained fire that is never re-read is not contained. It is abandoned with better vocabulary. I prefer short, dated reviews over elaborate tracking that nobody trusts.

Closing takeaway

In startup engineering, maturity is not fixing everything. It is knowing which fires can burn, which need containment, and which must never be allowed to spread.