Why Most Migrations Fail Before Code Starts
The code phase of a migration gets the most attention because it is visible. Branches move, dashboards change, compatibility layers appear, and people can point to progress. But many migrations are already in trouble before the first implementation pull request lands.
The failure usually begins in the plan.
The thesis
Most migrations fail before code starts because they are framed as technical replacement projects instead of coordinated risk retirement.
The target architecture matters, but the migration architecture matters more. A good end state cannot rescue a bad sequence.
The production pattern
A migration often starts with a reasonable observation: the current system is costly, fragile, limiting, or no longer aligned with the product. A better shape is proposed. People agree in principle. The roadmap gets a line item.
Then the hidden problems surface.
The old and new systems must coexist longer than expected. Ownership is split between people building the replacement and people maintaining the old path. Rollback is discussed as a hope, not a design. Product commitments keep landing on both sides. Data semantics are "almost the same." A long tail of consumers appears late. The migration becomes a second system, not a bridge.
At that point the technical work may be solid, but the program is carrying unpriced risk.
The model
I evaluate migrations across six dimensions.
First, ownership. Who owns the old system until the final consumer is gone? Who owns the migration path? Who can stop the rollout?
Second, compatibility. What must both systems understand at the same time? Which contracts are stable, and which are being translated?
Third, sequencing. What can move independently? What must move together? Which step creates the first irreversible commitment?
Fourth, observability. How will we know the new path is correct before it is the only path? What comparisons are meaningful rather than comforting?
Fifth, rollback. Can we go back after writes have happened, or only before traffic shifts? What data repair would rollback require?
Sixth, exit criteria. What specific condition lets us delete the old path? Without deletion, the migration is not complete.
My migration readiness checklist:
- Named owner for old path and new path
- Written coexistence period
- Consumer inventory with unknowns called out
- Rollout stages tied to evidence
- Rollback plan for each irreversible step
- Deletion criteria agreed before build starts
I also classify migration risk before approving the plan.
Semantic risk means the old and new paths disagree about what the data means. Operational risk means the team cannot safely run both paths under load. Adoption risk means consumers will not move when the platform is ready. Control-plane risk means flags, routing, or permissions can put the system into a mixed state no one intended. Deletion risk means the old path survives because no one can prove it is unused.
Each category needs a different mitigation. Semantic risk needs comparison and reconciliation. Operational risk needs load testing, dashboards, and rollback drills. Adoption risk needs owner escalation and a visible dependency list. Control-plane risk needs guardrails and simple states. Deletion risk needs instrumentation from the start.
This taxonomy changes planning because "migration risk" stops being a single vague concern. It becomes a set of work items that can be sequenced, owned, and traded off.
Where this goes wrong
The model can be too heavy for small migrations. If the change is local, reversible, and has few consumers, a lightweight checklist is enough. Not every replacement needs a program.
The other counterpoint is morale. Teams can get trapped in migration planning that feels like distrust. The framing matters: the goal is not to prove the team might fail. The goal is to make success less dependent on heroic coordination.
There is also a real cost to running old and new paths together. Dual operation can create complexity, slow feature work, and confuse ownership. Sometimes the right answer is a sharper cutover with a stronger rollback boundary. The point is to choose that consciously, not drift into it.
What I do now
Before supporting a migration, I ask for the timeline of risk, not just the timeline of work. When does risk increase? When does it decrease? Which milestone actually retires an operational burden?
I also ask teams to write the deletion plan at the start. This changes the conversation. A migration plan that cannot explain deletion usually has not understood its consumers. A migration plan that cannot explain rollback usually has not understood its write semantics.
The principal-engineer lens is sequencing. Many migrations are technically correct in the final state and dangerous in the middle. The middle is where users live, where teams operate, and where priorities change.
Closing takeaway
Do not ask only whether the destination architecture is better. Ask whether every step toward it reduces more risk than it introduces.