Deletion Is the Hardest Feature in Distributed Data Systems
Deletion looks simple when data has one home. Remove a row, clear an object, mark a record, and move on. Most production systems stopped having one home a long time ago.
The hard part is not deleting a value. The hard part is deleting every consequence of that value.
The thesis
Deletion is hard because modern systems multiply data into indexes, streams, caches, search documents, backups, analytics tables, feature stores, logs, exports, and human workflows.
A delete request is therefore not a database operation. It is a distributed product promise. The system is saying that data will either disappear, become inaccessible, or be retained for a named reason under a named policy.
The principal-engineer question is not "did we delete the row?" It is "can we prove what happened to every derived copy?"
The production pattern
The common pattern starts with a product requirement: users, operators, or policy owners need data removed. A team adds a deleted flag or a hard delete in the primary store. The application stops showing the record. The happy path demo passes.
Then someone asks about search. Then analytics. Then cached API responses. Then message topics. Then offline jobs. Then backups. Then downstream systems that created their own derived records. Then logs that contain identifiers or payload snippets. Then dashboards that aggregated the value into a count.
No one was careless. The system was doing what distributed data systems do: preserving, copying, denormalizing, indexing, and replaying information so the product could be fast and reliable.
Deletion is the bill for that convenience.
The problem gets worse when the delete has to be durable across time. A stream consumer may be down. A compaction job may run later. A warehouse table may be rebuilt from old input. A restore from backup may accidentally resurrect data that was deleted yesterday. A cache may serve old content after the primary store has complied.
If deletion is treated as a local mutation, the product promise becomes fiction.
The trap
The trap is using "soft delete" as a complete answer. Soft deletion is often useful, but it only answers one question: should normal application reads hide this record? It does not answer whether the data should remain in derived stores, backups, logs, exports, or training inputs. It does not answer who can undelete it. It does not answer when physical removal occurs.
Another trap is treating hard deletion as morally cleaner. A hard delete from the primary database can make recovery, audit, reconciliation, and propagation harder. It may remove the very marker downstream systems need in order to delete their own copies.
The deepest trap is forgetting that deletion competes with other promises. Reliability wants backups. Analytics wants history. Audit wants evidence. Debugging wants logs. Product wants undo. Compliance may require retention or removal depending on data class and jurisdiction. A serious deletion design has to reconcile those promises instead of letting the primary database pretend the conflict is gone.
The model
I use a six-part deletion model.
Logical deletion defines what the product should stop treating as active. This may be a deleted flag, a tombstone, a disabled state, a revoked relationship, or a lifecycle transition. The important part is that normal reads and business workflows agree on the meaning.
Physical deletion defines when bytes should be removed from each store. Physical deletion may happen immediately, after a retention period, after compaction, or never for a legally retained class. The policy has to be explicit per data type and store.
Propagation defines how every derived copy learns about the deletion. Events, tombstones, change-data capture, periodic sweeps, rebuild rules, and cache invalidation all count. Propagation needs retry behavior because the systems that need the message will not always be healthy when it is sent.
Retention defines what must remain and for how long. Retention is not an excuse to keep everything forever. It is a product and policy decision that names purpose, duration, access controls, and removal schedule.
Audit constraints define what evidence must survive after deletion. Sometimes the system must keep proof that an action occurred without keeping the original payload. That may require redaction, tokenization, aggregation, or separate audit records with narrower fields.
Proof of deletion defines how the organization knows the promise was met. Proof can be job completion records, sampled checks, per-store deletion status, tombstone age, export inventories, cache purge logs, or restore tests that verify deleted data is not resurrected.
The short version is state, bytes, copies, time, evidence, proof.
Where this model breaks
Not all data needs hard deletion. Sometimes retention is the correct product promise. Financial records, security evidence, operational audit trails, and safety-relevant history may need to remain under controlled access. Pretending every delete should erase all traces can create worse failures than retaining a narrow, justified record.
The model also breaks if applied with the same weight to every field. A public preference, a derived aggregate, a private identifier, and a sensitive payload do not deserve the same machinery. The right unit of design is not always a table. It is the data class and the promises attached to it.
There is another counterpoint: absolute proof can be impossible in some environments. Backups, third-party exports, human downloads, and historical aggregates may impose limits. A mature design names those limits clearly. It does not hide them behind vague wording.
What I do now
When deletion appears in a design review, I ask for a data-copy inventory. Not a perfect enterprise map. A practical list: primary stores, secondary indexes, caches, queues, event streams, logs, analytics tables, exports, backups, and derived features. If a team cannot draw where the data goes, it cannot promise deletion.
I prefer tombstones for distributed propagation when downstream systems need to react. A tombstone is not glamorous, but it gives consumers a durable fact to process: this identity existed, and now it must be removed or hidden according to policy.
I ask for restore behavior. If last week's backup is restored into a recovery environment, does deleted data come back? If yes, what reconciliation pass removes it again before the system becomes authoritative? Backups are part of deletion design, not an exception to it.
I also ask teams to separate product visibility from retained evidence. Hiding a record from users, removing a payload from search, and keeping a minimal audit trace are different operations. Combining them into one boolean creates confusion.
The principal-engineer lens is promise management. Deletion sits at the boundary between user trust, operational safety, legal constraints, and system evolution. It is expensive because the system has been optimized to remember.
Closing takeaway
Do not approve deletion until you can name where the data lives, how removal propagates, what is retained, and what evidence proves the promise.