Back to archive

Engineering

Why Data Quality Problems Look Like Engineering Problems

A taxonomy for separating data truth, pipeline behavior, model symptoms, and product expectations.

Why Data Quality Problems Look Like Engineering Problems

Data quality problems rarely announce themselves as data quality problems. They show up as flaky tests, bad recommendations, strange model behavior, broken dashboards, slow investigations, and arguments about whose system is wrong.

That is why senior engineers need a sharper taxonomy than "the data is bad."

The thesis

Many data quality problems look like engineering problems because the system has not separated truth, transport, interpretation, and expectation.

Until those layers are separated, teams debug symptoms instead of causes.

The production pattern

A downstream system behaves incorrectly. The immediate suspect is the component closest to the visible failure: the model, the API, the dashboard, the job, the cache.

Sometimes that component is guilty. Often it is faithfully processing data whose meaning changed upstream, arrived late, duplicated, lost a field, violated an implicit invariant, or never meant what the downstream consumer thought it meant.

The engineering failure is not merely that the data was imperfect. It is that the contracts around the data were too weak to make the imperfection visible.

The model

I use five buckets.

Truth problems are disagreements about the real-world meaning of a field or label. The system may be working, but people disagree about what "active," "complete," "safe," or "eligible" means.

Capture problems happen when the source records the wrong thing, omits context, samples poorly, or allows inconsistent entry.

Pipeline problems happen when transport, transformation, deduplication, ordering, backfill, or schema evolution changes the data.

Interpretation problems happen when a consumer applies the wrong semantics, joins at the wrong grain, treats missing as false, or assumes freshness that does not exist.

Expectation problems happen when the product wants certainty, completeness, or latency that the data supply chain cannot provide.

The diagnostic checklist:

  • Is the disputed field measuring the thing people think it measures?
  • Where is the first point the bad value exists?
  • Did the schema change, or did the meaning change without schema movement?
  • Is the consumer relying on ordering, freshness, or completeness?
  • Are missing, unknown, false, and not-applicable distinct states?
  • Which owner can change the contract rather than patch the symptom?

Where this goes wrong

The counterpoint is that not every data issue deserves a platform program. Some problems are local bugs with local fixes. A broken parser, missed null check, or incorrect join can be repaired without creating a governance process.

The failure is swinging between extremes. One extreme treats every data incident as a one-off. The other turns every field into a committee decision.

The principal-engineer move is to find the contract that would have made the failure cheaper to detect.

What I do now

I try to move data discussions from blame to contracts.

For critical data, I want named owners, semantic definitions, freshness expectations, allowed null states, backfill rules, quality checks, and consumer-facing change processes.

For ML systems, I want data quality checks tied to model behavior. A model regression may be caused by input distribution drift, label drift, retrieval drift, or product expectation drift. Those are different fixes.

For analytics, I want dashboards to expose uncertainty where possible. A clean chart built on ambiguous semantics creates false confidence.

Choosing the right fix

The fix depends on which bucket is actually failing.

For truth problems, code is rarely the first answer. The organization needs a definition people can live with, plus a migration path for old meaning. If "active" means one thing to billing and another to product analytics, a cleaner pipeline will only move the disagreement faster.

For capture problems, the fix is usually closer to source validation, workflow design, or required context. Downstream consumers cannot reliably infer data that was never recorded.

For pipeline problems, I want mechanical guarantees: schema checks, idempotent backfills, ordering rules, replay tests, and alerts tied to freshness or volume bands. These are classic engineering controls.

For interpretation problems, the fix is consumer education plus stronger contracts. A field should advertise whether it is nullable, delayed, approximate, sampled, or scoped to a particular grain.

For expectation problems, the answer may be product negotiation. Some data cannot be perfectly fresh, complete, and cheap at the same time. Pretending otherwise creates brittle systems and angry stakeholders.

The mistake is applying the most familiar fix to every bucket. Engineers reach for pipeline hardening. Analysts reach for definitions. Product teams reach for UI copy. Each can be right, but only after the failure layer is named.

I also try to write the expected failure in observable terms. "The data is bad" does not help an operator. "The eligibility field can lag the source system by a full refresh window" gives the consumer something to design around.

Closing takeaway

When data quality looks like an engineering bug, separate truth, capture, pipeline, interpretation, and expectation before choosing the fix.