Streams Need Contracts, Not Just Consumers
Event streams often look healthy while their contracts are failing. Producers publish. Consumers consume. Lag is acceptable. Dashboards are green. Then a replay, schema change, late event, or new consumer exposes the real problem: nobody agreed what the stream meant.
The failure is social before it is technical.
The thesis
Streams need explicit contracts because consumers depend on meaning, ordering, retention, replay, and failure behavior, not just bytes arriving from a producer.
Without that contract, every consumer builds its own interpretation. The stream becomes a shared dependency with private semantics.
The production pattern
A team emits events because it wants decoupling. Another team subscribes because polling is expensive or too slow. A third consumer appears for analytics, search, notifications, billing, machine learning, audit, or operational automation.
At first, this feels like a good architecture. Producers do not call every consumer. Consumers can move independently. New use cases can attach to the stream without asking for request-path changes.
Then the stream matures. A consumer needs to replay from last month. Another assumes events are ordered by user. A producer changes when an event fires. A late event arrives after a derived state has already moved forward. A poison event blocks a partition. A retention setting is shortened to reduce cost. An enum grows. A field keeps the same name but changes meaning.
The system did not become coupled because it used events. It became coupled because the event contract was implicit.
The trap
The trap is treating consumers as proof of contract.
If three consumers are running, it is tempting to believe the stream is understood. In reality, each consumer may rely on a different undocumented property. One treats the event as a fact. Another treats it as a hint. Another assumes exactly-once effects. Another assumes replay is safe. Another only works because current producers emit in a particular order.
The second trap is using event names as semantics. UserUpdated or OrderChanged sounds descriptive, but it does not say what changed, when the change becomes true, whether the event is complete, whether missing fields mean unknown or unchanged, or whether consumers can rebuild state from the stream.
The third trap is believing loose coupling means no coordination. Streams reduce synchronous coordination. They do not remove semantic coordination.
The model
I use a stream contract checklist with six fields.
Event name: what business fact occurred? The name should describe a meaningful fact, not a producer implementation detail. If the event means "profile email verified," do not make consumers infer that from a generic update event.
Semantic version: how can meaning evolve? Versioning is not only schema shape. A field can keep the same type and change meaning. The contract should say which changes are additive, which require a new event, and how long old semantics remain supported.
Ordering key: what ordering, if any, is promised? Per-user ordering, per-resource ordering, partition ordering, and global ordering are different promises with different costs. If no ordering is promised, say so. If consumers must handle reordering, make that explicit.
Retention: how long can consumers rely on the stream for recovery, replay, or bootstrap? Retention is a product and operational promise. A stream retained for a few days cannot be the only recovery path for a consumer that may need historical rebuilds.
Late arrival: what should consumers do when events arrive after related state has advanced? Late data is not an edge case in distributed systems. The contract should say whether consumers reconcile, ignore, compensate, or emit a correction.
Poison event handling: what happens when one event cannot be processed? The answer should avoid blocking unrelated work forever. Dead-letter handling, quarantine, skip-with-audit, retry limits, and owner escalation belong in the contract, not in each consumer's private panic path.
This checklist is not about bureaucracy. It makes the stream safe to share across teams because it names the assumptions that would otherwise become hidden dependencies.
Where this model breaks
Some streams should stay loose. Internal telemetry, short-lived diagnostics, local metrics, and exploratory analytics may not deserve a strong semantic contract. The cost of formalizing every event can slow useful learning.
The boundary I use is replay and consequence. If consumers need replay to restore state, or if the event drives user-visible, financial, permission, operational, or compliance decisions, the stream needs a real contract. If the event is disposable observation, a lighter standard is fine.
There is also a risk of overfitting contracts to today's consumers. A stream contract should describe producer promises, not every consumer's desired convenience. Otherwise the producer becomes a custom integration service wearing an event-stream costume.
What I do now
When reviewing an event stream, I ask whether a new consumer could build safely from the contract alone. If the answer requires reading producer code, asking a long-tenured engineer, or copying another consumer's behavior, the contract is not real yet.
I also distinguish fact events from notification events. A fact event says something became true and can often support replay. A notification event says something happened that may be useful to react to, but it may not be enough to reconstruct state. Mixing these creates fragile consumers.
For important streams, I want replay tested before it is needed. Replaying after an incident is too late to discover that old events no longer parse, retention is too short, ordering assumptions were false, or side effects are not idempotent.
I ask producers to publish negative promises as well. No global ordering. Unknown enum values may appear. Events can arrive late. Retention is not a long-term archive. Consumers must deduplicate by event id. These statements feel defensive, but they prevent false architecture.
The principal-engineer lens is organizational decoupling. Events can let teams move independently only if the shared contract is stronger than tribal memory. Otherwise, the architecture has merely moved coupling from request paths into ambiguity.
Closing takeaway
Do not approve a production stream because consumers exist. Approve it when the contract names meaning, versioning, ordering, retention, late data, and poison-event behavior.