What Your SLA Actually Promises
An SLA is easy to write as a target and hard to operate as a promise. The gap appears when the service is technically up, users are blocked, and every dashboard can defend itself.
The thesis
An SLA is not an aspiration. It is a promise about failure shape, measurement, ownership, and cost.
Availability language often compresses too much. A service can respond while returning degraded data. It can meet an uptime target while a critical workflow fails for one segment of users. It can be healthy from the provider's view and unusable from the caller's view. It can recover quickly and still leave behind duplicated work or unprocessed requests.
The principal question is not "what number did we commit to?" It is "who bears the damage when the system partially fails?"
The production pattern
The problem often starts when a service boundary becomes important before its promise becomes precise. Internal users depend on it. Product workflows assume it. Dashboards track response codes and latency. A contract says the service should be available. Then a failure arrives with an inconvenient shape.
The service responds, but slowly enough that callers time out. One endpoint works while another fails. Writes succeed, but reads lag. The control plane works, but the data plane is impaired. A dependency returns cached success for actions that are not really complete. A retry storm amplifies errors downstream. Measurement says the service mostly met its target. Users disagree because the promise they cared about was never measured.
This is how SLAs become arguments instead of contracts.
The model
I use four questions to make an SLA concrete: what fails, who measures, who owns, and who pays.
What fails: define the user-visible operation, not only the component. Is the promise about login, search, payment initiation, report generation, message delivery, administrative action, or background processing? Different operations have different failure shapes.
Who measures: decide where measurement happens. Provider-side metrics are necessary but incomplete. Caller-side success, synthetic checks, business events, and reconciliation outcomes may tell a different story. Measurement must be trusted before the incident.
Who owns: name the owner for partial failure. If the service is up but stale, who decides whether to pause callers? If one region is impaired, who communicates status? If old work is delayed, who clears it?
Who pays: define consequences. Cost can mean credits, engineering time, support load, manual repair, degraded product experience, or delayed roadmap. Someone pays when the promise is wrong, even if the contract is silent.
This model turns an SLA from a percentage into an operating agreement.
Where this goes wrong
The first mistake is equating uptime with usefulness. A dependency that returns responses while violating freshness, correctness, or completion semantics may satisfy a narrow uptime metric and still break the product.
The second mistake is using averages for promises that users experience at the tail. A workflow with many service calls amplifies small error rates and slow paths. The user does not experience the average component. The user experiences the composed path.
The third mistake is measuring what is easy rather than what is promised. Health checks, status codes, and host metrics are useful, but they rarely capture partial failure, stale data, permission drift, queue delay, or reconciliation backlog.
The counterpoint is that SLAs cannot describe every edge case. Overly detailed contracts become unreadable and create false precision. Some systems need simple internal targets to move quickly. The answer is not exhaustive legal language. The answer is a small set of promises tied to the operations that actually matter.
What I do now
I start with the workflow, not the service. What user or internal actor is depending on the boundary? What action do they need to complete? What does success mean after all asynchronous work settles? This keeps the promise attached to reality.
I separate availability, latency, freshness, correctness, and recovery. A service can satisfy one and violate another. If freshness matters, name it. If recovery matters, measure backlog drain and repair. If correctness matters, include reconciliation or domain checks.
I ask for both provider-side and consumer-side signals. Provider dashboards explain capacity and health. Consumer signals explain whether the promise is experienced. When these disagree, the SLA should tell people which signal drives action.
I also inspect ownership before accepting a promise. An SLA without incident ownership is a statement of hope. Someone must own communication, mitigation, repair, and post-failure cleanup.
Finally, I prefer promises with explicit exclusions and degraded modes. If batch work can lag during dependency recovery, say so. If a read model can be stale, show its freshness. If a fallback returns partial results, label them. Honest degradation protects trust better than hidden failure.
Closing takeaway
An SLA should tell the organization what user operation is protected, how failure is measured, who acts during partial failure, and who pays when the promise is wrong.