Back to archive

Projects & Notes

The Cost Model Every AI Product Needs

A unit-economics lens for token cost, latency, quality, and human review in AI products.

The Cost Model Every AI Product Needs

Most AI product conversations start with capability. Can the system answer, summarize, classify, search, write, route, or act? That is the right first question for a prototype and the wrong only question for a product.

The product starts becoming real when every successful action has a visible cost shape. Tokens are only one line item. Latency, retries, context assembly, human review, evals, logging, abuse controls, and user trust all become part of the unit economics.

The thesis

An AI feature without a cost model is not merely financially incomplete. It is architecturally incomplete.

If the team cannot say what one useful unit of work costs, what makes it vary, and what quality bar justifies that cost, the system will drift toward either waste or disappointment. The principal-engineer job is to turn "the model is expensive" into a design surface the team can reason about.

The production pattern

The pattern I keep seeing is a feature that works in a demo because the load, context, users, and failure cases are polite. Then real use arrives.

Inputs get longer. Users ask follow-up questions. Retrieval pulls too much context. The product retries after ambiguous failures. A background job fans out across many records. Human review becomes the hidden queue. Support asks for auditability. Finance asks why the cost curve is not proportional to revenue.

None of these are exotic failures. They are normal product pressure.

The model

I like to model an AI feature as five ledgers:

  1. Inference ledger: model calls, input tokens, output tokens, retries, tool calls, streaming behavior, and model tier.
  2. Context ledger: retrieval, ranking, embedding, cache misses, serialization, prompt assembly, and stale-context repair.
  3. Quality ledger: eval runs, sampling, red-team cases, reviewer time, escalation paths, and post-release monitoring.
  4. Latency ledger: queue time, network time, model time, tool time, user wait tolerance, and timeout recovery.
  5. Trust ledger: citations, audit logs, undo paths, permission checks, data retention, and user-visible uncertainty.

The useful unit depends on the product. It might be one resolved request, one reviewed document, one accepted suggestion, one completed workflow, or one automated decision that survives audit. The unit must be tied to value, not merely to an API call.

For each unit, I want the team to know:

  • What is the average path?
  • What is the expensive path?
  • What input or user behavior creates the expensive path?
  • Which cost is capped by architecture and which cost is capped only by hope?
  • Which quality failures increase cost through review, support, or rework?
  • What would make this unit unprofitable or operationally embarrassing?

That last question is intentionally blunt. It forces the design discussion out of benchmark theater and into product reality.

Where this goes wrong

The common mistake is treating token cost as the cost model. Token cost matters, but it can be the easiest part to optimize and the least important part to understand.

A cheaper model that produces uncertain output may increase human review. Aggressive context trimming may lower inference cost while raising support cost. Caching can reduce spend while returning stale or permission-leaking answers if invalidation is not designed carefully. A slow but accurate path may be acceptable for batch work and fatal in an interactive workflow.

The opposite mistake is overbuilding controls before the product has evidence of use. Some early products do not need elaborate accounting infrastructure. They need coarse limits, sampled traces, and a weekly review of expensive requests. Precision should follow materiality.

What I do now

Before approving a serious AI feature design, I ask for a one-page cost model:

  • The user-visible unit of work.
  • The expected model and tool-call path.
  • The expensive path and why it happens.
  • The quality gate that decides whether the work was useful.
  • The latency budget and recovery behavior.
  • The human review or support burden.
  • The first three controls the team will add when usage grows.

I also ask the team to name one decision they would change if cost doubled and one decision they would change if latency doubled. If the answer is "nothing," the design probably has hidden assumptions.

At principal level, the goal is not to make every feature cheap. Some features deserve expensive reasoning. The goal is to make cost intentional, inspectable, and connected to value.

The operating review

The review I want is not a spreadsheet detached from architecture. I want a live design artifact that changes implementation choices.

Start by separating fixed cost, variable cost, and failure cost. Fixed cost includes evaluation sets, observability, prompt maintenance, policy work, and operational dashboards. Variable cost includes inference, retrieval, tool calls, storage, review minutes, and support load. Failure cost includes rework, refunds, escalations, lost trust, and cleanup after bad automation.

Then identify the control point for each cost. Some costs can be capped with quotas, smaller context windows, model routing, or batching. Some can only be influenced by product design: narrower user intent, clearer scope selection, better defaults, or fewer ambiguous states. Some costs require business choices, such as whether a premium workflow deserves a slower and more expensive verification path.

The riskiest design is the one where the expensive path is also the invisible path. A user asks a broad question, retrieval expands, the model retries, review escalates, and nobody sees the combined unit until the bill or queue arrives. I prefer to make expensive paths named product modes. If deep analysis costs more latency and more review, call it a deeper pass and design it as one.

Closing takeaway

Do not ask what the model call costs. Ask what a useful, trusted, recoverable unit of work costs, and design around that number.