The Model Should Not Be the Policy Engine

Models are useful at interpreting messy requests. They can summarize intent, classify risk, notice missing context, and explain why a rule may apply. That does not make them the right place to enforce policy.

The thesis

The model can advise on policy, but policy decisions that gate real tools should be enforced outside the model.

This is especially important when agents have side effects. A prompt can say, "Never deploy to production without approval." A model can repeat that rule. It can even refuse most unsafe requests. But if the deployment tool accepts the call anyway, the real policy boundary is the model's next token. That is not a boundary I want protecting production.

Policy needs a deterministic home with explicit inputs, auditable decisions, and authority over tool execution.

The production pattern

An agent is connected to a repository, issue tracker, messaging system, and deployment tool. The system prompt includes rules: do not write outside assigned files, do not message external users, do not deploy on Fridays, require approval for production, refuse secrets, and stop on failed verification.

This works in demos because the model usually follows the rules. It fails in production because rules conflict with task pressure and context ambiguity. The user says the deploy is urgent. The runbook says Friday freezes have exceptions. A previous summary says approval was granted. A tool description says production deploys are allowed for hotfixes. The model now has to arbitrate policy from a pile of language.

Even when the model makes the right call, the system cannot easily prove why. Which rule applied? Which principal had authority? Which resource was in scope? Was the approval current? Was the action denied or merely discouraged?

Those questions need policy decisions, not policy-flavored prose.

The model

I separate policy into subject, action, resource, context, and decision.

The subject is who or what is acting: agent id, user id, workspace, role, and delegated identity. The action is the requested operation: read, write, send, merge, deploy, delete, execute. The resource is the concrete target: file path, branch, ticket, account, environment, recipient, command, or service. The context includes time, approval id, proposal id, resource version, data classification, incident state, and risk tier.

The policy engine returns allow, deny, require approval, require narrower scope, or require verification. It should also return reasons and obligations. An obligation might be "approval must come from on-call," "write only allowed on branch agent/*," "external email must use approved draft id," or "command must run with network disabled."

The model still has a role. It can map natural language into a proposed action. It can explain to the user why approval is needed. It can ask for missing information. It can suggest a safer alternative. But it does not get to decide that a denied action is allowed because the wording changed.

This model also keeps policy close to tools. A write tool should check policy at execution time, not only during planning. If the resource changed or approval expired, the tool should deny the call even if the model expected success.

Where this goes wrong

The first failure is prompt-only enforcement. The model is told not to do something, but the tool is fully capable of doing it. A prompt injection, stale summary, or confused plan can move the model across the line.

The second failure is model-mediated exceptions. The policy says production deploys need approval. The model reads an old note saying "approval granted for hotfixes" and decides this case qualifies. If exception logic matters, it belongs in policy with explicit inputs.

The third failure is silent policy bypass through broad tools. A shell tool can modify files, call network services, read secrets, and run deploy scripts. If policy is attached only to named high-level tools, the agent may reach the same effect through a lower-level capability.

The fourth failure is no record of denial. If the model refuses in text, that refusal may never enter a policy log. Later, another run asks differently and succeeds. Denials should be structured events so repeated attempts are visible.

The counterpoint is that deterministic policy cannot understand every messy human request. That is true. The model is useful before the policy decision, where it turns messy intent into structured proposals. But once a proposal exists, enforcement should be explicit. If the proposal cannot be structured, that itself is a reason to stop.

What I do now

I put policy checks on the tool path. The write does not happen unless policy allows it or returns a specific approval requirement that has been satisfied. This applies even when the model already asked politely.

I keep policy inputs small and inspectable. A policy engine should not need the whole prompt. It needs the actor, action, resource, proposal, approval, context references, and relevant classifications. If the rule depends on a fact, that fact should have provenance.

I make policy results visible to the agent but not controlled by it. The agent can read "denied because resource outside owned files" and propose a narrower action. It cannot override the denial by arguing.

I also test policy with adversarial plans. Try to write through a different tool. Try to reuse expired approval. Try to change the resource after approval. Try to downgrade a production action into a generic command. A policy boundary that only works for friendly plans is not a boundary.

Closing takeaway

Use the model to interpret intent and explain constraints. Use policy outside the model to decide what tools may actually do.