Tools Are Capabilities, Not Functions

The first design mistake in many agent systems is too small to notice. A tool is described as a function: name, arguments, return value. That is useful for calling it, but it hides the more important question: what authority did we just give away?

The thesis

An agent tool is a delegated capability. Its signature matters less than its scope, permissions, side effects, and enforcement boundary.

If a model can call send_email, it does not merely have a function. It has access to an identity, a delivery channel, recipients, content, attachments, and the social authority of the sender. If it can call run_command, it may have access to the filesystem, environment variables, network, credentials, and any write path the process can reach. A short tool description does not reduce that authority.

Production design starts by asking what the tool allows, not how the function is shaped.

The production pattern

A team exposes a convenient tool to an internal agent: update_ticket(ticket_id, body). It looks harmless. The agent can keep work moving by updating issue descriptions and comments.

Later, the tool grows. It can add labels. Then it can close tickets. Then it can assign owners. Then it can trigger automations attached to status changes. The name still sounds like a function, but the capability has changed from "write text" to "alter workflow state." A model that once had permission to add a factual note now has permission to change ownership and close work.

The same pattern appears with repo tools. edit_file(path, content) sounds precise, but it is only precise if the path is scoped, the branch is controlled, the write is reviewed, and the execution identity cannot bypass repository policy. Otherwise the function is a general filesystem mutation API with a friendly name.

Shell tools make this more obvious. A command runner with a safe-looking prompt can still run from the wrong directory, read a secret from the environment, generate artifacts outside the workspace, or start a long-lived process. The tool signature is not the boundary. The process sandbox, working directory, allowed command prefixes, network policy, timeout, and write roots are the boundary.

The model

I describe every agent tool with a capability envelope.

The envelope names the subject, action, resource, scope, effect class, preconditions, approval requirements, and evidence requirements. The subject is the actor: this agent, on behalf of this user, in this workspace. The action is what can happen: read, propose, write, delete, send, deploy, merge, or execute. The resource is the concrete target: a file tree, repository, issue tracker project, calendar, queue, account, or service. The scope narrows it: branch, directory, label set, recipient domain, environment, time window, or maximum count.

The effect class matters. A read capability has different risk from a write capability. A reversible write is different from an external send. A production deploy is different from a staging config change. A delete is different from a comment. A tool can have the same code path and still require different policy depending on the effect.

Preconditions keep the capability attached to reality. "Write this file only if the current blob hash is still abc123" is stronger than "write this file." "Close this ticket only if it is still assigned to the requesting user and has no open blocker label" is stronger than "close ticket 42."

Evidence requirements define how the system knows what happened. A successful tool call should return more than "ok." It should return a resource id, version, diff, receipt, job id, or verification handle. Without that, the agent can narrate success without giving the rest of the system anything to reconcile.

Where this goes wrong

One failure mode is tool bundling. A tool called complete_task reads context, edits files, commits changes, pushes a branch, updates the ticket, and posts a summary. That may be convenient for a demo, but it erases all the boundaries that matter in production. Which part needed approval? Which part failed? Which part should be retried? Which part was allowed in a read-only session?

Another failure is relying on the model to obey the tool description. "Only use this for staging" is not an enforcement boundary. It is an instruction. The tool itself must reject production resources unless production is in scope. The caller should not be able to smuggle a production target through an argument just because the model was persuaded that it is safe.

Approval drift is a third failure. A human approves "update the docs page." The agent then discovers a generated index needs refreshing, a deployment needs running, and a notification should be sent. If the system treats the first approval as permission for the expanded plan, the tool boundary has moved without a new grant.

The counterpoint is that overly narrow tools can make agents clumsy. If every tiny operation requires a separate round trip, the planner spends more effort negotiating the tool surface than doing useful work. The answer is not thousands of microscopic functions. The answer is capabilities that are narrow around authority and broad enough around ordinary mechanics.

What I do now

I name tools after the authority they grant. read_repo_file is different from write_owned_file. propose_pull_request is different from merge_pull_request. draft_email is different from send_email. The names should make the effect class hard to miss.

I split planning tools from effect tools. A planning tool can compute a diff, search issues, or summarize logs. An effect tool mutates shared state. The agent can use planning tools freely within context limits. Effect tools require capability checks, preconditions, and evidence.

I avoid generic write surfaces unless the environment provides a real sandbox. If a tool accepts arbitrary paths, commands, URLs, or payloads, it needs external constraints. Allowed roots, command prefixes, resource allowlists, output limits, and deadlines are part of the tool contract.

I also review tools when product behavior changes. If closing a ticket starts triggering customer email, the capability changed even if the function signature did not. If a repo branch gains auto-deploy behavior, pushing to that branch becomes a deploy capability. Tool reviews need to follow side effects, not just code.

Closing takeaway

Do not ask, "Can the model call this function?" Ask, "What authority does this capability delegate, and where is that authority enforced?"