# Keep tool-calling agents on a short leash

> The tool surface, eval loop, and refusal rules I want before an agent can touch a real repo.

When we started, the loudest voices in the room wanted the architecture diagram to look impressive. Kafka, a service mesh, CQRS, a half-dozen microservices before we had a single paying customer. I made an unpopular call: we would ship the most boring thing that could possibly work, and we would not add a component until product evidence made it earn its place.

Four years later that "boring" stack was still the core of the platform — every order, every ledger write, every background job. This is the part nobody tells you up front: **restraint is an architecture**, and it is far harder to hold than it is to add a queue.

That same lesson applies to agents. The impressive demo is easy: give the model a terminal, a browser, a repo, and a vague goal. The useful system is much smaller. It knows what it is allowed to touch, what it must ask before doing, and how to prove that the answer got better instead of louder.

The agent I trust is not autonomous in the science-fiction sense. It is autonomous in the boring CI sense: scoped, logged, retryable, and easy to stop.

There is another part that matters more every year: the world around the agent has to be
code-defined. Repo rules, Terraform modules, Ansible playbooks, CI gates, generated SDKs, and
runbooks are all instruction surfaces. A web console setting is not.

## Shrink the tool surface first

The first security decision is not the prompt. It is the list of tools.

If an agent is reviewing a pull request, it does not need write access. If it is fixing a test, it does not need your deployment secrets. If it is updating dependencies, it should not edit the fetch client, the auth fallback, or anything else you already know is outside the blast radius.

I keep the surface small enough that the tool list itself reads like an API contract:

```md title="agent-tools.md"
allowed:

- read files
- search with rg
- run bun test / lint / typecheck
- apply patch inside claimed files

ask first:

- delete files
- change migrations
- edit auth, billing, or transport clients
- install packages

never:

- print secrets
- rewrite generated clients
- mask a failing check
```

This is not there to make the agent polite. It is there to make the behavior inspectable. When something goes wrong, I want to know whether the contract was too wide, the instruction was ambiguous, or the model ignored the boundary.

## Make the environment inspectable

The best prompt still fails when the system state lives in places the agent cannot inspect.

I would rather ask an agent to change a Terraform module than click through GitHub settings. I would
rather ask it to update an Ansible role than describe a Mac setup from memory. Code-defined
operations give the agent something concrete to read, patch, plan, test, and explain.

```mermaid
flowchart LR
  A[repo rules] --> E[agent]
  B[terraform desired state] --> E
  C[ansible roles] --> E
  D[ci gates + tests] --> E
  E --> F[small patch]
  F --> G[plan/check]
  G --> H[human review]
```

<Decision title="Code is the agent's operating surface">
  If a setting matters enough for an agent or teammate to change it safely, it should live in a
  place they can inspect before acting.
</Decision>

## Let the repo reject bad work

Agents produce a lot of text. Repos produce facts.

The useful loop is simple: inspect, patch, run the repo's own checks, read the result, patch again. I do not want a "looks good" summary before the tests run. I want the exact command, the exact result, and a patch that makes the behavior pass for the right reason.

The best agents I have used are almost boring here. They do not celebrate. They run `bun test`, `bun run lint`, `bun run typecheck`, and `bun run build`, then they report what changed. If they cannot run a gate, they say why. That habit matters more than clever prompt wording.

```mermaid
flowchart LR
  A[scope] --> B[read]
  B --> C[patch]
  C --> D[run checks]
  D -->|fail| B
  D -->|pass| E[explain diff]
  E --> F[human review]
```

<Principle title="Repo facts beat agent confidence">
  The agent can propose a change, but the repository decides whether the change is real. A summary
  without a command, a failing check, or a inspected diff is just another model output.
</Principle>

## Evals are the harness, not the ceremony

For code agents, I like small evals that look like real work:

- take a failing test and fix only the bug;
- update a generated contract without adding legacy aliases;
- remove hollow tests without deleting meaningful coverage;
- migrate one route pattern and prove every caller moved with it.

Each eval should have a cheap oracle. Did the test fail before and pass after? Did the diff avoid forbidden files? Did it add a compatibility shim when the instruction said "make a clean cut"? If the answer needs a meeting, the eval is too vague.

The transcript is part of the artifact. I want to see where the agent searched, which files it ignored, and what it chose not to change. Silence is where expensive mistakes hide.

<Tradeoff title="Autonomy costs review surface">
  More tool access can make the agent look faster in a demo, but it also creates more paths a human
  has to audit after the fact. I would rather give an agent a smaller job and a cleaner proof trail.
</Tradeoff>

## The leash is what makes it useful

The point is not to make the model small. The point is to make the work bounded enough that the
model can be useful without becoming a new production surface with no operating contract.

Autonomy without a leash is a demo. Autonomy with a tight tool surface, code-defined surroundings, a
real verification loop, and a boring audit trail is an engineering tool. That is the version I will
let near a repo.