# LLMs in production: tools, retrieval, evals

> Where each pattern earns its place, where it does not, and how I keep cost, latency, and trust honest.

The demo version of an LLM feature is a text box and a confident answer. The production version is a budget, a fallback, a pile of ugly inputs, and a dashboard that tells you when the model quietly got worse.

I like LLMs in production when the job is narrow enough to inspect. Tool-calling, RAG, and evals are useful. They are not a strategy by themselves. They are pieces you earn by naming the quality bar.

## Tool-calling needs a boring permission model

The model should not "use tools." It should use a small set of typed operations with boring names and obvious blast radius.

Good tool surfaces read like internal APIs:

```ts title="tools.ts"
const tools = {
  searchDocs: { readonly: true },
  inspectOrder: { readonly: true },
  draftReply: { readonly: true },
  createTicket: { requiresApproval: true },
};
```

The model can propose a ticket. A human or a separate policy step approves the side effect. That line keeps "assistant" from becoming "unreviewed production actor."

I also want every tool call logged with the input shape, output shape, latency, cost, and decision path. Not because logs are fun. Because the first serious bug will be a question of "what did it know, and when did it know it?"

<Decision title="Treat tools like production APIs">
  A tool call needs a typed input, a permission model, logs, latency budget, and an owner. If that
  sounds heavy, the model probably should not be allowed to call it yet.
</Decision>

## RAG is retrieval engineering

RAG does not fix bad knowledge. It exposes it.

The hard work is outside the model:

- chunking documents around the way humans ask questions;
- keeping source metadata attached until the final answer;
- deleting stale content instead of embedding it forever;
- testing retrieval misses, not only happy-path answers;
- showing citations when the answer depends on a specific document.

The cheapest RAG improvement is often not a better model. It is a smaller corpus with less junk in it.

<Principle title="Retrieval quality is product quality">
  If the corpus is stale, duplicated, or missing source metadata, the model will turn that mess into
  a confident answer. Clean retrieval beats a larger prompt most of the time.
</Principle>

## Evals should look like real support work

Most AI evals start too clean. Production inputs are not clean.

I want eval cases that include partial data, outdated docs, ambiguous user intent, similar entities, missing permissions, slow tools, and prompts that try to drag the model outside the product boundary. A model that passes a friendly benchmark can still fail the exact Tuesday afternoon support question that made you build the feature.

The eval should record:

- expected tool calls;
- forbidden tool calls;
- required citations;
- acceptable answer shape;
- latency and cost budget;
- whether fallback is better than guessing.

```mermaid
flowchart TD
  A[user question] --> B[retrieve]
  B --> C{enough evidence?}
  C -->|yes| D[draft with citations]
  C -->|no| E[ask or fallback]
  D --> F[evaluate answer]
  E --> F
```

## Keep the boring fallback

Every LLM path needs a non-LLM path. A queue, a form, a rule-based answer, a human review lane, a link to the raw record. Something.

This is less exciting than "AI-native." It is also what lets you ship. The system can be useful before it is perfect because uncertainty has a product path instead of a dead end.

The model earns its place when it reduces work without hiding uncertainty. If it cannot cite, cannot ask, cannot decline, or cannot fall back, it is not ready for production. It is still a demo with nicer styling.