LLMs in production: tools, retrieval, evals

The demo version of an LLM feature is a text box and a confident answer. The production version is a budget, a fallback, a pile of ugly inputs, and a dashboard that tells you when the model quietly got worse.

I like LLMs in production when the job is narrow enough to inspect. Tool-calling, RAG, and evals are useful. They are not a strategy by themselves. They are pieces you earn by naming the quality bar.

Tool-calling needs a boring permission model#

The model should not "use tools." It should use a small set of typed operations with boring names and obvious blast radius.

Good tool surfaces read like internal APIs:

tools.ts

const tools = {
  searchDocs: { readonly: true },
  inspectOrder: { readonly: true },
  draftReply: { readonly: true },
  createTicket: { requiresApproval: true },
};

The model can propose a ticket. A human or a separate policy step approves the side effect. That line keeps "assistant" from becoming "unreviewed production actor."

I also want every tool call logged with the input shape, output shape, latency, cost, and decision path. Not because logs are fun. Because the first serious bug will be a question of "what did it know, and when did it know it?"

RAG is retrieval engineering#

RAG does not fix bad knowledge. It exposes it.

The hard work is outside the model:

chunking documents around the way humans ask questions;
keeping source metadata attached until the final answer;
deleting stale content instead of embedding it forever;
testing retrieval misses, not only happy-path answers;
showing citations when the answer depends on a specific document.

The cheapest RAG improvement is often not a better model. It is a smaller corpus with less junk in it.

Evals should look like real support work#

Most AI evals start too clean. Production inputs are not clean.

I want eval cases that include partial data, outdated docs, ambiguous user intent, similar entities, missing permissions, slow tools, and prompts that try to drag the model outside the product boundary. A model that passes a friendly benchmark can still fail the exact Tuesday afternoon support question that made you build the feature.

The eval should record:

expected tool calls;
forbidden tool calls;
required citations;
acceptable answer shape;
latency and cost budget;
whether fallback is better than guessing.

flowchart TD A[user question] --> B[retrieve] B --> C{enough evidence?} C -->|yes| D[draft with citations] C -->|no| E[ask or fallback] D --> F[evaluate answer] E --> F

Keep the boring fallback#

Every LLM path needs a non-LLM path. A queue, a form, a rule-based answer, a human review lane, a link to the raw record. Something.

This is less exciting than "AI-native." It is also what lets you ship. The system can be useful before it is perfect because uncertainty has a product path instead of a dead end.

The model earns its place when it reduces work without hiding uncertainty. If it cannot cite, cannot ask, cannot decline, or cannot fall back, it is not ready for production. It is still a demo with nicer styling.