The most useful architecture arguments I have been in did not start with a tool. They started with a boring question: what has to stay true after this ships?
A marketplace, a travel platform, a farm operations product, an AI workflow, they all exercise different parts of the system. Buyers retry payments. Suppliers upload broken data. Workers redeliver the same job. Mobile devices disappear offline. A model gives a confident answer to a bad input. The shape changes, but the operating problem is familiar: the system has to keep making progress when the happy path is gone.
I like simple systems because they leave fewer places for responsibility to hide. Simple does not mean small. It means a new engineer can follow the business event, find the side effect, understand the retry, and verify the deploy without first learning a private mythology of components.
Start With The Business Transaction#
The first boundary is usually not a service. It is the business transaction.
For a commerce system, that might be an order moving from quoted to paid. For a travel platform, it might be a booking changing state after a supplier response. For an AI product, it might be the durable record that says which input, model version, tools, and outputs produced a customer-visible answer.
The implementation can vary. The rule is the same: the business fact and the record of work that must happen next should be written together, or you have made the most important part of the system probabilistic.
BEGIN;
UPDATE orders
SET status = 'paid'
WHERE id = $1;
INSERT INTO outbox (topic, payload)
VALUES ('order.paid', json_object('id', $1));
COMMIT;The important part is not the table name. It is the guarantee. The state change and the work record are one decision. If the transaction fails, nothing escapes. If the worker crashes, the work is still visible. If the job is delivered twice, the system has to prove that twice means once.
Design Duplicate Delivery As Normal Weather#
Distributed systems do not ask for permission before becoming distributed. A browser times out. A payment provider retries a webhook. A supplier feed sends the same file twice. A queue redelivers after a deploy. An AI agent runs the same tool call after a stream interruption.
If duplicate delivery is treated as a surprise, the team pays for that mistake every week. If it is treated as normal weather, the system gets calmer.
CREATE TABLE idempotency_keys (
key text PRIMARY KEY,
request_hash text NOT NULL,
response json,
created_at timestamp NOT NULL DEFAULT current_timestamp
);The table is not the architecture. The architecture is the rule:
- every externally retried mutation carries an idempotency key;
- every worker job has a stable business key;
- every webhook handler can accept the same event again;
- every replay path is expected to hit existing state;
- every mismatch is visible enough to debug without guessing.
Add Components After The Pain Is Real#
Queues, search engines, streams, caches, vector stores, and services can all be correct. The mistake is adopting them before the team can say what constraint they remove.
I do not care whether the answer is a queue, a table, a file, a worker, a managed service, or a short script if the ownership is clear. I do care when the system grows a component because the diagram felt too plain.
- Step 1Name the constraint: latency, isolation, data shape, cost, team ownership, compliance, or scale.
- Step 2Prove the constraint with product evidence, not vibes.
- Step 3Pick the smallest component that removes that constraint.
- Step 4Write the release check before the migration starts.
- Step 5Delete the old path when the new one is boring enough to keep.
For search, the constraint might be relevance and ranking. For reporting, it might be read shape and query isolation. For AI workflows, it might be evaluation speed, tool-call auditability, or model cost control. Those are real reasons. "Modern architecture" is not.
The best component is the one that makes the next operational question easier to answer. Who owns this? What happens if it is down? Can we replay the work? How do we know it is falling behind? What is the smallest way to turn it off?
The strongest platform improvements I have seen were not only new modules or cleaner diagrams. They were better release loops: code-first API contracts, generated clients, integration tests that exercised the real workflow, canaries, feature flags, and enough backward compatibility that old clients did not become a hidden outage.
Make The Team Part Of The Design#
Architecture is not only runtime behavior. It is also the shape of the team that has to understand it on an ordinary Tuesday.
Every extra component creates documentation, alerts, dashboards, permissions, deploy rules, test fixtures, cost conversations, operating playbooks, and onboarding weight. Sometimes that cost is worth paying. Often it is just hidden because the first version worked on a laptop.
The healthier test is this: can a new engineer debug a real customer issue by following one business event through the system?
If the answer is yes, the architecture is probably earning its keep. If the answer is no, the team will invent unofficial maps, private Slack lore, and fragile senior-engineer rituals. That is where reliability goes to disappear.
The Boring Thing Is Not The Lazy Thing#
There is a version of "boring architecture" that becomes an excuse to avoid hard work. That is not what I mean.
Boring is not anti-ambition. It is ambition pointed at the product instead of the diagram. It is the discipline to spend complexity only where the product has earned it, and to keep the rest of the system readable enough that people can improve it without fear.
Good engineering is not a stack. It is a habit of making the next decision easier: smaller changes, faster feedback, clearer recovery paths, honest measurements, and fewer surprises for the people who inherit the system after launch.