Why the router is five tiers — Automatos

When we started building the router, we had four tiers. Override at the top — for the rare case the operator says "do this, my way, now" — then a cache, then a rule engine, then the LLM at the bottom for everything that didn't fit. It was clean. We could draw it on a napkin. We did, several times.

This note is about the week, in March, when we tore that drawing up and added a fifth tier. It's not a long story, but it changed the shape of the platform — and, looking at the numbers since, probably the unit economics for every operator who'll ever run on it.¹

02 · The four-tier modelWhy four felt right

The first version of routing/engine.py looked exactly like the napkin:

↓RequestEnvelope

T0overrideuser-set

T1cacheredis · exact key

T2rulespattern match

T3llmclassify · expensive

↓RoutingDecisionlogged

The principle was simple: cheapest competent answer wins. The cache caught exact repeats. The rule engine caught everything we'd seen often enough to write a regex for. Anything else fell to the LLM, which was costly — but, we reasoned, the kind of decision that deserved the LLM.

It worked. For two months it worked beautifully. Cache hit rate was a respectable 41%. Rule hit rate was another 22%. The LLM saw 37% of traffic. We were happy with that mix and we wrote it on the whiteboard so we'd remember.

03 · The cache numbersWhat the logs said

And then, around week 8, we instrumented the cache misses. Not the misses themselves — those we already counted — but the shape of what was missing.²

Bucket	% of misses	Mean cosine to nearest cache key
Identical intent, different wording	31%	0.94
Genuinely novel	42%	0.61
Pattern fits a rule we hadn't written yet	22%	—
Other	5%	—

That first row is the one that broke us. Almost a third of the things we were sending to the LLM had a sibling in the cache that meant the same thing — they just didn't share an exact key. "Refund this order" and "please refund #4438" and "can we issue a refund on Maria's last purchase" are, for routing purposes, the same request. The cache didn't know that.

04 · Why four was wrongThe mistake we'd been making

The cache was answering the wrong question. It was asking "have I seen this exact string before" when the question that mattered was "have I seen this exact intent before."

You can't fix that by writing more rules. The intents are too varied. You can't fix it inside the LLM tier either — by the time you've sent a request to the LLM, you've spent the money. What we needed was something that sat between the rule engine and the LLM, asking a cheaper question: is this semantically close to anything I've routed before?

05 · The fifth tierHow T2.5 was built

It took eight days. Most of that was deciding the threshold. The mechanics were simple — embed the request, search the cache by cosine similarity, accept any hit above 0.95. We tried 0.92, 0.94, 0.95, 0.97 in shadow mode against a week of pilot traffic.

0.92 — too generous. False positives at 11%. The cache started routing "refund this order" to the same path as "cancel this subscription". Correct in spirit, wrong in fact.
0.94 — better. False positives at 4%. Acceptable for low-stakes routing, not for billing.
0.95 — false positives at 0.7%. We could live with that. Hit rate dropped a little.
0.97 — only marginal hits. Most of what we wanted to catch fell below the line.

0.95 it was. We wrote it into the engine, tucked it between rules and LLM, and called it T2.5. The half-tier number was a joke that stuck.

↓RequestEnvelope

T0overrideuser

T1cacheexact key · ~5ms

T2rulespattern · ~8ms

T2.5semanticcosine ≥ 0.95 · ~22ms

T3llmclassify · ~900ms

↓RoutingDecisionlogged · tier-tagged

06 · The numbers, afterWhat changed

We let the new engine run for two weeks. Then we counted.

Metric	Before T2.5	After T2.5	Δ
Cache + semantic hit rate	41%	68%	+27pp
Rule hit rate	22%	21%	−1pp
LLM hit rate	37%	11%	−26pp
Mean cost / decision	$0.0091	$0.0027	−70%
Mean latency / decision	340ms	95ms	−72%

The cost number is the one that matters. Three quarters of the LLM bill, gone. Most of the latency, gone. And — this is the part we didn't expect — the false positive rate held at the 0.7% we'd seen in shadow mode. The threshold was right.

07 · What it taught usThe takeaway

Two things, mostly.

First, that the architecture you draw on a napkin is the one to ship — and then, eight weeks in, to second-guess. The four-tier router was not wrong. It was right for the data we had on the napkin. It was wrong for the data we had after eight weeks of operators using it. Those are different problems.

Second, that "semantic" doesn't have to mean "send it to the LLM." A cosine search against your own cache, at 22ms, is semantic. It's just cheap semantic. The LLM tier is for genuine novelty, not for paraphrase.

The router today still has five tiers. We've talked about adding a sixth — a tiny model, sub-50ms, that would sit between T2.5 and T3 — and we've decided, for now, against. The shape of what's left in T3 is genuinely novel enough to warrant the bigger model. We're not going to add a tier just because we can.³

— Yann · Lisbon · 09 May 2026

Footnotes

¹ All numbers in this note are from our internal pilot — 11 operators, ~140k routing decisions over the two-week comparison window. Aggregated; no per-operator data.
² The instrumentation lives in orchestrator/observability/routing_telemetry.py if you're self-hosting and want to do the same exercise.
³ The "tier you don't add" is a recurring theme. See FN.07 for more on why we resist new tiers, and FN.11 for the mission engine equivalent.

Why the router
is five tiers,
and not four.

02 · The four-tier modelWhy four felt right

03 · The cache numbersWhat the logs said

04 · Why four was wrongThe mistake we'd been making

05 · The fifth tierHow T2.5 was built

06 · The numbers, afterWhat changed

07 · What it taught usThe takeaway

Run the router
against your
own traffic.

Why the routeris five tiers,and not four.

02 · The four-tier modelWhy four felt right

03 · The cache numbersWhat the logs said

04 · Why four was wrongThe mistake we'd been making

05 · The fifth tierHow T2.5 was built

06 · The numbers, afterWhat changed

07 · What it taught usThe takeaway

The mission engine, in one diagram.

Why cheap models route most of our traffic.

Memory, in five layers.

Run the routeragainst yourown traffic.

Why the router
is five tiers,
and not four.

Run the router
against your
own traffic.