Automatos·ai
Sign in Sign up free →
FN. 12 Architecture
Index   ·   Volume 01
09 May 2026 · 14 min read

Why the router
is five tiers,
and not four.

The semantic tier was a late addition. We almost shipped without it. This is the story of the week we changed our minds — and what the cache hit rate looked like before and after.

Published · 09 May 2026
Read time · 14 minutes
Subject · Architecture · Routing
Yann L. Co-founder · routing
Five tiers Fig. 12 · Router
System · Override → Cache → Rules → Semantic → LLMCrew · 5 tiers
T0OVERRIDE T1CACHE T2RULES T2.5SEMANTIC T3LLM CHEAPEST COMPETENT ANSWER WINS + COSINE ≥ 0.95
ShippedT 2.5
Plate · 012Drawn to scale
Fig. 12 — Router cutaway, the morning we shipped T2.5 Plate / FN12

When we started building the router, we had four tiers. Override at the top — for the rare case the operator says "do this, my way, now" — then a cache, then a rule engine, then the LLM at the bottom for everything that didn't fit. It was clean. We could draw it on a napkin. We did, several times.

This note is about the week, in March, when we tore that drawing up and added a fifth tier. It's not a long story, but it changed the shape of the platform — and, looking at the numbers since, probably the unit economics for every operator who'll ever run on it.¹

02 · The four-tier modelWhy four felt right

The first version of routing/engine.py looked exactly like the napkin:

RequestEnvelope
T0overrideuser-set
T1cacheredis · exact key
T2rulespattern match
T3llmclassify · expensive
RoutingDecisionlogged

The principle was simple: cheapest competent answer wins. The cache caught exact repeats. The rule engine caught everything we'd seen often enough to write a regex for. Anything else fell to the LLM, which was costly — but, we reasoned, the kind of decision that deserved the LLM.

It worked. For two months it worked beautifully. Cache hit rate was a respectable 41%. Rule hit rate was another 22%. The LLM saw 37% of traffic. We were happy with that mix and we wrote it on the whiteboard so we'd remember.

03 · The cache numbersWhat the logs said

And then, around week 8, we instrumented the cache misses. Not the misses themselves — those we already counted — but the shape of what was missing.²

Bucket% of missesMean cosine to nearest cache key
Identical intent, different wording31%0.94
Genuinely novel42%0.61
Pattern fits a rule we hadn't written yet22%
Other5%

That first row is the one that broke us. Almost a third of the things we were sending to the LLM had a sibling in the cache that meant the same thing — they just didn't share an exact key. "Refund this order" and "please refund #4438" and "can we issue a refund on Maria's last purchase" are, for routing purposes, the same request. The cache didn't know that.

04 · Why four was wrongThe mistake we'd been making

The cache was answering the wrong question. It was asking "have I seen this exact string before" when the question that mattered was "have I seen this exact intent before."

You can't fix that by writing more rules. The intents are too varied. You can't fix it inside the LLM tier either — by the time you've sent a request to the LLM, you've spent the money. What we needed was something that sat between the rule engine and the LLM, asking a cheaper question: is this semantically close to anything I've routed before?

05 · The fifth tierHow T2.5 was built

It took eight days. Most of that was deciding the threshold. The mechanics were simple — embed the request, search the cache by cosine similarity, accept any hit above 0.95. We tried 0.92, 0.94, 0.95, 0.97 in shadow mode against a week of pilot traffic.

0.95 it was. We wrote it into the engine, tucked it between rules and LLM, and called it T2.5. The half-tier number was a joke that stuck.

RequestEnvelope
T0overrideuser
T1cacheexact key · ~5ms
T2rulespattern · ~8ms
T2.5semanticcosine ≥ 0.95 · ~22ms
T3llmclassify · ~900ms
RoutingDecisionlogged · tier-tagged

06 · The numbers, afterWhat changed

We let the new engine run for two weeks. Then we counted.

MetricBefore T2.5After T2.5Δ
Cache + semantic hit rate41%68%+27pp
Rule hit rate22%21%−1pp
LLM hit rate37%11%−26pp
Mean cost / decision$0.0091$0.0027−70%
Mean latency / decision340ms95ms−72%

The cost number is the one that matters. Three quarters of the LLM bill, gone. Most of the latency, gone. And — this is the part we didn't expect — the false positive rate held at the 0.7% we'd seen in shadow mode. The threshold was right.

07 · What it taught usThe takeaway

Two things, mostly.

First, that the architecture you draw on a napkin is the one to ship — and then, eight weeks in, to second-guess. The four-tier router was not wrong. It was right for the data we had on the napkin. It was wrong for the data we had after eight weeks of operators using it. Those are different problems.

Second, that "semantic" doesn't have to mean "send it to the LLM." A cosine search against your own cache, at 22ms, is semantic. It's just cheap semantic. The LLM tier is for genuine novelty, not for paraphrase.

The router today still has five tiers. We've talked about adding a sixth — a tiny model, sub-50ms, that would sit between T2.5 and T3 — and we've decided, for now, against. The shape of what's left in T3 is genuinely novel enough to warrant the bigger model. We're not going to add a tier just because we can.³

— Yann · Lisbon · 09 May 2026

Footnotes
  1. ¹ All numbers in this note are from our internal pilot — 11 operators, ~140k routing decisions over the two-week comparison window. Aggregated; no per-operator data.
  2. ² The instrumentation lives in orchestrator/observability/routing_telemetry.py if you're self-hosting and want to do the same exercise.
  3. ³ The "tier you don't add" is a recurring theme. See FN.07 for more on why we resist new tiers, and FN.11 for the mission engine equivalent.

Run the router
against your
own traffic.

Sign up free and watch the routing telemetry against your own missions in real time. Or pull the repo and run all five tiers in your own cluster today.