When we started building the router, we had four tiers. Override at the top — for the rare case the operator says "do this, my way, now" — then a cache, then a rule engine, then the LLM at the bottom for everything that didn't fit. It was clean. We could draw it on a napkin. We did, several times.
This note is about the week, in March, when we tore that drawing up and added a fifth tier. It's not a long story, but it changed the shape of the platform — and, looking at the numbers since, probably the unit economics for every operator who'll ever run on it.¹
02 · The four-tier modelWhy four felt right
The first version of routing/engine.py looked exactly like the napkin:
The principle was simple: cheapest competent answer wins. The cache caught exact repeats. The rule engine caught everything we'd seen often enough to write a regex for. Anything else fell to the LLM, which was costly — but, we reasoned, the kind of decision that deserved the LLM.
It worked. For two months it worked beautifully. Cache hit rate was a respectable 41%. Rule hit rate was another 22%. The LLM saw 37% of traffic. We were happy with that mix and we wrote it on the whiteboard so we'd remember.
03 · The cache numbersWhat the logs said
And then, around week 8, we instrumented the cache misses. Not the misses themselves — those we already counted — but the shape of what was missing.²
| Bucket | % of misses | Mean cosine to nearest cache key |
|---|---|---|
| Identical intent, different wording | 31% | 0.94 |
| Genuinely novel | 42% | 0.61 |
| Pattern fits a rule we hadn't written yet | 22% | — |
| Other | 5% | — |
That first row is the one that broke us. Almost a third of the things we were sending to the LLM had a sibling in the cache that meant the same thing — they just didn't share an exact key. "Refund this order" and "please refund #4438" and "can we issue a refund on Maria's last purchase" are, for routing purposes, the same request. The cache didn't know that.
04 · Why four was wrongThe mistake we'd been making
The cache was answering the wrong question. It was asking "have I seen this exact string before" when the question that mattered was "have I seen this exact intent before."
You can't fix that by writing more rules. The intents are too varied. You can't fix it inside the LLM tier either — by the time you've sent a request to the LLM, you've spent the money. What we needed was something that sat between the rule engine and the LLM, asking a cheaper question: is this semantically close to anything I've routed before?
05 · The fifth tierHow T2.5 was built
It took eight days. Most of that was deciding the threshold. The mechanics were simple — embed the request, search the cache by cosine similarity, accept any hit above 0.95. We tried 0.92, 0.94, 0.95, 0.97 in shadow mode against a week of pilot traffic.
- 0.92 — too generous. False positives at 11%. The cache started routing "refund this order" to the same path as "cancel this subscription". Correct in spirit, wrong in fact.
- 0.94 — better. False positives at 4%. Acceptable for low-stakes routing, not for billing.
- 0.95 — false positives at 0.7%. We could live with that. Hit rate dropped a little.
- 0.97 — only marginal hits. Most of what we wanted to catch fell below the line.
0.95 it was. We wrote it into the engine, tucked it between rules and LLM, and called it T2.5. The half-tier number was a joke that stuck.
06 · The numbers, afterWhat changed
We let the new engine run for two weeks. Then we counted.
| Metric | Before T2.5 | After T2.5 | Δ |
|---|---|---|---|
| Cache + semantic hit rate | 41% | 68% | +27pp |
| Rule hit rate | 22% | 21% | −1pp |
| LLM hit rate | 37% | 11% | −26pp |
| Mean cost / decision | $0.0091 | $0.0027 | −70% |
| Mean latency / decision | 340ms | 95ms | −72% |
The cost number is the one that matters. Three quarters of the LLM bill, gone. Most of the latency, gone. And — this is the part we didn't expect — the false positive rate held at the 0.7% we'd seen in shadow mode. The threshold was right.
07 · What it taught usThe takeaway
Two things, mostly.
First, that the architecture you draw on a napkin is the one to ship — and then, eight weeks in, to second-guess. The four-tier router was not wrong. It was right for the data we had on the napkin. It was wrong for the data we had after eight weeks of operators using it. Those are different problems.
Second, that "semantic" doesn't have to mean "send it to the LLM." A cosine search against your own cache, at 22ms, is semantic. It's just cheap semantic. The LLM tier is for genuine novelty, not for paraphrase.
The router today still has five tiers. We've talked about adding a sixth — a tiny model, sub-50ms, that would sit between T2.5 and T3 — and we've decided, for now, against. The shape of what's left in T3 is genuinely novel enough to warrant the bigger model. We're not going to add a tier just because we can.³
— Yann · Lisbon · 09 May 2026