arXiv:wm.2602.0001 · [cs.LG] · cs.NI · cs.DC v1

MatrixEdge auto-tune v2: per-route TTL adaptation, online.

The MatrixEdge sub-team¹

¹ Matrix Systems Innovations Ltd, Cardiff CF10

Abstract

Auto-tune v2 is the reinforcement loop that selects, per route, per hour, per region, the TTL applied to MatrixEdge cache entries under an operator-declared feasibility set. We describe what changed from v1, the loop end-to-end, the operator-visible decision deltas, and — at length — the things v2 deliberately will not do. The point of the release is not "more caching"; it is the operator knowing why a TTL was the value it was.

What changed in v2.

v1 of the auto-tune loop, shipped in mid-2025, was a per-route bandit. It chose between a small fixed set of TTL candidates per route on a thirty-minute cadence, with a hard-coded objective function. v2 generalises three things at once. The TTL is now continuous; the cadence is per route, not platform-wide; and the objective function is an operator-declared three-dimensional target — origin cost, p99 tail latency, hit ratio — under a feasibility set the operator writes in the route record.

The continuous-TTL change is the largest in effect, the smallest in code. Previously, the gap between candidate TTLs of, say, 30 s and 60 s could be a 12 % swing in hit ratio without the bandit ever choosing 42 s, which would have been the right answer. The continuous version closes that gap. The per-route cadence change is what stopped a high-traffic noisy route from forcing the whole platform onto its rhythm.

The loop, end to end.

The loop reads the edge-decision column of the WebMatrix data model — the same column the observability surface reads — and writes its decisions back into the same column. The choice of TTL per (route, region, hour) minimises α·cost + β·latency_p99 − γ·hit-ratio, subject to a feasibility set the operator declares once: maximum TTL, hard cap on stale-while-revalidate, and refuse-cache predicates (for example, "no cache when the Authorization header is present").

T_r,R,h^* = arg min_{T ∈ F_r} α · cost(T) + β · p99(T) − γ · hit(T) (1)

The decision is reversible per route, at any time, by the operator. The decision record contains a human-readable reason field — written by the loop, not by us — explaining what observation made the choice. The reason field is what the on-call reads when a TTL has changed unexpectedly.

What the loop cannot do.

Explicit non-capabilities

The loop cannot pick a TTL outside the feasibility set. It cannot exceed your max-TTL by one second under any circumstance.
The loop cannot cache a request that fails a refuse-cache predicate. The predicate is the gate; the loop is downstream of it.
The loop cannot, of itself, retire a stale cache entry. Eviction is the cache's job; the loop only decides the entry's birthday.
The loop cannot learn across customers. Each tenant's loop is trained on each tenant's traffic, in each tenant's tenancy, against each tenant's α, β, γ.
The loop cannot promote itself onto a route the operator has not enabled. Auto-tune is opt-in per route, by default off.

Why the dashboard is public.

The platform's marketing-grade numbers are the operator-facing numbers. Per-route-class calibration plots, refreshed weekly. Per-region rolling 28-day p99 distributions. Decision-delta logs whenever the loop changes a TTL by more than 50 % from the previous hour, with the reason field. We publish these because the operator at 03:14 needs to be able to read them; if they are good enough for the operator, they are good enough for everyone.

We do not publish customer cache keys, customer traffic patterns, customer hit ratios, or aggregate customer numbers in marketing copy. The line between "the methodology is public" and "the customer is public" is the line we draw.

Calibration discipline.

The loop is calibrated against a rolling 28-day window per region. When a region falls outside the window — a new PoP, a major traffic shift, a known incident — the loop is suspended on that region until the window closes. v2 has had two such suspensions in its life so far. Both are in the public calibration log, with date, region, lane class, duration and root cause. The shortest was four hours; the longer, eleven days, following the Frankfurt PoP coming online in November.

Event	Region	Started	Closed	Root cause
SUS-2511-A	EU-CENTRAL	22 Nov 2025	03 Dec 2025	New PoP (Frankfurt) — calibration window re-open
SUS-2603-A	EU-WEST	14 Mar 2026, 09:00	14 Mar 2026, 13:00	Acquirer-eu-2 incident; traffic shape unrepresentative

Table 1 Suspension events in the life of auto-tune v2. Both are in the public calibration log with the full decision record around them.

Want auto-tune on a route?

Open an engineering call. The first conversation is held against your route record, your declared α, β, γ and the routes the loop should be turned on for first. The note we write same-day says which routes auto-tune is clearly worth it for, which it is clearly not, and the calibration window your region needs.

Open an engineering call MatrixEdge note Changelog