Observability with reasoning: from log dumps to incident paragraphs.
1 Matrix Systems Innovations Ltd, Cardiff CF10
Abstract
A walk-through of how MatrixObserve takes a P1 — twelve services, four regions, two upstream dependencies — and produces a one-paragraph explanation in the time it takes the on-call to pour coffee. We describe the structural join that lets the inference layer cite evidence rather than guess, the monthly public eval, and the two material misses MatrixObserve has had in the last six months. We also report the headline MTTE reduction from 28 minutes to 6 minutes (median, n = 14 engagements) and what that number does and does not mean.
The problem MatrixObserve is solving.
A P1 incident at 03:14 used to begin with the on-call opening four browser tabs. The logs tab. The traces tab. The RUM dashboard. The synthetic check page. The on-call's job, for the next twenty-eight minutes, was to do the join the platforms had failed to do: which log lines belonged to which span, which RUM beacon came from which trace, which synthetic check went red in the right region at the right time. The reasoning was the easy part; the join was the long part.
MatrixObserve takes the position that the join is the problem. It is, in particular, the problem that the inference layer cannot solve well: any model asked to reason over four uncorrelated dashboards will reason badly, because the relationships are not in the data. They are in the operator's head. They have to be in the data first.
The graph, before reasoning.
The first six months of the MatrixObserve work were spent on the join. Logs, traces, RUM beacons and synthetic checks land on the same data model, the same envelope, the same time base. The join is structural — a span is linked to its log lines through the trace-id; a RUM beacon is linked to the trace whose route it was issued from; a synthetic check is linked to the route and region it was probing. There is no heuristic correlation pass; the relationships are written into the data at emit-time, by the agents and the beacons, on a schema that is documented and signed.
This is the boring part of the release. It is also the reason the rest works.
The reasoning, with discipline.
At incident time, the inference layer reads the joined graph and produces a paragraph. The paragraph has three named parts: the symptom (what the user saw, drawn from RUM and synthetic), the proximate cause (the span that failed, drawn from traces and logs), and the upstream cause (the dependency, region or rollout that explains the proximate cause, drawn from the dependency-graph and the rollout-history columns).
The inference layer is disciplined in three ways. It cannot cite evidence outside the joined graph; if it does not have a link to a span, log or beacon, it does not assert. It is required to flag low-confidence assertions explicitly, with the confidence score on the assertion. And it has a documented "be silent" mode: when the graph does not contain enough evidence to produce a paragraph at the rubric's threshold, the output is the graph alone, not a hallucinated paragraph. We would rather be silent than wrong.
The eval, monthly, public.
The inference layer is evaluated against a public incident set every month. The eval set is twenty incidents drawn from the public post-mortem corpus across four kinds of system — web checkout, streaming media, B2B API, internal platform — with the ground-truth proximate and upstream causes annotated by humans and re-validated each quarter. The rubric scores three things: did the paragraph identify the right proximate cause, the right upstream cause, and were both assertions properly cited.
The eval set, the rubric and the per-incident scoresheet are published on the same day the eval runs. Two of the last six months' evals contain a material miss — a paragraph whose upstream cause was wrong. The first is a misattribution to a database failover that did not, in fact, fail over; the second is a misattribution to a deployment rollout that had been complete for three hours by the time of the incident. Both are in the eval log, with the model version at the time and the change shipped in response.
What MatrixObserve is bad at.
Honest negatives
- Novel failure modes. An incident whose root cause is the first of its kind in your environment has no precedent in the dependency-graph; the inference layer will say so rather than guess.
- Hardware-only incidents with no telemetry. A NIC flap on a rack you do not instrument is invisible to MatrixObserve. We will not invent telemetry that does not exist.
- Vendor incidents the vendor has not posted. The vendor-feed integration is fast (median 3 min from vendor post to incident card), but it is not faster than the vendor.
- Long-tail concurrency bugs. A 0.01 %-rate concurrency bug that doesn't fire often enough to register in the graph is, by design, not what MatrixObserve catches.
The 28 → 6 number.
The headline number for the release is that MTTE for a four-region P1 with twelve services and two upstream dependencies fell from 28 minutes (industry baseline, operator-confirmed against their last twelve P1s pre-WebMatrix) to 6 minutes (MatrixObserve, in production, median over fourteen engagements). The number is a median, not a mean. The variance is large; the worst MatrixObserve MTTE in the sample is 19 minutes, the best is 90 seconds.
The reduction is concentrated in the join step. Once the joined graph exists, the human writing-the-explanation step is fast; it was the join that took twenty minutes. The number is therefore not a claim that the model is faster than a human at reasoning; it is a claim that the joined graph is faster than four browser tabs at supporting the human's reasoning.