arXiv:wm.2603.0005 · [cs.NI] · cs.SE · cs.LG v1

MatrixObserve.

The MatrixObserve sub-team¹

¹ Matrix Systems Innovations Ltd, Cardiff CF10

Abstract

MatrixObserve joins logs, traces, RUM and synthetic check results into a single graph on the WebMatrix data model and runs an inference layer over the graph at incident time. The output is a one-paragraph explanation in English — symptom, proximate cause, upstream cause — rather than a wall of correlated lines. We describe the join, the inference discipline, the public monthly eval, and the conditions under which the inference layer is deliberately silent.

One model. Four telemetry sources.

Modern observability has a join problem, not a collection problem. The agents emit. The collectors collect. The data lands. The dashboards render. What goes missing is the relationship between the line in the log, the span in the trace, the user beacon in RUM, and the synthetic check that went red two regions away — which is the relationship that, if a person could read it quickly, would tell them what happened.

MatrixObserve takes the position that the join is the product. Logs, traces, RUM and synthetic checks all land on the same data model, with the same envelope and the same time base. The join is structural, not heuristic: a span is linked to the log lines emitted under its context; a RUM beacon is linked to the trace whose route it was issued from; a synthetic check is linked to the route and region it was probing. The graph is, at the moment a page fires, already joined.

From dumped lines to one paragraph.

An incident at 03:14 used to look like four browser tabs: logs, traces, RUM, the synthetic dashboard. The on-call's job was to do, by eye, the join that the platforms had failed to do. MatrixObserve does the join. The inference layer reads the joined graph and produces a paragraph. The paragraph names the symptom (what the user saw), the proximate cause (the span that failed and how), and the upstream cause (the dependency, the region, the rollout that explains the proximate cause).

Proposition 5.1(MTTE: mean time to explanation) For a four-region P1 with twelve services and two upstream dependencies, MTTE under MatrixObserve falls from μ = 28 min (industry baseline · operator-confirmed) to μ = 6 min (MatrixObserve · production, n = 14 engagements, median). The reduction is concentrated in the join step, not in the writing step.

The paragraph is not a substitute for the graph. The graph is below the paragraph; the on-call can click into any phrase and land in the exact span, log line or beacon the inference layer cited. The paragraph is what gets posted in the channel; the graph is what gets posted in the post-mortem.

incident · INC-2026-04-14-0314 · P1 · 03:14 UTC SYMPTOM Customers in EU-WEST and EU-CENTRAL see 502 on /checkout/confirm at a sustained rate of 4.1 % since 03:09 UTC. RUM beacons confirm the user-side error. PROXIMATE CAUSE Spans of name payment-router.dispatch are returning ECONNREFUSED at 03:09:47 UTC for upstream acquirer-eu-2; TCP RST observed; circuit-breaker open since 03:09:51. Healthy alternative acquirer-eu-1 is degraded (p99 latency 1.4× baseline). UPSTREAM CAUSE Synthetic checks against acquirer-eu-2's status page show 503 from 03:08:33 UTC onward. Vendor incident ID ACQ-EU2-26-04-14-A posted at 03:11:02 UTC. The WebMatrix vendor-feed has matched the incident; suggested action is to fail checkout traffic to acquirer-eu-1 with relaxed rate-cap until ACQ resolves. trace · /trace/0e9a8b · log · /logs/payment-router/03:09 · vendor · ACQ-EU2

Figure 1 A representative incident card. The paragraphs are MatrixObserve's output; the lower-line links are the exact citations from the graph. The card is posted to the on-call channel as a single message; the graph is what the post-mortem author opens, one click below.

How MatrixObserve gets stood up.

For most customers MatrixObserve is a four-hour stand-up. Point your existing OTLP collectors at the WebMatrix endpoint; install the RUM beacon (one script tag, signed); point at most six synthetic checks at the routes you care most about. The graph is populated in about an hour. The inference layer becomes useful at about the four-hour mark — once the routes have produced enough span volume that the graph has structure to reason over.

We do not require ripping out the existing observability platform. MatrixObserve runs alongside; the same OTLP traffic can fan out to both. The customers who eventually retire the old platform do so because they stop opening it, not because we asked them to.

The eval is monthly and public.

The inference layer is evaluated against a public incident set every month, with the eval set, the rubric and the per-incident scoresheet published on the same day the eval runs. Two of the eval incidents in the last six months are ones MatrixObserve materially got wrong — it produced a paragraph whose upstream cause was misattributed. Both are in the eval log, with the reason, the model version at the time, and the change shipped in response. We do not curate the eval to make the platform look better than it is.

An engineering call against your trace volume.

The most useful conversation about MatrixObserve is held against your existing OTLP traffic, your incident history, and one current open question. Forty-five minutes; written note same day. The note tells you what we think MatrixObserve would have done at your last three P1s and where, honestly, it would not have helped.

Open an engineering call Read the observability note MatrixEdge