All posts
Engineering May 2026·9 min read

Predictive vs reactive monitoring: why thresholds keep failing you

Almost every monitoring setup in production today is reactive by design. You pick a number — 90% disk, 75% heap, a queue depth of 10,000 — and you wait for a metric to cross it. When the alert finally fires, the condition is already happening. You are not preventing the incident. You are being notified that it has begun.

That model made sense in an era when collecting metrics was the hard part. It isn't anymore. Agents are cheap, time-series databases are fast, and dashboards are everywhere. The hard part today is the judgement layer on top: knowing which of a thousand trends matters, what it means for the service, and how much time you have before it turns into a page. Static thresholds answer none of those questions.

A threshold is a lagging indicator wearing a costume

Consider the most common alert in the world: disk usage over 90%. On its own, that number tells you almost nothing about urgency. A volume filling at 100 MB/hour with 50 GB free gives you roughly three weeks. The same 90% reading on a volume filling at 4 GB/hour gives you under three hours. Identical alert, two completely different situations — and a static alarm cannot tell them apart, because it only knows the level, never the rate.

This is the core defect. A threshold collapses a moving system into a single instantaneous reading. It throws away the slope, which is exactly the information that predicts when you'll hit the wall.

Why teams set thresholds too low — and too high

Because a threshold can't reason about rate, engineers compensate by guessing. Set it conservatively (say 70% disk) and you get paged constantly for volumes that are perfectly stable at 72% — classic alert fatigue, the thing that trains an on-call rotation to swipe away notifications without reading them. Set it aggressively (95%) and you preserve sleep right up until the night something fills fast and you get nine minutes of warning instead of nine hours.

There is no static number that is simultaneously quiet enough to ignore at rest and early enough to act on under load. You are picking a point on a bad trade-off curve. Forecasting removes the trade-off, because it keys off the trend rather than the level.

What forecasting actually computes

The mechanics are not exotic. Fit the recent trajectory of a metric, account for its normal daily and weekly seasonality, and project it forward to the point where it crosses a meaningful boundary. The output is not "disk is at 90%" — it's "this volume reaches 100% at roughly 02:00 tomorrow, give or take an hour." That single reframing changes the whole conversation:

  • Free space or raise a watermark before Elasticsearch flips indices to read-only.
  • Set an eviction policy before a noeviction Redis starts rejecting writes.
  • Scale a pipeline before the persistent queue overflows and Logstash drops events.

The same approach works for any saturating resource: JVM old-gen occupancy, Redis memory against maxmemory, inode counts, connection pools. Anything that fills, leaks, or trends toward a limit is forecastable, and the limit is usually the thing that actually causes the outage.

The gap the industry left open

Most monitoring tools optimised for the wrong half of the problem. They got very good at storing and drawing metrics, and assumed a human would supply the intelligence. So you end up with forty dashboards that show you everything and decide nothing. At 2am, a tired engineer is expected to hold heap, GC time, search latency and shard allocation in their head simultaneously and infer the cause. That's not a tooling success; it's the tool outsourcing its hardest job back to you.

Predictive monitoring closes that gap by doing the correlation and the projection up front, and by leading with a conclusion instead of a canvas. The dashboard is still there when you want to verify — but the default surface is a sentence, not a grid of charts.

Lead time has real economic value

Hours of warning is the difference between a calm fix during business hours and a frantic one mid-outage. With lead time, remediation is planned, reviewed, and reversible. Without it, you're improvising under pressure with a customer-facing clock running — the conditions under which the worst mistakes get made. Prevention isn't just nicer for on-call; it's measurably cheaper than detection, because the incident that never reaches production has no MTTR, no status page, and no postmortem.

Where prediction honestly doesn't help

Forecasting is for things that trend. It will not predict a kernel panic, a bad deploy, a network partition, or a cable someone unplugged — those are step changes with no runway. A serious monitoring system needs both: forecasting for the slow-burn saturation failures that make up a surprising share of real outages, and fast reactive detection for the sudden faults. Anyone selling you prediction as a replacement for detection is overselling. The point is to stop using reactive tools for problems that were visible hours in advance, if only something had been doing the projection.

The shift in one line

Reactive monitoring measures how fast you find out. Predictive monitoring measures how often you never have to. The first is a property of your tooling; the second is a property of your night's sleep. We think the second is the better metric — and it's the one your on-call rotation will thank you for.

See it on your own infrastructure

One line to install. Your first insight lands within minutes.

Back to home

Talk to us

Questions about the product, Enterprise, or self-hosting? We read every message.

Send a message Use the contact form Email us hello@foreseer.app