Predicting Elasticsearch heap pressure before nodes drop out
The worst Elasticsearch incidents rarely announce themselves as "out of memory." They show up as a node that stopped responding to the master, shards going unassigned, search latency spiking, and a cascade as the cluster reroutes load onto the survivors — which are now also under pressure. Underneath almost all of it is the JVM heap. If you can see heap trouble coming, you can prevent the cascade. Here's how to read the signals that actually lead.
Heap used is the wrong number to alert on
Total heap usage sawtooths constantly — it climbs as objects allocate, then drops when garbage collection reclaims them. Alerting on "heap > 75%" catches the top of every normal GC sawtooth and pages you for a cluster that's perfectly healthy. The number that actually predicts trouble is old-generation occupancy after a major GC. If old-gen is still high right after a collection ran, the JVM couldn't reclaim it — that memory is live, the runway is shrinking, and the next collections will run more often for less benefit.
GC time is the leading indicator
Watch time spent in garbage collection per interval, especially old-gen (CMS or G1 old) collections. The death spiral is recognisable long before a node drops: collections get more frequent, each one reclaims less, and the JVM starts spending a meaningful fraction of wall-clock time in stop-the-world pauses instead of serving requests. Rising GC time with flat or falling reclaimed memory is the single clearest "this node is heading for trouble" signal Elasticsearch emits — and it's almost never on the default dashboard.
Circuit breakers are the last warning before errors
Elasticsearch's circuit breakers (parent, fielddata, request) trip to reject operations that would push heap over a limit, turning an OOM into a handled error. Tripped breakers are not a nuisance to silence — they're the cluster telling you it's already protecting itself. Breaker trips climbing is a sign you're operating at the edge of your heap budget, and a forecast of when fielddata or request memory crosses its breaker limit is far more useful than the rejection error after the fact.
What's eating the heap (and why it creeps)
Heap pressure usually isn't a single event — it's accumulation. The common drivers:
- Shard count. Every shard and every Lucene segment carries fixed heap overhead. A cluster that's fine today can creep into trouble purely from index growth and over-sharding, with no change in query load.
- Fielddata. Aggregating or sorting on text fields loads fielddata into heap, where it stays until evicted. One expensive aggregation pattern can dominate old-gen.
- Large or bulk requests. Oversized bulk indexing and deep pagination spike request memory and are a frequent trigger for parent-breaker trips.
Because these creep, they're textbook forecasting targets: the trend is smooth and the limit is fixed, so projecting old-gen occupancy forward gives real lead time.
From "node left the cluster" to hours of warning
Tie it together and the leading signals form a clear early-warning chain: old-gen occupancy after GC
trending up, GC time per interval rising, breaker trips beginning to appear. Each is visible well before a
node actually ages out. Forecasting old-gen occupancy against the heap ceiling converts that chain into a
sentence with a clock on it — "node es-data-03 is on track to sustained heap pressure within
the hour; GC time is up 3× and the orders index shards on it will go unassigned if it drops" — instead of a
3am page that says a node is simply gone.
The remediation you buy with lead time
Warning is only valuable if you can act on it. With hours instead of minutes you can clear fielddata, kill or throttle the offending aggregation, reduce shard count on bloated indices, rebalance, or scale a data node — all calmly, before the cluster starts shedding nodes and rerouting shards under load. That's the whole argument for predictive monitoring on Elasticsearch: heap crises are slow, legible, and forecastable, which means most of them never have to become incidents at all.
See it on your own infrastructure
One line to install. Your first insight lands within minutes.
Back to home