How to monitor a Redis cluster (and forecast OOM before it drops writes)
Redis fails quietly. CPU looks fine, the dashboard is green, latency is flat — and then writes start
bouncing with OOM command not allowed because memory crossed maxmemory on a node
nobody was watching. A Redis cluster is more than the sum of its nodes, and the metrics that predict its
failures are not the ones most dashboards put front and centre. Here's what to actually watch, and why.
Memory is the metric — but not the way you think
The number that matters is not "memory used" in absolute terms. It's used_memory as a
trajectory against maxmemory, combined with the eviction policy that decides what
happens when the two meet. Three values together tell the whole story:
used_memoryand its rate of change — is it climbing, and how fast?maxmemory— the ceiling it's heading for.maxmemory-policy— what Redis does at the ceiling.
That last one is decisive and routinely overlooked. With an allkeys-lru or
volatile-ttl policy, hitting the ceiling means evictions — degraded hit rate, but writes
survive. With noeviction — the default — hitting the ceiling means write commands start
failing. Same memory curve, completely different incident. A monitor that reports memory without
reading the policy can't tell you which one you're about to have.
Forecasting time-to-OOM
Once you treat memory as a trajectory, the useful output is a time, not a percentage. If
used_memory is climbing 42 MB/min and there's 1.6 GB of headroom, you have roughly 38 minutes —
and if the policy is noeviction, that's 38 minutes until writes fail. That forecast is what
turns a memory chart into an alert worth acting on: enough lead time to raise the ceiling, fix the eviction
policy, or hunt down the runaway key growth before anything breaks.
Fragmentation: the memory you're using but can't see
used_memory is the logical size of your data. The process can hold considerably more resident
memory than that, and the ratio between them — mem_fragmentation_ratio — is a quiet killer. A
ratio drifting well above 1.5 means the allocator is holding memory your data isn't using, which brings the
real OOM point forward without used_memory ever looking alarming. Watching fragmentation drift
is how you avoid being surprised by an OOM that "shouldn't" have happened yet.
Hit rate, evictions, and a skewed keyspace
Cache hit rate — keyspace_hits over hits-plus-misses — is your efficiency signal. A falling
hit rate alongside rising evictions usually means the working set has outgrown memory, or that a hot key
pattern is thrashing the cache. In a cluster, an eviction storm on a single node is frequently the
first visible symptom of a skewed keyspace: hash slots aren't evenly loaded, one shard is doing more than
its share, and it'll be the first to OOM. That's a cluster-level diagnosis you can only make by watching
every node at once.
Replication lag and persistence
For replicated setups, track the offset gap between primary and replica
(master_repl_offset vs the replica's processed offset). Lag that grows means a replica that
won't be current if you fail over to it — a data-loss risk hiding behind a healthy-looking primary. On the
persistence side, watch rdb_last_bgsave_status and aof_last_bgrewrite_status: a
failing background save is both a durability problem and, because forking copies memory pages, a memory
spike risk on a box that's already tight.
Connections trending toward the limit
connected_clients climbing toward maxclients is a slow-burn failure that static
thresholds catch late. Forecasting the trend gives you time to find the connection leak — usually an
application pool that isn't releasing — before new clients start getting refused.
Watching the whole cluster from one agent
Clusters fail at the seams: the node nobody was watching, the metric that only matters in aggregate. You don't want to install and maintain an agent per node, or hand-write a config that enumerates them. With Foreseer, one agent scrapes every node you list, reports cluster-level health by default, and lets you drill into any single node's memory, evictions, hit rate, clients and persistence on demand — each with its own time range and aggregation. You install one agent, and because every node is tracked independently, each counts as a service against your plan.
The payoff is simple: instead of a green dashboard and a 3am surprise, you get "node cache-02
reaches maxmemory in ~38 minutes under a noeviction policy — writes will start
failing; raise the ceiling or set an eviction policy now." That's the difference between watching Redis and
actually staying ahead of it.
See it on your own infrastructure
One line to install. Your first insight lands within minutes.
Back to home