Redis fails quietly. CPU looks fine, the dashboard is green, latency is flat — and then writes start bouncing with OOM command not allowed because memory crossed maxmemory on a node nobody was watching. A Redis cluster is more than the sum of its nodes, and the metrics that predict its failures are not the ones most dashboards put front and centre. Here's what to actually watch, and why.

Memory is the metric — but not the way you think

The number that matters is not "memory used" in absolute terms. It's used_memory as a trajectory against maxmemory, combined with the eviction policy that decides what happens when the two meet. Three values together tell the whole story:

used_memory and its rate of change — is it climbing, and how fast?
maxmemory — the ceiling it's heading for.
maxmemory-policy — what Redis does at the ceiling.

That last one is decisive and routinely overlooked. With an allkeys-lru or volatile-ttl policy, hitting the ceiling means evictions — degraded hit rate, but writes survive. With noeviction — the default — hitting the ceiling means write commands start failing. Same memory curve, completely different incident. A monitor that reports memory without reading the policy can't tell you which one you're about to have.

Forecasting time-to-OOM

Once you treat memory as a trajectory, the useful output is a time, not a percentage. If used_memory is climbing 42 MB/min and there's 1.6 GB of headroom, you have roughly 38 minutes — and if the policy is noeviction, that's 38 minutes until writes fail. That forecast is what turns a memory chart into an alert worth acting on: enough lead time to raise the ceiling, fix the eviction policy, or hunt down the runaway key growth before anything breaks.

Fragmentation: the memory you're using but can't see

used_memory is the logical size of your data. The process can hold considerably more resident memory than that, and the ratio between them — mem_fragmentation_ratio — is a quiet killer. A ratio drifting well above 1.5 means the allocator is holding memory your data isn't using, which brings the real OOM point forward without used_memory ever looking alarming. Watching fragmentation drift is how you avoid being surprised by an OOM that "shouldn't" have happened yet.

Hit rate, evictions, and a skewed keyspace

Cache hit rate — keyspace_hits over hits-plus-misses — is your efficiency signal. A falling hit rate alongside rising evictions usually means the working set has outgrown memory, or that a hot key pattern is thrashing the cache. In a cluster, an eviction storm on a single node is frequently the first visible symptom of a skewed keyspace: hash slots aren't evenly loaded, one shard is doing more than its share, and it'll be the first to OOM. That's a cluster-level diagnosis you can only make by watching every node at once.

Replication lag and persistence

For replicated setups, track the offset gap between primary and replica (master_repl_offset vs the replica's processed offset). Lag that grows means a replica that won't be current if you fail over to it — a data-loss risk hiding behind a healthy-looking primary. On the persistence side, watch rdb_last_bgsave_status and aof_last_bgrewrite_status: a failing background save is both a durability problem and, because forking copies memory pages, a memory spike risk on a box that's already tight.

Connections trending toward the limit

connected_clients climbing toward maxclients is a slow-burn failure that static thresholds catch late. Forecasting the trend gives you time to find the connection leak — usually an application pool that isn't releasing — before new clients start getting refused.

Watching the whole cluster from one agent

Clusters fail at the seams: the node nobody was watching, the metric that only matters in aggregate. You don't want to install and maintain an agent per node, or hand-write a config that enumerates them. With Foreseer, one agent scrapes every node you list, reports cluster-level health by default, and lets you drill into any single node's memory, evictions, hit rate, clients and persistence on demand — each with its own time range and aggregation. You install one agent, and because every node is tracked independently, each counts as a service against your plan.

The payoff is simple: instead of a green dashboard and a 3am surprise, you get "node cache-02 reaches maxmemory in ~38 minutes under a noeviction policy — writes will start failing; raise the ceiling or set an eviction policy now." That's the difference between watching Redis and actually staying ahead of it.

How to monitor a Redis cluster (and forecast OOM before it drops writes)