Digression: Zipf's Law and the Limits of Caching

Caching is the first instinct of many developers when services are under heavy load. And as a rule, it helps — dramatically, in fact. A search service that queries the database with every request might manage 200 requests per second. The same service, with a cache for frequent search queries, might manage 1,000. A fivefold increase, without changing a single line of business logic.

That sounds like a scaling solution. But it isn’t. The factor of five is borrowed — it doesn’t belong to the system, but to the traffic pattern — and those change.

Why caching works so well

The explanation for why caching is often a good idea is provided by a law formulated by the Harvard linguist George Kingsley Zipf in 1949. Zipf investigated the frequency of words in texts and found a remarkably regular pattern: the most frequent word occurs twice as often as the second most frequent, three times as often as the third most frequent, and so on.

In general: The kth most frequently requested item is requested in proportion to $\frac{1}{k^\alpha}$, where $\alpha$ is typically close to 1 (values between 0.6 and 1.2 have been measured, depending on the domain).

Curiously, this pattern appears in all sorts of contexts — for example, in the distribution of search queries or page views. Breslau et al. demonstrated in 1999 that web traffic follows a Zipf distribution. The consequence: a tiny fraction of all possible queries accounts for the majority of traffic.

An example: An online shop receives 500,000 different search queries per day. The top 5,000 (i.e. 1%) generate 400,000 of these. A cache containing these 5,000 entries — a few megabytes of storage — reduces the database load by 80%. This also reduces infrastructure costs, not least.

If the database is the bottleneck and handles 200 queries per second, the calculation looks like this: With an 80% hit rate, only 20% of requests reach the database. Instead of 200 queries per second, the system now handles 1,000 — because only one in five queries requires the database. This is the formula behind it:

\[\text{Capacity}_\text{eff} = \frac{\text{Capacity}_\text{DB}}{1 - \text{Hit-Rate}}\]

Impressive. But the crucial term in this formula isn’t the database capacity — you control that. It’s the hit rate. And you don’t have control over that.

The factor is borrowed

Hit rates of 80–95% are common in practice — but they are not the result of good implementation. They are a consequence of the fact that real users behave in a Zipf-distributed manner: many search for the same things, few search for rare items. The system’s capacity therefore does not depend on how much traffic comes in, but what kind of traffic.

That is the difference between performance optimisation and scaling. Performance optimisation says: under normal conditions, the system is faster. Scaling says: double the resources, double the capacity — no matter what comes.

A cached system is optimised for traffic distribution, not traffic volume. And that makes it fragile.

The bot that brought the shop to its knees

I once experienced how a single bot brought our online shop to its knees — not through volume, but through its access pattern. The bot wasn’t malicious; it belonged to a price comparison site: it systematically retrieved every variant of a product. Every colour, every size, every combination. At a high frequency, but only once per variant.

From the cache’s perspective, this was disastrous. Normal customers search for ‘Nike Air Max’ — a cache hit after the first request. The bot searched for ‘Nike Air Max’ — and then clicked on all sizes, colours and variants before moving on to the next product. Combinations that simply did not follow a Zipf distribution — and therefore resulted in near-total cache misses. Every request went straight to the database.

But it got worse. The shop used an LRU cache with a limited size. The bot requests displaced the popular entries, so that once the bot had finished its run, the cache was ineffective for ‘normal traffic’. Response times increased, the queues filled up, and the shop became unusable. Not because there was too much traffic, but because the wrong traffic was coming in.

The Long Tail

The bot story is an extreme case, but it illustrates a fundamental problem. Even without bots, the remaining 20% of requests are spread across a huge number of different queries. In a shop with a million products, most products are rarely or never searched for. These queries — the so-called Long Tail — cannot be cached. Not because the cache is too small, but because caching isn’t worth it: an entry queried once an hour takes up memory without ever generating a second hit.

And it is precisely these long-tail queries that are the costly ones. No cache hit means a full database query. Often more complex queries — faceted searches, rare filter combinations, full-text searches for obscure terms. Response times vary more widely than for popular queries, which obediently come from the cache.

In the words of Kingman: The cache makes the average faster — but it makes the distribution wider. The fast cache hits (30 milliseconds) and the slow database queries (300 milliseconds) create precisely the bimodal mix that leads to disproportionately long waiting times under high utilisation. The variability V increases, so sensitivity to load increases. The range between “running fine” and “crashing” becomes narrower.

Anyone looking only at the median and P90 thinks caching is magic — the vast majority of requests arrive in 30 rather than 300 milliseconds. But the P99 tells the truth: the slowest 1% of requests are the long-tail queries that, under high load, quickly turn 300 milliseconds into 3 seconds. Capacity planning becomes more difficult because the actual load depends on which requests are arriving at any given moment — not just how many.

Cache Stampede

There is a second effect that can make caching dangerous under load. When a popular cache entry expires — say, for a product currently being heavily promoted — and a hundred requests for that exact entry arrive at the same time, all hundred hit the database. For a brief moment, the load explodes as if there were no cache.

This is the cache stampede, also known as the Thundering Herd. Under low load, this is no problem — the database can handle the brief burst. Under high load, this burst may be enough to push the system past the saturation point. And because response times then increase for everyone, it takes longer to refill the cache — a self-reinforcing feedback loop. At its core, this is a backpressure problem: the cache fails as a protective mechanism, and the system has no second line of defence.

Countermeasures do exist: cache locks (only one thread fills the entry, the others wait), stochastic pre-refresh (the entry is refreshed before it expires), or staggered TTLs. All solvable, but only if you know the problem exists. The price, however, is a massive increase in complexity, an additional source of error, and system behaviour that is difficult to predict.

When the cache becomes a crutch

There is a scenario that teams usually only discover when it is too late: a system that relies on warm caches to handle its load cannot start with empty caches.

At OTTO, this was one of the starting points for the new development in 2010. The legacy system was so heavily reliant on the cache that a restart under production load was impossible — the database would not have survived the traffic without cache protection. A simple restart became a planning problem: when is there enough of a break in traffic to allow the cache to warm up again? ‘Startup capability under load’ was therefore one of the most important requirements for the new system — not because it was an elegant architectural idea, but because the lack of this capability had caused months of pain.

The cold-start problem is the litmus test: if the cache fails or is empty and the system therefore stops working, then the cache is no longer an optimisation — it is infrastructure. And infrastructure must be operated differently from a best-effort speedup. A system that cannot start without a cache is a system whose capacity does not depend on its resources, but on its state.

‘There are only two hard things…’

Phil Karlton is credited with the quote: ‘There are only two hard things in computer science: cache invalidation and naming things.’ Cache invalidation is so difficult because it is a consistency problem.

A cached value is a copy. As soon as the original changes — a price is updated, a product sells out — the copy is incorrect. The question is: how long is it allowed to be incorrect?

With a TTL of five minutes, the shop displays an incorrect price for up to five minutes. With event-driven invalidation (the database change triggers an event that clears the cache), the inconsistency is shorter — but never zero. And the infrastructure for event-driven invalidation comes at its own cost: messaging systems, eventual consistency, additional sources of error.

The more caching layers a system has — browser cache, CDN, application cache, database cache — the more places there are where outdated data can get stuck. And the harder it becomes to reliably propagate a change all the way to the user.

At its core, this is an Amdahl problem. Every active invalidation — whether via event, API call or broadcast — is a serial fraction: a coordination effort that grows with every additional cache layer and every additional consumer. TTL-based invalidation avoids this coupling, but at the cost of inconsistency. Event-based invalidation reduces inconsistency, but increases the serial fraction. You cannot have both at the same time — this is the same trade-off described by PACELC.

Performance tool, not a scaling principle

Caching makes a system faster and cheaper to run. That is valuable — no question. But it does not make a system more scalable, because the capacity gained is not based on resources, but on an assumption about user behaviour.

The capacity of a cached system is a function of the hit rate. The hit rate is a function of the traffic pattern. And you do not control the traffic pattern. A bot with the ‘wrong’ access pattern, a cold cache after a deployment, a cache stampede at the wrong time — any of these scenarios can reduce effective capacity to a fraction, without the volume having changed at all.

True scalability means: capacity is a function of resources — double the resources, double the capacity. This requires architectural measures: functional decomposition, data partitioning, decoupling of services. That is exactly what this series is about.

Caching is a powerful first lever — but anyone who builds their system on it is building on sand.


Sources

  • Breslau et al. (1999) — Web Caching and Zipf-like Distributions: Evidence and Implications. IEEE INFOCOM 1999.
  • Zipf (1949)Human Behavior and the Principle of Least Effort. Addison-Wesley.

Comments