Kingman's Formula
Kingman’s Formula
Photo: Denys Nevozhai @ Unsplash - Unsplash licence
An online shop’s search function, which until recently was a problem child: under load, it crashed because the connection pool to the database was far too small. The team had found the error, increased the pool from ten to fifty connections, and passed load tests. Problem solved.
Two months later.
The shop is growing. There are more products in the catalogue and more customers. Traffic has more than tripled to 350 requests per second, and the growing volume of data is making every database access slightly slower. Database accesses that took 70 milliseconds two months ago now take 100 milliseconds. The connection pool utilisation is around 70%.
For the time being, the team is no longer sending out newsletters as there would be insufficient reserves. The dashboard shows an average response time of 120 milliseconds. This is slightly up, but query optimisation is on the roadmap. The next measures are already planned, but an important feature needs to be implemented first.
Then the first customers get in touch to say that the search is ‘incredibly slow’. The team looks at the dashboard: 120 milliseconds, as expected. So it’s the customer’s problem, not the service’s. But then the complaints pile up, and someone gets the idea to take a closer look at the logs.
Why the average is misleading
Using averages in monitoring can be misleading. It takes a broad distribution and turns it into a harmless figure. 120 milliseconds sounds like a service that has just become slightly slower. But the mean smooths everything out, treating the fast cache hits at 30 milliseconds the same as the expensive queries at 400. This is not representative of either case.
Further reading: Percentiles
A percentile describes the threshold below which a certain proportion of all measured values lie. p95 = 350 ms means that 95% of requests are faster, but one in twenty takes longer.
Whilst the mean smooths out outliers, percentiles make them visible, which is why p95 and p99 are the more important values in monitoring: they show what the slowest users experience.
Only the histogram of response times reveals what is happening: most requests are fast, taking 30–50 ms and involving cache hits that never touch the database. However, a long tail extends to 500 milliseconds and beyond, caused by wildcard searches and faceted queries on a dataset that has grown significantly over the past two months. The median is 80 milliseconds, the mean is 120 milliseconds, and the p95 is 350 milliseconds. One in twenty queries is almost three times slower than the mean suggests.
If you only look at the mean, you won’t notice the problem until customers start calling.
Five per cent sounds like a small figure. However, a user doesn’t make just one request – they search, filter and scroll. With ten page views per session, the probability of experiencing at least one slow request is:
\[1 - 0{,}95^{10} \approx 40\%\]‘One in twenty requests’ becomes ‘almost every second user’. No wonder the complaints are piling up.
And it gets worse. Slow requests aren’t just a problem for the users who trigger them; they affect everyone. The search service has a pool of fifty database connections. At low utilisation, most of them are free. A slow request of 400 milliseconds instead of 30 occupies its connection for thirteen times longer, but there are enough others available. The fluctuation is absorbed.
At 70% utilisation, the picture changes: around 35 of the 50 connections are constantly in use. When several slow requests arrive simultaneously – and at 350 requests per second, it is not a question of if but when – they hold their connections thirteen times longer than a cache hit. The fifteen free connections dwindle quickly. New requests can no longer find a free connection and begin to queue. Among these queued requests, there is a high probability of another slow one – and the queue continues to grow. The fluctuations can no longer resolve themselves; they accumulate.
Every cluster of slow requests creates a wave that affects all subsequent requests.
The result is a curve that has already appeared in the last post: long and flat, then steep – the hockey stick. The question is: can this effect be quantified?
The formula behind the hockey stick
In 1961 – the same year that John Little proved his queuing formula – the British mathematician John Kingman formulated an approximation for precisely this phenomenon. Kingman asked: How long is the waiting time in a queue if the service times are not constant?
Digression: Strictly speaking – and mathematicians tend to be strict – Kingman’s formula does not apply to our case because he describes a single service station, whereas we have multiple service stations due to the pool. However, the formula is nevertheless suitable for pools with many connections, as explained in the digression: G/G/1 vs. G/G/c – Kingman and multiple service stations.
His answer breaks down the waiting time into three factors:
\[\text{Waiting time} \approx V \times \frac{\rho}{1 - \rho} \times S\]-
S: the average service time. In our search service, this is 120 milliseconds, 100 of which are spent in the database and the rest on CPU processing.
-
ρ/(1−ρ): the utilisation factor (ρ, ‘rho’, represents the proportion of capacity utilised). At 50% utilisation, this gives 1. At 80%, it is already 4. At 90%, it is 9. At 95%: 19. This factor explains why the curve does not grow linearly, but explodes.
-
V: the variability of service times, a measure of how much service times deviate from their mean. V = 0 means that all requests take exactly the same amount of time with no fluctuations or queue time apart from the service time itself. V = 1 would mean exponentially distributed times, the classic worst-case scenario. For the search service, with its mix of cache hits and wildcard searches, V is around 0.5.
The total response time, from request to response, is the sum of the service time and the waiting time:
\[\begin{align}Response time &= S + \text{Waiting time} \\ &= S \times \left(1 + V \times \frac{\rho}{1 - \rho}\right)\end{align}\]Or as a multiple of the service time:
\[\frac{Response time}{S} = 1 + V \times \frac{\rho}{1 - \rho}\]The factor of 1 represents the pure service time, i.e. the time taken by every request, even if the queue is empty. Anything above that comes from waiting.
Let’s work this out for the search service with variability V = 0.5 and a service time of S = 120 ms.
| Utilisation | ρ/(1−ρ) | Waiting time | Response time |
|---|---|---|---|
| 50% | 1 | 60 ms | 180 ms |
| 80% | 4 | 240 ms | 360 ms |
| 90% | 9 | 540 ms | 660 ms |
| 95% | 19 | 1,140 ms | 1,260 ms |
At 50% load, it is unremarkable. At 80%, the response time has tripled. At 95%, it takes over a second for a service that can process individual requests in 120 milliseconds. The waiting time is almost ten times longer than the service time.
If someone had asked the team two months ago whether they had a variability problem, the answer would have been: ‘A what?’
It’s not a sudden collapse, but a gradual increase that becomes noticeable from 80% utilisation and uncontrollable from 90%.
The table alone does not show this, but variability determines how steep the curve becomes. At V = 0.1 — a service with highly uniform service times — the response time at 95% utilisation increases to almost three times the service time. Unpleasant, but manageable. At V = 0.5, however, it is over ten times the service time. The difference between a homogeneous and a heterogeneous workload is not gradual; it is the difference between a stable service and one that collapses under load.
Phantom traffic jams
Anyone who has ever been stuck in a traffic jam on a motorway, despite there being no roadworks or accidents, has experienced Kingman without realising it.
Traffic researcher Martin Treiber describes this phenomenon in Traffic Flow Dynamics: on a busy motorway, even a minor disruption, such as a driver braking briefly because they have dropped their mobile phone, can trigger a chain reaction. The driver behind brakes harder, and so does the next driver. The wave of braking propagates backwards, and kilometres further on, cars come to a standstill, never knowing what triggered it. Phantom traffic jams.
The parallel is no coincidence. Both traffic flow and queues are subject to the same mathematics: high volume plus fluctuations equals disproportionate delays.
Therefore, phantom traffic jams are rarer on motorways with speed limits than on motorways without. This is not because the speed limit increases throughput, but because it reduces speed variance. When everyone drives at a similar speed, the fluctuations become smaller and the braking waves weaker. In Kingman’s terms, a speed limit is a reduction in V; in the view of many motorists, it is a restriction of freedom, but that is another matter.
But back to the search service. It is not throughput that is the problem – according to Little’s Law, that is sufficient. The problem lies in the variation in service times. The combination of 30-millisecond cache hits and 400-millisecond wildcard searches creates phantom bottlenecks in the queue.
What to do?
Kingman’s formula has three factors. Each can be addressed.
Reduce variability (reduce V)
This is the most effective lever, yet it is also the most frequently overlooked. When service times vary less, waiting times decrease disproportionately. For the search service, there are several approaches:
-
Split workloads by distributing cache-supported standard searches and complex faceted queries across different service instances, or even different services. Each individual service then has a more homogeneous load distribution, resulting in lower V and more stable response times. In the language of scalability, this is functional decomposition (Y-axis).
-
Timeouts as variability limiters. A timeout of 500 milliseconds trims the long tail of the distribution. Not every request is answered – but those that are answered no longer spend minutes queuing behind a monster query. When combined with a circuit breaker, this protects the service from cascading delays. Unfortunately, this isn’t always an option.
-
Expand caching selectively. Cache not the top 100, but the top 5,000 search queries. The long tail becomes shorter, the distribution narrower. However, caching has its own limitations.
Digression: Zipf’s Law and the Limits of Caching Why even a large cache eventually stops helping – and what Zipf’s Law has to do with it.
Reduce service time (reduce S)
Little’s Law drives this lever: faster queries, better indexes and fewer database accesses per request.
Every millisecond saved has a dual effect: one direct effect on service time and one multiplicative effect on waiting time. In our scenario, if the team reduces the average service time from 120 to 80 milliseconds, for example through better indexes on the growing dataset, the response time at 80% utilisation drops from 360 to 240 ms.
This is a reduction of one third, even though the service time has only decreased by 40 ms, because Kingman amplifies the effect of optimisations.
Reduce utilisation (reduce ρ)
The most obvious lever is to provide more capacity than is mathematically necessary, but this is also the most expensive option. Kingman provides the rationale for not taking 95% utilisation as the target value, but rather staying deliberately below 80%. The difference between 80% and 95% utilisation with V = 0.5 is a factor of 3.5 in response time. Capacity buffers are not a luxury; they are the cheapest insurance against the hockey stick.
To determine the optimal capacity, both formulas from this series are required. Little’s law determines the size of the pool, i.e. how many connections are needed for this throughput. Kingman’s formula tells you how much additional buffer is required to prevent response times from increasing significantly under load.
In monitoring, Kingman reveals itself before it causes problems. If a system is far enough from its saturation point, response times will remain roughly stable as the load increases, or even decrease as caches warm up. If, on the other hand, response times rise with the load over the course of the day, this is a warning sign that the system is operating in a range where Kingman strikes. You don’t need to know the utilisation rate for this; the trend in response times under increasing load reveals how close you are to saturation.
Digression: Backpressure – When Systems Learn to Say No If reducing V or reducing S does not help quickly enough, the only option left is to teach the system to reject requests. Motorway slip roads with traffic lights work on the same principle.
However, both formulas describe the behaviour of a single instance and service. The obvious response to a system at its limit is to increase parallelism. More threads, more cores and, eventually, more instances. Intuitively, one might expect that doubling the resources would double the throughput. Reality is somewhat less cooperative, however. Why this is the case, which serial dependencies eat into the theoretically possible gain and where the hard mathematical limit lies, is the subject of the next part of this series.
But all three levers – V, S and ρ – describe the behaviour of a single instance, a single service. The obvious response to a system at its limit is to increase parallelism. More threads, more cores and, eventually, more instances. Intuitively, one might expect that doubling the resources would double the throughput. Reality is somewhat less cooperative, however. Why this is the case, which serial dependencies eat into the theoretically possible gain and where the hard mathematical limit lies, is the subject of the next part of this series on Amdahl’s Law.
Sources
- Kingman (1961) – The single server queue in heavy traffic. Mathematical Proceedings of the Cambridge Philosophical Society, 57(4).
- Hopp & Spearman (2011) – Factory Physics. 3rd ed. Waveland Press. Chap. 8 (Variability, VUT equation).
- Treiber & Kesting (2013) – Traffic Flow Dynamics: Data, Models and Simulation. Springer. Chapters 10–12.
- Little (1961) – A Proof for the Queuing Formula: L = λW. Operations Research, 9(3), 383–387.
Comments