The Capacity of a Single Instance: Little's Law in Practice

On a Tuesday afternoon, the search service is running smoothly: 100 requests per second, response times under 100 milliseconds. Then the marketing team sends the weekly newsletter. Within three minutes, traffic climbs to 500 requests per second. Response times explode: first to 800 milliseconds, then to several seconds.

The load balancer marks instances as unreachable, first one, then all: users see error pages. A look at the logs: no exceptions, no slow queries, no network problems. But timeouts. Lots of timeouts.

This reveals a scalability limit — within a single instance. Before thinking about additional servers, a more fundamental question arises: what capacity does this one instance actually have? And why did it collapse at a point that nobody anticipated?

The answer lies in a 1961 tool from the field of queuing theory.

Three quantities that describe every system

Every service that receives and answers requests can be understood as a queuing system and characterised by three values:

Throughput $\lambda$ (Lambda): The average number of requests completed per second. When a system collapses under load, dropping or queuing requests, it processes fewer than arrive. Throughput describes what actually gets through, not what is requested.

Response time W: The average time a request spends in the system, from arrival to completed response. For the search service in our example, W (for waiting) was 90 milliseconds, of which about 70 milliseconds were spent on database access.

Concurrency L: The average number of requests in the system at any given moment. If this value rises — say because more requests per second suddenly arrive — it can lead to collapse.

These three quantities are not independent. They are connected by a formula that John Little proved in 1961.

Little’s Law: one formula, three quantities

The intuition behind Little’s Law can be understood using the example of a supermarket checkout — a typical queuing system. Customers arrive, wait in line, pay, and leave the shop. The throughput $\lambda$ is limited by the cashier: scan all items, take payment. How many customers are at the checkout on average?

The answer depends on two things: how many customers are served per minute and how long each spends at the checkout on average. At two customers per minute and an average total time of three minutes, there are on average six customers at the checkout at any given moment.

Little proved in 1961 that exactly this relationship holds for every stable queuing system, regardless of how arrivals are distributed, how many checkouts are open, or what priority rules apply:

\[L = \lambda \times W\] \[\text{Concurrent requests} = \text{Throughput} \times \text{Response time}\]

Staggered request bars: At any point in time, L requests are simultaneously in the system

The term “stable” means in this context that the system is able to keep up with the arrival rate — that it is not overloaded. If that is not the case, more customers wait from minute to minute, response times rise, and the equation no longer holds. Time to open another checkout.

Applying Little’s Law to the search service: at 100 requests per second and 90 ms response time, 9 requests are in the system simultaneously:

\[L = 100 \text{ req/s} \times 0.09 \text{ s} = 9 \text{ req}\]

Nine concurrent requests. The thread pool with 200 slots is far from its limit. And at the newsletter peak?

\[L = 500 \text{ req/s} \times 0.09 \text{ s} = 45 \text{ req}\]

45 concurrent requests. With a pool of 200 threads, that sounds uncritical — and it is, because the thread pool wasn’t the bottleneck at all: the problem was one level deeper.

One service, multiple queuing systems

A web service is a chain of queuing systems. Every request passes through multiple stations, and each one can become a bottleneck: Request → Thread Pool → Connection Pool → Database.

A service as a chain of queuing systems: Users, Thread Pool, Service, Connection Pool, Database

Little’s Law can be applied to each of these stations individually — regardless of whether the station has one or fifty parallel servers (why this works is explained in the digression on G/G/1 vs. G/G/c). Rearranging the formula (maximum throughput = pool size divided by response time), you can identify the bottleneck:

  • For the thread pool: 200 threads divided by 0.09 s response time = roughly 2,200 requests per second. Far beyond demand.
  • For the connection pool: 10 connections divided by 0.07 s response time (the database claims the lion’s share of the 90 milliseconds) gives roughly 143 requests per second.

There was the bottleneck. The connection pool was configured to 10 connections, the framework default. A number nobody had questioned because it was sufficient at 100 requests per second. At the newsletter peak, the service needed 35 concurrent database connections ($L = 500 \text{ req/s} \times 0.07 \text{ s}$), so the connection pool was undersized.

Once all connections were taken, threads blocked waiting for a free connection — and in doing so held their place in the thread pool, until that too was exhausted and requests were rejected. The average response time grew in parallel from 90 milliseconds to several seconds.

The system became unstable and was no longer able to serve all incoming requests. A single miscalibrated number — and the service collapsed not at ten thousand concurrent users, but at five times the normal traffic.

The solution — and its limits

The obvious question: why not just increase the connection pool? For a single instance, that is exactly the right answer. 35 connections at peak, plus 30% headroom: a pool of 50 connections solves the problem.

Why not 200 or 500 connections then? Because the database cannot serve an unlimited number of connections simultaneously — beyond a relatively low threshold, throughput drops rather than rises. As always, there is an optimal range: large enough that threads don’t have to wait, small enough that the database doesn’t slow down under coordination overhead.

Deep dive: Pool sizing in practice

Brett Wooldridge, the author of the connection pool framework HikariCP, argues in a widely cited article that the optimal pool is much smaller than you’d intuitively expect: often close to 2 × number of CPU cores + number of disks. The reason: more connections create more context switches and lock contention on the database. The Tomcat default of 200 threads was never the problem in our scenario; the HikariCP default of 10 connections was.

The real architectural limit only becomes visible when multiple instances share the same database, each bringing its own connection pool. Why it is almost always the database that stops scaling first is the topic of Part 4 of this series.

What response time contains — and why thread pools get large

The response time W contains everything: CPU computation, waiting for the database, external calls, and also waiting for a free connection from the pool. In most web services, the waiting dominates so strongly that actual computation time becomes practically irrelevant. For the search service: 70 out of 90 milliseconds for the database. 78% of the time, the thread is blocked and waiting. In practice, it can even be 90% or more. This makes thread pools large, often irritatingly large: the threads are idle most of the time yet still occupy their slots.

“Thread pool” is a flattering name in that regard. “Thread waiting room” would be more honest.

Breakdown of a request: 20ms CPU work vs. 70ms waiting for the database

Deep dive: Reactive frameworks and virtual threads

Reactive frameworks like Spring WebFlux or Vert.x and — since Java 21 — virtual threads address this problem directly: they decouple processing capacity from the number of blocked OS threads. This changes the optimisation approach, but not the fundamental statement of Little’s Law: the response time remains the response time, and $L = \lambda \times W$ still holds. And it changes nothing about the connection pool: whether an OS thread or a virtual thread is waiting for a database connection, the resource is needed either way.

Little’s Law gives the minimum for ongoing operation. The startup phase after a deployment (filling the connection pool, warming caches, etc.) is not included. Those buffers come on top.

The saturation point

Little’s Law assumes that the system is stable: the processing rate can keep up with the arrival rate. In the stable range, concurrency grows proportionally with throughput. Everything is in equilibrium.

Once the load exceeds capacity, this behaviour reverses. Using the checkout example: if more customers arrive per minute than can be served, the queue grows, and with it the waiting time — indefinitely.

This transition is called the saturation point.

digression: Backpressure — When Systems Learn to Say No A system at the saturation point has two options: accept all requests and slow down for everyone — or deliberately say no, so the rest stays fast.

Why the system still doesn’t run smoothly

Anyone expecting everything to run fine up to the capacity limit and then a clean step to appear will be disappointed in practice. A system runs stably at 70% utilisation, at 80% response times begin to rise, at 90% it breaks down. Flat for a long time, then steep — like a hockey stick.

Hockey-stick curve: Response time stays flat for a long time, then rises steeply

Why does a system that, according to Little’s Law, still has capacity, tip over? The short answer: not every request takes the same amount of time. The role this variability plays is the topic of the next post.

From formula to practice

Little’s Law provides the calibration for our pools: how large do they need to be configured so that the expected traffic can be served?

The formula also works in the other direction. I once saw a team running twenty instances for what was essentially a trivial service. Twenty instances — for a service that did little more than read data from a database and return it as JSON. A quick back-of-the-envelope calculation with Little’s Law would have shown that three instances would have sufficed, and that the performance problem the team was trying to address with 20 instances was caused in an entirely different place.

Little’s Law is therefore not just a planning tool, but also a wonderful diagnostic tool for unnecessary resource consumption and inappropriate scaling.

The formula is only as good as its inputs, though. The response time is not a fixed value. It depends on database utilisation, on cache fill levels. After a deployment, when the caches are cold, W looks different than two hours later. And peak throughput? Newsletter peaks can still be predicted, viral social media posts cannot.

That’s why you can’t get around load tests. Little’s Law tells you where to start. The load test shows what happens in practice. And if the measured values deviate significantly from the calculation, that is not a reason to blindly crank up the pool sizes. It is a signal that you’ve overlooked something important: a hidden dependency, a lock in the database, a wait time that was invisible under low load.

The limits of a single instance — and what comes next

Little’s Law answers the first question of scalability: how much can a single instance handle? Within an instance, there are two levers:

  1. Reduce service time — faster code, fewer blocking calls, caching frequently used results, algorithm optimisation.

  2. Increase capacity — enlarge pools, more threads, more database connections.

Both levers have limits. The response time has a minimum determined by the implementation and the database. Pools cannot grow indefinitely because the underlying resource — the database — doesn’t scale with them.

Once these levers are exhausted, the next step follows: a second instance, a load balancer distributing the load. This increases throughput, but not without limits. All instances share the database. Each brings its own connection pool, and the database must serve them all. Beyond a certain point, the application is no longer the bottleneck — the layer beneath is. More on this in Parts 4 and 6 of this series.

Three instances with 10 connections each converge on one database: 30 connections in total

But there is a subtler limit still: serial fractions in the code — locks, synchronised blocks, sequential dependencies — cap the achievable parallelism, no matter how many cores or instances are available. The more work is done in parallel, the more these serial fractions make themselves felt. We’ll come to that in Part 3 on Amdahl’s Law.


Sources

Comments