However, that won’t work. There will be other things within the system fighting for disk bandwidth. The tasks in the daemon other than actually writing to disk take up a little bit of time. We can’t quite keep the disk at 100% anyway. And so, even when overloaded and with updates backing up in the buffer, we won’t be at 100% load… so our flow control will never actually throttle us at all!
Ah, but what of that buffer? Surely we can measure the size of it!
But the buffer size isn’t a measure of load. It’s a measure of overload. That’s more what we want to measure anyway, which is good; but there’s a subtle issue. What’s the correct buffer length? At what point do we throttle more, and at what point do we throttle less?
The fact is, what matters is the rate of change of the buffer length. If it’s growing, we need to throttle more. If it’s shrinking, we can take on more load. But we also have other considerations: we only have a fixed amount of physical memory, and we don’t want the buffer growing any larger than that, or we’ll start to swap (and then everything falls apart fast, as that takes up disk bandwidth, making us even more overloaded). Also, the longer the buffer is, the more updates we need to request and replay if the server is forced to reboot without warning, which will make recovery take longer.
So what we’ve done is to choose a soft limit for the buffer’s size. By default, this is one sixteenth the size of physical memory on the server, but we let that be overridden; it’s impossible to choose a default that’s right for everyone. We then convert the buffer size to a percentage, to normalise for different servers having different soft limits, and that’s what the server broadcasts as its load level.
It is just a soft limit – a sudden spurt of work can cause servers to go over this limit before the flow control mechanism can kick in. But that’s OK – we have a good safety margin before we need to start swapping, and even then, going over the limit causes the source of the load to be throttled back very hard to let the servers catch up; as the servers drain away the backlog the clients slowly speed up until an equilibrium is reached.
The maths happens at the other end. Our throttling mechanism is simple; we just introduce a small delay before sending an update, by sleeping. We take the worst load level in the entire system, as mentioned before, and then convert it to a delay using a monotonically increasing, but superlinear function; it starts off at zero until the load reaches a few percent, then smoothly increases at an increasing pace, until it’s capped at a maximum of one second, around 95% load. Since we deal in thousands of updates per second normally, no server should ever get that far behind unless something has gone wrong; but if it does, every client is limited to at best one update per second.
If the load level is rising, then the throttle level will rise to combat it; if it is falling, then the throttle level will fall to allow more work in. It will therefore settle out at some level, where the throttle level correctly controls the load to keep the buffer usage roughly constant. As the nature of the workload changes, that correct throttle level may change, but it will still settle down. And because of the rising nature of the curve, the buffer usage will be kept as low as possible, and not allowed above the soft limit except when a large burst of traffic arrives before the flow control mechanism can damp it. That’s why we call it a soft limit.
