Clearly, the answer is to try and reduce the load-averaging period as best you can, and to try and reduce the feedback delay, and to have a gentler throttling mechanism. Certainly, just having a boolean throttle is a bad idea; it’s better if you can feed back some number to the load source to tell it how much to throttle, and increase the number if the load level is too high, and decrease it if it isn’t too high; that way, you will smoothly find the correct throttle level to obtain the desired load level. But even then, you must be careful of delays in the system, or it will still oscillate.
Measuring the load
How do you measure the load level of a distributed system? In our product, the updates are multicast to all the servers. Each server will have their own load level. They might all be the same – as they’re being sent the same work. Or they may differ – as they may not all be using identical hardware, and there may be other processes afoot on the servers. As a replicated database, we have a choice between running as fast as the slowest server can go (so all the servers get to keep up) or choosing some ‘average’ load level and accepting that slower servers might get overloaded, and then booted from the system for failing to keep up to date; we decided that booting slow servers out is a bad idea, as at best it will annoy people, and at worst, it will cause all the load to be concentrated onto fewer servers, perhaps causing cascading failures.
So our approach is simple; every server multicasts its load level periodically, and tracks the load level multicasts of others. The overall load level of the system is considered to be the maximum of all those observed.
The local load level is being computed on the fly, as the load changes (more on what we count as the ‘load level’ later). We multicast it once per second if it’s changed since the last multicast, but we don’t want to introduce a second’s lag into the feedback loop – so we also send it as soon as it’s changed if it’s larger than the current largest load level seen by any other server. This means that if we’re the most loaded server in the system, we tell everyone about it promptly. If we’re not the most loaded server, then our load level changes are only of interest to other servers if we become the most loaded server in future. The once-a-second periodic updates are really just to catch cases the fast-path algorithm misses, and to help new servers quickly find out the state of the system.
But what do we actually mean by ‘measuring the load level’? Ideally, we want the disk on each server to be 100% utilised. If we measure the percentage utilisation, we can reduce the throttle level if it’s below 100%, and increase it if it’s at 100% – but it can never get above 100%, so we’ll end up throttling the system back as soon as it reaches 100%, seeing it drop below 100%, and then letting it go a bit faster again. We’ll run at slightly less than 100% load, but we can’t really tell if we are running at capacity or over capacity unless we step back a little bit to make sure.
