Flow control in a distributed system

As a replicated database, the core of our technology is multicasting update operations to a number of servers efficiently.

Most of the difficulty is making sure they all get there, even though networks may come and go, servers come and go, and so on. However, that’s not what I’m going to talk about today!

We’re going to look at the opposite problem – stopping too many updates coming through the system at once.

The Problem

Each server runs a daemon which listens for these multicasts, and performs the updates contained therein. However, there are only small fixed-size buffers for pending multicasts; while this daemon is busy applying an update to disk (which can block on locks and disk bandwidth and so on), more multicasts might arrive, so it’s very important we satisfy them promptly.

The Solution

The first step, of course, is to have a buffer. The daemon is split into two threads, one of which listens for multicasts and puts them into the buffer; the other of which applies them. The buffer grows when updates come streaming in faster than they can be applied, and shrinks when they are coming in slower than the rate at which they can be applied.

That means that a spurt of updates won’t cause us to miss a message; they’ll just go into the queue and be drained out in due course. But it won’t protect us from a sustained overload. After all, how can you benchmark a database with our architecture, if you can emit updates as fast as you want, with no regard for the actual ability to write to disk? Clearly, we needed a flow control mechanism.

Now, flow control seems like a simple idea. Figure out if things are arriving too quickly at the far end, and then send back some kind of message to ask the source to slow down. But it’s easy to get it disastrously wrong, without the problem being obvious. As usual, it all comes down to maths.

Pages: 1 2 3 4 5 6

Leave a Reply