More Measuring
So the flow control system really takes two factors into account – it controls the rate that updates are generated to match the rate at which they can currently be processed by the slowest server in the system, and it makes sure that no server’s buffer overflows too outrageously. But we can add more – we can make each server report the worst of several metrics, as long as they can be scaled into a percentage. As well as accounting for memory usage, for example, we also have a fixed-sized data structure that’s part of our work queue: it’s a binary heap used to sort pending updates by timestamp. The capacity of the heap is quite generous, as it doesn’t use much memory per potential entry, but nonetheless, if it overlfows, we have to go into an emergency mode where we block on receiving more updates from the network until we’ve written some to disk, which may result in us losing a message and thereby losing consistency and dropping out of the system. So, again, 100% utilisation of the heap is a soft limit, but we throw that in as well; I find it hard to imagine a case that would threaten it (a large number of tiny but time-consuming updates, so the heap fills up without using much actual memory in the buffer?), but it’s nice to put a mechanism to restrict it in, just in case; it gives peace of mind.
Also, one consequence of a large backlog of work in the buffer is that the state on disk falls further and further behind the current state of the system. As I have mentioned before, we use a clever trick to hide the replication delay. However, the primary server that holds the new state of a record can’t be counted on to keep it forever, so we need to get the local persistent state updated within a reasonable timeframe. Indeed, to avoid certain pathological cases, we set a timeout of sixty seconds on the primary server copies of records. So we’d really like all updates to be flushed to disk, where possible, within those sixty seconds; since the flow control mechanism can only really provide soft limits, let’s halve that to thirty seconds. For system profiling, we already store wall-time-stamps (as Unix seconds-since-epoch values) in updates as they pass through the various stages of the system – issued by a client (and possibly queued locally to aggregate packets), multicast out, received by the server, transferred from the initial receive queue to the update buffer, and then performed. So when we perform an update, we can easily calculate the total time delay from initial issue to final writing to disk on this server. Of course, this will vary tremendously between updates, as each follows their own path through the network, so it needs some smoothing; we use an exponentially-decaying moving average for this, like a Unix load average. That average, scaled to be a percentage of half the primary-server timeout, produces yet another ‘server capacity fraction’ that can be introduced into the flow control algorithm; so if a server becomes overloaded in a way that makes it lag behind in time without the buffer growing much, we’ll catch that case and slow things down.
