Resilience

One of GenieDB’s design goals is resilience. Databases are one of the most critical points in application deployments. You can add more Web servers, and more load balancers, and more caching proxies, and just spread the incoming requests between them. If they break, you can just replace them, as they have no state beyond caches, and identical configurations.

But either way, few applications can run for long with their database down – and since the database is updated as well as read from, you can’t normally just run lots of copies of it, as they all need updating.

Which, of course, is why we decided to write a replicated database!

Sharding has its place, but you still need some replication somewhere if you want a fault-tolerant system; your data has to be stored in more than one place if you want to keep it despite disk failures, and to be able to get at it despite network failures.

Our product provides full replication: every node has a complete copy of the entire database – and, most importantly, all the updates are idempotent, meaning they can be applied more than once (and you still get the same effect); and they are timestamped, so no matter what order two updates to the same record are applied, the system can tell which of the two is the most recent.

The effect of this is that we can take a bunch of updates, and give them to each server in a totally different order. This has a number of benefits:

  1. We can easily cope with the fact that messages take different amounts of time to arrive from different servers due to network topology
  2. We don’t need to actually put the updates in order before applying them to disk. A global order exists – on the timestamps – but we ’sort’ the updates into order on a per-record basis. This means we need no complex buffering algorithm, nor a single central point of synchronisation that becomes a bottleneck and a single point of failure (master/slave databases, I’m looking at you).
  3. We can deal elegantly with network partitions; servers on both sides of the break can continue to operate, and when the split heals, they can send each other the updates they missed. As soon as everyone’s seen every update, we’re back in synch.

So with that established, let’s look at how GenieDB databases deal with different kinds of failure.

A server develops a ’soft’ software or hardware fault

Let’s say one of your servers runs out of disk space, or the disk starts to fail, or a bug somewhere causes something to fail.

If the failure is noticed in a client process, which only reads from the local data, then it’s assumed that the problem is local to the client process – perhaps it’s out of RAM. So the error is logged, and an error is returned from our code into the client.

If the failure is noticed by the daemon that performs updates, it can be one of two things. It’s either a real problem with that node – disk problems, for example – or something that just affects the daemon; a bug in it, perhaps. Either way, the critical thing is that the attempted operation failed, so the daemon is no longer able to keep this node up to date. The daemon logs the error (if possible), writes a crash snapshot file to disk for analysis (if possible), and then self-terminates.

Any connected clients will notice that the daemon has terminated as soon as they call into our library, at which point, they get an immediate error return.

How clients deal with the error return is up to them – some client apps might like to continue in some degraded mode without access to their database; some might prefer to abort and take that server out of the server pool. In order to support this, as well as starting to return errors to the application, as the daemon dies, it attempts to run a user-supplied “shutdown notification” script. However, in some failure modes, this script may not be run (if the machine is out of RAM, for example, or suffering really bad disk failures), so don’t rely on it!

Two things can happen now; either the server is repaired, or it’s replaced.

Server is repaired

If the node comes back up, with the problem resolved, then the daemon immediately connects to the network to receive multicasts. From this, it finds the current timestamp of the system by listening to updates as they stream in; it never has long to wait, since it also broadcasts a request for another server that’s in synch, and the reply will contain a timestamp.

If it can find another server, it then knows what timestamp the system is up to now – which, combined with the timestamp it finds on disk as the “last confirmed timestamp” from before it died, allows it to work out the timestamp interval it’s missed; it asks the server that responded to send it all updates in that timestamp interval (plus some leeway either side to allow for variances in message ordering, as it does no harm to receive the same update twice).

While this is happening, it’s also listening for new updates that come from the network, and applying them; this means it doesn’t need to chase an ever-moving target to get back into synch with. It also avoids a race condition, where a record is updated while an update to it is in flight to the recovering server. No matter what order the replayed update and the new update arrive in, the latest of the two will ‘win’, the same as on every other server.

When it’s received them all, it updates its status to ‘ready’, meaning that it can answer replay requests to other servers, and that clients can now connect. And it also runs a user-supplied notification script, just in case the server needs to be registered into a server pool or something. Either way, no data is served from the server until it knows it’s synchronised with the rest of the system.

But what if we can’t see another server? If we don’t get a reply to our request within five seconds, we must be the only active server we can see. In which case, the best we can do is to just continue with the data we have.

Server is replaced

A new server introduced to the system starts with no data, and believing that it is correct up to timestamp zero. Which is true, as all databases are empty at timestamp zero; no updates have happened yet (and inserts are a kind of update).

And given that… everything else happens just as above. It looks at the network, finds out what timestamp the system is up to, and asks another server for all the updates from time zero to now, which is everything.

Except it’s a bit cleverer than that; the ‘last modified index’ that servers keep in order to satisfy requests for all updates in a time period really are just indexes on the last-modified field of every record. So when a record is updated, its previous last-modified timestamp is removed from the index, and it’s re-inserted with its new timestamp. This means that every record only appears once in the index, at the point of its most recent update.

So when a recovering server requests reminding, it doesn’t receive the entire update log of the entire system in the time period – it just receives one update per modified record. If it asks for updates since time zero, it gets a snapshot of the entire database, rather than the entire history of it.

The GenieDB daemon process on a server dies unexpectedly

Perhaps it gets a segmentation violation or a bus error; either way, it dies, without us being able to do a clean shutdown. The notification script won’t be run, and at first, clients won’t notice – they’ll continue reading data that becomes increasingly out of date!

But, they won’t – because we have our consistency buffer technique that is already papering over the fact that servers are updated asynchronously anyway. However, that only covers us for up to sixty seconds; on a fast local network, the average replication delay to disk is a couple of seconds under high load, so we have a generous leeway. But we take correctness seriously, so we track (in shared memory) the wall-clock timestamp at which the daemon last updated the disk. Every time one of our API functions is called, we check to see if that’s more than thirty seconds ago; and if so, we consider that something is Very Wrong, and consider the node to be down, and so report an error to the client.

A server dies, hard

What if a server just… dies? The power is pulled out, the kernel panics, etc. It has no warning of this; it just goes offline.

This is why it’s crucial to not rely on the shutdown-notification script to be run – because servers can just die like this, without warning, or any opportunity to tell anyone.

However, all is not lost. The reliable multicast system that GenieDB servers use to talk to each other will quickly notice that the server isn’t responding, and announce that it has disappeared. This causes two things to happen:

  • All the GenieDB monitoring stations in the system will report that the server has disappeared in their logs, and remove the server from the list of active servers exported to your monitoring system (…you run a server cluster; you have a monitoring system, right?)
  • All the servers will record the timestamp at which they last heard from that server.

The first action means that you will know about it, so you can investigate, and make sure the server isn’t in any pools or anything any more, if necessary; and the second action makes more sense when we look at the next failure case.

What happens next depends on whether the server is replaced or repaired; the recovery actions are the same as for a soft failure.

A network failure isolates one or more servers

The system can’t tell the difference between a server that disappears due to catastrophic failure, and one that disappears due to a network failure; either way, the server just stops responding.

So when a network failure occurs, the servers on the other side of the failure proceed as for the previous case: monitoring stations note the server as gone, and database servers keep a note of when they last heard from it.

The fun part is, this is happening on both sides of the network split. But all the servers on both sides continue to operate as before; the only problem is, their state is diverging, as updates applied on one side aren’t appearing on the other. It could just be a single server isolated, of course – perhaps by the network cable to that server alone being pulled out.

When the network failure is repaired, though, something interesting happens. In all the cases above, where a server has disappeared, all the other servers have recorded the timestamp it disappeared at. And here’s why – because as well as the recovery that occurs when a server finds itself starting up, where it explicitly requests changes it misses, there’s a second mechanism. When a server we’ve seen disappear reappears, every server notes the timestamp it was last seen emitting before disappearing, and the timestamp when it comes back online; and the servers all tell that server what time period they’ve not seen it for – and ask it to send them all the updates it’s seen in that time period.

So in the case of a network partition, this means that both sides of the partition send the other side what updates they’ve missed.

In the simple case of a server going down and coming back up, of course, then that server performs the normal recovery process – and it will receive requests from other servers for what it’s seen while it was away. But since it was down, it won’t have seen anything, so there will be nothing to send. The other servers can’t tell if the server was down for the duration, or just isolated, so they ask anyway.

Conclusion

In summary, any server can continue to operate on its own, even if the entire rest of the system is down, or isolated by network failures. And when the servers rejoin, they’ll all tell each other what they’ve missed.

You could even have two totally isolated clusters, with a single server (perhaps a laptop) that travels between them. When it joins one cluster, that cluster tells it what it’s missed, and it tells the cluster what it knows that the cluster doesn’t. Then when it goes to the other cluster, it will pass on all the first cluster’s updates, and gather the second cluster’s updates. So when it returns to the first cluster, it will pass them on.

Of course, you may remember that we mentioned before that the replication lag hiding trick depends on the servers in a cluster being able to see each other. But even if they can’t, it only fails a bit – the illusion of zero replication lag fails for records whose primary server cannot be reached from the server that’s dealing with them, but the records can still be reached; it’s just that changes to them within the past few seconds might not be seen. As the network falls apart around us, the quality of service drops, but the system is still working and every record is available.

And all of this is automatic. Sure, servers that die need human attention to free up disk space, replace hardware, reboot them, and the like. But apart from that, we’ve automated pretty much everything we can. The design principle is simple – each server must track enough information about what’s going on with other servers that it can catch up on any updates it misses, automatically.

And anybody who’s had to resynchronise their MySQL slaves after promoting one to be a new master after the master has failed will appreciate how beautiful a thing that is…

Leave a Reply