Replication Lag

The key idea in a replicated database is that you store multiple copies of your data, spread over several servers. The advantage is that when you want to read some data, you can pick a server (ideally, the very server your code is running on has its own local replica, so you can bypass the network), and do the reads there. This means that you can handle more reads per second by just adding more servers, and you can have copies of the data that are near to all the places you want to read data; this reduces latency, and can reduce bandwidth usage over your backbone if you have distributed sites compared to all your application servers talking to a centralised database server.

However, it has a cost: when you do a write, you need to update all the replicas. The naive approach is to require a writer to write to every replica independently; even worse is to have a “master” server that does it for them, as many popular replication systems do, since it then becomes a bottleneck. More advanced implementations implement a multicast system of some kind, ranging from actual IP multicast to diffusion over spanning trees.

But however you do it, replicating changes to a number of servers isn’t an atomic operation. When you issue that write, there will be a time delay before every server has it; and if there’s a network partition in progress, or a server is offline, that time delay could extend to days or weeks. So “synchronous replication” – where the write request does not return to the issuer as “completed” until positive acknowledgement has been received from every replica – is rare in practice; it makes writes very slow. On the other hand, “asynchronous replication” involves returning writes to the issuer as “completed” as soon as they’ve been received by a master server, or the multicast process initiated; this makes writes return quickly, improving the throughput of bulk writes tremendously. But it has a cost: programmers have to be careful.

A normal web page to allow a user to update their details will issue SQL queries to select the user’s details, then populate a form with them; if the user submits the form, we then issue an SQL query to update the user’s details, then continue as normal – fetching the data back from the database to redisplay it. This pattern avoids duplicating code, as the same logic is used to fetch the user’s details in each case; and it ensures that we will always show the user the actual data as it is in the database – so any trimming to length limits or other SQL-layer data transformations are accurately reflected.

But in a system with replication lag, this leads to problems. The user changes their details, then when they get the page back, it still shows their old details – as the change to their details had not yet replicated to the server the details were pulled back from to regenerate the page! This may not happen in testing, where the system is underloaded so replication is fast; but when the system becomes busy, replication lags can stretch to hours. The fun part is that the user may then think they did something wrong, and keep resubmitting their change until it “sticks”; this means that as the load increases, the rate at which updates come in increases by a factor proportional to how overloaded the system already is – creating a disastrous feedback loop.

Generally, applications need to be written to hide replication delays; when displaying data to a user, the application must manually account for recent changes the user has made, on top of what actually comes back from the database. This complicates development, hampering progress, and introduces plenty of scope for intermittent bugs that upset customers.

Which is why we developed proprietary technology at GenieDB to hide replication lag within the database itself, meaning that in normal usage, as soon as a write operation has returned to the application, subsequent reads (even from other servers) will return the latest information. Of course, in situations with network and server failures, this isn’t always possible, so you can still end up with some replication lag in extreme circumstances (alas, we can’t change the laws of physics), but except in emergencies, GenieDB hides replication lag from the application. We want to make it easy for applications to scale to replicated storage!

So it was with some amusement that, as we started setting up company profiles on popular professional and social network sites, we saw signs of replication lag. As staff members updated their personal profiles to state that they were part of the company, we’d go and refresh the company’s page to see if they had “done it right”, and lo, they would not appear yet.

So have we done something wrong in adding ourselves to the company? Or is it just replication lag? Should we assume the former and try again (risking breaking a working setup that just hasn’t replicated yet, and/or further loading the site’s database with unnecessary updates)? Or leave it for half an hour and see if the company page catches up?

As users of a site, we shouldn’t be having to think like this. And, of course, few users are replicated database professionals, so most of them won’t think like this; they’ll just think the site is broken if they encounter the artefacts of replication lag.

Leave a Reply