Most replicated databases provide ‘eventual consistency’.
The reason for this is simple: replication, by definition, means there are multiple copies of the data lying about. And whenever something is changed, that change has to be applied to all the copies. And since doing that takes time, there will inevitably be situations where some of the copies have performed a given update, and some haven’t. This is known as ‘inconsistency’ – the replicas not being all the same. Applications have to be written to expect this; if you issue an update to the database, you can’t expect to get the new data back if you select it immediately. If the user submits a form that changes how the next page is displayed, that page can’t just read the database to generate itself (as it might do when being visited *not* from the form); the page has to manually take account of the change to reflect it, as the database might not. And users get frustrated if they change something, but it doesn’t seem to ‘work’. They submit the change again and again, putting more load on your system.
Eventual consistency means that there may be inconsistency while updates are ‘in flight’, but once this (usually short) period is over, unless more updates arrive, the replicas will become consistent again; and any given change will ‘eventually’ happen everywhere.
It’s possible to avoid this; using transactions and a two-phase commit protocol or more advanced systems like Paxos, one can arrange that all transactions are globally serialized (or an illusion thereof), so no node can ever see an inconsistent database state. However, doing so puts load on the system due to the extra messages interchanged and the extra temporary storage required for in-progress transactions while they are being agreed, and harms application latency due to the extra synchronisation required before doing anything, and can harm throughput due to making things wait on locks, or retrying transactions due to optimistic concurrency control detecting a conflict – and it cannot be enforced at the same time as the system is running with a network partition; if a network link fails you have to either abandon consistency or bring all or part of the system to a halt until communication is restored. There can be no globally consistent state without a way of communicating globally!
So strictly consistent replication is only used for systems that need absolute consistency enough to justify the costs.
However, here at GenieDB, we’ve discovered a way to get strict consistency cheaply for most operations, while the network is fully connected; and when network links fail, we fall back to eventual consistency in a graceful manner. By “most operations”, I mean:
- Any SELECT of a record by its primary key will consistently return the most recent state of that record, even if that is the deletion of the record
- Any other SELECTs – by non-primary key columns, by range of primary keys, or whole table scans – will consistently return the most recent state of the records it finds (even if they are deleted), but may not see newly created records
The way it works is quite simple: whenever we update a record (including deletions), we hash the primary key to choose a ‘primary server’ for that record; and we synchronously send the new state of the record to that server. We start an asynchronous ‘eventual’ replication of the change to the nodes in the background, and return control to MySQL.
When we fetch a record by primary key, we ask the primary server for that record first to see if there’s a recent version; if so, we use that rather than reading from our local disk. If it’s down or unreachable, we just use our local copy after all – which might be inconsistent, but is better than nothing.
So once a write has returned as complete, a read will always return the new data, no matter if replication is still in progress.
When the database is read without knowing a specific primary key – those table scans and secondary-index scans – we consult our local disk to find eligible primary keys, then look them up on their primary servers. If the records have been deleted, or the secondary index field changed so they are no longer eligible for the query, they’re skipped. Otherwise, we return their consistent current values from the primary servers.
Distributing records by hashing is ordinarily used as a cache, such as memcache, to allow an application to retrieve information quickly without needing to send out relatively expensive SQL queries. As our core engine is a lot faster than a MySQL database, all this talking to primary servers actually harms our performance due to the network latency; so we allow the application to turn the consistency algorithm off on a per-operation basis, for situations where eventual consistency is good enough.
What we provide isn’t quite strict consistency; it’s strictly consistent for all but a few specific ‘tricky’ kinds of query, and even then, it’s almost strictly consistent; and even in the face of multiple network failures it can never be worse than eventual consistency. But that covers a lot of cases. Enough that, for many applications, they can be written to assume strict consistency, and the odds of a network outage being kept low enough that the cost of what amounts to a few outdated selects per year is negligible compare to the cost of engineering the application to cope with eventual consistency.
This kind of thing is the core of what we’re doing here at GenieDB – database technologies such as replication offer seductive advantages, but usually at some unpleasant cost. We’re finding ways to reduce those costs, to make the advantages of these technologies more widely available and easier to use. And where we can’t remove them all, we make them optional on as fine-grained a level as we can, so developers can pick and choose throughout their application!

[...] they won’t – because we have our consistency buffer technique that is already papering over the fact that servers are updated asynchronously anyway. However, [...]
Great post, I bet a lot of work and research went into this article.