Standard MySQL is configurable such that a single master server can be clustered with a number of read-only slave servers. To enable this master-slave replication, master’s transaction logs are communicated to the slaves (log shipping). Log shipping is a form of asynchronous replication. Under this configuration, the data on the slave always remains behind the master, a condition referred to as slave lag or replication lag. The extent of the slave lag depends on workload, network bandwidth and network latency. Database reads can be served out of the slaves, assuming the application has been designed to tolerate the slave lag and requisite staleness of data (eventual consistency), which can at times be variable and opaque. MySQL master-slave replication offers the possibility of promoting a slave to become the new master should the master fail, but this is very painful to do in practice. The cluster has to stop taking ANY writes while it waits for all existing slaves to catch up and apply the failed master’s logs. Effectively quiescing the cluster. And then a new master can be assigned and all clients need to connect to it for writes. The situation is further complicated if the master server fails leaving different slaves at different bin-log coordinates.
Two or more MySQL nodes can be set up in a “daisy-chain” or circular pattern and are each able to accept reads and writes. This replication pattern also depends on log shipping, and so is another form of asynchronous replication. Allowing writes at more than one node introduces a new potential problem: conflicting updates to the same record. If conflicting changes happen on both nodes, then the database server has no automated way to resolve these differences and data consistency is lost. The challenges of conflict resolution have significant implications at both the database and application layers, and the scope of this problem has been well documented. A particularly interesting read is the post on Facebook Engineering Blog. Facebook ultimately re-architected its application and load-balancer, creating a custom version of MySQL to overcome these inherent challenges. Beyond conflict resolution issues, the circular approach to asynchronous replication has a compounding effect on the onerous replication lag. Furthermore, loss of any single server is highly problematic, as MySQL circular replication does not offer a solution for recovery from a single failed node, other than manual re-configuration/restart.
Synchronous master-master replication offers a strict guarantee that replication will occur in concert across all nodes (i.e., without replication lag). With its reliance on log-shipping, stock MySQL does not support synchronous replication of any kind, but third party solutions have recently become available with varying sets of features. In this implementation any change on a particular server is “certified” against all (or a majority) of other servers in the cluster. Only when this certification has been successfully returned from all nodes does the client achieve a successful transaction. One of the challenges with synchronous replication is the concomitant latency penalty, hindering overall system throughput, especially in a multi-availability zone or multi-region deployment. Furthermore, failures of any node can cause certification failures, substantial rollback of transactions, or even worse, deadlock.
GenieDB has written a whitepaper specifically focused on Multi-Region, geo-distribution of MySQL for the purposes of high availability, such that outage in one geographical area (AWS outage, Hurricane Sandy, etc) does not impact application availability. GenieDB solves the conflict resolution challenge utilizing a modified Lamport time-stamping approach that also enables automated healing after failure or after a period of operation when nodes are disconnected, allowed to receive updates, and then reconnected. You can read more of the whitepaper at http://www.geniedb.com/beyond-failover.