Conclusion
The end result is a surprisingly resilient system. By tracking all the ‘how far behind am I?’ metrics we can think of in the server, and feeding them back into flow control, we try and prevent any server from getting too far behind, no matter how you measure it. This fits nicely with our automatic recovery when a server fails and returns to the system, or when the network is partitioned by link failures; in general, we have built a system that runs itself as much as possible.
This contrasts starkly with our most widespread competitor, MySQL replication. MySQL replica slaves are notorious for losing synchronisation silently, requiring manual re-synchronisation with the master. When the system is overloaded with too many writes, they just let their logs get bigger and bigger as the slaves lag further and further behind, until something somewhere snaps (be it disk space, or your users’ frustrations with increasingly outdated data). MySQL replication users tend to have to augment it with monitoring systems to detect lagging or desynchronised slaves, and develop their own tools to re-synchronise them. When their systems become overloaded with too many writes, they have to implement their own means of shedding load somehow.
This is work they should not have to be doing!
