GenieDB is building a database with global distribution as its core thesis. It is no secret customers demand near-instantaneous and highly reliable service, and that they are becoming more globally dispersed than ever before. We believe that data custodianship must ultimately be moved to the “edge of the web” where it can be dynamically managed in order to improve user experience, optimize network/hardware utilization and reduce TCO. A single datacenter hosted database and application stack runs afoul of this fundamental thesis in a number of ways. In this article we will focus on the issue of improving response time for users even when they are globally distributed. This is simply a matter of physics and how long it takes to transmit a packet between the two locations. No amount of application tuning can overcome this obstacle.
The obvious solution is to have multiple copies of the database and application stacks located closer to various user bases. This is addressed at the database level by replication. One can set up a cluster that consists of geographically distributed servers where a data change in one replicates to all others. This solution leads to four issues/problems!
- What does it take to set up and maintain such an infrastructure? (Reliability & TCO)
- Do you allow writes to happen on every node in the cluster? (Multi-Master)
- If yes, how do you make sure that changes on one node don’t stomp over changes on the other? (Transactions)
- How does the system respond to node, network or Cloud provider failures? (Ultra-Availability)
GenieDB is built from the ground-up to be cognizant of the demands and vagaries of a globally distributed cluster, put another way, deployed on ultra-latency, frequently unreliable network. GenieDB was therefore designed around the basic assumptions that global networks are latent and unreliable and that servers will eventually fail. Some highlights of our architecture follow, many of which we will explore in later posts:
- Full copies of the entire database exist at each and every node
- Every node is full read/write master (masterless or peer-to-peer)
- GenieDB operates as a MySQL storage engine
- For persistent data storage GenieDB is currently employing BDB
- GenieDB’s Cloud Fabric layer is an advanced messaging and cache system which immediately replicates and synchronizes updates globally
- Lamport Time Stamping (enhanced) is utilized for conflict resolution, among other advancements
- Heartbeat and data synch are monitored for automated detection and managing of network partition/failure
- Advanced Data Custodianship to localize data based on usage patterns and network status
In the remainder of this post, we will cover the basic issue of how we reduce latency in a widely distributed system.
Set up of each node in the cluster is a simple matter of installing GenieDB software (cloudfabric) on each of the machines using standard Debian or CentOS utilities. The new node is then made part of the cluster by adding it to a VPN. This makes sure that all the traffic between nodes is encrypted and kept secure.
We currently use an open source solution “tinc” to create a VPN between the various nodes. The creation of a VPN is completely seperate aspect of setting up the cluster. Any other means of creating a cluster can be used as well, as long as it supports packet broadcasting. We provide scripts that make setting up a VPN using tinc very easy. Eventually we will also provide a ‘Management Console’ UI which will wrap this functionality into a point and click interface.
With the VPN set up, cloudfabric is configured with the private network address and port details. That’s all the needed configuration and on starting up the server (/etc/init.d/cloudfabric start), data stored in GenieDB on any other node that is part of the cluster will start streaming in automatically.
We created a test bed using the steps above with three nodes, one on EC2 West Coast, another on EC2 East Coast and one in our Cloud partner LayeredTech’s Kansas data center.
- 90ms ping time between EC2 East and West instances.
- 45ms ping time from EC2 East and West to the LayeredTech datacenter.
We compared the performance of replication under this set up against MySQL Circular replication. We ran a test which inserted a range of rows (1 to 10,000) on node #1 and measured the time it takes before the data is replicated to nodes #2 and #3. We used the following method to compute the lag:
- Sync’d the clocks on all servers using NTP.
- Started insertion on node #1 using a script that when done sends the finish timestamp to other nodes.
- Node #2 and #3 fetch a count of rows every 10ms. This gives us a resolution of 5ms.
- When the count matches the expected count, it computes the difference between its clock and the Node #1 timestamp that it received.
This measures the efficiency of the replication process. In our tests:
- The replication lag on Node #2 for GenieDB was lower than MySQL and remained stable while MySQL lag increased as we inserted more rows.
- The replication lag at Node #3 was the same as Node #2 on GenieDB while it was cumulative for MySQL circular replication and hence higher.