Archive for September, 2009

Into the Cloud

Monday, September 28th, 2009

Cloud environments offer enticing cost savings for Web businesses, but they present some interesting challenges. One of these challenges is that clouds tend to provide large numbers of relatively lightweight virtual servers, which may potentially fail; high availability is meant to happen in your application, by designing it so that it can tolerate virtual servers coming and going.

This is great for Web servers, and with a little cleverness, it even works well for very simple key-value data stores. However, it’s not a great fit for the database layer; traditional database products rely on a small number of powerful servers, where one or more may be single points of failure, so would usually be implemented on highly redundant hardware.

Here at GenieDB, we’ve been spending the last few years busily implementing a high-level database designed to transparently operate on a cluster of unreliable small servers… So, we’ve been looking at how we can help in cloud environments.

Our existing core technology is already a perfect match; we already support zero-downtime dynamic reconfiguration of clusters (planned and unplanned), with access to the data via independent MySQL instances on each server. We provide the software to be run on each server, and provide instructions on how to seamlessly add and remove servers from a cluster without downtime.

But many users won’t have an existing cluster management infrastructure in place, and configuring each node manually is no way to run a cluster. So as well as documenting the low-level operations on nodes, we’ve written something we call Cluster Tool, which handles bulk operations on clusters. Given some software installation tarballs and a configuration file listing the servers in your cluster and how you’d like them configured, Cluster Tool will ssh into those servers, install our software, configure it, and correctly follow the processes required. It files away a copy of your configuration, so when it’s presented with a new cluster configuration (which may have added, changed, and removed servers), it can decide what servers need uninstalling, which need upgrading, which need reconfiguring and which need installing, to correctly migrate the cluster to its new state. We ship Cluster Tool with its source code, so you can use it as-is, read the source as a reference implementation alongside the low-level management documentation, modify it to fit into your existing cluster management practices, or just wrap it in your own software that feeds it new cluster configurations.

With these foundations in place, automating cloud deployments was easy. We produced Cloud Tool, which uses a pluggable backend to manage virtual servers from various cloud providers. To grow a cluster, it requests more servers, then adds them to the Cluster Tool configuration, and asks Cluster Tool to add them to the cluster. When they are no longer needed, it asks Cluster Tool to remove them from the cluster, then it requests that the servers be de-provisioned.

It’s as simple as that! And like Cluster Tool, it’s provided as source code, and only uses documented interfaces in the software its built on top of (in this case, Cluster Tool itself), so you can build upon it in your own ways, such as running hybrid clusters based on privately-owned hardware that expand out onto cloud servers during peak load.

Replication Lag

Friday, September 4th, 2009

The key idea in a replicated database is that you store multiple copies of your data, spread over several servers. The advantage is that when you want to read some data, you can pick a server (ideally, the very server your code is running on has its own local replica, so you can bypass the network), and do the reads there. This means that you can handle more reads per second by just adding more servers, and you can have copies of the data that are near to all the places you want to read data; this reduces latency, and can reduce bandwidth usage over your backbone if you have distributed sites compared to all your application servers talking to a centralised database server.

However, it has a cost: when you do a write, you need to update all the replicas. The naive approach is to require a writer to write to every replica independently; even worse is to have a “master” server that does it for them, as many popular replication systems do, since it then becomes a bottleneck. More advanced implementations implement a multicast system of some kind, ranging from actual IP multicast to diffusion over spanning trees.

But however you do it, replicating changes to a number of servers isn’t an atomic operation. When you issue that write, there will be a time delay before every server has it; and if there’s a network partition in progress, or a server is offline, that time delay could extend to days or weeks. So “synchronous replication” – where the write request does not return to the issuer as “completed” until positive acknowledgement has been received from every replica – is rare in practice; it makes writes very slow. On the other hand, “asynchronous replication” involves returning writes to the issuer as “completed” as soon as they’ve been received by a master server, or the multicast process initiated; this makes writes return quickly, improving the throughput of bulk writes tremendously. But it has a cost: programmers have to be careful.

A normal web page to allow a user to update their details will issue SQL queries to select the user’s details, then populate a form with them; if the user submits the form, we then issue an SQL query to update the user’s details, then continue as normal – fetching the data back from the database to redisplay it. This pattern avoids duplicating code, as the same logic is used to fetch the user’s details in each case; and it ensures that we will always show the user the actual data as it is in the database – so any trimming to length limits or other SQL-layer data transformations are accurately reflected.

But in a system with replication lag, this leads to problems. The user changes their details, then when they get the page back, it still shows their old details – as the change to their details had not yet replicated to the server the details were pulled back from to regenerate the page! This may not happen in testing, where the system is underloaded so replication is fast; but when the system becomes busy, replication lags can stretch to hours. The fun part is that the user may then think they did something wrong, and keep resubmitting their change until it “sticks”; this means that as the load increases, the rate at which updates come in increases by a factor proportional to how overloaded the system already is – creating a disastrous feedback loop.

Generally, applications need to be written to hide replication delays; when displaying data to a user, the application must manually account for recent changes the user has made, on top of what actually comes back from the database. This complicates development, hampering progress, and introduces plenty of scope for intermittent bugs that upset customers.

Which is why we developed proprietary technology at GenieDB to hide replication lag within the database itself, meaning that in normal usage, as soon as a write operation has returned to the application, subsequent reads (even from other servers) will return the latest information. Of course, in situations with network and server failures, this isn’t always possible, so you can still end up with some replication lag in extreme circumstances (alas, we can’t change the laws of physics), but except in emergencies, GenieDB hides replication lag from the application. We want to make it easy for applications to scale to replicated storage!

So it was with some amusement that, as we started setting up company profiles on popular professional and social network sites, we saw signs of replication lag. As staff members updated their personal profiles to state that they were part of the company, we’d go and refresh the company’s page to see if they had “done it right”, and lo, they would not appear yet.

So have we done something wrong in adding ourselves to the company? Or is it just replication lag? Should we assume the former and try again (risking breaking a working setup that just hasn’t replicated yet, and/or further loading the site’s database with unnecessary updates)? Or leave it for half an hour and see if the company page catches up?

As users of a site, we shouldn’t be having to think like this. And, of course, few users are replicated database professionals, so most of them won’t think like this; they’ll just think the site is broken if they encounter the artefacts of replication lag.