Archive for January, 2010

Profiling and debugging a UNIX daemon

Monday, January 25th, 2010

There’s a number of tricks people use when profiling and debugging UNIX daemons.

The problem with a daemon is that it’s not generally connected to a user interface. It sits there in the background accepting requests through some communications medium, and sending responses back. By its very design, it’s meant to get out of the way and be unobtrusive.

So when you are trying to find out what a daemon is doing, in order to test it or to diagnose a problem, tricks are required.

The first trick is logging; make your daemon open up a file, or a channel to something like syslog, and log all sorts of internal events. This lets you extract a trace of events that occurred, in order, which is very useful; however, this can be complicated when your daemon has multiple worker processes or threads, as their outputs are interleaved. Also, it depends upon you putting logging commands in at the points you need them. If a fault occurs in a part of the system you’ve not instrumented in the way you need yet, you need to recompile it with more instrumentation, then attempt to recreate the fault. Also, all that logging takes up system resources, such as time and disk bandwidth; in the worst case, the logging can end up hogging disk bandwidth, and then the log writes start to block on available buffer space, which inserts large erratic delays into your daemon, possibly hiding the real issues or creating new ones.

As we are a distributed system, logging to files would be particularly annoying for us, as there would be a separate file on each server; correlating events between them would become unpleasant, and we’d need to drag them back to a central point for analysis in the first place. However, we have a multicast communication system in place anyway, so we use it for logging. Client and server processes all log into a special multicast group; then one or more monitoring nodes can run a special monitor that registers to receive the multicasts and spools them to disk in order (while also maintaining a ’scoreboard’ file listing all known servers, and their last-reported statistics). The multicasting is much lower-latency than a disk write, so the costs of logging are low, and unlikely to introduce long delays due to blocking on sparse disk bandwidth; the disk bandwidth used is on a dedicated machine. Also, we only log very interesting events by default; debug logging has to be turned on on a per-server basis, by enabling debug log flags.

The second trick is attaching gdb (or some other debugger) to a running daemon process, or to a core file from a crash. This is particularly excellent when your daemon has hung for some reason; once gdb is attached you can see what the current state of execution of all the daemon’s threads are, including all local and global variables. This very quickly tells you the nature of a deadlock; and it doesn’t require any planning ahead, beyond leaving debugging symbols on, which is cheap and easy. However, often the cause of a problem is far from where it’s detected; a pointer error can send a data structure awry and it only crashes some time later, in a perfectly valid and totally unrelated bit of code. A log file might be better at telling you where things went awry. gdb can also continue the daemon running, stopping when certain conditions are met, such as a variable changing in a certain way or a certain part of the code being reached, which can be similarly useful to logging – if you get gdb attached when the system is in a good state and you know a way to repeat the fault to get it into a bad state, you can watch what happens as the fault develops. However, if you don’t have a reproducible test case, this doesn’t help.

We use both tricks at GenieDB – but there’s another trick up our sleeve. Our daemon and all the GenieDB clients on a given server communicate via a shared memory segment; the primary function of this is so that all GenieDB processes on a node can share a sequence number counter, but it’s also used by the daemon to communicate global state to the clients easily; the list of tables and indices is there, so clients can access it easily, and read out information like the numbers of rows in each table without needing to communicate with the daemon process. However, we also use it to expose private state of the daemon (and instances of our client library inside client processes) for debugging purposes.

Many of the more interesting global variables of the daemon, for example, are placed into the shared memory area. We have a private tool that dumps them out, so we can check the internal state of the daemon without needing to attach gdb (and stopping the daemon in the process); this means we can run the tool under watch and eyeball them in another window while we run load tests.

But we can do more than that. Because updating a variable in shared memory is cheap, we can afford to do it lots – more often than we could justify logging statements. So we have a region in the shared memory for each client process and the daemon; and at all the interesting flow control points in the daemon and client library, we update an integer therein with the current line number, ORed with a few high bits to say which file it’s in. All of this is done with a macro, so we can leave it out of production builds.

Not only does this mean we can get a quick summary of the state of execution of all interesting processes, without interrupting them, in a single command; it also means we can sample this on a timer interrupt and accumulate a histogram, to tell where time is being spent when doing performance work.

But because updating and sampling these state variables is cheap, we’ve been inventive about what else we can track. At the core of the daemon is a thread that reads commands from a queue, and performs them; so we sample a high-resolution timer when reading from the queue then when processing a job. By taking differences, we know what proportion of the thread’s time is spent waiting for work, versus doing it. Taking a moving average of this gives us a ‘load percentage’, telling us how close to capacity this thread is, which is a good metric of how busy the node is, which goes in the shared memory segment. Along with the queue lengths.

This is a rather special queue, of course. As processing jobs from it is our overall bottleneck, we try and avoid putting jobs into it in the first place, if we can. Most of these jobs are updates to records on disk, it being a database. So the queue is indexed, meaning that when an update comes in from a client, we can consult the index to see if this update can be merged with an existing one. Perhaps two updates to the same record come along at once; the older of the two will be dropped. Perhaps a table is deleted while some updates to it are still in the queue. Perhaps two updates are to records that are stored together, in which case the two updates can be rolled into one larger one. Needless to say, we’re interested in how well this works, so we have counters for each case, in the shared memory segment; and our introspection tool lists them all, along with a computed percentage of all write operations successfully avoided. We record the timestamp at which various periodic internal processes occur, so we can see if they are overdue. We expose the various metrics the daemon uses to calculate how big its work backlog is to feed into the flow control system (the subject of a future blog post!), so we can see exactly what’s holding things up. And the list goes on.

Daemons like MySQL offer profiling and status-monitoring counters too, which you can get at via vendor-specific SQL queries. But all of that requires infrastructure; when we add a new counter, we declare it in the struct, make sure it’s initialised on system startup, and add a line to the introspection tool to print it out. We have to be careful about concurrent access to the shared state, but only if the counter is larger than a machine word (which the x86 platform guarantees can be atomically read and updated, even on SMP systems). We can afford to sample that a thousand times a second, if we want high-resolution profiling, while putting only a small load on the system. Try that with `show innodb status`!

Adding performance counters and the like is a painless task for us, so we do it a lot. This makes the internal state of the system highly transparent; not only can we hunt down problems and tune for performance more easily, we can quickly eyeball the numbers and see if everything is operating correctly during routine development. Currently, we’re not publishing the introspection tool for customer use; we expose the more interesting “overall load level” metrics via the aforementioned distributed cluster monitoring tool, which gathers them all at monitoring workstations once per second. But we’ll expose the underlying introspection tool, once we’ve documented it (which is the hard part) – advanced users doing advanced tuning will benefit from fine-grained metrics just as much as us developers do!

Eventual Consistency

Wednesday, January 13th, 2010

Most replicated databases provide ‘eventual consistency’.

The reason for this is simple: replication, by definition, means there are multiple copies of the data lying about. And whenever something is changed, that change has to be applied to all the copies. And since doing that takes time, there will inevitably be situations where some of the copies have performed a given update, and some haven’t. This is known as ‘inconsistency’ – the replicas not being all the same. Applications have to be written to expect this; if you issue an update to the database, you can’t expect to get the new data back if you select it immediately. If the user submits a form that changes how the next page is displayed, that page can’t just read the database to generate itself (as it might do when being visited *not* from the form); the page has to manually take account of the change to reflect it, as the database might not. And users get frustrated if they change something, but it doesn’t seem to ‘work’. They submit the change again and again, putting more load on your system.

Eventual consistency means that there may be inconsistency while updates are ‘in flight’, but once this (usually short) period is over, unless more updates arrive, the replicas will become consistent again; and any given change will ‘eventually’ happen everywhere.

It’s possible to avoid this; using transactions and a two-phase commit protocol or more advanced systems like Paxos, one can arrange that all transactions are globally serialized (or an illusion thereof), so no node can ever see an inconsistent database state. However, doing so puts load on the system due to the extra messages interchanged and the extra temporary storage required for in-progress transactions while they are being agreed, and harms application latency due to the extra synchronisation required before doing anything, and can harm throughput due to making things wait on locks, or retrying transactions due to optimistic concurrency control detecting a conflict – and it cannot be enforced at the same time as the system is running with a network partition; if a network link fails you have to either abandon consistency or bring all or part of the system to a halt until communication is restored. There can be no globally consistent state without a way of communicating globally!

So strictly consistent replication is only used for systems that need absolute consistency enough to justify the costs.

However, here at GenieDB, we’ve discovered a way to get strict consistency cheaply for most operations, while the network is fully connected; and when network links fail, we fall back to eventual consistency in a graceful manner. By “most operations”, I mean:

  • Any SELECT of a record by its primary key will consistently return the most recent state of that record, even if that is the deletion of the record
  • Any other SELECTs – by non-primary key columns, by range of primary keys, or whole table scans – will consistently return the most recent state of the records it finds (even if they are deleted), but may not see newly created records

The way it works is quite simple: whenever we update a record (including deletions), we hash the primary key to choose a ‘primary server’ for that record; and we synchronously send the new state of the record to that server. We start an asynchronous ‘eventual’ replication of the change to the nodes in the background, and return control to MySQL.

When we fetch a record by primary key, we ask the primary server for that record first to see if there’s a recent version; if so, we use that rather than reading from our local disk. If it’s down or unreachable, we just use our local copy after all – which might be inconsistent, but is better than nothing.

So once a write has returned as complete, a read will always return the new data, no matter if replication is still in progress.

When the database is read without knowing a specific primary key – those table scans and secondary-index scans – we consult our local disk to find eligible primary keys, then look them up on their primary servers. If the records have been deleted, or the secondary index field changed so they are no longer eligible for the query, they’re skipped. Otherwise, we return their consistent current values from the primary servers.

Distributing records by hashing is ordinarily used as a cache, such as memcache, to allow an application to retrieve information quickly without needing to send out relatively expensive SQL queries. As our core engine is a lot faster than a MySQL database, all this talking to primary servers actually harms our performance due to the network latency; so we allow the application to turn the consistency algorithm off on a per-operation basis, for situations where eventual consistency is good enough.

What we provide isn’t quite strict consistency; it’s strictly consistent for all but a few specific ‘tricky’ kinds of query, and even then, it’s almost strictly consistent; and even in the face of multiple network failures it can never be worse than eventual consistency. But that covers a lot of cases. Enough that, for many applications, they can be written to assume strict consistency, and the odds of a network outage being kept low enough that the cost of what amounts to a few outdated selects per year is negligible compare to the cost of engineering the application to cope with eventual consistency.

This kind of thing is the core of what we’re doing here at GenieDB – database technologies such as replication offer seductive advantages, but usually at some unpleasant cost. We’re finding ways to reduce those costs, to make the advantages of these technologies more widely available and easier to use. And where we can’t remove them all, we make them optional on as fine-grained a level as we can, so developers can pick and choose throughout their application!