There’s a number of tricks people use when profiling and debugging UNIX daemons.
The problem with a daemon is that it’s not generally connected to a user interface. It sits there in the background accepting requests through some communications medium, and sending responses back. By its very design, it’s meant to get out of the way and be unobtrusive.
So when you are trying to find out what a daemon is doing, in order to test it or to diagnose a problem, tricks are required.
The first trick is logging; make your daemon open up a file, or a channel to something like syslog, and log all sorts of internal events. This lets you extract a trace of events that occurred, in order, which is very useful; however, this can be complicated when your daemon has multiple worker processes or threads, as their outputs are interleaved. Also, it depends upon you putting logging commands in at the points you need them. If a fault occurs in a part of the system you’ve not instrumented in the way you need yet, you need to recompile it with more instrumentation, then attempt to recreate the fault. Also, all that logging takes up system resources, such as time and disk bandwidth; in the worst case, the logging can end up hogging disk bandwidth, and then the log writes start to block on available buffer space, which inserts large erratic delays into your daemon, possibly hiding the real issues or creating new ones.
As we are a distributed system, logging to files would be particularly annoying for us, as there would be a separate file on each server; correlating events between them would become unpleasant, and we’d need to drag them back to a central point for analysis in the first place. However, we have a multicast communication system in place anyway, so we use it for logging. Client and server processes all log into a special multicast group; then one or more monitoring nodes can run a special monitor that registers to receive the multicasts and spools them to disk in order (while also maintaining a ’scoreboard’ file listing all known servers, and their last-reported statistics). The multicasting is much lower-latency than a disk write, so the costs of logging are low, and unlikely to introduce long delays due to blocking on sparse disk bandwidth; the disk bandwidth used is on a dedicated machine. Also, we only log very interesting events by default; debug logging has to be turned on on a per-server basis, by enabling debug log flags.
The second trick is attaching gdb (or some other debugger) to a running daemon process, or to a core file from a crash. This is particularly excellent when your daemon has hung for some reason; once gdb is attached you can see what the current state of execution of all the daemon’s threads are, including all local and global variables. This very quickly tells you the nature of a deadlock; and it doesn’t require any planning ahead, beyond leaving debugging symbols on, which is cheap and easy. However, often the cause of a problem is far from where it’s detected; a pointer error can send a data structure awry and it only crashes some time later, in a perfectly valid and totally unrelated bit of code. A log file might be better at telling you where things went awry. gdb can also continue the daemon running, stopping when certain conditions are met, such as a variable changing in a certain way or a certain part of the code being reached, which can be similarly useful to logging – if you get gdb attached when the system is in a good state and you know a way to repeat the fault to get it into a bad state, you can watch what happens as the fault develops. However, if you don’t have a reproducible test case, this doesn’t help.
We use both tricks at GenieDB – but there’s another trick up our sleeve. Our daemon and all the GenieDB clients on a given server communicate via a shared memory segment; the primary function of this is so that all GenieDB processes on a node can share a sequence number counter, but it’s also used by the daemon to communicate global state to the clients easily; the list of tables and indices is there, so clients can access it easily, and read out information like the numbers of rows in each table without needing to communicate with the daemon process. However, we also use it to expose private state of the daemon (and instances of our client library inside client processes) for debugging purposes.
Many of the more interesting global variables of the daemon, for example, are placed into the shared memory area. We have a private tool that dumps them out, so we can check the internal state of the daemon without needing to attach gdb (and stopping the daemon in the process); this means we can run the tool under watch and eyeball them in another window while we run load tests.
But we can do more than that. Because updating a variable in shared memory is cheap, we can afford to do it lots – more often than we could justify logging statements. So we have a region in the shared memory for each client process and the daemon; and at all the interesting flow control points in the daemon and client library, we update an integer therein with the current line number, ORed with a few high bits to say which file it’s in. All of this is done with a macro, so we can leave it out of production builds.
Not only does this mean we can get a quick summary of the state of execution of all interesting processes, without interrupting them, in a single command; it also means we can sample this on a timer interrupt and accumulate a histogram, to tell where time is being spent when doing performance work.
But because updating and sampling these state variables is cheap, we’ve been inventive about what else we can track. At the core of the daemon is a thread that reads commands from a queue, and performs them; so we sample a high-resolution timer when reading from the queue then when processing a job. By taking differences, we know what proportion of the thread’s time is spent waiting for work, versus doing it. Taking a moving average of this gives us a ‘load percentage’, telling us how close to capacity this thread is, which is a good metric of how busy the node is, which goes in the shared memory segment. Along with the queue lengths.
This is a rather special queue, of course. As processing jobs from it is our overall bottleneck, we try and avoid putting jobs into it in the first place, if we can. Most of these jobs are updates to records on disk, it being a database. So the queue is indexed, meaning that when an update comes in from a client, we can consult the index to see if this update can be merged with an existing one. Perhaps two updates to the same record come along at once; the older of the two will be dropped. Perhaps a table is deleted while some updates to it are still in the queue. Perhaps two updates are to records that are stored together, in which case the two updates can be rolled into one larger one. Needless to say, we’re interested in how well this works, so we have counters for each case, in the shared memory segment; and our introspection tool lists them all, along with a computed percentage of all write operations successfully avoided. We record the timestamp at which various periodic internal processes occur, so we can see if they are overdue. We expose the various metrics the daemon uses to calculate how big its work backlog is to feed into the flow control system (the subject of a future blog post!), so we can see exactly what’s holding things up. And the list goes on.
Daemons like MySQL offer profiling and status-monitoring counters too, which you can get at via vendor-specific SQL queries. But all of that requires infrastructure; when we add a new counter, we declare it in the struct, make sure it’s initialised on system startup, and add a line to the introspection tool to print it out. We have to be careful about concurrent access to the shared state, but only if the counter is larger than a machine word (which the x86 platform guarantees can be atomically read and updated, even on SMP systems). We can afford to sample that a thousand times a second, if we want high-resolution profiling, while putting only a small load on the system. Try that with `show innodb status`!
Adding performance counters and the like is a painless task for us, so we do it a lot. This makes the internal state of the system highly transparent; not only can we hunt down problems and tune for performance more easily, we can quickly eyeball the numbers and see if everything is operating correctly during routine development. Currently, we’re not publishing the introspection tool for customer use; we expose the more interesting “overall load level” metrics via the aforementioned distributed cluster monitoring tool, which gathers them all at monitoring workstations once per second. But we’ll expose the underlying introspection tool, once we’ve documented it (which is the hard part) – advanced users doing advanced tuning will benefit from fine-grained metrics just as much as us developers do!