It seems almost fashionable to write about concurrency bugs these days. Books like Coders at Work and Beautiful Code contain harrowing accounts of Real Programmers finding subtle concurrency bugs in vast code bases.
Concurrency bugs are difficult to find, because they tend to involve the interaction between potentially widely-separated bits of code in unpredictable ways – and often only when several independent evens happen to coincide within narrow timing windows.
As we’re writing a distributed database here at GenieDB, we’re the kinds of people who are wise in the art of concurrency bugs, so we generally avoid them in the first place by only introducing concurrency where it is beneficial to do so. Obviously, the nodes in a distributed system execute concurrently, so we’ve kept the protocols for interacting between them simple and easy to analyse, and done our best to minimise state dependencies between them. This means that the ordering and timing of events should be unable to reveal unpleasant surprises. Within our software on each node, we’ve introduced concurrency only where we have to, in order to avoid blocking or to maximise performance. The interfaces between threads is very carefully marked, with the minimum of shared state; and we carefully reason about, and document, the expectations for managing that shared state, to avoid slip-ups in future.
However, this does not make us smugly immune to concurrency bugs; we fixed one yesterday that had been puzzling us for several days.
One of our daemons, when run within the test cluster harness, was sometimes dying while starting up. No core dump, no error message, nothing – it just disappeared midway through initialisation. We’d never had this problem before; it was only happening in the new Solaris port. Annoyingly, it was very timing-sensitive; running it under tools like truss or gdb made it always start up successfully, so we couldn’t see what was going wrong. Some of the logging we put in for tracing had to be taken out, as the problem never surfaced with it in place. It would always start up flawlessly from the command prompt; the problem only happened when it was run through our cluster testing harness, which ssh-ed into a group of servers and started the daemons on each.
But through a slow (and infuriating) process of deduction, we realised it was dying shortly after daemonising. In fact, the call to our daemonise function didn’t seem to return. Some careful logging of the return value of fork() led us to the horrifying conclusion that fork() would return in the parent – with a PID for the child – but it would never return in the child.
At which point it clicked. The process of daemonisation under UNIX is well documented in various places. The general procedure is to fork off a child process that becomes the actual daemon, then to terminate the parent. This means that the command that starts the daemon terminates, causing the shell or other process that started it to continue, while the daemon child process continues in the background. The child process then takes steps to isolate itself from the context of the parent (setsid(), a second fork, changing directory to the root, etc).
However, this standard process has a slight bug; for whatever reason, as it was timing sensitive, we noticed it on Solaris – but it was, presumably, just as present on Linux, the BSDs, and elsewhere.
Our daemon parent process was being invoked via ssh; so as soon as it terminated, the ssh connection terminated. If you are running a process in an ssh session and you close that session from your end – by killing the ssh process, losing the network connection, etc – then the ssh daemon on the server will send a “HUP” signal to the “process group” of the shell session. HUP is short for “hangup”, from the days of modem-driven dial in sessions; it is intended to tell a process that its connection to the user using it has gone away. By default, processes spawned from your ssh shell are all in the same process group; and by default, the HUP signal kills the processes; so if your ssh connection dies, then your processes on the server are killed off. Which is all well and good.
One of the steps performed in the child process of a daemon, immediately after it has been created by fork(), is to split away from the parent process group. This is so the daemon will not be killed off if the shell that spawned it happens to be terminated by a connection being closed; as a daemon, it does not interact with the user any further, and it should be a part of the system as a whole rather than the starting user’s session. However, it cannot form a new process group until after it’s forked, as the child process needs to exist before it can be assigned to a new group. So setsid() occurs immediately after fork() has returned in the child process.
What we were seeing is that the parent process forked and then terminated, and so the ssh daemon closed the session down, and thereby sent a HUP signal to the entire process group; but this all happened before the child process had ever had a chance to run. So it didn’t get a chance to break away from the process group. It did return from the fork() call – straight into the pending HUP signal, which kills it, despite the very next line of code being a setsid() that would have saved it.
Once we realised this might be the case, we made the parent process defined a handler for HUP signals that just prints a message, before it fork()s (so the handler is inherited by the child). Sure enough, the child process often received a HUP signal before getting to setsid(), immediately upon returning from fork(). So merely ignoring the signal, rather than letting it proceed to kill the process, made our problem go away.
So it seems the standard approach to daemonisation has a minor bug; there’s a time window where the parent process has terminated, but the child is not yet entirely isolated from the parent’s process group. Therfore, the child will be killed if the session terminates during this time interval. We only saw it because we were starting a daemon from ssh, which is a rare but not unheard of thing (particularly in automatically managed clusters). We suspect that the problem may have gone unnoticed in daemons that interpret the HUP signal as a request to re-read configuration files. If such a daemon happens to have installed their HUP signal handler before forking, then a signal arriving during daemonisation would just the “reload the configuration” flag to be set, causing a spurious but harmless re-reading of the configuration as soon as the daemon starts.
So here we are, developing a distributed highly-parallel database; and we find a concurrency bug in a standard practice that has existed within Unix for ages. Concurrency bugs can arise in the least expected of places…