Lock timeouts in BDB

When somebody reports a bug in software you’re responsible for, the worst thing that can possibly happen is to find that it’s somebody else’s fault.

If it’s your fault, you see, then you can fix it! And since it’s a problem in code you wrote, it’s usually not too hard for you to figure out what’s wrong in the first place.

But if it’s not immediately obvious what could be going wrong, and the trail of diagnosis seems to be leading into a third-party component you use, then a dull, throbbing, headache sets in. Bugs in third-party components are hard often hard to trace (as you’re not familiar with their code, even if you have access to the source), and hard to fix.

So we were delighted when, upon approaching Oracle with the suspicion that our software was deadlocking might be Berkeley DB’s fault, they promptly replied with some suggestions as to what to try – and when we subsequently produced an isolated test case that demonstrated the problem, they came back to us the next working day with a diagnosis of the problem and a patch!

The issue was simple: we’re using lock timeouts as part of our deadlock-handling strategy. We want locks to timeout in the core server, but never in clients, as they may not be able to restart their transactions. So we set a lock timeout of a few seconds in the server (which never performs operations lasting more than a second or so anyway), and set a timeout of zero (meaning “never time out”) in the clients.

But we were seeing deadlocked systems, with the server lock failing to time out. The fact that db_stat showed the lock sitting there with an expired timeout pointed the finger of blame therein.

So we produced a small standalone application that used Berkely DB in exactly the same way we did at the point where it got stuck, and lo, managed to reproduce it.

Sending this test app to Oracle, we were amazed to have an explanation of the problem, plus a patch, within about three hours!

Leave a Reply