Deadlock detector

Contact: tech-kern dholland

A deadlock occurs when a cycle appears in the waits-for graph. That is, if you make a graph with an edge from every process to every lock it's waiting for, and an edge from every lock to the process holding it (so the lock is conceptually waiting for the process to release it) that graph must remain acyclic.

A deadlock detector monitors this graph at runtime and takes some kind of remedial/recovery action if a cycle appears.

Deadlock detectors are common in the database world, where they're used to trigger transaction aborts to cope with sets of queries that inherently need to acquire locks in conflicting order.

We tend not to use deadlock detectors in kernels; instead we tend to favor writing kernels with carefully structured locking hierarchies so deadlocks don't occur.

However, because it's easy to get those locking hierarchies wrong, a deadlock detector can still be a useful diagnostic/debugging tool for a kernel. There's no need to get into transaction rollbacks or anything; one can just print the lock cycle and panic.

There's a perception that deadlock detectors are inherently complicated and making one safely concurrent is hard. This turns out not to be true. For ordinary mutexes in a kernel, you only ever wait for one lock at a time, and a lock is only ever held by one process at a time. This makes the graph quite degenerate: not only are there no cycles, but the graph doesn't even branch. Checking for a deadlock just means, when acquiring a lock, searching forward through a single chain of edges to see if you find yourself.

dholland@ deployed such a deadlock detector in the OS/161 teaching OS some years back and it was quite successful, and under 300 lines of code.

The first step of this project is to implement such a deadlock detector for NetBSD as a kernel compile-time option. You are welcome to (should) crib from the OS/161 implementation, which is called "hangman" and can easily be found in the OS/161 code. (See https://www.os161.org.) The primary difficulty lies in understanding and modifying NetBSD's locking primitives, which are considerably more complicated than OS/161's. This much on its own will be quite useful.

The second step is to then extend the deadlock detector to handle reader/writer locks. This makes things more complicated because multiple readers can hold such a lock at once, which means that the graph can now branch at reader-writer locks, which requires extending the per-lock data structure and a more complex search to avoid repeating work. However, neither of these points is especially difficult; again, the primary difficulty is dealing with the low-level concurrent innards of the system.

Note that deadlocks involving condition variables are not handled and not expected to be handled; that is a substantially harder problem, beginning with identifying exactly what a process blocked on a condition variable is actually waiting for.

Also note that the OS/161 deadlock detector uses a single global lock to protect itself, and you should do the same.

(Given enough CPUs scaling will still eventually become a problem. It should be possible to switch to atomic ops in at least some places. However, get it running first. Note that the performance overhead only has to beat LOCKDEBUG to look reasonable, and that's a very low bar.)

Last edited in the wee hours of Sunday night, January 6th, 2025

Preferences | Logout