Since migration to Redis for data storage and sharing as well as Docker for running SWISH, the system has been very unstable. After fixing some issues with the usage of Redis one issue remained: frequent dead locks. I finally managed to figure out that the root of this was a locking ordering issue in transaction/1 during the commit phase.
The health checking system should now also handle possible future deadlocks, reducing the downtime to about 1 minute in case of a dead lock.
Hopefully normal operation is restored. Still more work needs to be done to make multiple copies of SWISH act as backup for each other. That may take some time as there are some more pressing issues to take care of. For now we have an independent backup at https://swish2.swi-prolog.org and (thus) a fast recovery option should the old server running the primary instance die.
P.s. Is anyone aware of static analysis tools (for C) that can find lock ordering issues? In this case we had one thread locking L1 and then L2 and another locking L2 and then L1. Static analysis should be able to find that.
It could certainly help to try and run the codebase using tsan. If I understand the sanitizer family correctly though, they are runtime checks. If we get into this situation we’ll deadlock anyway and it isn’t that hard to find the cause of a deadlock using gdb. It was (one of) the first times I had to deal with an ordering issue though. Most times it is simply an attempt to grab the same lock twice in the same thread. Than a simple stack trace is enough. Now we have two. That is still easy: find an arbitrary deadlocked thread, find the mutex, print its details which tells us which thread is holding it. Dump the stack of that thread, find it is trying to lock another mutex, dump the details thereof, find the thread and print its backtrace From experience with the address sanitizer TSAN will probably do that work for me.
What I’m after though is that we have a function locking L1, calling a function that locks L2 inside the locked reason and another function locking L2 calling a function that locks L1 inside the locked reason. You find using static analysis that these two functions should not run concurrently.
Yes, they are runtime checks, but they don’t require a deadlock to be fired – I think TSAN finds potential deadlocks. You need to add annotations to make it useful (e.g., GUARDED_BY(mutex). It’s been a long time since I’ve used it, so I don’t remember details. I found this documentation, and there’s probably better stuff but Google search failed me: Thread Safety Analysis — Clang 16.0.0git documentation
My recollection is that TSAN is intended for unit tests; I don’t know how well it runs on a full “production” system.