Difficult to reproduce problems while running tests

It’s one of those. Just yesterday I was able to make the swipl:thread tests hang on my computer several times in a row. After building the latest master 5a7a135246e0c62237464bfacd74e9a1f6aed60c I did:

$ ( cd build && ctest -R "thread" )
Test project /home/boris/install/swipl-devel/build
    Start 26: swipl:thread
^C

and this would consume three of the four cores on my computer (top shows a swipl process at ~300% CPU) for many minutes before killing it manually. I can no longer reproduce this.

While I was just building and running all tests with ctest -j 4, before I noticed thread hanging, I also had language_server:language_server fail once; I could not reproduce that even once.

I use the following cmake command:

cmake -DSWIPL_PACKAGES_JAVA=OFF -DCMAKE_BUILD_TYPE=PGO -G Ninja ..

Does the PGO introduce some non-determinism to the compilation? (I did re-compile everything between yesterday, then the tests were hanging, and today.)

I will now add --output-on-failure to ctest in my build script and hopefully next time I have something slightly more useful to report. Sorry for the noise, posting this in the hope that someone else with a similar problem realize that they are not alone.

I would be really surprised. Some of the thread tests do have varying run times. Multiple minutes requires rather slow hardware though. I have these environment settings to control ctest:

export CTEST_PARALLEL_LEVEL=16
export CTEST_OUTPUT_ON_FAILURE=y

That sets the defaults, so just typing ctest does the job. The best parallel level depends a lot on your hardware. If you report, please include details (hardware, OS, compiler used and of course the Prolog version).

2 Likes

I have also noticed this problem (with 8.3.28) and with earlier version 8.3.7, as I reported here:

Ann: SWI-Prolog 8.3.7 - #5 by swi.

In between 8.3.7 and 8.3.28 I did not notice the problem.

This one is hard to reproduce, as it often is with thread deadlock problems.

Seems to be the same or similar. Since I cannot reproduce it easily there is nothing to be done at the moment. :frowning:

Actually I can reproduce it; the trick was to delete the build directory and build from scratch. I don’t have time now but I will sift through the linked thread and try to make a better bug report later this week. Thank you @swi for bringing my attention to the thread, I would have missed it.

1 Like

Reproduces for me too…thanks

In general doesn’t seem to reproduce. Both the PPA builds and optionally the ones on dev.swi-prolog.org are build from scratch and test fine. Same applies to my release builds on MacOS and linux. Can you give details (hardwarde, OS, etc)?

In my case it is an Intel I7 4 cores (8 with hyperthreading) 24 GB RAM, gcc 11.1, glibc 2.33-5; running archlinux at the moment.

I think I see a pattern here.

My processor is Intel(R) Core™ i5-6200U CPU @ 2.30GHz.
OS is also (updated) Archlinux.
GCC is also 11.1.0
I do have gperftools installed which should provide tcmalloc. Indeed, I see:

?- current_prolog_flag(malloc, X).
X = tcmalloc.

SWI-Prolog version reported as threaded, 64 bits, version 8.3.28-8-g5a7a13524

I haven’t had time to read through the linked thread carefully, yet.

Does your system also provide gcc-10? If so, could you test with that version?

Second step is to run this to get the commandline arguments

ctest -V -R thread

Now run the test command without the “-q”. That prints the scripts that are executed by this test. Find the one hanging and then do

src/swipl ../src/Tests/thread/test_XYZ.pl
?- test_XYZ.

Now if some test hangs you can use ^C to find out which.

Archlinux is a “rolling release” distro. I can however build gcc-10 from source I guess. I will do the other things you suggest when I get my hands on my laptop (where this is happening).

Another thing that I will try is to install archlinux on a virtual machine in VirtualBox (on a different host) and see if the same problem reproduces.

Not sure that is worth the trouble. The most likely cause is unexpected instruction reordering that causes one of the lock-free synchronizations to malfunction. After identifying where this happens we simply have to add one of the memory synchronization calls to make it clear to the compiler it cannot change the order.

That can be related to the compiler and the optimization options (including PGO). It must disappear when compiled without optimization.

Ubuntu 20.04 doesn’t have gcc-11. I’ll try running a newer version in a Docker container.

edit: tried on Fedora 34 in a Docker comtainer running on top of Ubuntu 20.04, gcc 11.1.1. AMD 3950X. PGO build. Works fine. Tested by running the command below for a while with various cpu-list arguments to taskset to force different numbers of active cores (1, 2, 4, and all 32). Bugs like this tend reproduce better with fewer cores, causing threads to preempt at arbitrary points.

taskset -c 0 bash -c "while ctest -V -R thread; do true; done"

I think the CPU may matter.

Running with gcc 11.1, pgo Intel I7, I get the following:

/tmp/swipl-devel/build.release/src/swipl "-f" "none" "--no-packs" "--on-error=status"  "/tmp/swipl-devel/src/test.pl" "--no-core" "thread"
% SWI-Prolog test suite.
% To run all tests run ?- test.
% 
Running scripts from thread (agc2.pl)[0.005sec].(agc3.pl)[0.006sec].(agc.pl)[0.071sec].(pooltest.pl)[0.002sec].(qpattern.pl)[0.003sec].(queue_create.pl)[0.062sec].(queue_gc.pl)........[0.011sec].(queue_max_size.pl)[0.023sec].(queue_race.pl)[0.101sec].(queue_resource.pl)[0.002sec].(queue_select.pl).[0.001sec].(queue_send.pl)[0.011sec].(queue_timeout.pl)....[0.124sec].(test_agc_callback.pl)[1.103sec].(test_agc_deadlock.pl)^C

so it looks like it’s hanging on test_agc_deadlock.pl

digging deeper:

$ src/swipl ../src/Tests/thread/test_agc_deadlock.pl 
1 ?- test_agc_deadlock.
^CAction (h for help) ? C-stack trace labeled "INT":
  [0] save_backtrace() at :? [0x7f8df5deb924]
  [1] PL_interrupt() at ??:? [0x7f8df5d97952]
  [2] dispatch_signal() at :? [0x7f8df5e36423]
  [3] handleSignals___LD() at :? [0x7f8df5e12a69]
  [4] pl_thread_join2_va() at pl-thread.c:? [0x7f8df5da8d74]
  [5] PL_next_solution___LD() at :? [0x7f8df5df4efe]
  [6] query_loop() at :? [0x7f8df5e309b8]
  [7] prologToplevel() at :? [0x7f8df5e3084b]
  [8] PL_toplevel() at ??:? [0x7f8df5e455eb]
  [9] src/swipl(+0x10b1) [0x558a089340b1]
  [10] __libc_start_main() at ??:? [0x7f8df5b8ab25]
  [11] src/swipl(+0x10ee) [0x558a089340ee]
Action (h for help) ? goals
     [14] thread_join(<thread>(5,0x558a0b1fd4a0), _61092)
     [12] test_agc_deadlock:'__aux_maplist/2_join+0'([<thread>(5,0x558a0b1fd4a0), <thread>(6,0x558a0b292aa0), <thread>(7,0x558a0b292a50), <thread>(8,0x558a0b292960)])
Action (h for help) ? 

I installed gcc10, and tried it:

1 ?- test_agc_deadlock.
^CAction (h for help) ? C-stack trace labeled "INT":
  [0] save_backtrace() at :? [0x7effc674e2c9]
  [1] PL_interrupt() at ??:? [0x7effc66f9cab]
  [2] dispatch_signal() at :? [0x7effc679b2fb]
  [3] handleSignals___LD() at :? [0x7effc6776b59]
  [4] pl_thread_join2_va() at pl-thread.c:? [0x7effc670b161]
  [5] PL_next_solution___LD() at :? [0x7effc6758682]
  [6] query_loop() at :? [0x7effc67958ca]
  [7] prologToplevel() at :? [0x7effc679575b]
  [8] PL_toplevel() at ??:? [0x7effc67aa3db]
  [9] src/swipl(+0x10b1) [0x55a701c770b1]
  [10] __libc_start_main() at ??:? [0x7effc64ecb25]
  [11] src/swipl(+0x10ee) [0x55a701c770ee]
Action (h for help) ? goals
     [12] thread_join(receiver)
     [11] test_agc_deadlock:test_agc_deadlock(4)
Action (h for help) ? goals
     [12] thread_join(receiver)
     [11] test_agc_deadlock:test_agc_deadlock(4)
Action (h for help) ? abort
% Execution Aborted
2 ?- current_prolog_flag(c_cc,V).
V = '/usr/sbin/gcc-10'.

Still hangs with gcc10, but backtrace is a little different.

Tried it again with gcc 11.1 (to see if I get the same backtrace):

1 ?- current_prolog_flag(c_cc,V).
V = '/usr/sbin/cc'.  % this is gcc 11.1

2 ?- test_agc_deadlock.
^CAction (h for help) ? C-stack trace labeled "INT":
  [0] save_backtrace() at :? [0x7f4fe0a2ea14]
  [1] PL_interrupt() at ??:? [0x7f4fe09da952]
  [2] dispatch_signal() at :? [0x7f4fe0a794b3]
  [3] handleSignals___LD() at :? [0x7f4fe0a55b59]
  [4] pl_thread_join2_va() at pl-thread.c:? [0x7f4fe09ebd77]
  [5] PL_next_solution___LD() at :? [0x7f4fe0a37fee]
  [6] query_loop() at :? [0x7f4fe0a73a48]
  [7] prologToplevel() at :? [0x7f4fe0a738db]
  [8] PL_toplevel() at ??:? [0x7f4fe0a8869b]
  [9] src/swipl(+0x10b1) [0x55de9ce620b1]
  [10] __libc_start_main() at ??:? [0x7f4fe07cdb25]
  [11] src/swipl(+0x10ee) [0x55de9ce620ee]
Action (h for help) ? goals
     [12] thread_join(receiver)
     [11] test_agc_deadlock:test_agc_deadlock(4)

gcc11.1 backtrace now matches gcc10 backtrace (perhaps because I interrupted too early the first time). Still hangs.

Hope this helps a little.

Thanks. Is this hanging every time? I’m now running a loop using

swipl test_agc_deadlock.pl
?- between(1, 100000, X), test_agc_deadlock, writeln(X), fail.

Just passed 300 iterations …

How may cores do you have (?- current_prolog_flag(cpu_count, Count).)?

edit: No luck. Nor on my dev machine (AMD3950X, Ubuntu 20.04, gcc 9.3), tried with several different combinations of concurrency, neither on my good old Intel NUC (Ubuntu 20.10, gcc 10.3, i7-5557U cpu (dual core hyper threaded).

Yes, hangs every time.

3 ?- current_prolog_flag(cpu_count, Count).
Count = 8.

I think you’ll be able to reproduce it if you try it on archlinux.

If only it would be easier to setup an archlinux installation. I tried to install it on a virtual machine yesterday in a hurry (at work during the break) and I simply ran out of time. I will now try to download a ready image for VirtualBox, install SWI-Prolog there and see if the problem reproduces…

PS: I am now noticing that archlinux gcc 11.1.0 is already marked as out of date. There is a bunch of regressions in 11.1.0 it seems.

PPS: Does not reproduce in the virtual machine. It might be processor-specific after all.

I think the most likely candidates that affect this are the C compiler (and flags used), the C runtime library and the CPU (including number of cores).

Note that there is a Docker image for arch linux. That is typically the simplest way to setup a test environment for a different Linux distro. It does (of course) run on the kernel of the host system. That is rarely an issue.

I was able to reproduce it inside a docker container (within archlinux):

$ docker run -it archlinux
root@3ba8877f973d /]# cd /tmp
[root@3ba8877f973d tmp]# pacman -Sy gcc ninja cmake pkgconfig 
...
[root@3ba8877f973d tmp]# curl https://www.swi-prolog.org/download/devel/src/swipl-8.3.28.tar.gz -o  s.tgz
[root@3ba8877f973d tmp]# cd swipl-8.3.28/
[root@3ba8877f973d swipl-8.3.28]# mkdir build
[root@3ba8877f973d swipl-8.3.28]# cd build
[root@3ba8877f973d build]# cmake -G Ninja -DCMAKE_BUILD_TYPE=PGO -DINSTALL_DOCUMENTATION=OFF .. && ninja -j 8 
...
[root@3ba8877f973d build]# src/swipl ../src/Tests/thread/test_agc_deadlock.pl

[root@3ba8877f973d build]# src/swipl ../src/Tests/thread/test_agc_deadlock.pl
Welcome to SWI-Prolog (threaded, 64 bits, version 8.3.28)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

    CMake built from "/tmp/swipl-8.3.28/build"

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- test_agc_deadlock.
^CAction (h for help) ? C-stack trace labeled "INT":
  [0] save_backtrace() at :? [0x7fd33a3279d4]
  [1] PL_interrupt() at ??:? [0x7fd33a2d37e0]
  [2] dispatch_signal() at :? [0x7fd33a3723f3]
  [3] handleSignals___LD() at :? [0x7fd33a34eb89]
  [4] pl_thread_join2_va() at pl-thread.c:? [0x7fd33a2e4bef]
  [5] PL_next_solution___LD() at :? [0x7fd33a330fbe]
  [6] query_loop() at :? [0x7fd33a36c988]
  [7] prologToplevel() at :? [0x7fd33a36c81b]
  [8] PL_toplevel() at ??:? [0x7fd33a3812fb]
  [9] src/swipl(+0x10b1) [0x55e9e64360b1]
  [10] __libc_start_main() at ??:? [0x7fd33a0c1b25]
  [11] src/swipl(+0x10ee) [0x55e9e64360ee]
Action (h for help) ? goals
     [15] thread_join(<thread>(5,0x55e9e77fcce0), _12292)
     [13] apply:maplist_([<thread>(5,0x55e9e77fcce0), <thread>(6,0x55e9e77fd220), <thread>(7,0x55e9e77fd760), <thread>(8,0x55e9e7809890)], test_agc_deadlock:join)
1 Like

Thanks. This indeed reproduces. So it is not the Linux kernel and not the CPU. According to my current -non-authoritative- theory, this means either the gcc version or glibc matters. Now all works fine on a Fedora 34 docker image on the same host. Both run glibc 2.33. Fedora uses gcc 11.1.1 and Arch 11.1.0.

Considering

I’m tempted to wait …

I don’t think it has to do with gcc 11.1, since as you see in my previous post it also reproduces with gcc 10.

You can try it in the docker archlinux image by installing gcc10 with pacman -Sy gcc10 and then using that to build with CC=/usr/bin/gcc-10 cmake -G ninja .. <etc.>

Ok. Compiled for debugging, which reveals all atom-creating threads hang in reserveAtom(). I managed to make it work using the patch below.

[root@cd3e54ac4605 src]# git diff
diff --git a/src/pl-atom.c b/src/pl-atom.c
index 9fae71203..a0f569606 100644
--- a/src/pl-atom.c
+++ b/src/pl-atom.c
@@ -472,7 +472,7 @@ reserveAtom(void)
 #endif /*O_ATOMGC*/
 
   for(;;)
-  { index = GD->atoms.highest;
+  { index = __atomic_load_n(&GD->atoms.highest, __ATOMIC_ACQUIRE);
     idx = MSB(index);
     assert(index >= 0);

Now I’m not completely happy. What I do not get is that no matter what you change, after an incremental build it works fine. It breaks after a clean build. Using ninja clean before rebuilding rather than starting completely from scratch works too (i.e., causes the bug). @dmchurch, do you have any idea why this could be?

It looks a lot like a C compiler error. If I use gdb and step at instruction level it loops executing a non-conditional jump to itself. That smells for me. The only reasoning I could understand is that if gcc decides there is no need to load GD->atoms.highest from memory and subsequently not to do any of the other loads it can infer that if this loop doesn’t break out the first time it never will and thus it can just as well do something silly.

I suspect @dmchurch may be able to give more insightful comment I’m having a hard time reading assembler and with all the inlining that takes place there are few anchors that tell me where I am :frowning:

1 Like