Segfault happening some time after 9.3.0

Hi everyone,

Some of my code started segfaulting after I upgraded my SWI version. The first query below is with SWI-Prolog 9.3.0 and it terminates correctly. The second is with SWI 9.3.7 and it segfaults:

Welcome to SWI-Prolog (threaded, 64 bits, version 9.3.0)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

1 ?- [load_headless].
Global stack limit 2,147,483,648
Table space 4,294,967,296
true.

2 ?- _T = move/2, time(learn(_T,_Ps)), length(_Ps,N).
% 44,465,502 inferences, 2.656 CPU in 7.557 seconds (35% CPU, 16739954 Lips)
N = 26.
Welcome to SWI-Prolog (threaded, 64 bits, version 9.3.7)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- [load_headless].
Global stack limit 2,147,483,648
Table space 4,294,967,296
true.

?- _T = move/2, time(learn(_T,_Ps)), length(_Ps,N).

ERROR: Received fatal signal 11 (segv)
Time: Wed Jul 24 18:47:26 2024
Inferences: 16441910
Thread: 1 (main)
C-stack trace labeled "crash":
  [0] Sset_exception() at ??:? [0x7f4dcf4c89bc]
  [1] PL_scan_options() at ??:? [0x7f4dcf44b7ca]
  [2] __sigaction() at ??:? [0x7f4dcf172520]
  [3] PL_thread_self() at ??:? [0x7f4dcf48102c]
  [4] PL_thread_self() at ??:? [0x7f4dcf487c36]
  [5] PL_thread_self() at ??:? [0x7f4dcf482b5c]
  [6] PL_thread_self() at ??:? [0x7f4dcf48a0c9]
  [7] PL_exception() at ??:? [0x7f4dcf4586b0]
  [8] PL_is_acyclic() at ??:? [0x7f4dcf4aa086]
  [9] PL_is_acyclic() at ??:? [0x7f4dcf4a9f11]
  [10] PL_toplevel() at ??:? [0x7f4dcf4c032f]
  [11] swipl(+0x1105) [0x55640cdf4105]
  [12] __libc_init_first() at ??:? [0x7f4dcf159d90]
  [13] __libc_start_main() at ??:? [0x7f4dcf159e40]
  [14] swipl(+0x114e) [0x55640cdf414e]


PROLOG STACK:
Segmentation fault

Note that the first bit of output above, where the program terminates, is from Windows powershell whereas the second one that segfaults is from WSL (running Ubuntu 22.04 Jammy). My query also crashes on Windows with SWI 9.3.7 but I don’t get any output so I could only capture the segfault on WSL.

I also tried the latest daily build for Windows ( swipl-w64-2024-07-24.exe) and I still get the crash.

The program that segfaults is large and it will be very difficult to create a simple example, but I’m hoping there will be something that can be found from the version numbers.

I made a small file that causes a segfault on SWI 9.3.7:

segfault.pl (1.8 KB)

I’m pasting the query and the error below. Note the stack and table limits- smaller limits only give an error but no segfault. The segfault happens when I increase the stack and table limits:

Welcome to SWI-Prolog (threaded, 64 bits, version 9.3.7)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- [segfault].
Global stack limit 17,179,869,184
Table space 34,359,738,368
true.

?- between(1,13,_K), findall(J,between(1,_K,J),_Ss), toh(s(_Ss,[],[]),s([],[],_Ss)), format('~w disks: ~w~n',[_K,_Ss]), fail ; true.
1 disks: [1]
2 disks: [1,2]
3 disks: [1,2,3]
4 disks: [1,2,3,4]
5 disks: [1,2,3,4,5]
6 disks: [1,2,3,4,5,6]
7 disks: [1,2,3,4,5,6,7]
8 disks: [1,2,3,4,5,6,7,8]
9 disks: [1,2,3,4,5,6,7,8,9]
10 disks: [1,2,3,4,5,6,7,8,9,10]
11 disks: [1,2,3,4,5,6,7,8,9,10,11]
12 disks: [1,2,3,4,5,6,7,8,9,10,11,12]

ERROR: Received fatal signal 11 (segv)
Time: Wed Jul 24 19:25:47 2024
Inferences: 134392642
Thread: 1 (main)
C-stack trace labeled "crash":
  [0] Sset_exception() at ??:? [0x7fee579659bc]
  [1] PL_scan_options() at ??:? [0x7fee578e87ca]
  [2] __sigaction() at ??:? [0x7fee5760f520]
  [3] PL_is_dict() at ??:? [0x7fee57958e8d]
  [4] PL_is_dict() at ??:? [0x7fee57958ef1]
  [5] PL_is_dict() at ??:? [0x7fee57958ef1]
  [6] PL_is_dict() at ??:? [0x7fee57958ef1]
  [7] PL_is_dict() at ??:? [0x7fee57958ef1]
  [8] PL_is_dict() at ??:? [0x7fee57958ef1]
  [9] PL_is_dict() at ??:? [0x7fee57958ef1]
  [10] PL_is_dict() at ??:? [0x7fee57958ef1]
  [11] PL_is_dict() at ??:? [0x7fee57958ef1]
  [12] PL_is_dict() at ??:? [0x7fee57958ef1]
  [13] PL_is_dict() at ??:? [0x7fee57958ef1]
  [14] PL_is_dict() at ??:? [0x7fee57958ef1]
  [15] PL_is_dict() at ??:? [0x7fee57958ef1]
  [16] PL_is_dict() at ??:? [0x7fee57958ef1]
  [17] PL_is_dict() at ??:? [0x7fee57958ef1]
  [18] PL_is_dict() at ??:? [0x7fee57958ef1]
  [19] PL_is_dict() at ??:? [0x7fee57958ef1]
  [20] PL_is_dict() at ??:? [0x7fee57958ef1]
  [21] PL_is_dict() at ??:? [0x7fee57958ef1]
  [22] PL_is_dict() at ??:? [0x7fee57958ef1]
  [23] PL_is_dict() at ??:? [0x7fee57958ef1]
  [24] PL_is_dict() at ??:? [0x7fee57958ef1]
  [25] PL_is_dict() at ??:? [0x7fee57958ef1]
  [26] PL_is_dict() at ??:? [0x7fee57958ef1]
  [27] PL_is_dict() at ??:? [0x7fee57958ef1]
  [28] PL_is_dict() at ??:? [0x7fee57958ef1]
  [29] PL_is_dict() at ??:? [0x7fee57958ef1]
  [30] PL_is_dict() at ??:? [0x7fee57958ef1]
  [31] PL_is_dict() at ??:? [0x7fee57958ef1]
  [32] PL_is_dict() at ??:? [0x7fee57958ef1]
  [33] PL_is_dict() at ??:? [0x7fee57958ef1]
  [34] PL_is_dict() at ??:? [0x7fee57958ef1]
  [35] PL_is_dict() at ??:? [0x7fee57958ef1]
  [36] PL_is_dict() at ??:? [0x7fee57958ef1]
  [37] PL_is_dict() at ??:? [0x7fee57958ef1]
  [38] PL_is_dict() at ??:? [0x7fee57958ef1]
  [39] PL_is_dict() at ??:? [0x7fee57958ef1]
  [40] PL_is_dict() at ??:? [0x7fee57958ef1]
  [41] PL_is_dict() at ??:? [0x7fee57958ef1]
  [42] PL_is_dict() at ??:? [0x7fee57958ef1]
  [43] PL_is_dict() at ??:? [0x7fee57958ef1]
  [44] PL_is_dict() at ??:? [0x7fee57958ef1]
  [45] PL_is_dict() at ??:? [0x7fee57958ef1]
  [46] PL_is_dict() at ??:? [0x7fee57958ef1]
  [47] PL_is_dict() at ??:? [0x7fee57958ef1]
  [48] PL_is_dict() at ??:? [0x7fee57958ef1]
  [49] PL_is_dict() at ??:? [0x7fee57958ef1]
  [50] PL_is_dict() at ??:? [0x7fee57958ef1]
  [51] PL_is_dict() at ??:? [0x7fee57958ef1]
  [52] PL_is_dict() at ??:? [0x7fee57958ef1]
  [53] PL_is_dict() at ??:? [0x7fee57958ef1]
  [54] PL_is_dict() at ??:? [0x7fee57958ef1]
  [55] PL_is_dict() at ??:? [0x7fee57958ef1]
  [56] PL_is_dict() at ??:? [0x7fee57958ef1]
  [57] PL_is_dict() at ??:? [0x7fee57958ef1]
  [58] PL_is_dict() at ??:? [0x7fee57958ef1]
  [59] PL_is_dict() at ??:? [0x7fee57958ef1]
  [60] PL_is_dict() at ??:? [0x7fee57958ef1]
  [61] PL_is_dict() at ??:? [0x7fee57958ef1]
  [62] PL_is_dict() at ??:? [0x7fee57958ef1]
  [63] PL_is_dict() at ??:? [0x7fee57958ef1]
  [64] PL_is_dict() at ??:? [0x7fee57958ef1]
  [65] PL_is_dict() at ??:? [0x7fee57958ef1]
  [66] PL_is_dict() at ??:? [0x7fee57958ef1]
  [67] PL_is_dict() at ??:? [0x7fee57958ef1]
  [68] PL_is_dict() at ??:? [0x7fee57958ef1]
  [69] PL_is_dict() at ??:? [0x7fee57958ef1]
  [70] PL_is_dict() at ??:? [0x7fee57958ef1]
  [71] PL_is_dict() at ??:? [0x7fee57958ef1]
  [72] PL_is_dict() at ??:? [0x7fee57958ef1]
  [73] PL_is_dict() at ??:? [0x7fee57958ef1]
  [74] PL_is_dict() at ??:? [0x7fee57958ef1]
  [75] PL_is_dict() at ??:? [0x7fee57958ef1]
  [76] PL_is_dict() at ??:? [0x7fee57958ef1]
  [77] PL_is_dict() at ??:? [0x7fee57958ef1]
  [78] PL_is_dict() at ??:? [0x7fee57958ef1]
  [79] PL_is_dict() at ??:? [0x7fee57958ef1]
  [80] PL_is_dict() at ??:? [0x7fee57958ef1]
  [81] PL_is_dict() at ??:? [0x7fee57958ef1]
  [82] PL_is_dict() at ??:? [0x7fee57958ef1]
  [83] PL_is_dict() at ??:? [0x7fee57958ef1]
  [84] PL_is_dict() at ??:? [0x7fee57958ef1]
  [85] PL_is_dict() at ??:? [0x7fee57958ef1]
  [86] PL_is_dict() at ??:? [0x7fee57958ef1]
  [87] PL_is_dict() at ??:? [0x7fee57958ef1]
  [88] PL_is_dict() at ??:? [0x7fee57958ef1]
  [89] PL_is_dict() at ??:? [0x7fee57958ef1]
  [90] PL_is_dict() at ??:? [0x7fee57958ef1]
  [91] PL_is_dict() at ??:? [0x7fee57958ef1]
  [92] PL_is_dict() at ??:? [0x7fee57958ef1]
  [93] PL_is_dict() at ??:? [0x7fee57958ef1]
  [94] PL_is_dict() at ??:? [0x7fee57958ef1]
  [95] PL_is_dict() at ??:? [0x7fee57958ef1]
  [96] PL_is_dict() at ??:? [0x7fee57958ef1]
  [97] PL_is_dict() at ??:? [0x7fee57958ef1]
  [98] PL_is_dict() at ??:? [0x7fee57958ef1]
  [99] PL_is_dict() at ??:? [0x7fee57958ef1]


PROLOG STACK:
Segmentation fault

This is windows WSL running Ubuntu 22.04 Jammy.

But note that this time I got a crash (without output) with version 9.3.0 also (on Windows), so I don’t think it’s got to do with the version after all.

Thanks. That reproduces. With a source compiled version (having all debug symbols), we see the culprit is a C stack overflow in the mutual recursive functions merge_children() and merge_one_component().

The quick work around is to use ulimit -s <size> to raise the stack limit. The test passed for me with

ulimit -s 100000
swipl segfault.pl
?- between(1,13,_K), findall(J,between(1,_K,J),_Ss), toh(s(_Ss,[],[]),s([],[],_Ss) format('~w disks: ~w~n',[_K,_Ss]), fail ; true.

A real fix will take some more time. I never thought this would lead to a stack overflow on a real programs :frowning:

1 Like

Thank you Jan! This works for the toh/2 program (a Hanoi Towers puzzle). It doesn’t for the larger program but I can still run that with SWI 9.3.0 so I’m fine for now. I’ll wait patiently for the final fix.

That’s interesting- a mutual recursion causing stack overflow. It seems some of the work I do with SWI-Prolog and ILP is pushing some of the limits of SWI, in particular with respect to tabling. I keep running out of table space and I’ve had to develop some strategies to avoid that.

I’m curious. I don’t know anything about the C implementation of SWI Prolog but is it the same on Windows and Linux or are there significant differences across the two platforms?

Also, does it look like the two segfault instances I reported above, the one in my original post and the one with the reproducible file have the same root cause or is it not possible to tell from the output of the error in my original post?

The final fix won’t help any better than raising the stack limit using ulimit -s. It seems the larger program suffers from something else. Note that the 9.3.x series had some serious redesign of the core Prolog data representation. That caused some regression. The last (known) issue for that was fixed with 9.3.8. Could you try that? Ot better, the GIT version compiled from sources, so you also get proper backtraces for crashes.

The C code is all the same except for I/O, thread synchronization and other OS related code.

Can you share the larger program? Size doesn’t matter too much. You can use direct mail or a PM from the forum.

I tried with 9.3.8 built from github on WSL. Here’s the output:

Welcome to SWI-Prolog (threaded, 64 bits, version 9.3.8-22-g990b0b1fc)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

    CMake built from "/home/yegoblynqueenne/swipl-devel/build"

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- [load_headless].
Global stack limit 17,179,869,184
Table space 34,359,738,368
true.

?- _T = move/2, time(learn(_T,_Ps)), length(_Ps,N).

ERROR: Received fatal signal 11 (segv)
Time: Thu Jul 25 06:51:36 2024
Inferences: 10877672
Thread: 1 (main)
C-stack trace labeled "crash":
  [0] save_backtrace() at /home/yegoblynqueenne/swipl-devel/src/os/pl-cstack.c:337 [0x7f12713c8040]
  [1] sigCrashHandler() at /home/yegoblynqueenne/swipl-devel/src/os/pl-cstack.c:937 [0x7f127135d38e]
  [2] __sigaction() at ??:? [0x7f12710ae520]
  [3] MurmurHashAligned2() at /home/yegoblynqueenne/swipl-devel/src/pl-hash.c:150 [0x7f127138a731]
  [4] COMPARE_AND_SWAP_UINT() at /home/yegoblynqueenne/swipl-devel/src/pl-inline.h:254 [0x7f127138ec3c]
  [5] trie_intern_indirect___LD() at /home/yegoblynqueenne/swipl-devel/src/pl-trie.c:673 [0x7f127138c0fb]
  [6] pl_tbl_wkl_add_answer4_va() at /home/yegoblynqueenne/swipl-devel/src/pl-tabling.c:3607 [0x7f1271390191]
  [7] PL_next_solution___LD() at /home/yegoblynqueenne/swipl-devel/src/pl-vmi.c:4574 [0x7f127136917a]
  [8] query_loop() at /home/yegoblynqueenne/swipl-devel/src/pl-pro.c:147 [0x7f12713aa024]
  [9] prologToplevel() at /home/yegoblynqueenne/swipl-devel/src/pl-pro.c:594 [0x7f12713a9eb1]
  [10] PL_toplevel() at /home/yegoblynqueenne/swipl-devel/src/pl-fli.c:4953 [0x7f12713bfb4f]
  [11] /home/yegoblynqueenne/swipl-devel/build/src/swipl(+0x1105) [0x557634227105]
  [12] __libc_init_first() at ??:? [0x7f1271095d90]
  [13] __libc_start_main() at ??:? [0x7f1271095e40]
  [14] /home/yegoblynqueenne/swipl-devel/build/src/swipl(+0x1145) [0x557634227145]


PROLOG STACK:
    [172] Segmentation fault

I can also try on Fedora linux if that’s likely to give a different result but I guess not?

I’m preparing to share the larger program with you. It is quite large and it’s my development version so I have to slim it down a bit. I’ll send you a PM when it’s ready.

And thanks!

Probably you get the same.

Thanks. If the stack is correct this might not be very hard to fix. Looking forward to the code.

Oh the stack- yes, sorry, the last output was with ulimit -s 100000 as you suggested.

That doesn’t matter here. As you can see, the stack is shallow. This is a real segv as a result from illegal memory access. Possibly due to some unexpected data type. Tabling has been tested quite a bit by using the very extensive XSB tabling test suite, but XSB has no strings, big integers or rational numbers. The stack suggests something goes wrong with one of these.

Note that in C, stack overflows also result in a segv error as C does not check whether there is any space on the stack, so eventually it tries to write outside the stack in protected memory. At least, that is how it works on most Unix systems. Windows gives a distinct error.

1 Like

I see, sorry, I thought you meant setting the stack with ulimit.

Say, how do I send you a .zip file with the code? I can’t directly upload .zip files. I can upload it as a fake .pl file (or some other accepted file format) say myfile.zip.pl but is that kosher?

Otherwise, where do I find your email?

Please do not attach it here. We’ll quickly run out of space. If it is small enough to send by mail, please send to bugs@swi-prolog.org. Otherwise use some file transfer site like WeTransfer (there are many of them).

1 Like

OK, I sent you an email at bugs@swi.prolog.org. Please let me know if there’s any problems with it.

For the record this issue is now resolved. Jan pushed a fix last night, at his usual super-human speed. And just in time for dinner!

Thank you Jan!