Segfault: tipc paxos

swi · November 3, 2025, 2:36am

I am getting a segfault with the following program:

:- use_module(library(tipc/tipc_paxos)).
:- use_module(library(paxos)).

main :-
   tipc_initialize,
   paxos_set(key,val1),
   paxos_get(key,V),
   format('Value is ~w~n',V).

Error:

Welcome to SWI-Prolog (threaded, 64 bits, version 9.3.33-23-g5cc962ee2)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

    CMake built from "/tmp/swipl-devel/b"

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

102 ?- main.

ERROR: Received fatal signal 11 (segv)
Time: Sun Nov  2 21:14:20 2025
Inferences: 603629
Thread: 1 (main)
C-stack trace labeled "crash":
  [0] save_backtrace() at /tmp/swipl-devel/src/os/pl-cstack.c:337 [0x7f733e78ca77]
  [1] sigCrashHandler() at /tmp/swipl-devel/src/os/pl-cstack.c:937 [0x7f733e78cbd8]
  [2] __sigaction() at ??:? [0x7f733e42b540]
  [3] Sfileno() at /tmp/swipl-devel/src/os/pl-stream.c:3718 [0x7f733e770a70]
  [4] input_on_stream() at /tmp/swipl-devel/src/pl-fli.c:5356 [0x7f733e74fcc9]
  [5] wait_socket() at /tmp/swipl-devel/packages/clib/nonblockio.c:473 (discriminator 1) [0x7f733d40b515]
  [6] pl_tipc_receive_subscr_event() at /tmp/swipl-devel/packages/tipc/tipc.c:858 (discriminator 1) [0x7f733d407690]
  [7] PL_next_solution_guarded___LD() at /tmp/swipl-devel/src/pl-vmi.c:4378 (discriminator 2) [0x7f733e63e262]
  [8] PL_next_solution___LD() at /tmp/swipl-devel/src/pl-wam.c:3573 [0x7f733e63336a]
  [9] callProlog() at /tmp/swipl-devel/src/pl-pro.c:535 [0x7f733e69a3e1]
  [10] pl_with_mutex() at /tmp/swipl-devel/src/pl-mutex.c:769 [0x7f733e72e0c4]
  [11] PL_next_solution_guarded___LD() at /tmp/swipl-devel/src/pl-vmi.c:4378 (discriminator 2) [0x7f733e63e262]
  [12] PL_next_solution___LD() at /tmp/swipl-devel/src/pl-wam.c:3573 [0x7f733e63336a]
  [13] query_loop() at /tmp/swipl-devel/src/pl-pro.c:179 (discriminator 1) [0x7f733e699cee]
  [14] prologToplevel() at /tmp/swipl-devel/src/pl-pro.c:660 [0x7f733e69a79d]
  [15] PL_toplevel() at /tmp/swipl-devel/src/pl-fli.c:4990 [0x7f733e74f3dc]
  [16] src/swipl(+0x10a2) [0x55737d3d00a2]
  [17] __libc_init_first() at ??:? [0x7f733e414675]
  [18] __libc_start_main() at ??:? [0x7f733e414729]
  [19] src/swipl(+0x10f5) [0x55737d3d00f5]


PROLOG STACK (without arguments):
  [21] tipc:tipc_receive_subscr_event/2 <foreign>
  [20] tipc:tipc_service_exists/2 [PC=63 in clause 1]
  [18] tipc:tipc_stack_initialize/0 [PC=14 in clause 1]
  [17] system:<meta-call>/1 [PC=2 in clause -1]
  [16] system:\+/1 [PC=6 in clause 1]
  [15] system:$c_call_prolog/0 [PC=0 in top query clause]
  [14] system:with_mutex/2 <foreign>
  [12] main/0 [PC=3 in clause 1]
  [11] $toplevel:toplevel_call/1 [PC=3 in clause 1]
  [10] $toplevel:stop_backtrace/2 [PC=4 in clause 1]


PROLOG STACK (with arguments; may crash if data is corrupted):
fish: Job 1, 'src/swipl /tmp/p.pl' terminated by signal SIGSEGV (Address boundary error)

I tried the following:

ran the same code on 9.2.9, everything OK, no segafault
after this, I re-ran on 9.3.33-23 and it was OK, no segfault

So it may be hard to debug as it seems to be intermittent, but hope the report helps to solve it.

EDIT 1: this may help, sometimes (sporadically) I also get this error if I exit SWI-Prolog on a node and start it again (on 9.3.33-23):

% uml(M) :- use_module(library(M)).
35 ?- uml(tipc/tipc_paxos).
true.

36 ?- uml(paxos).
true.

37 ?- tipc_initialize.
true.

38 ?- paxos_get(key,V).
false.

39 ?- paxos_get(key,V).
false.

39 ?- paxos_get(key,V).
ERROR: open/3: Not enough resources: max_files (Too many open files)
   Call: (29) true ? abort

jan · November 3, 2025, 8:51am

Hard to say. You my try to run under AddressSanitizer. If some memory issue is involved, that may help. Easiest build (from root of sources)

mkdir build.asan
cd build.asan
../scripts/configure
<agree>
ninja

Now use src/swipl[-win] to do your work. No need for installing. That is how I manage versions compiled with different options. The suffix of the build directory is picked up by scripts/configure to add CMake options and set environment variables. See the script for the supported suffixes.

swi · November 3, 2025, 10:20am

Built and ran it under address sanitizer, no problem. It works fine. One of those difficult intermittent problems. By the way, do you recommend using paxos? it seems quite useful with tipc.

jan · November 3, 2025, 11:40am

Ran fine here too. Question is, when the intermittent problem occurs, will AddressSanitizer point at the problem? It might. Good news is that its reports are typically detailed enough to find the culprit.

I don’t know. It was a fun project, but it never really materialized. It probably runs fine under normal circumstances, but I doubt it survives under all scenarios where it should survive due to failing and adding nodes. My original intend was to use it for horizontal scaling of SWISH. One of the problems is of course that one cannot use TICP in most places and Eventually I settled with a Redis cluster to provide the data sharing.

swi · November 3, 2025, 10:48pm

Hopefully I will run into the problem again under the address sanitizer.

Yes, I was doing some tests, it doesn’t work well with a failing and rejoining node. I disconnected the node –by exiting swipl– then I rejoined the node and paxos_get/2 doesn’t work. It fails silently, and then ends up getting Not enough resources: max_files (Too many open files) error.

I find tipc_linda is much more reliable and can be used as a key-value store with client nodes connecting and failing, but of course all the data is stored in the linda server so there is no fault tolerance for the linda server. Although with tipc it is not hard to have several linda servers and replicate the data, but that is not currently in the code.

I may end up using valkey (redis open sourced), but I was trying to keep everything in SWI-Prolog, one less component to manage.

jan · November 4, 2025, 7:56am

Basic fail and rejoin was working AFAIK. The max_files seems suspicious. Does the machine have such tight limits? You could have a look at /proc//fd to see whether it has a high number of open files. If so, there is probably something wrong in the tipc binding.

swi · November 4, 2025, 11:37am

The fds increment each time I run a paxos_get/2 in the failed/rejoined machine:

161 ?- process_id(PID), ls('/'/proc/PID/fd).
% 0    11   14   17   2    22   25   28   30   33   36   39   41   44   47   5    52   6    9
% 1    12   15   18   20   23   26   29   31   34   37   4    42   45   48   50   53   7
% 10   13   16   19   21   24   27   3    32   35   38   40   43   46   49   51   54   8
PID = 1523079.

162 ?- paxos_get(k,V).
false.

163 ?- process_id(PID), ls('/'/proc/PID/fd).
% 0     103   11    17    22    28    33    39    44    5     55    60    66    71    77    82    88    93    99
% 1     104   12    18    23    29    34    4     45    50    56    61    67    72    78    83    89    94
% 10    105   13    19    24    3     35    40    46    51    57    62    68    73    79    84    9     95
% 100   106   14    2     25    30    36    41    47    52    58    63    69    74    8     85    90    96
% 101   107   15    20    26    31    37    42    48    53    59    64    7     75    80    86    91    97
% 102   108   16    21    27    32    38    43    49    54    6     65    70    76    81    87    92    98
PID = 1523079.

164 ?- paxos_get(k,V).
false.

165 ?- process_id(PID), ls('/'/proc/PID/fd).
% 0     103   109   114   12    17    22    28    33    39    44    5     55    60    66    71    77    82    88    93    99
% 1     104   11    115   120   18    23    29    34    4     45    50    56    61    67    72    78    83    89    94
% 10    105   110   116   13    19    24    3     35    40    46    51    57    62    68    73    79    84    9     95
% 100   106   111   117   14    2     25    30    36    41    47    52    58    63    69    74    8     85    90    96
% 101   107   112   118   15    20    26    31    37    42    48    53    59    64    7     75    80    86    91    97
% 102   108   113   119   16    21    27    32    38    43    49    54    6     65    70    76    81    87    92    98
PID = 1523079.

166 ?-

I use two machines for this, in the first machine I run paxos_set(k,5) and ssh to the second machine, failing and rejoining, then I get the output above.

jan · November 4, 2025, 12:15pm

That is clearly wrong … I do not manage to create a failing node though. Ok, I use two terminals on the same host, but I guess that should be fine. What exactly do you do?

swi · November 4, 2025, 5:04pm

With two terminals on the same host it is OK, the problem happens when I use two separate machines.

Machine 1 & 2:

modprobe tipc
tipc bearer enable media eth dev eth0

Machine 1 (first):

$ swipl
91 ?- uml(tipc/tipc_paxos).
true.

92 ?- uml(paxos).
true.

93 ?- tipc_initialize.
true.
94 ?- paxos_set(k,5).
true.

Machine 2 (after running the code above in machine 1):

$ swipl
154 ?- uml(tipc/tipc_paxos).
true.

155 ?- uml(paxos).
true.

156 ?- tipc_initialize.
true.

157 ?- paxos_get(k,V).
false.

It happens only in two different machines, not in the same terminal. I think the key is to do the paxos_set(k,5) in machine 1 before you run swipl in machine 2.

If I run paxos_set(k,5) a few times while machine 2 is already running eventually I can get it to work by interleaving paxos_set(k,5) in machine 1 and paxos_get(k,V) in machine 2.

EDIT 1: one of the machines is on wfi the other is on ethernet,I just say this because this could cause longer latencies that go beyond the default 20ms timeout; this could explain why it doesn’t fail on one host? but maybe not, because there are 5 default retries…

jan · November 5, 2025, 11:15am

But that may leak file handles? It would be pretty weird if there is, besides timing, a difference between running on the same machine and running on different machines. AFAIK, TIPC traffic is poorly or not supported by many smart things that may be routing your internet traffic. It is pretty much intended for creating a cluster from machines in a well connected subnet. Is basic TIPC working properly?

swi · November 5, 2025, 3:53pm

Yes, TIPC is working perfectly in between the two machines; notably, tipc_linda works, even connecting and disconnecting the nodes. There must be some bug in the paxos code.

Topic		Replies	Views
Segfault happening some time after 9.3.0 General	12	142	July 26, 2024
Cannot allocate memory (tcmalloc) building .qlf file + re_replace crash Help!	27	2318	August 26, 2020
Ann: SWI-Prolog 8.3.7 Releases	14	1024	September 11, 2020
Ann: SWI-Prolog 8.1.23 Releases	11	859	February 25, 2020
Stacktrace after hitting t S u while browsing through query solutions General bug	3	357	November 13, 2023

Segfault: tipc paxos

Related topics