Performance implications of `--threads=false` in embedded swipl

Janno · October 12, 2023, 3:53pm

We embed SWI Prolog in an OCaml application and noticed today that SWI Prolog was handling stack overflows in a way that masked the application’s own handler, leading to very unintuitive error messages. We tried initializing the embedded interpreter with --signals=false but now our queries are sometimes a hundred times slower than they used to be. Unfortunately we cannot reproduce this behavior using the swipl binary with the exact same command line arguments.

Is this a known problem? How would we go about debugging this? We don’t yet have a minimal reproducing example but we might be able to minimize our FFI code to something that can be shared, if necessary.

jan · October 12, 2023, 6:43pm

? it only handles overflows on its own stacks. Well, on platforms that support signals and handling these on an “alt” stack it uses signal handling on a few recursive C functions to turn SIGSEGV into controlled stack overflow exceptions. That is only activated if signals are handled.

No. It also seems pretty weird to me. It just doesn’t setup signal handling. You’d expect no impact on performance and I never heart about any impact. My first guess at debugging would be to try profiling using e.g. valgrind on platforms where this is available.

What platform are we talking about?

Janno · October 13, 2023, 9:33am

it only handles overflows on its own stacks. Well, on platforms that support signals and handling these on an “alt” stack it uses signal handling on a few recursive C functions to turn SIGSEGV into controlled stack overflow exceptions. That is only activated if signals are handled.

It might just be that the OCaml process receives a fatal signal and that that is caught by SWI Prolog when not using --signals=false. Here’s the trace I get in that configuration when forcing a stack overflow in OCaml code after the call to PL_initialise.

SWI-Prolog [thread 1 (main) at Fri Oct 13 11:15:42 2023]: received fatal signal 11 (segv)
C-stack trace labeled "crash":
  [0] save_backtrace() at :? [0x7f2380de16ed]
  [1] print_c_backtrace() at :? [0x7f2380de187c]
  [2] sigCrashHandler() at :? [0x7f2380de19a3]
  [3] __restore_rt() at libc_sigaction.c:? [0x7f2391b36510]
  [4] automation.cmxs(camlAutomation__Prolog__overflow_3144+0x9) [0x7f235d07ccb9]
  (* the line above is where I force a stack overflow in OCaml. It is then repeated many times in the trace *)
Prolog stack:
Running on_halt hooks with status 139
Killing 467720 with default signal handlers

I don’t think this is entirely surprising given that SWI Prolog installs signal handlers so the important part of the question really is why there is such a large performance delta when the handlers are disabled.

My first guess at debugging would be to try profiling using e.g. valgrind on platforms where this is available.

It took me a while to get debugging symbols. Neither the stable nor the dev ppa seem to be compatible with current Debian unstable (libjpeg8 cannot be found) and Debian’s own package doesn’t have debugging symbols. I ended up recompiling from the sources (version 9.0.4, in case that matters).

I have no clue how to interpret the results so I captured a perf trace of our application running code that requests all solutions to a particular query and uploaded it here: Firefox Profiler. The profile in the link should already be focused on the SWI Prolog parts of the trace which amount to about 1s. There is some application-specific (non constant) work happening for every solution returned which is why the prolog work isn’t contiguous.

Our application has debug options to output the prolog state before the query. Feeding that output to swipl --signals=no and running a findall query equivalent to what our application computes (+some overhead from gathering the results into a list, I assume) takes between 5 and 10ms. I also uploaded a trace of that but I am not sure that that’s very useful: Firefox Profiler

The timing from the swipl binary matches what we see in our application without --signals=false.

What platform are we talking about?

This is happening on linux platforms (Debian unstable and Arch Linux have been tested).

jan · October 13, 2023, 12:14pm

If the profile doesn’t lie, it spends an awful lot of time trying to find clauses. You could wrap your Prolog goal in profile/1 to get a clue which predicate is involved. Is there a lot of assert/retract involved on some predicate? Although I have little clue why, one possible explanation is a lot of dead clauses due to retract/1 calls that are not reclaimed because for some reason the clause garbage collector does not run. You could run statistics/0 after the goal and compare clause and atom garbage collection calls between the good and bad run.

If not running the clause garbage collector is the problem we still have to find why … Yes, starting GCG involves signals, but synchronous Prolog level signals rather than asynchronous OS signals.

Although it is not a solution, you can also pass --no-threads to PL_initialise(). That runs CGC in the main thread, avoiding the need for (Prolog level) signals to start CGC.

jan · October 13, 2023, 12:21pm

A possible scenario for this is code like this:

:- dynamic s/1.

t(N) :-
    s(N),
    !,
    retractall(s(_)).
t(N) :-
    (   retract(s(N0))
    ->  N1 is N0+1
    ;   N1 = 0
    ),
    assertz(s(N1)),
    t(N).

This works just fine (~0.7sec for t(1 000 000)) with or without signals.

Janno · October 13, 2023, 1:32pm

Yes, we do constantly assert and clear facts in dynamic predicates.

I managed to get access to the interpreter from within the application. --no-threads does not seem to change anything. Running garbage_collect. directly before the query also doesn’t seem to make any difference.

I did manage to generate the profile, at least. I’ve run this several times to make sure we aren’t seeing the effects of initial tabling.

profile(findall([A,B,C,K,I],match_toplevel(fwd,A,B,C,K,I),B)).
=====================================================================
Total time: 0.384 seconds
=====================================================================
Predicate                       Box Entries =    Calls+Redos     Time
=====================================================================
prem/2                                  389 =      270+119     100.0%
call_with_inference_limit/3              90 =       75+15        0.0%
start_tabling/3                          48 =       48+0         0.0%
$tabling:start_tabling_2/6               48 =       48+0         0.0%
member/2                                224 =      224+0         0.0%
lists:member_/3                          87 =       87+0         0.0%
$add_findall_bag/1                       60 =       60+0         0.0%
$collect_findall_bag/2                    1 =        1+0         0.0%
$destroy_findall_bag/0                    1 =        1+0         0.0%
$inference_limit/2                       75 =       75+0         0.0%
$inference_limit_false/1                 15 =       15+0         0.0%
$inference_limit_true/3                  60 =       60+0         0.0%
$new_findall_bag/0                        1 =        1+0         0.0%
$tbl_variant_table/6                     48 =       48+0         0.0%
not/1                                   287 =      149+138       0.0%
sig_atomic/1                              1 =        1+0         0.0%
trie_gen_compiled/2                      48 =       48+0         0.0%
box_compatible/2                        197 =      197+0         0.0%
boxes_compatible/3                       75 =       75+0         0.0%
enabled_db/1                            207 =      207+0         0.0%
hint_as_subgoal/5                        48 =       48+0         0.0%
hint_subgoals/5                          75 =       75+0         0.0%
hint_subgoals_with_limit/6               90 =       75+15        0.0%
match_cncls/4                           207 =      207+0         0.0%
match_prems/4                           464 =      207+257       0.0%

prem is a dynamic predicate.

I got the listings of all involved predicates and loaded them into a swipl session using the same command line arguments. Again, this is after the first run that seems to generate noisy profiles due to tabling and autoloading of libraries.

profile(findall([A,B,C,K,I],match_toplevel(fwd,A,B,C,K,I),B)).
=====================================================================
Total time: 0.001 seconds
=====================================================================
Predicate                       Box Entries =    Calls+Redos     Time
=====================================================================
call_with_inference_limit/3              90 =       75+15        0.0%
start_tabling/3                          48 =       48+0         0.0%
$tabling:start_tabling_2/6               48 =       48+0         0.0%
member/2                                224 =      224+0         0.0%
lists:member_/3                          87 =       87+0         0.0%
$add_findall_bag/1                       60 =       60+0         0.0%
$collect_findall_bag/2                    1 =        1+0         0.0%
$destroy_findall_bag/0                    1 =        1+0         0.0%
$inference_limit/2                       75 =       75+0         0.0%
$inference_limit_false/1                 15 =       15+0         0.0%
$inference_limit_true/3                  60 =       60+0         0.0%
$new_findall_bag/0                        1 =        1+0         0.0%
$tbl_variant_table/6                     48 =       48+0         0.0%
not/1                                   287 =      149+138       0.0%
sig_atomic/1                              1 =        1+0         0.0%
trie_gen_compiled/2                      48 =       48+0         0.0%
box_compatible/2                        197 =      197+0         0.0%
boxes_compatible/3                       75 =       75+0         0.0%
enabled_db/1                            207 =      207+0         0.0%
hint_as_subgoal/5                        48 =       48+0         0.0%
hint_subgoals/5                          75 =       75+0         0.0%
hint_subgoals_with_limit/6               90 =       75+15        0.0%
match_cncls/4                           207 =      207+0         0.0%
match_prems/4                           457 =      207+250       0.0%
match_toplevel/6                        472 =        1+471       0.0%

I have inspected the results on both sides and they look identical.

What you’ll probably see in the profile and what I forgot to mention is that we use call_with_inference_limit. Does that use signals?

jan · October 13, 2023, 1:59pm

So, prem/2 is the problematic predicate. Could you get the output of calling statistics/0 after the inferencing, so we can check the number of clause garbage collections.

This is the classical Prolog stack garbage collector. That is not related. What is related is garbage_collect_clauses/0. Normally, that is triggered after it detects that running the code skips too many erased clauses or the combined size of the erased clauses gets over a certain limit. I think something is wrong with this triggering, but I have little clue what, especially as the --no-threads seems ineffective. Might get hard to debug if I can’t reproduce this.

Janno · October 13, 2023, 2:37pm

This is the classical Prolog stack garbage collector. That is not related.

I’ve tried garbage_collect_clauses and garbage_collect_atoms as well. Neither one improves the timing of the query (or takes any noticeable time by itself).

Could you get the output of calling statistics/0 after the inferencing, so we can check the number of clause garbage collections.

I should mention that the dynamic facts are not asserted in the match_toplevel query we are running here but rather in a preparatory step (executing a bunch of assertz queries) that we trigger from our application. That preparatory step does not seem to be any slower with --signals=false.

Still, in case it could be useful, here’s the output before and after the query and then after running garbage_collect_clauses. This is from the middle of my current debugging session and the particular point in the application has already run thousands of queries before, which is why there have been so many collections already.

statistics.
% Started at Fri Oct 13 15:04:20 2023
% 300.494 seconds cpu time for 9,242,462 inferences
% 7,291 atoms, 5,463 functors, 3,465 predicates, 47 modules, 11,818,360 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -      500 Kb      307 Kb
% Global stack:           -      256 Kb      237 Kb
% Trail  stack:           -       34 Kb      360  b
%        Total:    1,024 Mb      790 Kb      545 Kb
% 
% 72 garbage collections gained 104,678,712 bytes in 0.038 seconds.
% 2 atom garbage collections gained 33,574 atoms in 0.010 seconds.
% 1,556 clause garbage collections gained 165,548 clauses in 0.091 seconds.
% Stack shifts: 6 local, 10 global, 33 trail in 0.008 seconds
true.

profile(findall([A,B,C,K,I],match_toplevel(fwd,A,B,C,K,I),B)).
[..] % output omitted

statistics.
% Started at Fri Oct 13 15:04:20 2023
% 300.783 seconds cpu time for 9,249,342 inferences
% 7,291 atoms, 5,463 functors, 3,465 predicates, 47 modules, 11,818,383 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -      500 Kb      307 Kb
% Global stack:           -      512 Kb      300 Kb
% Trail  stack:           -       34 Kb      360  b
%        Total:    1,024 Mb    1,046 Kb      607 Kb
% 
% 72 garbage collections gained 104,678,712 bytes in 0.038 seconds.
% 2 atom garbage collections gained 33,574 atoms in 0.010 seconds.
% 1,556 clause garbage collections gained 165,548 clauses in 0.091 seconds.
% Stack shifts: 6 local, 11 global, 36 trail in 0.009 seconds
true.

garbage_collect_clauses.
true.

statistics.
% Started at Fri Oct 13 15:04:20 2023
% 300.787 seconds cpu time for 9,251,247 inferences
% 7,291 atoms, 5,463 functors, 3,465 predicates, 47 modules, 11,818,396 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -      500 Kb      307 Kb
% Global stack:           -      512 Kb      320 Kb
% Trail  stack:           -       34 Kb      360  b
%        Total:    1,024 Mb    1,046 Kb      627 Kb
% 
% 72 garbage collections gained 104,678,712 bytes in 0.038 seconds.
% 2 atom garbage collections gained 33,574 atoms in 0.010 seconds.
% 1,558 clause garbage collections gained 165,550 clauses in 0.092 seconds.
% Stack shifts: 6 local, 11 global, 38 trail in 0.009 seconds
true.

Might get hard to debug if I can’t reproduce this.

I understand. I will try to come up with a bit of C code that initialises the interpreter and loads our definitions and puts the system in the same state that I am observing with --signals=false.

I am also setting things up so that I can compare the embedded interpreter with and without --signals=no (instead of comparing embedded versus the swipl binary) to get a better, more direct comparison. This could help with interpreting the output of statistics which probably won’t mean anything without a direct comparison, I think. Although I am not sure statistics is going to help us much more.

jan · October 14, 2023, 8:21am

Note that the normal swipl binary is a very small program that embeds the real system that is in the shared object. I consider it possible that Ocaml modifies the environment in a way that conflicts. I have little clue what that could be though.

I can’t tell from the output what is run before the statistics and whether one run os ok (with signals) and the other is slow (no signals). statistics/0 gives us two interesting numbers: the number of clause GC calls and the size of the VM codes. If the latter is much larger in the no-signals scenario while the number of clause GC calls is comparable, it means CGC doesn’t do its work for some reason.

An alternative route we can go is to build the debug version of Prolog, so we can enable some messages, e.g. -d MSG_CGC_CONSIDER to print messages about the CGC trigger events.

Janno · October 16, 2023, 9:46am

That seems to be the case. I’ve been running both versions (signals and no signals) side by side and comparing the output of statistics at various points. The numbers before the output are just location indicators to match up the outputs.

Default (with signals)

951
% Started at Mon Oct 16 11:04:27 2023
% 161.294 seconds cpu time for 5,094,397 inferences
% 22,063 atoms, 5,193 functors, 3,297 predicates, 42 modules, 786,790 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -    1,012 Kb      186 Kb
% Global stack:           -    2,048 Kb      541 Kb
% Trail  stack:           -      514 Kb      640  b
%        Total:    1,024 Mb    3,574 Kb      727 Kb
% 
% 41 garbage collections gained 65,560,144 bytes in 0.018 seconds.
% 7 atom garbage collections gained 670 atoms in 0.005 seconds.
% 328 clause garbage collections gained 107,933 clauses in 0.106 seconds.
% Stack shifts: 5 local, 5 global, 4 trail in 0.005 seconds
% 2 threads, 0 finished threads used 0.000 seconds

985
% Started at Mon Oct 16 11:04:27 2023
% 184.272 seconds cpu time for 7,226,094 inferences
% 24,619 atoms, 5,216 functors, 3,297 predicates, 42 modules, 806,072 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -    1,012 Kb      277 Kb
% Global stack:           -    2,048 Kb      660 Kb
% Trail  stack:           -      514 Kb       72  b
%        Total:    1,024 Mb    3,574 Kb      937 Kb
% 
% 54 garbage collections gained 88,837,600 bytes in 0.025 seconds.
% 7 atom garbage collections gained 670 atoms in 0.005 seconds.
% 445 clause garbage collections gained 151,252 clauses in 0.151 seconds.
% Stack shifts: 5 local, 5 global, 4 trail in 0.005 seconds
% 2 threads, 0 finished threads used 0.000 seconds

1007
% Started at Mon Oct 16 11:04:27 2023
% 192.298 seconds cpu time for 8,161,451 inferences
% 29,533 atoms, 5,244 functors, 3,297 predicates, 42 modules, 868,284 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -    1,012 Kb      298 Kb
% Global stack:           -    2,048 Kb      219 Kb
% Trail  stack:           -      514 Kb       72  b
%        Total:    1,024 Mb    3,574 Kb      517 Kb
% 
% 57 garbage collections gained 94,238,152 bytes in 0.027 seconds.
% 7 atom garbage collections gained 670 atoms in 0.005 seconds.
% 469 clause garbage collections gained 159,930 clauses in 0.160 seconds.
% Stack shifts: 5 local, 5 global, 4 trail in 0.005 seconds
% 2 threads, 0 finished threads used 0.000 seconds

No signals

951
% Started at Mon Oct 16 11:30:46 2023
% 174.443 seconds cpu time for 5,094,396 inferences
% 22,666 atoms, 5,193 functors, 3,247 predicates, 42 modules, 7,469,982 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -    1,012 Kb      186 Kb
% Global stack:           -    2,048 Kb      541 Kb
% Trail  stack:           -      514 Kb      640  b
%        Total:    1,024 Mb    3,574 Kb      727 Kb
% 
% 41 garbage collections gained 65,560,144 bytes in 0.017 seconds.
% 940 clause garbage collections gained 107,998 clauses in 0.051 seconds.
% Stack shifts: 5 local, 5 global, 4 trail in 0.005 seconds

985
% Started at Mon Oct 16 11:30:46 2023
% 236.799 seconds cpu time for 7,226,030 inferences
% 25,222 atoms, 5,216 functors, 3,236 predicates, 42 modules, 11,107,093 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -    1,012 Kb      277 Kb
% Global stack:           -    2,048 Kb      660 Kb
% Trail  stack:           -      514 Kb       72  b
%        Total:    1,024 Mb    3,574 Kb      937 Kb
% 
% 54 garbage collections gained 88,835,688 bytes in 0.024 seconds.
% 1,401 clause garbage collections gained 151,324 clauses in 0.077 seconds.
% Stack shifts: 5 local, 5 global, 4 trail in 0.005 seconds

1007
% Started at Mon Oct 16 11:30:46 2023
% 254.268 seconds cpu time for 8,161,941 inferences
% 30,136 atoms, 5,244 functors, 3,236 predicates, 42 modules, 11,939,140 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -    1,012 Kb      298 Kb
% Global stack:           -    2,048 Kb      230 Kb
% Trail  stack:           -      514 Kb      832  b
%        Total:    1,024 Mb    3,574 Kb      529 Kb
% 
% 57 garbage collections gained 94,234,328 bytes in 0.026 seconds.
% 1,503 clause garbage collections gained 160,138 clauses in 0.083 seconds.
% Stack shifts: 5 local, 5 global, 4 trail in 0.005 seconds

The version with signals spends a little more time in clause GC (2x although that’s negligible compared to the totals we are looking at here) but has fewer invocations of it by roughly a factor of three. It also ran atom GC which is completely missing from the version running --signals=false. More importantly, the VM-codes are growing at a much, much higher rate in the version without signals. That seems to match your suspicion regarding GC not working correctly.

jan · October 16, 2023, 11:25am

Yes. What remains strange though is that the number of collected clauses is comparable. Might be an ambiguity in how the statistics are collected. Atom-gc calls may be explained by collected clauses that cause atoms to become possible garbage.

I see three ways to get closer to this

Get a reproducible scenario
Compile for debugging (cmake -DCMAKE_BUILD_TYPE=Debug ...) and enable the various CGC related messages (named MSG_CGC_* in pl-debug.c) using -d topic1,topic2,.... That will produce a lot of noise and hopefully comparing the signal vs no-signal version will give a hint on what is going on.
See (using gdb or whatever) which signals are actually processed. SWI-Prolog doesn’t use signals for anything important. The most important use is to unblock blocked system calls, but that only applies when using multiple threads and signalling between the threads (as well as time outs and alarms that also use thread signals). The other usages are keyboard interrupts and turning fatal signals into a proper error report. So far I have the impression none of this is relevant, so why would (no) signal handling matter?

Its an interesting mystery Never saw anything similar …

Janno · October 16, 2023, 3:02pm

I still don’t have a reproducing example that isn’t tied intimately to our specific application and the code around it but I have made a bit of a breakthrough: it seems --signals=no implies --threads=no (as is visible in the statistics output in my previous reply, assuming the absence of 2 threads, [..] is indicative of that) and --no-threads alone is enough to trigger the slowdown and generate identical statistics output to the --signals=no version.

jan · October 16, 2023, 4:39pm

I see. --no-signals also sets the Prolog flag gc_thread to false. I saw no reason for that, but removing that make the embedding Prolog in Java tests hang. Needs some study.

In the meanwhile you can call set_prolog_flag(gc_thread, true) and see what happens. That still does not explain why CGC behaves so badly without a thread though and why this only affects when embedded in OCaml.

Janno · October 16, 2023, 5:30pm

I’ll have full certainty later tonight or tomorrow (when our automated performance analysis in CI finishes) but it does seem like re-enabling gc_thread restores the original performance characteristics of our application.

Thank you very much for all your help!

Unfortunately I have not managed to minimize the example any further. I can only observe a slowdown in our biggest files and only after a substantial amount of queries (including intermediate retracting and asserting of dynamic facts).

jan · October 16, 2023, 7:35pm

I’m not yet happy Had a look at the hanging Prolog embedded in Java. This is totally unclear. I disabled the GC thread explicitly in JPL and stopped disabling it when using --no-signals. We’ll see …

Still, I’m mostly interested in the scenario where clause garbage collection obviously doesn’t do its job.

Topic		Replies	Views
SWI does some good garbage-collection/optimisation these days General	23	1090	October 28, 2021
Ann: SWI-Prolog 8.3.7 Releases	14	997	September 11, 2020
Difficult to reproduce problems while running tests General	22	934	August 11, 2021
Ann: SWI-Prolog 8.1.32 Releases	22	1304	May 26, 2020
Fluctuating performance and sometimes "The following threads wouldn't die: [gc]" General	3	403	February 17, 2021

Performance implications of `--threads=false` in embedded swipl

Related topics