The subtle costs of garbage collection

When trying to make a program run faster recently, I wondered what effect garbage collection has on it, and in particular how much effect the threaded “don’t stop the world” gc was costing. So, I built swipl with -DMULTI_THREADED=OFF, and saw a 20% performance boost, even when gc was turned off. I hypothesize that this is due to all the “atomic” instructions in the threaded code (memory barrier, atomic increment, etc – these can be seen as macros in pl-inline.h), which do horrible things to the pipeline and (maybe) the cache. I haven’t yet tried this on a server-class machine (I’ve noticed that some things I ran on dev.swi-prolog.org were about 3x faster than on my machine, and the most significant differences were the number of cores (which shouldn’t have mattered for my code) and cache size (16MB vs 0.5MB, according to /proc/cpuinfo); but bogomips were almost the same; and there are probably some other architectural differences between Intel server CPUs and AMD laptop CPUs).

I wonder if it’s worth thinking about having 2 “engines” - a single-threaded one that is used until something triggers the multi-threaded one (e.g., garbage collection, or a call to concurrent_maplist/3) and that can switch back when multi-threading is no longer needed. I suspect that this would be complicated, and not all code would see a 20% performance boost, but if it’s straightforward, it might be worth thinking about …

Incidentally, my measurements show that PGO gives roughly 20% performance boost. And, roughly extrapolating from some experiments with GNU Prolog, compiled code would give ~3x performance boost (probably even more on a machine with a large cache).

[Usual caveats about not very scientific benchmarks, YMMV, etc. etc.]

Perhaps a related question: I’ve also noticed on some of my performance tests that the the load gets distributed across multiple cores (my CPU has 8), maybe for power/heat management. (I have a small widget on my menu bar that shows the current load on each core.) I’ve often wondered how much this disrupts the pipeline and primary cache performance.

Maybe it’s OS dependent but are there any mechanisms for pinning a thread to a specific core?

When running CPU-intensive SWI-Prolog tests, I’ve noticed that one CPU gets 100% load, then another, typically staying with one CPU for 5-10 seconds but sometimes for less than a second; I just assumed it was a quirk of the way the Linux kernel assigns threads/processes to cores (and maybe some interaction with the time slicing – after all, there are other processes running on the machine, so if a lot of processes get wake up, the swipl process could lose affinity). There’s a taskset command; but it didn’t seem to have any effect when I tried it.

That is quite extreme. Just ran some tests. The default benchmark suite is 6% faster using the single threaded version, the famous old CHAT80 program 13%. I guess the difference also depends on the CPU and compiler. This is GCC 11 on AMD 3950X (16 cores, 32 threads).

I think the difference used to be smaller. The introduction of transactions is possibly part of the trouble because it demands a fully consistent view in each thread.

Rather than two engines, a split of some of the code that walks clause lists and checks the visibility of clauses might be an option. Although static code does use the same visibility rules as dynamic code to realize an atomic switch for reloading source files, strict consistency is probably not required, and surely not as long as no source file has been modified and reloaded. In that case we can omit the entire visibility test as all static code is visible.

It would surely make sense to analyze the code on cache behavior using valgrind. I also assume that the layout of various structs is not optimal. Neither is probably their placement in memory. Possibly some linked lists should also be replaced by arrays to improve cache behavior. No clue how much there is to gain. There are surely other places were we can easily gain some performance …

I just tried my heavy backtracking program (countdown2.pl) on @jan’s machine - I saw 9% difference between threaded and non-threaded. I also saw about that same variation between sequential runs of the test with the same executable – there’s no obvious pattern. But for the PGO benchmarking, the threaded version ran about 16% faster. And, comparing Jan’s AMD 3950x vs my AMD 3700C, performance was essentially the same (for other programs, I had observed up to 3x performance difference, which I had ascribed to cache size difference … but now I’m wondering whether I made a mistake in doing the comparison).

I don’t know what to make of these numbers. I know that the CPU can do throttling based on temperature; I’ll try things out on an Intel machine to see if it shows anything interesting.

[If anyone wants to join in this, my “benchmark” uses GitHub - kamahen/nerdle: Nerdle game in Prolog and runs this test in around 27 seconds:

[countdown2]
time(forall(solve([1,3,5,10,25,50], 999, E), true)).

Benchmarking is complicated. And I’m the opposite of a meticulous experimenter/record-keeper, so I probably should re-do everything. (But not today)

For static predicates there shouldn’t be a difference between single threaded
and multi threaded. It should be rather the other way around, that multi-threaded
is faster, if one additionally manages to run garbage collection tasks in

separate Prolog system threads. Although the recent trend was away from
such models and towards garbage collection with thread affinity. But it is
a little complicated to archive a compilation sheme that does that. The idea

is that static predicates are read-only and don’t need any memory barriers.
The read-only invariance is already violated in SWI-Prolog in that abolish/1
can act on a static predicate, whereas the ISO core standard would require

a permission error. The retract/1 is blocked but the abolish/1 isn’t:

?- retract(app([], X, X)).
ERROR: No permission to modify static procedure `app/3'
?- abolish(app/3).
true.

Of course that is not a perfect conclusion, that static predicates have dynamic
aspect in SWI-Prolog. Since abolish/1 does not affect individual clauses but a
predicate as a whole. If static predicates can be seen read-only, then this

opens up the path to access them without locks or without lockfree concurrent
datastructures that use some atomic operations for synchronization.

Edit 22.02.2023
Currently SWI-Prolog uses MEMORY_BARRIER() on the level of a Supervisor,
which is responsible for a predicate. One compilation scheme that can make static
predicates read only is to use a MEMORY_BARRIER() one level up, i.e. on the

module level, and see predicates swapped in and out, together with their module,
and synchronize this module during consult. On the downside in one of my Prolog
systems I still need synchronization, either through locks or through lockfree

concurrent data structure, if the static predicate is multifile, but if the static
predicate is not multifile, its now read only and doesn’t need any synchronization
at all. if I am not totally mistaken some other programming languages like

Erlang also use module level consult synchronization.

I keep my development machine for me alone :slight_smile: The dev machine to which you have access runs in the cloud. It is rather slow old hardware as the provider had a nice offer for a fairly decent machine using their previous generation hardware. /proc/cpuinfo says “Intel Core Processor (Broadwell, IBRS)”

Anyway, for the fun also tried by Macbook with M1. Now we have

  • Benchmark suite from benchmark dir: 6% (same as AMD 3950X+GCC11)
  • CHAT80: 0% (actually the threaded version seems a little faster). AMD: 13%
  • countdown2.pl: 13% (using -O: 8%), AMD: 13%.

AMD versions compiled with GCC 11, PGO, Apple M1 using Apple Clang 14.0, no PGO as it doesn’t help with Clang (last time tried).

None of this is very scientific, measured only once on pretty idle machine. Differences between runs can easily be 5%. Also consistent differences after recompiling after making changes that surely should not affect the performance of the benchmarks can be significant.

They mostly are, but there are three areas of concern: abolish/1, reloading (reconsult) modified sources and JITI. Actually, there is practically no difference between static and dynamic code. The main concern is in visibleClause() that decides whether a clause is currently visible, given the logical update semantics (classical Prolog), reconsult (SWI-Prolog specific to allow hot patching of running code) and transactions (SWI-Prolog specific). Notably transactions require consistency. If we do

  transaction((retract(p(1), assert(p(2)))

other threads are guaranteed to see either p(1) or (p2), and never see neither, nor see both. This requires careful ordering of reading and writing the clause generations. So, probably we can and should use a simpler visibleClause() for static code as this cannot be involved in transactions.

I think (hope) you misread. If a supervisor is created we synchronize before updating the predicate to point to the new code. As index updates are not that frequent it doesn’t matter. The thread executing the predicate just jumps to the supervisor without synchronizing. We don’t care much if it gets the old one as worst case it will redo creating an appropriate one and the old is not freed before clause garbage collection decides it is safe to do so.

I did a little experiment by using an early if to choose between static and dynamic code. It seems to lower the difference a little, but leads to an overall slowdown. Some more reproducible results on countdown2.pl give me a difference of about 11%, compared to 2-3% on CHAT80.

Hmm. At least for my setup I solved the mystery. In my init file I have

:- set_prolog_flag(prefer_rationals, true).

That implies that division uses rational arithmetic and as countdown2.pl only verifies the division returned an integer after the whole computation, once we have a rational, we keep doing rationals.

But, the GMP library uses malloc(). If I compare without preferring rationals the CPU time drops from about 25 to 12 seconds and the difference between the single and multi-threaded version becomes about 3% again. So, my not-so-scientific experiment suggests the real difference between the single and multi-threaded version is due to malloc().

Peter, do you also have rational arithmetic enabled?

Not as far as I can tell. At one point, I did have it set for my “countdown” code, but I’ve removed that (as @jan observed, preferring rationals slows things down a bit). But I’m now seeing so much skew in the repeated tests that even 20% differences might not be meaningful.

I have a separate stupid question: how does compacting the stacks work without any mutexes or atomic operations? Is it the case that the “mark” phase of garbage collection runs in a separate thread (and, because it’s a conservative garbage collector, if it misinterprets the top-of-stack because another thread changes it, that’s OK); and the compaction runs in the main thread?

Here’s my check for rational-vs-float (confirming that I don’t have a setup file):

$ swipl
Welcome to SWI-Prolog (threaded, 64 bits, version 9.1.4-56-g119163351-DIRTY)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- current_prolog_flag(prefer_rationals, Z).
Z = false.

You are mixing up. SWI-Prolog has three independent garbage collectors:

  1. Traditional Prolog stack GC. That compacts the term (global) and trail stacks. Each thread has its own stacks and each thread does its own stack GC. That is a mark-and-sweep collector implemented by closely following the excellent paper my Mats Carlson et al on the SICStus garbage collector.
  2. The atom garbage collector. That is triggered after each 10k (flag agc_margin) atoms have been created that are not explicitly locked using PL_register_atom(). Atoms are shared resources and therefore the collecting thread and the others have to cooperate. The old version was a stop-the-world collector. The current version, mostly by Keri Harris, does not stop other threads. That makes it attractive to run the atom garbage collector in its own thread so we can do atom-gc without blocking normal progress. There is a small price of course. While scanning the atoms referenced from a thread the thread may not execute stack shifts or (stack) garbage collection. The scan is quick though: it does a simple linear scan through the stacks to find all atom-like cells. That causes the conservative nature: it may find things that look like an atom but are not, such as bit patterns in strings, floats or big integers. It may also scan a bit further than the currently in use stack.
  3. Clause garbage collector. This is triggered on heuristics derived from counting the number of clauses considered that are actually erased and the memory wasted on erased clauses relative to the total number of clauses. The asynchronous behavior is very comparable to atom-gc: it performs an asynchronous scan of the local stack to find referenced predicates and clauses. Old versions used reference counts, but that severely hindered multi-threaded scalability.

The events to consider atom and clause GC are detected by the normal threads. If one thinks it is about time a bit of code is executed that creates the GC thread if it doesn’t exist yet or sends it a wakeup. The GC thread executes the atom and clause garbage collections. It might be better to wait with creating this thread until we get into a situation where it is really worthwhile. As is, the normal startup sequence is already enough to create this thread while, for example, your countdown does not produce any garbage atoms or clauses. Proper tuning of all these processes is an interesting challenge …

2 Likes

But doesn’t this still need to have a memory barrier if there’s a possibility of another thread changing the value? (That is, a memory barrier at every “call” to another predicate)

Also, I’m wondering if there’s something in time/1 that causes memory barriers or atomic operations?

Anyway, the GC paper appears to be this, so I have some reading ahead of me: https://www.researchgate.net/publication/279463524_Garbage_Collection_for_Prolog_Based_on_WAM_Revised_version
As this was written in 1986, am I correct in assuming that it runs in the same thread but it’s so fast that the stop for garbage collection isn’t noticeable?

No. Having the old one is normally safe, possibly less efficient.

time/1 should not be involved in this. It just fetches CPU and wall time before and after the call. Why?

I would not claim that. Yes, for most programs GC time is small. There are exceptions though. The profiler tells you about the impact of normal stack GC. Roughly backtracking code as your countdown does little GC. Good functional code does, but it is typically quite efficient. Only forward progress leaving choicepoints behind is the most common scenario for problems.

Just to confirm: the stack GC runs synchronously in the same thread as the stacks that it’s GC-ing? (that is it “stops the world”)

Trying to eliminate variables while trying to figure out why I get different times for threaded and unthreaded swipl (albeit with more “jitter” than I’d like, so it’s possible that all I’m seeing is due to random effects – I suppose I should try this on a machine where nothing is running, and not my ChromeBook which might have its own ideas about thread priorities).

I tried this both with GC on and GC off.
With GC: 4 seconds
Without GC: 7 seconds
(Using statistics/0, GC took 0.1 seconds)

It would appear that GC makes things faster by avoiding some stack shifts (3 local, 19 global, 15 trail in 1.669 seconds).

(I’m seeing a lot of jitter in multiple runs, so I’ll need to try this on a different machine).

One could imagine a Prolog system where all predicates
are by default thread local, and where threads can only
share information through message passing.

You would more or less get JavaScript Workers. Such
a Prolog system would not anymore need synchronization of
predicates, neither needed for static nor dynamic predicates.

I am currently wondering whether I can build such a Prolog
variant and what the performance would be? But a special form of
thread local would be needed, since the predicate needs not be

visible among multiple workers.

Edit 27.02.2023
A bonus would be if the Prolog system had event
loops in its Workers by default, so that kind of fibers
would be available, and one would still have some benefits

of a multi-threaded Prolog system, just do it async
in the same thread. Maybe a couple service threads besides
the workers threads nevertheless, so that for example

things like call_with_time_limit/2 still work. Maybe such a
Worker Prolog could be bootstrapped from a multi-threaded
Prolog system by changing some defaults, for example

that predicates are by default this new form of thread local.

It seems quite a waste to replicate all static code over threads, no? I think I did make a mistake by having dynamic predicates by default shared. XSB has them by default local, which IMO is better. I made this choice because it seemed to cause fewer consequences. There are indeed some of these cases such as the file search path that you probably want shared. For a lot of user code, simply sharing dynamic predicates between threads does not work though.

YMMV, but it seems the difference between single and multi-threaded version is somewhere between 2 and 20%, depending on hardware, C compiler and Prolog program. Not very impressive I’d say.

That is surely still something we’d want, notably for the WASM version. I don’t see the relation to the above though? I see roughly two ways out:

  • When waiting for an event , grab the continuation and start than if the event comes in from the “event loop”
  • Have engines and make the event loop dispatch the event to the engine waiting for it.

The first has some issues because continuations in (SWI-)Prolog do not properly handle cuts that should cut “over” the continuation. The second may use too much resources, but might also work well. An engine is just some K-bytes if its stacks are nearly empty. As is, it is no option as engines are built on top of the thread infrastructure that we do not use for the WASM version. It is surely technically possible to untangle this and get engines in the single threaded version. Together with yield back to JavaScript that can (I think) get a fully transparent “async” experience with Prolog.

Fibers in Prolog terms, at least interpreted as Fiber (computer science) - Wikipedia is pretty much an engine as introduced by Paul Tarau and implemented in SWI-Prolog. And yes, they would allow implementing an event loop and do all this stuff. As said, that would require untangling them from threads. This is possibly not very hard. I don’t know.

And call_with_time_limit/2 uses a thread (not even a Prolog thread), so it won’t work using the current implementation. Implementing it on top of the WASM version won’t be particularly hard by checking whether the timeout has elapsed when Prolog yields on a heart beat.

I claim “would allow (for)”. To use an event loop with multiple threads of control you either needs these threads (not to be confused with OS threads) to be able to yield, so if you encounter something that has to wait for an event you yield and if the event happened the event loop resumes you. Alternatively you can use continuations, but both automatically generated as well as user generated (call backs) continuations cannot cooperate well with cuts.

It uses the heart beat yield, which yields every N inferences from the WASM Prolog VM back to JavaScript. That allows the JavaScript event loop to do its job. That can set a request for an abort in the Prolog instance. The resume tells Prolog to pick up the abort and act.

We only have one engine though and after Prolog yields, you can make another Prolog call. The “nested” goal must be completed before we can resume the earlier yield. I.e., we cannot yield from the nested goal and resume the older one.

catch/3, setup_call_cleanup/3, etc all work fine in the WASM versions. They do not create a nested VM call. In the absence of normal signals the WASM version implements sig_atomic/1 without a call through C, such that yielding works. It is just that the PL_open_query() … PL_close_query() must be strictly nested, i.e., if you open another query at the …, you must close it before closing this one. That is because the inner one’s stacks are on top of the outer ones. Possibly some spaghetti stack implementation could resolve that. No idea how hard that would be to implement. It is an interesting idea as it would provide cheaper fibers compared to engines.