When trying to make a program run faster recently, I wondered what effect garbage collection has on it, and in particular how much effect the threaded “don’t stop the world” gc was costing. So, I built
-DMULTI_THREADED=OFF, and saw a 20% performance boost, even when gc was turned off. I hypothesize that this is due to all the “atomic” instructions in the threaded code (memory barrier, atomic increment, etc – these can be seen as macros in
pl-inline.h), which do horrible things to the pipeline and (maybe) the cache. I haven’t yet tried this on a server-class machine (I’ve noticed that some things I ran on
dev.swi-prolog.org were about 3x faster than on my machine, and the most significant differences were the number of cores (which shouldn’t have mattered for my code) and cache size (16MB vs 0.5MB, according to /proc/cpuinfo); but bogomips were almost the same; and there are probably some other architectural differences between Intel server CPUs and AMD laptop CPUs).
I wonder if it’s worth thinking about having 2 “engines” - a single-threaded one that is used until something triggers the multi-threaded one (e.g., garbage collection, or a call to concurrent_maplist/3) and that can switch back when multi-threading is no longer needed. I suspect that this would be complicated, and not all code would see a 20% performance boost, but if it’s straightforward, it might be worth thinking about …
Incidentally, my measurements show that PGO gives roughly 20% performance boost. And, roughly extrapolating from some experiments with GNU Prolog, compiled code would give ~3x performance boost (probably even more on a machine with a large cache).
[Usual caveats about not very scientific benchmarks, YMMV, etc. etc.]