Porting the SWI-Prolog benchmark suite: comparing 8 Prolog systems

See Porting the SWI-Prolog benchmark suite: comparing 8 Prolog systems - #31 by jan for updated figures for queens_clpfd, improved timing, updated Ciao Prolog and Scryer Prolog.

I did some work on GitHub - SWI-Prolog/bench: Prolog benchmarks (`van Roy' set). This suite is used for training using “Profile Guided Optimization” as well as to assess the impact on changes. It used to run on SWI-Prolog, SICStus and YAP. As the driver heavily dependent on conditional compilation and a Quintus Prolog derived module system, the portability was limited.

I added a portability layer that relies on rewriting the original benchmark programs. This allows us to generate something that can work easily on different Prolog systems (using SWI-Prolog for the rewrite process as well as scripting the workflow). After some refinement, adding a system is now doable in less than an hour. Currently it supports SWI-Prolog in several configurations and 7 other systems:

Identifier     Description
――――――――――――――――――――――――――――――――――――――――――――――――――
swipl          SWI-Prolog (-O,PGO)          9.1.19
swipl-no-pgo   SWI-Prolog (-O,no PGO)       9.1.19
swipl-no-O     SWI-Prolog (PGO)             9.1.19
swipl-st       SWI-Prolog (no threads,-O)   9.1.19
swipl-wasm     SWI-Prolog (WASM,-O)         9.1.19
swipl-win64    SWI-Prolog (Win64,-O)        9.1.19
gprolog        GNU-Prolog                    1.6.0
yap            YAP                           6.3.4
sicstus        SICStus Prolog                4.8.0
scryer         Scryer Prolog               0.8.127
xsb            XSB                           5.0.0
trealla        Trealla Prolog               1.6.18
ciao           Ciao Lang                    1.21.0

For SWI-Prolog, “-O” means running swipl -O (optimized arithmetic while many of the benchmarks do a lot of arithmetic). “PGO” means compiled using Profile Guided Optimization, “no threads” means single thread version, “WASM” the Web Assembly version running under node.js and “Win64” the Windows binary running under Wine. SICStus Prolog is the official binary version, Other systems were built from source with default options.

Hardware and OS: All benchmark are executed on AMD3950X, 128Gb core, Ubuntu 22.04

Benchmarks that use more than minimal ISO:

  • perfect – requires unbounded integers.
  • queens_clpfd – requires clp(fd) :slight_smile:
  • pingpong – requires tabling
  • fib - requires tabling and unbounded integers (XSB result is wrong)
  • moded_path – requires tabling with answer subsumption
  • det – Benchmarks single sided unification (=> rules)

Below I’ll share some charts. Missing bars implies the system could not run the benchmark. In some cases this can probably be fixed, in others it is lack of features, such as unbounded integers, clp(fd), tabling, etc. Giving SWI-Prolog (GIT version), GNU-plot (for the charts) the command to generate the chart is added. The number of iterations of all benchmarks has been calibrated in 2021 such that each benchmark took about 1 second on SWI-Prolog 8.5.2.

Some classic systems

swipl compare.pl -o classics.svg swipl gprolog yap sicstus xsb ciao

classics

The “modern” systems vs. SWI-Prolog

swipl compare.pl -o modern.svg --ymax 50 swipl trealla scryer

modern

  • The sieve benchmark tests the dynamic database. It is, AFAIK, fully ISO, but not accepted by Scryer. Trealla accepts it, but didn’t finish in an hour. Trealla seems to get slower on this on each iteration.

SWI-Prolog using different compilers and setup

This comparison uses gcc 11.4 and clang 15, where for gcc we compile with and without PGO (Profile Guided Optimization) and we also built the version without multi thread support. On Clang, using PGO makes the system slower, so we stopped using that.

swipl compare.pl -o c-compiler.svg swipl swipl-no-pgo swipl-st swipl-clang-15

c-compiler

SWI-Prolog native vs WASM

This comparison compares the native AMD64 binaries with the WASM version compiled using Emscripten 3.1.44 (based on Clang) running on node 18.17.1.

wasm

All data

Below is the CSV (well, the separator is | as GNU-Plot does not like , in quoted fields) with all raw data. Generated using

swipl compare.pl -o all.svg swipl swipl-no-pgo swipl-clang-15 swipl-no-O swipl-st swipl-wasm swipl-win64 gprolog yap sicstus scryer xsb trealla ciao

all.csv.log (5.0 KB)

Discussion

There is a lot to say about this figures. At the same time, benchmarks are always highly debatable. It is well possible that notably the really poor results on some individual benchmarks are caused by a trivial oversight in the implementation. Some benchmarks failing on some implementations are probably easily fixed and some are probably caused by me not being patient enough to find a work around.

Porting turned out to be hard. I was surprised how hard it is to load one file from another Prolog file. Some systems take the name of the module to load relative to the file in which the directive appears. Some use a search path and do not like the extension (XSB). For Trealla I failed to figure out the rules and the only solution that worked was including all code in one file except for the benchmarks itself and run the benchmarks from the directory holding the source. Porting to the “classic” systems was a lot easier because error messages are much clearer, documentation is better and I experienced no crashes on any of these. My respect for Paulo Moura, making Logtalk run on these systems, has grown a lot!

A few highlights

  • SICStus Prolog is unbeatable on this benchmark set. Congrats!
  • While “Scryer Prolog” is advertised as “A modern Prolog implementation written mostly in Rust”, its GIT history goes back 7 years. It has a long way to go wrt. performance, some context to error messages and even simply loading a file (include/1 does not work, only using use_module/1 I managed to load non-module files). “Trealla” is faster, but has some weird outliers (poly_10 and sieve) and I had to work around many crashes.
  • WASM is always claimed to provide “near native performance”. On average the WASM version of SWI-Prolog is over 5 times slower. Ok, about 30% of this goes to Clang vs. GCC.

Future work

If you think some system is handled unfairly, please let me know.

  • Include more systems. Please provide a PR for your system.
  • Include benchmarks that cover other areas. Think about GC, built ins (sort/2, findall/3, etc.), program load time, indexing large databases, coroutining, constraints, tabling.
  • Include memory usage, possibly monitoring using time(1) on Linux?
  • Improve the GNU plot output. E.g. show value on hover, somehow break exceptional long bars into two. Both seem to be possible. Any GNU plot wizards around here?

Acknowledgements

@j4n_bur53 brought benchmarking under the attention.

I would like to thank the SICStus team for providing me with a permanent license for SICStus Prolog.

20 Likes

Good job. What prevents swi prolog from achieving higher speeds? For my tasks, the speed of swi prolog is quite enough.

That is not so easy to answer. In some places I probably made the wrong choice in design of the VM based on the ZIP rather than the WAM. The stack based argument passing of the ZIP does make it easier to integrate the debugger into the VM though. Some choices wrt the data representation are also dubious. This could be improved quite easily by actually using the benefits of 64 bit architectures. I’m a little reluctant doing that as it would either imply to give up on 32 bits or get into a dual representation with an extra maintenance burden. Possibly when 64 bit WASM becomes the norm we can ditch 32 bits support?

Clause indexing for predicates that you typically find in normal static code with say 2 to 20 clauses per predicate is not very good. That can be fixed without redesign, just decide on appropriate indexing instructions and compile the clause selection into this language.

The VM is a overloaded with features such as synchronous interrupts, depth and inference limits, debugger, profiler, coverage analysis, etc. I think it is worth the slowdown, but I would not be surprised if that reaches 20% (measured very long ago when it was possible to remove all that stuff). Possibly there are better alternatives. If we implement the VM instructions as an array of functions, we can swap optimized and instrumented versions. As is, this is implemented by @dmchurch and works for Clang, which seems to be bad in optimizing large functions and as a result using functions or one big switch performs about equal. GCC does a good job on the big switch and wins by about 30% was we can see in the comparison.

And, finally, SICStus compiles to native code on various platforms. That notably improves performance for small benchmarks. The gain on many larger programs is less obvious because virtual code is shorter and thus utilizes the caches better.

But, optimization is not easy. It is notably hard to predict the effect of rewrites, some of which are very involved. Spending weeks to end up with something that turns out to be slower than were you started is rather frustrating :frowning:

2 Likes

Asking this in general.

Are there any users of SWI-Prolog that can only use the 32-bit version?

I wonder where speed plays a role in tests outside of the tests themselves? I don’t know any real applications based on Prolog where someone would say that they were limited by the performance of the application or saved a lot on servers by replacing swi-prolog with SICStus Prolog

Sometimes a system needs not only have speed but also
stay responsive. I am not 100% sure whether this is guaranteed
by all Prolog systems. I noticed even with SWI-Prolog that the cursor

sometimes stops blinking and that aborting a long running
problem becomes difficult. Not sure whats going on. If you
garbage collect a large memory occupation with a non-concurrent

or non-incremental garbage collector, you might see
such effects. Might be annoying for applications running in
WASM in the browser as well as for the standalone applications.

Sometimes mitigated by separate workers or threads, but
might be not that successful if garbage collection itself
isn’t interruptible. This is a recurring topic in language design.

Those applications surely exist. SWI-Prolog still has a reputation of being slow. Over time it got a few times faster, placing it now amongst most of the “classical” systems (leaving the “modern” systems far behind :slight_smile: ) I assume that applications for which performance really matters mostly use YAP or SICStus.

SWI-Prolog has been reported to compete well with YAP and SICStus on many large applications though. It all depends what type of operations are critical. Do not forget that these old benchmark have a pretty poor coverage of what many large applications use.

If you mean with responsive, being capable to handle interrupts or, for the WASM version to yield back to the browser, the only long term blocking action is the (stack) garbage collector. Atom and clause garbage collections happen in a separate thread (as yet not for the WASM version) and thus run concurrently without blocking other threads. As we have seen, some foreign predicates may block handling interrupt for quite a long time, but this is fairly easily fixed. The interfaces for making them yield to keep the browser responsive when using the WASM version are there, but they are not easy to use. In particular if a lot of state is involved in the foreign code. With some effort, GC could be made to yield. Atom and clause GC are probably not even hard to keep responsive in the WASM version.

And, who knows what will happen in the WASM world. Emscrypten already allows C code to be compiled into asynchronous code. AFAIK at the cost of a large code blowup and a lot of performance for now. I see no real reason why the WASM VM cannot support asynchronous behavior though.

Hi Jan,

According to our experience with Ciao, I believe that your analysis is quite correct. Disabling some of the features that you mention (limits, profiler, etc.) may make you gain 10-20% but it is not worth, if your users require those features. Microbenchmarks are good to show how fast is your language/implementation and how it compares with other languages. But at the end, what really matters is how fast is the system for each particular real application (and sometimes well engineered builtins, or indexing, etc. can achieve orders of magnitude speedups).

I also agree that 32-bit cannot be dropped at least until WASM goes 64-bits.

It’d be great to see the benefits of PGO in other systems (just to learn if the plethora of low-level tricks traditionally used in VM implementations are still worth it) as well as try on other CPUs (different cache size, etc.).

Nevertheless, one take way from your data is that “modern systems” are not learning from the state of the art, at all (or they know, but it is really hard to get to this point). This is really dangerous for the community, specially when newbies chose a Prolog system based on github popularity rather than their technical qualities (this view has been publicly ridiculed on social media but I really won’t stop complaining about it). Serious technical analysis of systems should be more important than popularity. I’m sure there are many useful performance/resource metrics where each new or old system excel.

EDIT: Please accept my apologies if anyone felt offended by the comments above, it was really not my intention. My (unfortunate) exaggerated “dangerous for the community” is only looking at the possibility that some good implementation ideas (which took decades of research) may be forgotten or reinvented. This may only apply to “VM implementation technology” into account and systems are much bigger than that. Every new system I’ve seen definitely solves some issues better than others! I tried to reflect that with the last sentence about using diverse performance/resource metrics to compare systems. Clearly microbenchmarks only show a limited picture and there are many dimensions to consider (ISO conformance, advanced features (CLP, tabling, etc.) stability, ease of maintenance, flexibility, code base size, portability, resource usage, performance per builtin, etc.).

1 Like

The engine is simply tested with the usual ctest, which runs all normal commandline tests using node. There is not much testing of the browser integration yet. The WASM port is usable, but far from perfect. Interactive test are at SWI-Prolog WASM demos, notable Some tests. There is no CI for it. Complete CI is anyway a bit of a problem due to the enormous diversity of environment in which you can compile SWI-Prolog and the various configuration options. So, as yet there is only fully integrated CI for the default build on 64 bit Intel/AMD Ubuntu Linux. In my dev environment I test a lot more configurations after potentially non-portable changes.

The repository is a GIT submodule of the entire SWI-Prolog distribution. This is right if you clone the whole system and build the WASM version in a build directory build.wasm as I do. Same for the single threaded and non-PGO version.

Building different configurations is supported by creating a directory build.feat1-feat2-... under the project root, go there and then run ../scripts/configure. This scans the desired features and assembles the environment and commandline options for CMake to configure the system. This avoids the need to remember and type the long CMake options. It is comparable to using using CMake presets, but you have to define the mapping of each feature to flags only once, after which you can explore all combinations. Won’t work on Windows (but does run on almost anything else).

It is a source repository. There are no generated files in it (well, except for the BibTeX generated bibliography such that we can build the HTML documentation without the need for a full LaTeX installation).

The WASM version of SWI-Prolog is available as an npm package. See GitHub - SWI-Prolog/npm-swipl-wasm: SWI-Prolog WebAssembly build as a NPM package. Thanks to @jeswr!

This CPU has a big cache (1/8/64 MiB) and fast memory – the CPU alone costs about the same as my desktop machine. :slight_smile:
I wonder if you’d get significantly different relative results on a machine with smaller cache (e.g., my desktop’s Intel 0.256/1/6 MiB) – ISTR some things ran on the AmD3950X about 3x faster than on my desktop. And cache isn’t the only factor; my desktop (Intel) is generally faster than my ChromeBook (AMD), despite the latter having bigger cache (0.384/3/16) and more cores.

Of course, the future will bring bigger and bigger caches, so what is now a server CPU will be similar to a laptop CPU in a few years. But comparisons with different CPU architectures may provide insights into where the performance bottlenecks are (and these tests would need to have their own PGO builds, I presume)

If you want to replicate the results, get a non-Windows machine (WSL will do as well, be it slow), clone the git repo and build the things you want to check. It takes less time than sending all these messages and allows for playing around with a lot more parameters.

I am still grateful to got it donated by TerminusDB :slight_smile:

That is a good question. I think not. PGO mainly helps the compiler in branch prediction and identifying hot places in the code AFAIK. You probably can do the same using GCC/Clang __buildin_expect() to help branch prediction and AFAIK there are also attributes to tell the compiler that a function is hot. There are also compiler instructions to preload memory locations, etc. Quite likely you can get almost the same results as with PGO by using these annotations. It would be great if you could use PGO to automatically annotate the source :slight_smile: It would also be nice if there are tools to optimize structure layout wrt. cache behavior. I guess that should be possible, but my last search for such tools was no success.

One would assume that tuning for the target CPU is attractive. GCC says (by default, none of these flags is specified).

Where none of -mtune=, -mcpu= or -march= are specified, the code is tuned to perform well across a range of target processors.

So, I tried with -march=native -mtune=native. I’m not excited, but it seems to help a little:

bench

Surely, both CPU architecture and cache architecture cause some choices to work out better on one platform and others better on another. As keeping the alternatives around is too costly, we have to make a choice. These choices are mostly based on the measurements I get on my CPU.

As @jfmc says, a lot of tricks have been tried over the years. Some remain valid, others have been made obsolete due to progress in CPU design and/or compiler design. So called jumbo instructions are an interesting example. When I talked to Bart Demoen about these things in the early 2000s Bart was quite clear: identify common instructions sequences and create jumbo instructions for them. That is quite a bit of work. I did so for some really common sequences with quite good results. The main reason is that the VM switch in those days was not handled by the CPU pipeline. Modern CPUs do (often?) anticipate the jump though, making the gain much smaller. These days, optimizing cache behavior is probably one the most important considerations. But then, CPUs get bigger caches and possibly predictive capabilities to reduce cache misses. Who knows …

RE:

The current version of Scryer (as of 2 weeks ago) is 0.9.3. Scryer Prolog version 0.9.3 is out! · mthom/scryer-prolog · Discussion #2162 · GitHub

I’m wondering if there was a particular reason to benchmark against 0.8.2 (especially since it seems there were issues/bugs getting that to work).

I’m mainly asking because I quite enjoy Markus Triska’s enthusiasm and insights around logical purity, and so I have high hopes for Scryer Prolog’s lofty goals given that he is the #2 contributor.

(Having said that, I have a practical need to get things done, run fast concurrent code, have an active community, and so I’m using SWI-Prolog for day-to-day coding.)

Anyway, this is why I’m interested in a fair comparison between SWI and Scryer (and haven’t looked at other version numbers).

1 Like

My mistake. I build from the Nov 16 git repo, but there was an older version I had in ~/.cargo/bin and some part of rust/cargo/… apparently edited by ~/.bashrc :frowning: I’ll check the configuration and rerun the benchmarks.

I used rustc 1.74. Installed from the Rust side. It wasn’t a big deal to build. It would be nice if somewhat older Rust versions would work too. As this is my only Rust project it is ok, but if version requirements are too tight it quickly gets a mess on your dev machine unless you use a lot of containers :frowning:

1 Like

:slight_smile: love this !

Hi @jan, we’ve created a PR [1] to address some possible problem when measuring times in the Ciao port (under some circumstances call/1 in Ciao can be relatively slower than in other systems, this commit attempts to reduce this negative impact so that benchmarks results really reflect the benchmarked program and not something else). It has been only partially tested as I have some problems running the benchmarks (probably I’m using a wrong version of SWI or not following the instructions carefully) – I’ll continue looking at it.

[1] attempt to remove meta-call related noise from ntimes/2 loop by jfmc · Pull Request #1 · SWI-Prolog/bench · GitHub

Thanks for the input. So, here is take 2:

  • Updated the “dummy” loop timing as contributed by @jfmc
  • Fixed queens_clpfd.pl. The old code posted the constraints, but did not solve them. There is something weird going on in SICStus here. SICStus is surely much faster in clp(fd) in general. I asked for a response, so this may get updated with either an explanation or a fix.
  • Some driver updates, notably an option to regenerate the charts in different combinations without rerunning the benchmarks.

Some classic systems

classics

  • Updated Ciao Prolog to 1.22.0
  • Updated Ciao chart to include @jfmc’s improved loop.

The “modern” systems vs. SWI-Prolog

modern

  • Updated Scryer Prolog to 0.9.3. They made serious progress. Congrats! The queens_clpfd.pl and the sieve.pl benchmarks have been added. The ISO predicates number/1 and retractall/1 have been added. I had to made more changes to to get the code loaded. Creating a module with the programs and some support predicates somehow did not work anymore (predicates became invisible). Loading a file programs.pl from directory holding a subdirectory programs silently loaded nothing until I added the .pl suffix. The sieve bar is cut at 20, but the actual value is 359.

I will add a pointer to these results in the original post.

2 Likes

Thanks you for the quick update!! I think that the numbers reflect more closely our expectations on this set of benchmarks (taking into account the optimizations that each system implements).

This looks like a really interesting project to extend with more benchmarks, perhaps classified by topic (clp, tabling, dynamic database, arithmetic, etc.). I think that I’d be a mistake to take it to seriously while doing micro-optimization of our VMs. For example, Ciao seems the fastest system executing nreverse (even faster than SICStus). But it is fine sacrificing some performance (even in all benchmarks) if you need to implement some nice feature, e.g., tabling.

1 Like

For some rather strange reason it seems that Ciao tends to negative results, i.e., the dummy loop takes consistently more time than the real loop. Yes, some of the tests are really simple, but this is still weird, notably because it seems fairly consistent. Any idea what could be happening? If you have a recent version if SWI-Prolog (devel), you can reproduce using (in bench top dir).

swipl port/prepare/ciao.pl
ciao
?- ['port/run/ciao.pl'].
?- run(1).

You can use run(0.1) to run 10 times less iterations. If you can’t do the prepare, I can send the files in a private mail.

Agree. Would anyway be good to have more acknowledged benchmarks then the old van Roy suite.

1 Like