This targets the LLVM IR rather than implementing the interpreter in C++ … the main advantage, it seems, are:
reducing the size of the giant switch statement (the size of the switch inhibits optimization) - I think that @dmchurch did this a while ago.
using tail-call for the functions in the switch statement, avoiding some register-loading overhead.
Using LLVM for a JIT interpreter turns out to be problematic – or, it used to be: the Unladen Swallow project for Python ran into a lot of problems with using LLVM as a JIT, and eventually gave up. However, the Lua project seems to avoid these problems by generating the main interpreter with LLVM rather than using LLVM as a JIT.
I missed the original post (had some email trouble around that time). Yes, I agree this is interesting. @dmchurch indeed wrote some clever macros that allows for running the main interpreter loop as a set of functions or as a switch.
The toolchain described in the blogpost relies on clang/llvm. Unfortunately clang does rather poor job on SWI-Prolog Numbers below are on Ubuntu 22.04 using the latest versions of gcc (12) and clang (15), running on AMD3950X. All versions run the default benchmark suite that is also used for PGO (Profile Guided Optimization) as
src/swipl -O ../bench/run.pl
Time is the average over all benchmarks (less is better).
gcc-12 PGO Functions
clang-15 PGO Functions
I have no clue what is wrong clang’s PGO. Possibly something is wrong with our configuration. Otherwise it is broken
GCC seems to do a much better job on the large switch than clang.
In @dmchurch’s current version, AFAIK, we do not use tail calls. I.e., the main interpreter loop picks the next VMI and calls its implementation from an array of function pointers.
The time needed to rewrite the virtual machine to get an additional 30% performance would be considerable – I think that it would be better to spend the same amount of time on new features. (A 200% improvement might be worth the effort – but that can be achieved right now by switching between clang and gcc; an even greater performance gain is possible by using a machine with larger L1 and L2 cache, as I’ve observed by comparing Jan’s machine with mine.)
Whether it should be on the top of the priority list is another matter There are potentially interesting features for having virtual machine instructions as functions.
We can dynamically replace VMI instructions. This allows for instrumented vs. normal mode for instructions to come at no cost. With instrumented I mean all the checking debug mode, profiling, coverage analysis, waiting for signals, inference/depth limit, etc.
We can way easier create jumbo instructions, i.e., sequences of common instructions. Using jumbo instructions reduces the VM code size, avoids VMI dispatching and allows the compiler to schedule the combined action. But, in the current design it is a lot of work, so only some have been done.
It probably simplifies JIT compilation?
It is simply a lot more elegant. Easier to write, easier to debug, easier to profile (and thus optimize), etc.
For clang it appears that the gain we get from better optimization of short functions outweighs the function call overhead, but for gcc the story is very different, mainly because it seems to be much better optimizing the giant function. If we could find a workable way to turn the VM instructions into functions that doesn’t loose performance compared to the best we have (GCC+PGO), I’d be all in favor. As is, we still loose nearly 30% though
edit We might get away by defining all VM instructions as functions and use a switch where each entry simply calls the function inline.
Is this the work that @dmchurch did? Is it independent of generating LLVM IR?
Similarly, creating jumbo instructions would be independent of whether the VMI instructions are written in C or IR?
How universal is the IR? Can it be used by both gcc and clang? Presumably Microsoft’s compiler has something different, so would there be a problem on Windows with using the IR?
My concern about the possible performance gain is that possibly gcc already generates something close to the optimal IR, so rewriting the interpreter might get less than 30% performance improvement. OIs it possible to feed gcc’s IR to clang and clang’s IR to gcc, to find out where the bad optimization is?
Yes. This work redefines the VMI() macros in pl-vmi.c to either generate a switch or functions. It has nothing to do with LLVM IR. It simply rebuilds the VM as a set of functions and runs a VM instruction as (simplified)
Yes. If we have functions we can define a jumbo VMI simply by calling the sequence of instructions it is composed of, passing the right arguments. If we carefully inline, the compiler will optimize the sequence.
This is pretty clang specific AFAIK. gcc also has an intermediate language and AFAIK you can have gcc stop at some specific step and emit the intermediate result. Not sure whether you can manipulate this and have gcc continue with that. The intermediate representations are S-expressions. I once used that to find code that handles Prolog terms as native pointers and calls some function that may (indirectly) call GC or stack shifts
That is indeed not impossible. In that case the only benefit would be that clang can produce binaries with the same performance as gcc. We can only start using a function based VM if it performs as well as what we can achieve now on the major platforms. So far that doesn’t work. One option that might work is to write the VM instruction as functions and for e.g., gcc use a switch that inlines all these functions. That would allow for the simple definition of jumbo instructions, but not for dynamically selecting execution modes by modifying the array of functions.
I fear not. gcc and clang seem to walk a different path here. gcc fixes optimization of the giant switch and clang comes with extensions that allows using small functions and tail calls. E.g., clang has a [[musttail]] function attribute. gcc does tail calls at higher optimization levels only and gives no user control over them. There is a gcc plugin that allows for a __attribute__((musttail)) property, but this is not (yet?) accepted. The next thing we’d need according to the blog is to modify the calling convention.’
… And then we have MSVC. As is, it looses big time against both gcc and clang for SWI-Prolog. Maybe we should simply forget about that compiler for now.
It is probably too early to call. On the other hand, if someone likes experimenting, please do so and report your findings.
edit I just pushed a patch to use function pointers for clang by default as it seems to have the same speed using clang-14 on M1 and about 10% better using clang-15, both on x64 and wasm targets.