VM code refactor - thoughts?

This is basically what PGO does, actually! Rather than the programmers having to figure out which conditions are more or less likely to happen, the PGO process runs a benchmark suite and records all the branches that get taken and how often, and then the optimizer can use that information to duplicate instructions, reorder code and/or branches, etc. It’s certainly a good idea! Just, one that’s already getting done :smiley:

I doubt very much this would help. Note that more code typically also means slowdown as it means more cache misses. Only in tight loops it pays of. The biggest opportunity to gain without changing any algorithms is (I think) reordering notably memory layout to get fewer cache misses.

All is merged! Also tested the Win64 build. Builds fine.

What cache are we talking about here?

Hmmm … that means a 32K L1 cache has 512 association sets.

If the working set of the swipl VM is more than 32K, then we could have cache thrashing. But assuming that’s not the case, we might also have a problem if there’s significant wastage in the “small chunks” – e.g., if on average there’s a jump every 8 instructions, then almost half of the L1 cache is wasted, so a working set of more than 19K could cause cache thrashing (plus lower performance in general because of the wasted 13K of cache). This could be ameliorated by rearranging the code so that all the high-usage code are adjacent, which would lead to each cache chunk containing mostly high-use code. [The average x86 instruction is 5 bytes (x86 Machine Code Statistics - strchr.com), so ~13 x86 instructions per cache “chunk”]

(ARM instructions would probably be similar to x86, on average; x86 instructions can become quite large, but presumably they fit within 64 bytes)

Branch prediction can also be a big factor, as the PGO results show; it’s not clear to me how branch prediction helps/hurts L1 caching.

More stuff on caches here: https://danluu.com/
There’s also some interesting stuff on branch prediction here:
Branch prediction
Other discussions of caching strategies and virtual memory replacement strategies by doing searches for, e.g., [virtual memory site:garlic.com/~lynn])

Presumably the likely() and unlikely() sprinkled throughout pl-incl.c are an attempt to capture what you think PGO comes up with? Is there a way to get more information from PGO and use that to fine-tune the use of likely/unlikely? (It would be nice if the PPA version of swipl were as fast as building it on my own machine with PGO.)

I guess the way to figure out is using valgrind --tool=cachegrind (if I recall correctly). The core file has a text size of 210Kb (on x64). This contains more than the VM, but on the other hand the VM also makes calls to functions outside this object. This is a bit outside my expertise. Some of this is handled well by modern compilers, in particular when using PGO optimization. Some probably not. I suspect that structure layout matters. trying to make subsequent access is in the same cache line. The layout of a structure in memory is, AFAIK, not something the compiler can decide upon.

Surely, @dmchurch input makes analysis easier. I don’t know what is really worth investment when it comes to low level optimization. The target changes over time with changing pipeline logic, changing compiler intelligence and changing cache architectures. Possibly it makes more sense to chase more high level issues such as clause indexing. Last summer I did some comparison work with SICStus. Their performance on small benchmarks is truly impressive. On large programs the story is not that clear though

the likely/unlikely annotations are intended to get better results without PGO. The effect is not impressive. I guess there are far too few annotations. Probably the PPA scripts should be changed to build the PGO versions. Note that the other binaries (SNAP, Windows and MacOS bundle) do use PGO.

I tend to agree; but the fact that PGO can give 20-30% performance gains makes me think that there is some low-hanging fruit that hasn’t been tried.

It’s not clear whether cachegrind gives the information needed to suggest improvements; it seems to be more of a tool for for comparing two different versions of an executable.
There’s also linux-tools (it seems you install with sudo apt install linux-tools-$(uname -r)), which gives the “perf” tool, which has a long list of hardware measurements – again, this seems to be more for measuring changes than in suggesting what to do.

I noticed that the VMI macro has a call to count() … is there an easy way to use this to get VM instruction frequencies? I’m guessing that ordering the VM instructions by frequency could give some cache benefits; but doing fancier analysis might be more trouble than it’s worth.

I read some time ago about Intel’s One API [1] – I wonder, how could this fit into the performance picture … could this be relevant – or would this be a complete rewrite of the core code.

Edit: or would this out of necessity lock the code in for Intel’s Architecture excluding other ones – e.g. ARM …

[1] Intel® oneAPI: A Unified X-Architecture Programming Model

That is really old stuff. I think it is time to remove that as it probably doesn’t work anyway and the @dmchurch VMI functions allows doing the counting simply using C level profiling. Probably more important is to look at longer common sequences of (especially small) instructions. Turning these into a VM tends to help. That added instructions like H_LIST_FF, dealing with [H|T] where H and T are first seen variables. Otherwise you would get H_LIST, H_FVAR, H_FVAR, H_POP. But, remember, below a certain threshold of popularity of such sequences they speedup tight loops that use these sequences and slowdown big programs due to a large part of the VM being active and thus more cache misses.

To find sequences you can reuse the low-level predicates from vm_list/1. Especially with the new VM infrastructure it should also be trivial to add some instrumentation that dumps the execution of a program as a long sequence of VM instructions. If you dump them as single-byte opcode values you have a nice sequence for data mining :slight_smile:

I keep wondering what it takes to generate compiled code out of such VMIs (and supervision code) -JIT style …

Dan

As have I! It’s one of the directions I’ll be exploring in, don’t worry :slight_smile:

1 Like

This would be wonderful – please do share your thoughts and insights – i am super curious about this.

The performance improvements by adding “combined instructions” suggest two hypotheses (they’re not mutually exclusive):

  • the VM interpreter overhead is sometimes significant
  • the I-cache is not being fully utilized

It’s not clear that generating compiled code (e.g. a JIT) would give much more benefit compared to adding “combined instructions” – the resulting “bloat” of compiled code could hurt the I-cache. Adding combined instructions might get the best of both worlds: remove interpreter overhead where it matters but maintain code compactness (and also avoid the complexities of a JIT).

The results of the Python “Unladen Swallow” project show that compiling out the interpreter loop isn’t always worth the effort, especially when the VM instructions are moderately complex. You might want to look at the disappointing benchmarks and discussion of how difficult it is to build a JIT using LLVM. (There are also some specific aspects of Python that might or might not be relevant – if you want me to pontificate on them, you merely have to ask.)

1 Like

The “Unladen Swallow” project was aware of PyPy – and PyPy has had a lot of work put into it and at best gives 5x performance improvement and sometimes a slowdown. https://speed.pypy.org/
From a quick look at the benchmarks, performance improvements seem best for array and matrix manipulation – for which, the numpy package might be a better choice anyway.

Performance improvement is complicated and what looks like a good approach doesn’t always pay off. I would rather work on improving features, some of which might also pay off with performance improvements. For example, it might be worthwhile developing O(1) data structures for Prolog (e.g., hash, array).

1 Like

Indeed – and I always wonder what it would take, to have (safe) pointers, as part of the dynamic fact base as well …

But, also I think, having am empirical evaluation of complied approach such as JIT could be very worthwhile as well.

I think this might have been due to not having set the optimized flag for the Rust compiler – could you check that …

I just injured myself moving furniture, so don’t expect any brilliant insights from me through the pain meds, but what I can tell you is my intuition says there are some serious gains to be found, looking through the VM code. Whether those come from jit-compilation or not is something I won’t know until I’ve done some of the prerequisite exploratory work.

Consider, though: the gains are less likely to come from eliminating the interpreter loop itself, and more from the fact that the compiler can eliminate some code paths entirely when it’s optimizing across a bunch of instructions.

1 Like

Very sorry to hear – if you have access to a capable osteopath, that will work wonders!