VM code refactor - thoughts?

A few more Python-optimization links from Python Software Foundation News: The 2021 Python Language Summit: Making CPython Faster – I have no idea how applicable any of these ideas might be.
PEP 659 -- Specializing Adaptive Interpreter | Python.org
GitHub - faster-cpython/ideas
GitHub - faster-cpython/tools
http://github.com/markshannon/faster-cpython

I still don’t understand why those FASM libraries are (apparently) so much faster than functionality equivalent c implementations …

The libraries themselves aren’t complex code, and complexity analysis applies equally to both implementations – so why does assembly make such a difference … where is performance lost with C compilation.

1 Like

Good question. I will take a different approach attempting to answer it, since I observed

implementation time (msec)
latest SWI-Prolog 192
GNU Prolog 172
gplc (compiled GNU Prolog) 48

for the pure Prolog benchmark of 10 queens (724 solutions, no arithmetic, just list processing)

place_queen(I,[I|_],[I|_],[I|_]).
place_queen(I,[_|Cs],[_|Us],[_|Ds]):-place_queen(I,Cs,Us,Ds).

place_queens([],_,_,_).
place_queens([I|Is],Cs,Us,[_|Ds]):-
  place_queens(Is,Cs,[_|Us],Ds),
  place_queen(I,Cs,Us,Ds).

gen_places([],[]).
gen_places([_|Qs],[_|Ps]):-gen_places(Qs,Ps).

qs(Qs,Ps):-gen_places(Qs,Ps),place_queens(Qs,Ps,_,_).

% goal(Ps):-qs([0,1,2,3],Ps).
goal(Ps):-qs([0,1,2,3,4,5,6,7,8,9],Ps).

@CapelliC do you think the ratio will hold for much larger inputs as well –

So, how do you interpret these results …

Where in list processing does performance get lost.

One though of mine was time to access memory (pointer chasing) but this likely happen equally often – so, it then pure processing speed of term / list referencing, manipulation and stack handling (aka unification).

That’s a great observation …

I would imagine that in the overall structure of Prolog programs, built-ins can appear in two places: a) the base case and b) in much used data structures.

Regarding base cases, the these would be called less often, so shouldn’t contribute much to speedup.

But, regarding data structures those are often accesssed and hence making them built in could be significant: e.g. instead of prolog assoc library to have a assembly implementation of assoc.

Another consequence of your observation might be that the execution of VM instructions will likely have less impact – or, its impact will be significantly felt once data structures often used are optimized.

Edit:

Btw, in @CapelliC example, there aren’t really any built-ins – as far as i can tell …

Comparing GNU Prolog and SWI-Prolog it’s not too much instructive, even ruling out arithmetic, because gprolog doesn’t have multithreading, that - I think - will slow down SWI-Prolog engine in single thread computations. What is interesting, is the improvement in the compiled gprolog implementation. More than 300%, it’s a lot… now I’m reading the document On the implementation of GNU Prolog, before delving into details…

Perhaps, on the contrary, having both work on a problem single threadedly is good for machinery comparison … and deriving conclusions.

Its an artificial setup but can perhaps shed light on where to make optimizations that are then felt by each processing thread as well.

Dan

Threading doesn’t slow down much. Normal (pure) Prolog execution is not affected at all and neither are most built-ins. A couple are, but given that most low-level operations are lock-free and the impact is low unless threads really hammer some shared resource. There is some impact for dynamic code from which some clauses have been retracted because the logic for deciding whether a clause is visible or not on is more complicated and we need an additional indirection in the clause representation to safely add and remove clauses concurrently.

There are a couple of reasons that make SWI-Prolog slower though. To name a few (I don’t know the impact of each)

  • Unbounded arithmetic. E.g. A is 1<<1000 says 1099511627776 in GNU-Prolog. SWI-Prolog has to do overflow checking and move to GMP on overflow.

  • Lack of GC and stack expansion in GNU-Prolog. Try the program below. GC and stack expansion are not free as we need to be aware of all reachable Prolog data at any time and we must be aware that data can move at any time. This implies we need indirect pointers to data or, when we use direct pointers we must be careful to recover if we call a function that may shift the stacks or run GC.

    t :- t(_).
    t([_|T]) :- t(T).
    
  • SWI-Prolog has rather aggressive GC. That is intended to keep stack usage low and accommodate many threads. That surely implies a performance loss compared to just allocating a fixed stack size and give up if that overflows. SWI-Prolog’s initial stacks are small and at any overflow it must decide on GC and/or enlarging the stacks. Under some workloads the result can be pretty bad.

  • Lack atom garbage collection in GNU-Prolog. This too requires visibility of data structures and reference counting for atoms that are (only) accessible from foreign code.

  • Lack of support for cyclic terms. Try ?- X = f(X), Y = f(Y), X = Y. Cyclic unification and comparison is slower, except when there is significant structure sharing in the terms being unified or compared.

  • Almost all Prolog systems distinguish list cells from normal compounds. SWI-Prolog does not. It does have some special instructions. This surely makes SWI-Prolog loose on space requirements for lists. I’m not too sure about time as it also means we avoid code duplication.

It is. Native code generation together with GC/shift is a lot harder though. Also, these evaluations are nearly invariably on small programs. On big programs that have no clear (small) hotspots things might well be different. It always reminds me on SWI-Prolog’s optimized arithmetic: programs that are mostly doing arithmetic tend to get 2 or 3 times faster. Larger programs that do some arithmetic occasionally tend to get slower by a couple of percent, even if profiling shows they spent a clearly measurable fraction of their time on arithmetic.

I’ve observed the big gains with optimized arithmetic but was surprised that some programs get slower by a few percent. Any ideas as to why (and when) that’s the case, e.g., complex expressions? Or is it caused by second order effects like cacheing efficiency or small errors in the performance measurement techniques.

I think caching is the only sensible explanation.

A little late to jump in…

I assume this time difference between translating a Prolog program to C then compiling it with the conventional C compiler vs not.

For Sicstus I assumed that the speed up is that they experimented with a PL2AM “emulator” (a PL2VMI) … and experimented, and experimented , and experimented until the VMI instead of having only XX instructions it now has XXX instructions (lots of specializations that get rid of unneeded work but adds lots of extra cruft)

One pathway I went towards finding the “right size” of VMI (not that I am interested in doing it anymore) was to start out with a PL2AM (example https://github.com/TeamSPoon/SxxMachine/blob/cb7ee3dab4035bdc0d2e383a4cb962129b9d5922/prolog/sxx_pl2cpp.pl ) translate several benchmarks/ typical programs and finding the least dumb outputs. No mater how dumb (or smart) they were relative to each other and gave me tons of optimizations I would not otherwise considered had it hidden the intermediate output from me.

Whenever, I, as a human, could written a faster impl than my translator could write, I pretended it was a bug. Examples would be program translations involving between/3 or repeat/0 Not that I needed better more or better builtins, but i need to catch many idioms that prolog programs are going to try to silently sneak by the translator like:

foo:-foo.
foo.

bar :- foo, baz, fail.

``

From pag.4 of the document referred above, seems that GNU Prolog was born when they ditched such route and introduced instead the MiniAssembly intermediate language…

There’s also Richard O’Keefes tokeniser (section 10.7 of The Craft of Prolog). It’s probably online somewhere, but I didn’t see it at http://www.cs.otago.ac.nz/staffpriv/ok/software.htm

Yeah.. I mashed together PrologCafe, jProlog and KernelProlog (Predecessor to JinniProlog). Although KernelProlog and jProlog had a much more advanced System.Objects (like Attributed Variables, Multivars, Memoization Fluents, Closures etc) that with some care, the PrologCafe compiler had no problem targeting all three systems. Even though most of the code you will see is emitting java, it was retranslated as a next step into C++ using the Source Code Converters . Though i tried hard to make sure the java was still testable.

The repo link I gave was wrong above I meant to have given https://github.com/DouglasRMiles/SxxMachine

Edit

Still, you might be mistaking Eärendil’s “JProlog” with Bart’s “jProlog”

jProlog does have a compiler but it was rudimentary .. ORINIGALLY Wayback Machine (the rest is found here https://web.archive.org/web/*/http://people.cs.kuleuven.be/~bart.demoen/PrologInJava/* )

Examples of different compilers (boven.pl / comp*.pl) and their output in the .j files in.. SxxMachine/jsrc-kuleuven/prolog at master · DouglasRMiles/SxxMachine · GitHub

like:

It does,
see:



The 3 systems I used each had their own methodology all three very good at things the other 2 are 10x slower at) the resulting combined set becomes a PredikatenPrologMachine that ended up having typical WAM and non-WAM Machine operations and Tarau’s kernel machine unfolders .. Who knows, maybe I overlapped with some ZIP instructions. In many cases (often as I had time for) I ensured they did not simply convert into inline javafications of WAM instructions (the PrologCafe method) but instead to a set of static method calls and around the ever evolving 3±system prolog machine.

My point was answering why some systems are faster than others .(.My two cents costs only that).. I believe it has less to do with whether they are compiled and more to do with different styles of abstract “PrologMachine” instructions available. If ZIP instructions have unneeded slowdowns they will be NOT fixed by starting to compile/emit the them into llvm. And not some setting change that could have been made all along to existing workflow.. ZIP will be sped up by adding/changing instructions and detecting when new instructions will be more optimal and updating how existing prolog code will benefit from them.

1 Like