Optimzing context argument in the VM

Hello,

I am rereading the Crafts book again. Section 1.4 discusses the context argument that is used in lieu of a local variable in imperative languages.

The example given is below, where Multiplier is a “local” variable that is passed across all recursive calls unchanged.

scale([], _, []).
scale([X|Xs], Multiplier, [Y|Ys] :-
  Y is X * Multiplier,
   scale(Xs, Multiplier, Ys).

O’Keefe then provides a template for preprocesing that factors out the Multiplier argument, which, however, would be reintroduced again by the preprocessor.

I am wondering about the cost of Multiplier argument passing vs. a local variable access in imperative code. The cost of passing an argument + accessing the arguments value seems much higher in Prolog.

I am however, wondering, if the VM could (or is already) catching such context argument passing and ensures that its binding accessible without passing it across each call?

Edit:
A first idea to reduce the cost was to use a global variable, however, the cost would then still be a hash lookup … i think a VM catching this pattern could do better.

Does this make sense?

thanks,

Dan

My observation is that low-level details like this seldom have much over-all effect on performance.

In the case of the WAM VM, there would be no overhead by passing the Multiplier argument because it stays in the same register; at worst, there would be a “move register-register” if it were in a different argument position. I don’t know SWI-Prolog’s VM very well, but my guess is that it similarly have a low cost.

When I profile code I sometimes wonder about such small overheads, yet distributed across the program – since such constructions are common.

Each adds a bit cycle time – quite like background noise – and, they don’t show up as a sluggish peak (where code spends most of its time), but as a constant hum in the long tail – quite like a car being held back by baggage – causing the car to slow down

Perhaps, its just my run-away imagination :slight_smile:

Dan

Nothing beats a good old car analogy.

1 Like

Using the devel versions and ?- vm_list(scale)., we see

      16 l_nolco('L1')
      18 l_var(0,4)
      21 l_var(2,6)
      24 i_tcall

So, the last call copies to the first and last arguments, but leaves the 2nd arg alone. This only applies for optimized last calls.

The stable version doesn’t optimize this.

Hi Jan,

Thank you for shedding light on this.

Is there, btw, a description of the VM language available somewhere …

thanks,

Dan

I just happen to listen to Markus video on memory usage in prolog [1].

It seems that every call – without last call optimization – would create a new environment frame and push the variables on it – quite like function calls in conventional languages.

So, if i understood this correctly, without optimization, would the context argument be repeatedly pushed onto the stack for each recursive call.

During last call optimization – the same stack area is reused and the context argument can just stay where it is, which is what is happening.

Dan

[1] https://youtu.be/xABsTV8kuGE

Yes.
If you turn off LCO, the stack space can grow with the number of iterations (recursion calls) but with LCO the stack space can be bounded – without LCO is also slower because of all the environment copying.

You’re not wrong to think about those … I’ve sped up a C program by ~5x, mostly by tweaking small overheads that were distributed. Conversely, some of those overheads can be mitigated by using macros or goal expansion (e.g., library(apply_macros)). You’ll also notice that the implementation of maplist/2 et al uses an auxiliary predicate maplist_/2 to get 1st-term indexing (if you try the naïve version of maplist, without the auxiliary predicate, you’ll see that it leaves choicepoints and jiti_list/0 shows that it doesn’t get indexed at all).

Nevertheless, most of the time, it’s better to think about improving the algorithms or data representations rather than worrying about low-level improvements. And remember the three rules of optimization (I don’t know whom to attribute):

  1. Don’t
  2. Don’t
  3. (For experts only) Not yet

BTW, at the hardware level, branch prediction is an example of a “small overhead” that’s distributed and can cause major performance issues. Here’s a historical survey including some performance numbers (and which explained why Quintus Prolog on the early PowerPC performed much better than I had expected from reading IBM’s “Redbook”):
https://danluu.com/branch-prediction/

Thank you Peter,

Indeed, i am very aware of the rules of optimization, and am therefore spending most time to complete functionality, yet as i continue to add functionality I see a noticeable slow down.

One key uncertainty that comes with it is whether I will be able to increase performance significantly in prolog or whether I will have to rewrite some core in C/C++ (or Rust).

Another uncertainty is whether Prolog is already optimized enough so that a move to, say, C, will not make a significant difference – an outcome like this would need deep rethinking what to do – including considering hardware acceleration already at this earlier stages.

Finally, i may have a deadline in 3 months where I will need to have resolved these questions in principle and possible in practice, to some extend, as well.

I am therefore trying to analyze in parallel, at least in principle, and with a bit of tinkering, where improvements can be had.

Dan

The obvious question is: have you run profile/1 to see where the current bottlenecks are? And if your code uses things like maplist/3, have you tried concurrent_maplist/3?

The Aquarius compiler showed that in theory Prolog could run competitively with C++ … but people haven’t worked much on compilers for Prolog because (IMHO) Prolog tends to be used for problems that are clumsy to write in C++ and that won’t necessarily speed up a lot when rewritten in C++ - and they’d be a nightmare to maintain in C++.

(The experience of PyPy vs Python is instructive – part of the problem is that the semantics of Python don’t lend themselves well to compilation (it’s a very dynamic language); but another part is that the interpreter is already high enough level that removing the interpreter loop doesn’t gain a lot. I expect something similar for a lot of typical Prolog programs that depend heavily on pattern matching or backtracking.)

1 Like

I am running profile from time to time – and it leads me to two things – one is the property graph traversal, including property lookup, which is currently an assoc, is high up there in usage, and the second is a very long tail.

And, with the more recent introduction of transaction/1, also this overhead started to show up.

I hope that with a better understanding of choice points are and how to control them via code / argument structuring, indexing and cuts, the more I will be able to improve things.

Here are strategies i am thinking about:

re: cuts:
One move of a cut from client code, to the called predicate, which happened to be heavily used, had a dramatic effect the other month, hopefully, other such opportunities are lurking.

re: property graph representation
I need to scrutinize the graph representation – currently, its a straight forward one node/1, link/3 (each link having an identifier as well) – this is possibly not the most performant one, since it causes choice points for accessed outgoing (and incoming) link, when traversing the graph.

An adjacency list based graph representation, might be a better choice, since it would access all adjacencies at once, binding them into a variable, without need for choice points – and reduced recurring access of the fact base – its one thing i am considering. Something like: node/1, node_adjacency/2. E.g. (which also indicates an identifier for the link)

node_adjacent(a, items(l1-a, l2-b, l3-c)).

(note: if its not a list, but a functor with arguments, as above, – i would need some predicates to access the arguments like a list – the assumption here is that fact access to an array would be faster than to a cons list).

A key drawback is that i would need to maintain two such adjacencies, one for outgoing and one for incoming links – since its a directional graph.

Perhaps, tabling could offer a happy medium – by using the normalized representation (node/1, link/3) and a tabled rule that turns it into the node_adjacency/2 representation – it would still require more memory, and a bit more compute, but at least, only those graph areas that are traversed will take up more of the memory.

It would be important to update tabled adacencies in case new ones are asserted – best, if this can be done in a pin-pointed way, rather than invalidating the whole table.

Alternatively, i could roll my own memoing for this purpose, and ensure that the memoing of adjacent’s is updated in case nodes and links are added.

re: graph properties
Perhaps assoc to store properties over nodes and links isn’t the best choice, and there is a better way to associate nodes and links with properties. Jan, suggested an implementation based on arg (array based, essentially, with node and link id’s as indexes – so, instead of an assoc hash its an indexed lookup).

Some properties on the graph could eventually be turned into bitmaps (essentially hard coding them), and tests of bitmaps in some code areas, perhaps there is some benefit there as well – this could be one (research) outcome of my work, to identify such properties and elevate them into the “language” as it were.

re: parallelism
Also, there is some potential for parallelism that i noticed and haven’t yet fully exploited. I’d also like to try the spawn library for this.

re: first argument indexing
And, i will scrutinize maplist, which is used quite a bit to filter results of queries, first for potentially missed first argument indexing and potential for parallelism.

re: term expansion to remove call indirection and “simulate” constants
Also, i have been experimenting a bit with term expansions, to remove call indirection mainly due to the modular architecture (g1 :- module1:g2 and g2:- module2:g3).

Also to simulate constants, instead of retrieving constants by predicate (e.g. constant_1(A) translated into A=1 in the code).

re: guards

The current library and call structure, caused some guards to be executed more than once – i will need to clean this up – essentially, having guarded calls – intended for the end-user, and unguarded calls, intended for internal (re)use.

However, i also want to have strict rules for calls, and failure should most often lead to an exception – however, i will likely relegate this to debugging time via conditional compile - to ensure performance.

re: tabling
Finally, I need to investigate tabling – and reuse of computed results – when recompute is fresh and when it might become stale (disregarding at this stage memory footprint).

Would be glad to hear about additional optimization tactics …

Dan

Its possible that my problem is better suited for C/C++ (or Rust) because it is at the system level - and not the typical (relational) database application.

Hence, quite possibly, my struggle – system level development with Prolog lacks many of the things (memory level control) C/C++ and now Rust offers that Prolog inherently lacks.

Yet, i believe that Prolog was the right choice given the research uncertainty of my work – there is, in my mind, no better language to explore relevant research questions declaratively – yet, i am now paying the price for that :slight_smile:

I knew all along that a rewrite in C/C++ (or Rust) might be in the cards – but, i am now hoping that I can push Prolog all the way.

And, perhaps one outcome of my work is to show that Prolog (with some possible additions) can even be used at system level for some purposes.

Dan

p.s. I explored the last couple of days how a graph could be implemented in Rust. It turns out that the memory ownership model of Rust makes it effectively impossible to represent graph via linked nodes, and a list based approach must be taken.

If you add the need for memory (re) allocation and transactions to the mix, then the overhead adds up in Rust as well. Perhaps closing the gap to Prolog from that direction.

There are also benchmarks that seem to show that list based approaches perform better than linked-lists based approaches:

https://www.diva-portal.org/smash/get/diva2:1463648/FULLTEXT01.pdf

Here is a Picat example, a translated foreach loop:

p(_,[]).
p(V,[E|T]) :- writeln(V-E), p(V,T).

Was checking the generated code between SWI-Prolog 8.3.21 and SWI-Prolog 8.3.22, the code is the same. But I see another problem in compling aka preprocessing Picat, which wasn’t adressed so far.

There is a problem with indexing and the argument order, I still see an unnecessary choice point, both in SWI-Prolog 8.3.21 and SWI-Prolog 8.3.22:

?- p(a,[1,2,3]).
a-1
a-2
a-3
true ;
false.

Edit 09.04.2021:
The problem goes away by using (=>)/2, since (=>)/2 does
perform a cut. Try this Prolog code instead:

p(_,[]) => true.
p(V,[E|T]) => writeln(V-E), p(V,T).

Now I don’t see a choice point anymore at the end of execution:

?- p(a,[1,2,3]).
a-1
a-2
a-3
true.

But the Prolog system would be more efficient, if it wouldn’t generate
a choice point in the first place, one that is cut away by (=>)/2.

Yes, i had this discussion with Jan – apparently, the overhead is negligible – although, in my case, working at the system level – I think that even such small costs add up quickly.