When are evaluable functions looked up?

Because expression evaluation runs outside the usual
stack limits, can I use it to bomb a pengin or SWISH?

Thats strange, because in many other scenarios you have
configurable stack limit. Why the exception of this rule?

bomb(f(X,Y)) :- bomb(X), bomb(Y).

?- bomb(X).
ERROR: Stack limit (1.0Gb) exceeded

Since it says stack limit, why is this not the same stack limit
as well for expression evaluation? Assume you have a resource
quota in (is)/2, you can get rid of the cycle test.

Here is a sketch of a post mortem analysis, even written in 100%
pure Prolog, which gives an error analysis service:

X is Y :-
    catch(eval(Y,X), error(foo,_),
     (acyclic_term(Y) -> throw(error(bar,_)); throw(error(baz,_)))).

The single error “foo” produced by the internal eval/2 predicate
is changed into either an error “bar” or an error “baz”, to provide the
end-user some more information. Can be also implemented

natively by the builtin is/2, making a cycle test during eval/2
unnecessary, since acyclic_term/1 exists. Or is there a danger
that acyclic_term/1 crashes? It probably uses less stack than eval/2.

C routines that process Prolog data structures do not use C recursion, but instead malloc chunks of memory to stack required context. There are no limits to this stack (except for malloc failing), so this only works nicely if we ensure these routines trap cycles and thus the limit of what needs to be processed is given by the limited size of the Prolog term that needs to fit on the stack.

Most routines (depending on how they need to handle sharing and cycles) maintain a bit to check what has been visited. Arithmetic evaluation does not, so we need to check for a cycle. The alternative would be to limit the C context stack proportional to the Prolog stack. Still, I see no reason to modify anything here. Evaluating huge very deeply nested expressions should be extremely rare.

Also, scenarios such as SWISH make it attractive to trap errors more eagerly than waiting for resource exceptions :slight_smile:

You can blow up a traversal. Its not proportional to the cell size
used by a term. For example this code here, not based on cycles,
creates quite a large traversal, but uses very little cell size.

bomb(0, 1) :- !.
bomb(N, (X+X)) :- M is N-1, bomb(M, X).

This makes the width of the term 2^N, with only N nodes. You
see this when you run a small example, like this here. Happens also
in my Prolog systems. bomb/2 terminates quickly, but (is)/2 ?

?- bomb(40,X), _ is X.
%%% hangs, even Ctrl-C doesn't work anymore

But I am not sure, can a similar trick be used to create a very
deep term as well, not only a very wide term?

Good catch. Depth is not a problem and it will eventually (provided your computer lives long enough) terminate. Evaluating an expression runs fully in C and SWI-Prolog’s interrupt handling is synchronous. On POSIX systems you can get out by pressing ^C twice. The second is handled asynchronously and provides only a few safe options.

Added a check for synchronous interrupt processing after every 1,024 nodes. That is a bit tricky as evaluation assumes the stacks are not affected. So, if we processed a signal that did run GC or shack shift but did not ask for an abort we have little choice but abandoning the evaluation and restart it from the beginning :frowning:

I have a unit test that checks for recovery from stack overflow and other out-of-memory situations – test_cpp.pl test square_root_2, also tests malloc, new_chars_2. Some of the tests rely on C++ std::bad_alloc exception, but that just wraps standard C runtime stuff, I think.

So, if you don’t mind raising some kind of resource_error instead of detecting a cycle, I think you can safely remove the cycle check in evaluating expressions.

Relying on malloc() failures is not a good idea. If you press the system that far Linux will at some point heuristically start killing processes to keep the system alive (look for “OOM killer”). The (then) huge Prolog process is a likely candidate and your exception handling is not going to help here. And, of course you do not wan other tasks on the system to be victim of this.

Notably to accommodate SWISH and Pengines, SWI-Prolog has to act in a sane way on any (safe) query. Sane meaning it should not crash, not eat all system resources and listen to timeouts within some reasonable delay. The arithmetic bomb example by @j4n_bur53 must be considered a bug (and is now fixed, but not yet on the public SWISH instance). Quite likely there are a few more of these :frowning: Please try to find them on your local instance and not on SWISH.

Its not a good example, since it doesn’t use extra heap memory nor some
extra execution stack. It only needs CPU time. Maybe you see some low
jitter in heap memory and execution stack, since it produces eval sub

results, but it doesn’t explode on the memory or stack side:

?- bomb(40,X), _ is X.
%%% hangs, even Ctrl-C doesn't work anymore

Its only busy with it self burning CPU cycles, because we gave
it a mesh structure, which has many many paths. What is a mesh
structure? Its a DAG graph that isn’t a tree:

                A
              ^   ^
             /     \
            B       C
             ^     ^
              \   /
                D

As soon as you have such little diamonds, in your Prolog term,
which I created via (X+X), you will visit A twice leading to a execution
time explosion. Your solution polling interrupt makes it only slower for

the sunshine case when a non pathological term is passed to eval.

Edit 13.09.2023
That Prolog terms can form a mesh exist thanks to the invention of
so called structure sharing Prolog systems. For example the above
might not happend for Prolog systems that are not based on

structure sharing. Many Toy Prolog systems you find on the internet
that juggle with some explicit substitution, helpful to draw Tau Prolog
execution, diagrams are often not “structure sharing”.

By nothing measurable. A counter, which we have anyway to trigger the cycle check and an if ( (counter%1024) && is_signalled() ), where is_signalled simply checks the (Prolog) signal mask to be non-zero. The gain is that we can respond timely to timeout signals and that is vital to prevent DoS attacks on e.g., SWISH.

If you can keep such a counter, you could also keep
another counter, for recursion depth or something, and
throw out the cycle test of the window, its not needed.

Make it faster. And ultimately throw out one of the
stacks, use a the native stack instead, make it bleeding
fast. Its easy as that! Whats the problem?

Currently your solution has two stacks:

segstack term_stack;
segstack arg_stack;

https://\github.com/SWI-Prolog/swipl-devel/blob/06aed1f8902aaba51ae6818e18999bd99da2ec46/src/pl-arith.c

In principle, since you have only number results,
maybe you can throw out both stacks? Don’t know,
what happens to bigint results? if you reach the point

of GNU Prolog, you even don’t need set_prolog_flag(optimise, true)
anymore, I guess a fast evaluator is competitive with generated
code. Because your generate code is still some interpreted “byte-code”,

so there is not much gain. Its not metal code that is generated.

Edit 13.09.2023
I don’t have time or specific SWI-Prolog experience to push an
implementation. Of course one can only get wiser by actually
trying it. But there might be a lack of SWI-Prolog

contributors that can do it. My bet: A fast evaluator is faster than
generated code, if you find a way to solve this problem as well,
namely “When are evaluable functions looked up?”.

This is one of the tricky parts as well in my opinion.

Well it used to work the way GNU-Prolog does it. It was replaced by the current implementation that to be able to process arbitrary large terms without using a lot of native stack and have reliable and portable overflow handling. Guess what: it did’t get slower. Actually it got a significantly faster, but that was because the change also removed user-defined Prolog arithmetic functions as callbacks from the evaluation. After that change we no longer have to fear GC during function evaluation, so we can use “raw pointers” rather than term handles.

During these changes almost all functions using native stacks were replaced by C loops using an explicit stack and this has no significant implication on performance. I don’t recall whether it got a bit better or worse, but it surely was barely measurable.

Avoiding a table lookup for getting the function looks interesting, but according to valgrind it only takes 1.5% of the time. Thanks to the ultrafast lockfree tables by Keri Harris :slight_smile:

… And @ridgeworks has contributed a lot to arithmetic. Unfortunately the better functionality made it a little slower as it required some tests were we don’t want these …

Nice story, but somehow irrelevant. I am asking something totally different!

I don’t know, from vm_list/1 its not clear “when a evaluable function
is looked up”, the title of this thread. And why there is a difference between
compiled and your (is)/2 function that uses its own stacks, and you say its fast.

For the math example vm_list/1 shows me the following op-codes:

Fastest?
a_add: No atom involved in the op-code, directly evaluation done?
Medium?
a_func1: Evaluable function handle fetched from the given atom and
evaluation through this function handle done?
a_func2: I guess similar to a_func1?
Slowest?
b_functor: The given atom is used to create a compound, evaluation
done later? But essential same function handle lookup from atom then?
b_rfunctor: Whats the difference to b_functor?

Maybe compiled is only faster when op-codes a_add and the like is involved? And
not so fast for a_func1 respectively a_func2. Or its it always faster since it doesn’t
use two stacks, but only one stack? What kind of stack does compiled use,

can’t you use the same for your (is)/2 function? Means you wouldn’t
allocate an extra stack, just dive into this existing stack, used by a compiled
arithmetic expression. You did nowhere report some experimentation with

such an approach. The resource quota is then very naturally the stack limit
of the existing stack used by a compiled arithmetic expression, now
also used for your (is)/2 function. I hinted that possibility already.

Here is what the vm_list/1 shows me:

Not compiled:

   9 b_void
  10 b_functor(exp/1)
  12 b_rfunctor((+)/2)
  14 b_smallint(1)
  16 b_rfunctor((/)/2)
  18 b_argvar(0)
  20 b_smallint(1000000)
  22 b_pop
  23 i_call(system:(is)/2)

Compiled:

   9 b_void
  10 a_enter
  11 a_integer(1000000)
  13 a_var0
  14 a_func2((/)/2)
  16 a_integer(1)
  18 a_add
  19 a_func1(exp/1)
  21 a_is

<begin excursion>

BTW: Your a_ code is just the postfix traversal, whereas the
b_ code is just the prefix traversal.

?- postfix_visitor(a_foo(a_bar(1,2),3)).
1 2 a_bar 3 a_foo 
true.

?- prefix_visitor(b_foo(b_bar(1,2),3)).
b_foo b_bar 1 2 3 
true.

Postfix traversal leads to Reverse Polish notation (RPN), which
can be easily processed by a stack machine, basically a little FORTH
interpreter. So I guess you have a stack for your compiled code.

This is different for example to Scryer Prolog which uses
Three-address code, which stores the evaluation results in
registers which are named by arguments of op-codes of the form:

t1 := func(t2, t3)

So there would be still alternative ways around to compile
evaluable expressions, since your stack is also a register file.
In your ZIP implementation the stack is also

the local variables array, right? Some variants of Three-address code,
like SSA, sometimes find only partially their way into native code
compiler back-ends, then sometimes back-ends rely more

heavily on such a form, like LLVM.

<end excursion>

<begin terminology>

I don’t understand your mention: “After that change we no longer
have to fear GC during function evaluation, so we can use
“raw pointers” rather than term handles.”

I used the word “handle” for the function pointer. So I guess I
used “handle” for something else, not term handles. I used
handle for what is behind (+)/2, sin/2, etc… on the native

side. When you ultimately process the numeric arguments.
Was refering to function handles, not term handles. But maybe
this is bad terminology, don’t know. But from your compiled code

example, I guess the stack used by compiled evaluable
expressions is not subject to GC issues? Or are the a_ instructions
more complex than they look from far away?

<end terminology>

a_add is a simple shorthand for a_func_2(ar_add). As it is so common we shorten the VM code a bit. a_func_1 and a_func2 carry a single argument that is a pointer to the function.

These things create the evaluable term on the stack as a normal Prolog term for non-optimized arithmetic. They start a compound term and make the argument pointer point at the first argument. The “r” variant is used for the last argument to avoid a pop. I.e., similar to last call optimization.

The gain comes from two aspects. First of all, if we find A is B+1, non-optimized arithmetic creates +(B,1) on the stack and then calls is/2 to analyze the term again and finally GC needs to get rid of the +(B,1) term if we do not fail soon enough (so backtracking gets rid of it). In the optimized version we simply push B, then 1 (already in native representation, so we do not have to figure out what it is) and can immediately call ar_add(). The uses a stack of number structs, indeed using the same implementation as the argument stack used for valueExpression(). Note that another advantage of this is that if something goes wrong we have a nice stack of all our intermediate pending results that we can easily clear. We don’t need to clear floats and small integers as these fit in the number struct, but we do need to clear big integers and rational numbers.

All this stuff uses the “seg(mented) stacks”. This is a C struct that is most often created as local variable in a native frame. Typically it starts from a small buffer also in the C stack frame. As it grows it will use malloc() to allocate new segments. For the couple of elements used for typical terms it doesn’t need any allocation and thus no cleanup.

The a_ code computes the result from compiled form. The b_ instructions build arguments for predicates to be called (“B” for Body).

As it is now, either compiled or term evaluation, arithmetic does not change the Prolog stacks, so the engine will not force a GC or stack shift. All arithmetic intermediate results are stored in the (seg)stack of number structs, carrying numbers in native format (it combines the type (int/float/big int/rational) and a union with the value in native representation). Before, we did callbacks on Prolog for user defined arithmetic, in which case we had to translate the numbers back to Prolog representation, call the predicate and convert its result. This may run arbitrary Prolog and thus may force stack shifts and/or GC. As a result we must use “indirect” handles to the terms that are updated by GC/stack shift. Also for bigints and rationals we read from the Prolog stack, we leave the actual bits representing the number on the stack. As long as it is not shifted, that is fine. If the stack could shift we need to copy that away or make GC and stack shifter aware of this problem. One costs time and space, the other complicates code.

Enough about arithmetic. IMO the design is good as it is. Some individual functions could be improved and some careful analysis of the implementation would probably find possibilities for optimization. I did a few while analyzing your examples, making some of this considerably faster. Further progress on these examples is marginal.

You would spare this allocate and free, if you would use the existing stack. I still
don’t understand why the existing stack cannot be used, the same stack that
the a_ instructions use. I have to check, you say its allocated initially

on the native stack? And doesn’t need allocate and free for a small size?
The small size is 16 elements, it is seen here:

Word term_buf[16];
number arg_buf[16];

Its used as actual parameters in:

initSegStack(&term_stack, sizeof(Word), sizeof(term_buf), term_buf);
initSegStack(&arg_stack, sizeof(number), sizeof(arg_buf), arg_buf);

Wouldn’t it be faster to use the head-room of the existing stack?
The register window that ZIP uses? Can I create a test case where
the above chunk size is too small and it switches to allocate and free,

and there is then a drop in the speed of your (is)/2? Have to try.

I’d first run a profiling tool such as valgrind on the code. I very much doubt that there will be a significant cost from the malloc()/free() after we overflow the 16 stack entries. But yes, one could try to see whether the bump is noticeable and if so raise the initial buffer a bit. Reducing malloc/free load is typically worthwhile, although the effect differs wildly depending on the malloc implementation used. Notably Window’s default malloc seems way slower than tcmalloc on Linux.

Its difficult to measure the two stacks separately, I don’t know a specific
test for the term_stack or arg_stack. But measuring both and the rest of
valueExpression can be done. Maybe cycle test is also measured?

Is it possible to switch off cycle test for measuring? You can measure the
lump of all via left, middle and right, all compute the same number 1024,
only with a different tree layout, and thus different allocate bumps:

/* SWI-Prolog 9.1.14 */
?- time(left), time(middle), time(right).
% 201,023 inferences, 4.734 CPU in 4.733 seconds (100% CPU, 42460 Lips)
% 204,092 inferences, 3.984 CPU in 3.984 seconds (100% CPU, 51223 Lips)
% 201,023 inferences, 3.156 CPU in 3.156 seconds (100% CPU, 63690 Lips)
true.

In my system I see some relatively smaller bumps, having to do with the native
stack, my reference counting and Java Garbage Collection:

/* Jekejeke Prolog, 1.6.3, JDK 8 */
?- time(left), time(middle), time(right).
% Time 2077 ms, GC 19 ms, Wall 14/09/2023 15:22
% Time 1804 ms, GC 0 ms, Wall 14/09/2023 15:22
% Time 1772 ms, GC 17 ms, Wall 14/09/2023 15:22
true.

In GNU Prolog middle wins, although left, middle and right have all 1023 ops,
don’t know whats going on there:

/* GNU Prolog 1.5.0 */
?- left.
(1234 ms) yes
?- middle.
(1203 ms) yes
?- right.
(1375 ms) yes

This is the test case:

left :- left(1024, X), between(1, 100000, _), _ is X, fail.
left.
middle :- middle(1024, X), between(1, 100000, _), _ is X, fail.
middle.
right :- right(1024, X), between(1, 100000, _), _ is X, fail.
right.

left(1, 1) :- !.
left(N, (X+1)) :- M is N-1, left(M, X).
middle(1, 1) :- !.
middle(N, (X+Y)) :- M is N//2, middle(M, X), K is N-M, middle(K, Y).
right(1, 1) :- !.
right(N, (1+X)) :- M is N-1, right(M, X).

It seems my recent changes made quite a difference.

Old version, Linux

% 200,002 inferences, 4.742 CPU in 4.750 seconds (100% CPU, 42174 Lips)
% 200,001 inferences, 3.795 CPU in 3.796 seconds (100% CPU, 52699 Lips)
% 200,001 inferences, 2.993 CPU in 2.993 seconds (100% CPU, 66824 Lips)

Old version, Windows

% 200,002 inferences, 4.230 CPU in 4.224 seconds (100% CPU, 47282 Lips)
% 200,001 inferences, 3.580 CPU in 3.583 seconds (100% CPU, 55866 Lips)
% 200,001 inferences, 3.020 CPU in 3.022 seconds (100% CPU, 66225 Lips)

So, that more or less confirms your findings. But, current GIT version:

Linux

% 200,000 inferences, 2.313 CPU in 2.321 seconds (100% CPU, 86480 Lips)
% 199,999 inferences, 1.884 CPU in 1.884 seconds (100% CPU, 106165 Lips)
% 199,999 inferences, 1.994 CPU in 1.994 seconds (100% CPU, 100324 Lips)

Windows

% 200,000 inferences, 1.820 CPU in 1.819 seconds (100% CPU, 109890 Lips)
% 199,999 inferences, 1.840 CPU in 1.840 seconds (100% CPU, 108695 Lips)
% 199,999 inferences, 2.030 CPU in 2.031 seconds (100% CPU, 98522 Lips)

I think this is the first time I see the Windows version outperforms the Linux version. In fact, the measurement is running the Windows version under Wine. In my experience, timing CPU bound tasks under Wine is pretty close to Windows under a VM. I do not have “bare metal” Windows.

Anyway, this doesn’t seem to suggest that native vs userland stacks makes much difference. That would also be my expectation. Creating, destroying and checking overflows is a bit more expensive.
You need the overflow check anyway and it is always going to cost (and hard to implement). Diving into the term however just pushes a single pointer. That is cheaper than a C function call. Note that diving into the term means (1) push the current pointer and (2) set it to the last argument of the new compound. Next iteration decrements this pointer until we find the functor that marks we are done with all arguments.

Woa! You can use bomb/2 to kill much more stuff. For example the top-level:

?- bomb(40, X).
%%% hangs, even Ctrl-C doesn't work anymore

But I wanted to test something else, how assertz/1 and evaluable expressions
behave in the face of a bomb. But what is causing the above blockage?
Is it acyclic_term/2 or the similar factoring of term for printing?

Edit 28.09.2023
Nope, must be something else, the two suspects are fine. Or at least the
below built-in use some clever algorithms:

/* SWI-Prolog 9.1.16 */
?- bomb(40, X), time(acyclic_term(X)), fail; true.
% -1 inferences, 0.000 CPU in 0.000 seconds (0% CPU, Infinite Lips)
true.

?- bomb(40, X), time(term_factorized(X,_,_)), fail; true.
% 73,667 inferences, 0.016 CPU in 0.020 seconds (79% CPU, 4714688 Lips)
true.

It is the toplevel trying to figure out which variables in the term need to be named and which can be printed as _. That code walks the term where shared subterms are walked each time they are encountered. Added interrupt handling to this as well as numbervars/4 using the singletons(true) option. Quite likely there are more, roughly all C code that walks a Prolog term this way (and has not been improved to handle signals).

Sounds familiar, for example term_singletons/2 will explode:

?- between(23,27,N), bomb(N,X),
    time(term_singletons(X,_)), fail; true.
% -1 inferences, 0.078 CPU in 0.074 seconds (106% CPU, -13 Lips)
% -1 inferences, 0.141 CPU in 0.153 seconds (92% CPU, -7 Lips)
% -1 inferences, 0.328 CPU in 0.322 seconds (102% CPU, -3 Lips)
% -1 inferences, 0.672 CPU in 0.667 seconds (101% CPU, -1 Lips)
% -1 inferences, 1.344 CPU in 1.350 seconds (100% CPU, -1 Lips)
true.

term_singletons/2 and friends are tricky, to make a smart algorithm.