Change in performance measures with non-tail recursion

Greetings. I was experimenting with very simple predicates making notes on conversion from non-tail recursion to tail recursion, when I stumbled upon this interesting phenomenon:

The first time you query a non-tail recursive predicate (simple recursion or stack recursion), predicate time reports a big number of Inferences and Lips. From the second time the same query is made, the number of Inferences decreases and the number of Lips increases.
On the contrary a tail-recursive predicate always reports the same numbers when queried.

It seems the compiler makes some kind of optimization and I would like to understand it.

Check the following example: a simple predicate for reversing a list in two versions, non-tail recursion and tail recursion:

%=====================
% Non-tail recursion version
%=====================
   reverse01([ ], [ ]).  
   reverse01([X|Rest], Solution):-
	reverse01(Rest, InvertedRest),
	append(InvertedRest, [X], Solution).
%=====================
% Tail recursion version
%=====================
   reverse02(List, Inverted):- reverse_aux(List, [ ], Inverted).

   reverse_aux([ ], Acum, Acum).
   reverse_aux([X|Rest], Acum, Solution):-
	reverse_aux(Rest, [X|Acum], Solution).

Both versions are correct and work as expected.
Now when starting the console, I query each version two times with the same query
and get the following results:

?- time( reverse01([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], L) ).
% 26,846 inferences, 0.007 CPU in 0.007 seconds (100% CPU, **4068915 Lips**)
L = [15, 14, 13, 12, 11, 10, 9, 8, 7|...].

?- time( reverse01([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], L) ).
% 134 inferences, 0.000 CPU in 0.000 seconds (88% CPU, **4964618 Lips**)
L = [15, 14, 13, 12, 11, 10, 9, 8, 7|...].

?- time( reverse02([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], L) ).
% 15 inferences, 0.000 CPU in 0.000 seconds (84% CPU, **860042 Lips**)
L = [15, 14, 13, 12, 11, 10, 9, 8, 7|...].

?- time( reverse02([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], L) ).
% 15 inferences, 0.000 CPU in 0.000 seconds (84% CPU, **860042 Lips**)
L = [15, 14, 13, 12, 11, 10, 9, 8, 7|...].

As you can see, the non-tail recursion version performs different the first and second times it is used (26,846 inferences the first time, and 134 inferences from the second timet onwards), but the tail-recursion version behaves constant.

I then tested with various other examples and the non-tail recursive versions always behave in the same way, although with increment/decrement of different magnitude in the number of inferences.

Can you please explain this phenomenon?

Thanks in advance.

I don’t get that result (although I have noticed that sort of first-time delay in the past, which I think is due to SWI-Prolog -- Just-in-time clause indexing ). On a fresh load:

Welcome to SWI-Prolog (threaded, 64 bits, version 9.2.5)

?- [reverse_slow_first_time].
true.

?- time( reverse01([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], L) ).
% 275 inferences, 0.000 CPU in 0.000 seconds (99% CPU, 5667649 Lips)
L = [15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1].

?- time( reverse01([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], L) ).
% 134 inferences, 0.000 CPU in 0.000 seconds (86% CPU, 3886875 Lips)
L = [15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1].

What version are you using?

For profiling info, try instead:

profile( reverse01([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], L) ).

… from SWI-Prolog -- Execution profiling

Greetings @brebs,

I’m using SWI-Prolog (threaded, 64 bits, version 9.3.7) over Linux.

I don’t know how your results are so different…

Most likely, you are measuring the auto-load time of append/3 in the first call:

Predicate append/3
Availability::- use_module(library(lists)). (can be autoloaded)
https://www.swi-prolog.org/pldoc/doc_for?object=append/3

If you would put use_module/1 before your test, you wouldn’t see it.

1 Like

I have to comment that, uppon studying SWI-Prolog’s Just-in-time clause indexing, it may well be the explanation for the difference in performance between the first query and the others. However, the relationship of that to tail reciursion or stack recursion is still unclear…

Further testing led me to observe that no all tail-recursive predicates perform better than non-tail recursive ones.

The first example I gave (reverse01 and reverse02) may have been biased because of the presence of append in one version and not in the other version. But consider a new example where the two versions have the same expressions but in diferent order. The sum of elements in a numeric list. The code looks like this:

%=====================
% Non-tail recursion version
%=====================
   sum01([ ], 0).  
   sum01([X|Rest], Solution):-
	sum01(Rest, AuxSum),
	Solution is X + AuxSum.
%=====================
% Tail recursion version
%=====================
   sum02(List, Solution):- sum_aux(List, 0, Solution).

   sum_aux([ ], Acum, Acum).
   sum_aux( [X | Rest], Acum, Solution):-
        AuxSum  is X + Acum,
        sum_aux(Rest, AuxSum, Solution).

And the time test yields:

?- time( sum01([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], S) ).
% 30 inferences, 0.000 CPU in 0.000 seconds (93% CPU, 1484414 Lips)
S = 120.

?- time( sum01([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], S) ).
% 29 inferences, 0.000 CPU in 0.000 seconds (87% CPU, 1425131 Lips)
Suma = 120.

?- time( sum02([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], S) ).
% 30 inferences, 0.000 CPU in 0.000 seconds (86% CPU, 1322985 Lips)
S = 120.

?- time( sum02([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], S) ).
% 30 inferences, 0.000 CPU in 0.000 seconds (86% CPU, 1322985 Lips)
S = 120.

In this case, the tail-recursion uses 1 more inference than the non-tail recursion version.

So, ¿Can someone explain this? Again, thanks in advance…

I don’t know what makes you claim anything on performance. All CPU times are zero :slight_smile: If we try a bit bigger data sets we get differences:

69 ?- numlist(1, 1_000_000, L), time( sum02(L, S) ).
% 2,000,000 inferences, 0.144 CPU in 0.144 seconds (100% CPU, 13912840 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...],
S = 500000500000.

70 ?- numlist(1, 1_000_000, L), time( sum01(L, S) ).
% 1,999,999 inferences, 1.171 CPU in 1.176 seconds (100% CPU, 1708421 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...],
S = 500000500000.

If we use swipl -O (optimized) we get

71 ?- numlist(1, 1_000_000, L), time( sum01(L, S) ).
% 1,000,000 inferences, 0.419 CPU in 0.421 seconds (100% CPU, 2389244 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...],
S = 500000500000.

71 ?- numlist(1, 1_000_000, L), time( sum02(L, S) ).
% 1,000,000 inferences, 0.028 CPU in 0.029 seconds (100% CPU, 35125913 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...],
S = 500000500000.

Some of the reasons: the tail recursive version uses far less memory as it does not need to build 1M stack frames (which implies it needs do stack expansion quite a few times), it does not need to exit from all these stack frames and the tail call to the same predicate is optimized using dedicated VM instructions.

Professor @jan,

Thank you for your observation.
Maybe “performance” was not the correct word, I meant performance as for the answers the time predicate delivers.

Your explanation about less memory used by tail recurions is sound, but I will need to dive into that optimization option… ¿Where can I study about it?

Search for “last call optimization” or “tail call optimization” or “tail recursion optimization”. There’s a fairly long description in Wikipedia (which I haven’t read in detail), including a slightly confusing/misleading description of TRO for Prolog.

Greetings @peter.ludemann,

As you suggested I have read the Wikipedia entry for “Tail Call” and particularly focused on the Prolog example code (in section “Tail recursion modulo cons”).

I would love to see your (preferably long) explanation about why you think it is confusing/misleading. I’m sure everyone on this forum would learn much from it.

Please!

My simple explanation of tail-recursion optimization (TRO) is that it transforms a call/return into a goto.

The tail recursion modulo cons section implies that TRO only works with “cons” but the concept is more general than just “cons” and can be used with any term that’s constructed and would be a return value in a functional language – and the Wikipedia explanation doesn’t mention logical variables, which make this possible.

The simplest example of TRO (using “cons”) is append/3, where the 2nd clause is tail-recursive in Prolog but not in LISP (unless something special is done with “cons”):

append([], Ys, Ys).
append([X|Xs], Ys, [Z|Zs]) : append(Xs, Ys, Zs).

You can see the LISP “cons” here:

(define append
  (lambda (ls1 ls2)
    (if (null? ls1)
      ls2
      (cons (car ls1) (append (cdr ls1) ls2)))))

Technically speaking, the final “cons” in the LISP version of append is a tail-call; there is no need to create a new stack frame and then do a call/return; instead the tail-call can re-use the stack frame and do a go-to the “cons” function. But the major cost of append is in the recursive calls: in Prolog, the recursive call to append/3 is a tail call but in LISP it isn’t, so the LISP stack grows (and needs to be unwound) whereas the Prolog stack doesn’t grow.

(The Prolog tail-call isn’t completely without cost: at the implementation level, there’s an extra level of indirection because of the logical variable that gets bound later (this is the [Z|Zs] in the code above). However, this is still cheaper than creating/deleting stack frames.)

An example of TRO with something other than “cons” is Peano arithmetic, where the s(C) is a “return value” term and the call to sum/3 is tail-recursive:

sum(A, 0, A).
sum(A, s(B), s(C)) :-
	sum(A, B, C).

(This code is tail-recursive both when it’s used for addition (e.g. sum(s(s(0)), s(s(s(0))), Sum)) and for subtraction (e.g. sum(s(s(0)), Diff, s(s(s(0))))).