Have you any information on Prolog’s performance in the 2025 “Billion Nested Loop Iteration” test? Three hundred different programming languages are said to have taken part in. [My apologies if a thread already exists on this issue]
So, this is kind of the worst type of benchmark for prolog because it uses 2 features from procedural languages that prolog completely lacks:
- loops
for x in y - inplace assignment
+=
But I enjoy applying prolog where it doesn’t belong, so here is my take, following the python code and the official description of the task:
loops(U, Return) :-
random(0, 10_000, R),
compound_name_arity(A, array, 10_000),
init(A, 10_000),
loops(U, R, A, 10_000),
R1 is R + 1,
arg(R1, A, Return).
init(A, I) :-
( I == 0
-> true
; arg(I, A, 0),
I1 is I - 1,
init(A, I1)
).
loops(U, R, A, I) :-
( I == 0
-> true
; loops(U, R, A, I, 10_000),
arg(I, A, V1),
V2 is V1 + R,
nb_setarg(I, A, V2),
I1 is I - 1,
loops(U, R, A, I1)
).
loops(U, R, A, I, J) :-
( J == 0
-> true
; arg(I, A, V1),
J1 is J - 1,
V2 is V1 + J1 mod U,
nb_setarg(I, A, V2),
loops(U, R, A, I, J1)
).
Notes on the implementation:
- I have used a compound to model the array and not a prolog list. (just try to write this task with a prolog list and you will understand why I didn’t ^^)
- loops are obviously translated to recursion. One twist is that I do iterations from 10_000 to 0 instead of 0 to 10_000 because I find it easier and I believe slightly faster. I believe it doesn’t change anything to the spirit of the benchmark.
- inplace operations
+=is translated to:arg(I, Array, V1),V2 is V1 + Other,nb_setarg(I, Array, V2).
nb_setargis slightly faster thansetargbut I don’t really know if it is portable.
- the code needed to be adapted to use 1-based indexing for
arg.
Here are the results on my machine:
$ lscpu
12th Gen Intel(R) Core(TM) i7-1255U % on Fedora 43
$ swipl -O loops.pl
Welcome to SWI-Prolog (threaded, 64 bits, version 9.2.9)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.
For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).
?- time(loops(40, R)).
% 200,072,550 inferences, 13.724 CPU in 13.782 seconds (100% CPU, 14578271 Lips)
R = 201415.
Profiling reveals that 99.9% of the time is spent in loops/5 (the inner loop), of which 77.5% is spent in itself and 22.5% in nb_setarg.
Impressive!
Thank you for your response ; but, you did not tell us in what position this result would have placed Prolog among the 300 or so languages.
Regarding the actual configuration of the test, how about organizing a parallel test more atune to the characteristics of higher level languages? Could SWI-Prolog take the lead?
Hm, how about last ![]()
I tried to track down the runtime of other language and I believe I saw something like 4 sec for Python.
So no, prolog can’t take the lead for higher level languages for procedural benchmark.
What would be interesting would be a backtracking heavy benchmark. maybe prolog could take the lead there ?
This is indeed just about the worst task imaginable for comparing Prolog performance. I think the code is about as good as it gets. On my AMD Ryzen 9 3950X, Fedora 43, using 10.1.1 I get 8.8 sec. Compared to ~4 sec for Python I think this is excellent ![]()
Python is considered to be a slow language. Twice as slow as python is not excellent. It might be understandable given Prolog’s nature, but it is certainly not excellent. To improve might require a complex optimization subsystem.
I’m not sure nb_setarg is a key Prolog function, but what else does Prolog have if you want O(1) update time?
Using :- set_prolog_flag(optimize, true). on the original function reduces execution time by 50%. Replacing the original code with your code, keeping optimize on, reduces another 30% of the execution time.
Wait: the original code has a 37% distortion in profiler time, and the new code has 2%. This accounts for all of the time savings. But what accounts for the 37% distortion?
“What would be interesting would be a backtracking heavy benchmark”
… and why not! It is not just a question of relative advantage, but also one of higher level of computational capability.
Well, profiling shows that nb_setarg is 22% of the whole runtime, so pretty important I would say.
Well, I believe your proposed implementation using prolog list doesn’t really follow the challenge rules. They explicitly says in the description of the task:
Note: Like with the reference implementations the innermost sum should be assigned directly to the array index.
You don’t assign the results of each innermost sum to the list, reducing the 10^8 array access to just 10^4, which of course leads to your mistaken observation that nb_setarg doesn’t contribute to the runtime much.
I wonder how Prolog compares with Haskell? ![]()
No sure how to read this. Prolog is designed to search for solutions (variable bindings) that satisfy some logic based specification. Pattern matching based data transformation could also be considered something Prolog is good at. Loops over integers and handling arrays are not what it is good at. Then there are low level operations. Incrementing an array element is expensive. Imperative languages typically have low level support for array[i]++, Prolog has not. It doesn’t have a generally accepted and usable/portable notion of arrays. Few algorithms based arrays and notably updating arrays can be represented comfortably and efficient in Prolog.
Note that Python is not that good at these tasks either. It compensates for that using vector libraries that are implemented in C(++). That can be done for Prolog as well.
Richard O’Keefe implemented FFTs in Prolog: Learning to implement data structures & algorithms in Prolog - #4 by peter.ludemann
It’s explained in the README of the project:
Changes to the benchmarks compared to legacy runner
…
- loops: The inner loop is now 10k, again to allow slower languages to complete more runs.
It was 100k before.
Perhaps a move away from plain number crunching to logic intensive number crunching? Even better, provide Prolog with the missing “low-level” extra capabilities? … Sounds like F1 racing.
Hi, with your code as is, this is the result I get with the same query as you. I compared this to a naive “count nested loops” query:
101 ?- time( loops(40, R) ).
% 200,051,123 inferences, 5.729 CPU in 5.742 seconds (100% CPU, 34922005 Lips)
R = 198000.
101 ?- time( aggregate_all(count, ( between(1, 10 000, _), between(1, 10 000, _) ), N) ).
% 200,010,003 inferences, 5.090 CPU in 5.099 seconds (100% CPU, 39296557 Lips)
N = 100000000.
It seems that using aggregation to count doesn’t add much in run time, or number of inferences. I then used the following query to try and guess how much of this time is caused by aggregation:
108 ?- time( forall( between(1, 10 000, _), forall( between(1, 10 000, _), true ) ) ).
% 100,009,998 inferences, 2.237 CPU in 2.240 seconds (100% CPU, 44705238 Lips)
Could anyone comment on the correctness of this approach in regards to the problem statement? Am I measuring the right things?
[EDIT]
The above timings were done with swipl -O. Without -O, I get such numbers:
101 ?- time( loops(40, R) ).
% 300,061,123 inferences, 17.066 CPU in 21.231 seconds (80% CPU, 17582905 Lips)
R = 204685.
102 ?- time( forall( between(1, 10 000, _), forall( between(1, 10 000, _), true ) ) ).
% 200,019,999 inferences, 4.503 CPU in 4.511 seconds (100% CPU, 44415360 Lips)
true.
So, the benchmark/challenge is measuring 100 million recursive call where each call does a load_memory + add + store_memory operation. This took you 5.7 sec.
You are comparing that with:
- 100 million backtracking + a simple non-backtracking counter. that took 5 sec
- 100 million backtracking. that took you 2 sec.
I’m not sure that we can draw any conclusion from this, as you are not comparing the same things.
But here are some observations:
- there is a big performance drop from having no op to having something to do when backtracking/recursing.
- I attribute this to the shuffling of registers that needs to happen when you do something, instead of just doing jumps to the same instruction. (this is very hand wavy, I don’t really know what I am talking about).
This is shown in the big difference between your two comparison (5sec vs 2sec)
- I attribute this to the shuffling of registers that needs to happen when you do something, instead of just doing jumps to the same instruction. (this is very hand wavy, I don’t really know what I am talking about).
swipl -Ois extremely effective when having arithmetic in a hot path- your 100k backtracking + non backtracking counter takes almost the same time as the full benchmark with recursion. I believe this hints that backtracking is in general slower than recursing.
This compares 100 million non backtracking counter with recursion vs backtracking:
$ swipl -O
Welcome to SWI-Prolog (threaded, 64 bits, version 9.2.9)
?- [user].
|: loop(T) :- arg(1, T, N), (N == 0 -> true ; N1 is N - 1, nb_setarg(1, T, N1), loop(T)).
|: ^D% user://1 compiled 0.00 sec, 1 clauses
true.
?- time(loop(t(100 000 000))).
% 200,000,001 inferences, 2.849 CPU in 2.856 seconds (100% CPU, 70211263 Lips)
true.
?- time( aggregate_all(count, (between(1, 100 000 000, _) ), N) ).
% 200,000,002 inferences, 5.204 CPU in 5.219 seconds (100% CPU, 38433252 Lips)
N = 100000000.
A big problem with this type of code in the current system is calling out to predicates defined in C. If we try the code below, doing a simple loop and only retrieving elements from “the array”, according to the profiler, 44% of the time is spent in arg/3, the only involved foreign predicate.
Reconsidering the foreign predicate interface might be worthwhile. It will remain problematic though because we need to save enough of the VM state to allow the foreign code to call the garbage collector, setup callbacks to Prolog, etc. Possibly we should define a more low level foreign interface …
t(N) :-
functor(State, state, 100),
loop(N, State).
loop(N, State) :-
( N > 0
-> loop2(State, 1, 100),
N2 is N - 1,
loop(N2, State)
; true
).
loop2(State, I, Max) :-
( I =< Max
-> arg(I, State, _),
I2 is I+1,
loop2(State, I2, Max)
; true
).
Would that be a “no call-backs allowed” limitation that would help make faster foreign predicates? Or something completely different?
Additional twist to the benchmark could also help: a coefficient for terseness/brevity of the code.
There are several things we do before calling out to a foreign predicate:
- Save several VM registers, so we get the complete view for running GC.
- More VM state to allow for callbacks
- Create a choicepoint, so we can backtrack and we can implement backtracking foreign predicates.
In addition, we create a normal environment stack frame holding the arguments. This allows us to have a uniform simple calling convention and provide a backtrace that includes the foreign predicate in case of an exception.
Omitting some of these steps will reduce the calling overhead, but also complicates the codebase. It is not that clear what the net effect would be …