Prolog program is much slower than Haskell -- why?

I thought I’d learn a bit of Haskell, so I started watching Graham Hutton’s online course. The first example is a generate-and-test solver for the “countdown” puzzle, and I thought “that’s something that Prolog can do both more compactly and faster”. I was right about the “more compactly” but terribly wrong about “faster” – on my machine, Haskell takes 0.3 seconds for all solutions (compiled; 4.2 seconds using the ghci interpreter); but Prolog takes over 40 seconds. And, looking at my Prolog code, I don’t see an obvious way of speeding it up - the profiler shows almost all the time spent here:
81% eval/2
16% expr/2 (14% expr_2/2, 7% op_primitive/3
One thing that doesn’t show is any first-argument indexing – I assume it’s there but vm_list/1 doesn’t show anything. I was able to make the Prolog code run about 20% faster by changing some of the comparison operations; and the -O option got another 20%; but to be competitive with the Haskell code, I’d need a 10x speedup, and I don’t see how to do that.

The Prolog code is in countdown.pl; the Haskell code is in pgp-countdown.hs and documented in https://www.cs.nott.ac.uk/~pszgmh/countdown.pdf ; I wrote a few notes in countdown-notes.md. The test problem is run by:

time(forall(solve([1,3,5,10,25,50], 999, Expr), writeln(Expr))).

Is this slow performance because SWI-Prolog is designed for nice development features with performance being secondary, or is there some fundamental difference between Prolog and Haskell that accounts for the speed difference? Or is my Prolog code just really bad?

I tried GNU-Prolog’s compiler – it ran in 18 .4 seconds, which is a definite improvement, but still not enough to compete with Haskell. (GNU-Prolog’s interpreter was nearly 4x slower than SWI-Prolog)

(I noticed that GNU-Prolog builds with -O3 -fomit-frame-pointer -Wno-char-subscripts -funsigned-char but I haven’t tried that, and I suspect it wouldn’t make much difference.)

EDIT:

I rebuilt swipl with the GNU-Prolog compiler options and there was at most a few percent performance change. (In both cases, I used PGO)

I’ve improved the code, now in countdown2.pl – it’s now only 7x slower than interpreted Haskell. It’s hard to see where I can get more speed because the profiler shows it spending 50% of its time in is/2. (My speedup was by changing eval/2 to is/2; the overall time is now split evenly between evaluation and generating candidate expressions.)

With compiled GNU-Prolog, I’m down to 18 seconds (almost twice as fast as SWI-Prolog), or roughly 4x slower than interpreted Haskell.

The Haskell code (as far as I can tell, because I’m not an expert Haskell programmer) generates a lazy list of permutations (whereas Prolog uses backtracking). The eval and apply functions presumably are faster than Prolog’s is/2 because the Haskell code doesn’t have to do any extra tag checking. But that accounts for at most of half the performance difference between the Haskell and Prolog programs.

4 Likes

Does this help: Prolog finding target number in a list by applying operations (+,-,/,*) - Stack Overflow

?- time(solve_countdown([1,3,5,10,25,50], 999, L)).
% 7,324,251 inferences, 0.454 CPU in 0.455 seconds (100% CPU, 16115631 Lips)
L = [1+(3+(25*(50-10)-5)), 1+(25*(50-10)-(5-3)), 1+(3+25*(50-10)-5), 
3+(1+(25*(50-10)-5)), 3+(25*(50-10)-(5-1)), 3+(1+25*(50-10)-5), ...

Without changing anything to the algorithm, but moving the evaluation into generating the expression we get two advantages: the pruning happens earlier and the evaluation is not on an unknown term and thus -O optimization works. That gets me down to 0.462 seconds. It also avoids duplicating the code in eval and clean, so it gets a lot shorter :slight_smile:

Code
% -*- mode: Prolog; coding: utf-8 -*-

% Solve the "countdown" problem.
% https://www.cs.nott.ac.uk/~pszgmh/countdown.pdf
% https://en.wikipedia.org/wiki/Countdown_(game_show)#Numbers_round

:- module(countdown,
          [solve/3]).

solve(Numbers, Target, Expr) :-
    subseq(Numbers, Numbers2),
    perm(Numbers2, NumbersPermuted),
    expr(NumbersPermuted, Expr, Target).

%! expr(+Ns, -Expr, -Value).
% Ns is a list of numbers; Expr is an expression
% made by interpolating all the possible operators.
% For example, expr([1,2,3], 1+2*3, 7).

expr([N], Expr, Value) => Expr = N, Value = N.
expr(Ns, Expr, Value) =>
    append(Left, Right, Ns),
    expr_(Left, Right, Expr, Value).

expr_(Left, Right, Expr, Value) :-
    Left = [_|_],
    Right = [_|_],
    expr(Left, LeftExpr, LValue),
    expr(Right, RightExpr, RValue),
    op_primitive(LeftExpr, RightExpr, Expr, LValue, RValue, Value).

op_primitive(Left, Right, Left+Right, LV, RV, V) :-
    LV =< RV, RV =\= 0, V is LV+RV.
op_primitive(Left, Right, Left*Right, LV, RV, V) :-
    LV =< RV, RV > 1, V is LV*RV.
op_primitive(Left, Right, Left-Right, LV, RV, V) :-
    RV =\= 0,
    LV > RV,
    V is LV-RV.
op_primitive(Left, Right, Left/Right, LV, RV, V) :-
    RV =\= 0, V is LV/RV, integer(V).

% Permutation - taken from library(lists).
perm([], []).
perm(List, [First|Perm]) :-
    select(First, List, Rest),
    perm(Rest, Perm).

subseq(List, Subseq) :-
    Subseq = [_|_],
    subseq_(List, Subseq).

subseq_([], []).
subseq_([Head|Tail], [Head|SubTail]) :-
    subseq_(Tail, SubTail).
subseq_([_|Tail], SubSequence) :-
    subseq_(Tail, SubSequence).

And yes, GNU-Prolog only does 32/64 bit integer arithmetic without overflow checking. With native code that will be faster :slight_smile:

I don’t know whether the comparison is now fair. Seems we are fairly on par regarding performance though (even with interpreted vs compiled code) and with about 50 vs 180 lines of source I think that is not too bad :slight_smile:

edit a quick scan seems to suggest the Haskell version does the pruning also early. If you can confirm that, it seems to fair performance comparison. The code comparison is not entirely fair as the Haskell version has some boilerplate for which you use the Prolog toplevel as replacement.

5 Likes

A faster method has appeared :grinning: : Prolog finding target number in a list by applying operations (+,-,/,*) - Stack Overflow

1 Like

I’ve incorporated @jan 's code into countdown3.pl with a few small enhancements. The performance improvements were roughly:

  • ~3x by doing the arithmetic with known terms instead of a general term (by itself, turning -O on and off gives 2x performance)
  • ~3x by earlier pruning of the associative duplicates, “plus 0”, “times 1”, etc. sub-expressions.

I tried tabling but it had no effect or made things a bit worse – the reduction in inferences was offset by the overhead of tabling.

Anyway, it seems that Prolog is faster for this kind of problem after all, and it’s a lot more succinct. When compiled (with GNU-Prolog), it’s about 15% faster than Haskell (SWI-Prolog is about 3x slower, but it’s interpreted).

2 Likes

On my machine, this is about 60% faster than @jan’s improvement on my code. But on compiled GNU-Prolog, it’s only about 15% faster.

Another place for a small optimization, which is, I think, also a bit more elegant is to change this

expr([N], Expr, Value) => Expr = N, Value = N.
expr(Ns, Expr, Value) =>
    append(Left, Right, Ns),
    expr_(Left, Right, Expr, Value).

expr_(Left, Right, Expr, Value) :-
    Left = [_|_],
    Right = [_|_],
    expr(Left, LeftExpr, LValue),
    expr(Right, RightExpr, RValue),
    op_primitive(LeftExpr, RightExpr, Expr, LValue, RValue, Value).

Into

expr([N], Expr, Value) => Expr = N, Value = N.
expr(Ns, Expr, Value) =>
    Left = [_|_],
    Right = [_|_],
    append(Left, Right, Ns),
    expr_(Left, Right, Expr, Value).

expr_(Left, Right, Expr, Value) :-
    expr(Left, LeftExpr, LValue),
    expr(Right, RightExpr, RValue),
    op_primitive(LeftExpr, RightExpr, Expr, LValue, RValue, Value).

But, all and all that is not really the point. The point seems to be that @peter.ludemann’s first solution didn’t do early pruning where its Haskell version did. That is why Prolog lost badly on performance. With that fixed we are pretty much in the same league and all the rest is small details in how it is written and which Prolog system you use. Bottom line is that the Prolog version is a lot shorter and about as fast :slight_smile: Please correct me if I’m wrong.

1 Like

Yes, that’s what I wanted to see. However, it is interesting that a generate-and-test that uses lazy evaluation of a list of permutations runs about as fast as Prolog’s backtracking. I’m also surprised the GNU-Prolog’s compiler gets a 3x performance improvement on code that is heavily backtracking (10x improvement, compared to GNU-Prolog’s byte-code interpreter) – I would have expected a much lower performance improvement for that kind of code, compared to the performance improvement by compiling deterministic code.

(Profiling turned out to be a bit misleading – it showed that using is/2 on known terms would get at most a 2x speedup; but it turned out to be 3x. Also, I did have pruning in my original code [my original original code (which I didn’t post) didn’t have pruning, and it ran about 2x slower - and I didn’t compare the timing with the Haskell code without pruning, which might have given me a hint]; my mistake was in not realizing that pruning could be done on the results of evaluating the nodes in the expression tree whereas I was pruning on the unevaluated nodes (that is, I was doing a top-down style of pruning but @jan (and the Haskell code) did bottom-up – and these turned out to be equivalent.)

Yes, I had noticed this also, and it got another 10% or so of performance. And I put expr_/4 into expr/4 – the 2 separate predicates were an artifact of some earlier refactoring.
Here’s my final code (using :-! instead of => so that it works with GNU-Prolog also):

expr([N], Expr, Value) :- !, Expr = N, Value = N.
expr(Ns, Expr, Value) :-
    Left = [_|_], Right = [_|_],
    append(Left, Right, Ns),
    expr(Left, LeftExpr, LValue),
    expr(Right, RightExpr, RValue),
    op_primitive(LeftExpr, RightExpr, Expr, LValue, RValue, Value).
1 Like

The SWIPL Application use the “Virtual-ZIP-Mascine” for reprasentation the Code
that was created by the Compiler (Interpreter).
This Code is a simple ByteCode, packed as ZIP-Archive File.
This Zip File will be stored after the Exe File.
The Exe File itself is a Stub, which call the SWIPL Environment Files.

BTW: The Format and Idea of the virtual ZIP machine is very old.

This means: The ByteCode must be converted back (depack, create CPU Menomics…)
And these Step’s can take a while.

A second Point is your File System Layout.
So, tiny 512 Byte Blocks can be used, to save Disk Space for the data Informations,
but the Operating System have to do more “seek” things as if you use
2048 Byte Blocks.

The cons on 2048 Byte Blocks is, that a File of 128 Bytes consume 2048 Bytes.

I’m familiar with that technique, but didn’t see an obvious way of doing it for this problem.
My experience in the past is that freeze/2 roughly doubles the cost of the tests, so the pruning needs to reduce the number of items generated by at least 3x to be effective. (A similar comment about tabling – which I did try, and found that the cost of tabling equalled or exceeded the speedup from re-doing goals)

Constraint solving looks like an interesting alternative – @hakank - do you have any thoughts on whether this kind of problem would work well with a constraint solver?

Somehow freeze/2 and related attributed variable wakeup is relatively slow. There seem to be faster ways, although I don’t see obvious opportunities that are likely to make a big difference. For tabling, that would be much easier. As is, we wrap the whole tabled predicate. What would need to happen is to have the tabled predicate “supervisor” look for a table and generate answers from it and only start the more complicated route if there is no (completed) table. That is no rocket science. It is a bit of work though.

@peter.ludemann

Regarding a constraint model (using clpfd), there are at least two issues that complicates the matter:

  1. The division is only integer division so some solutions might be missing.
  2. A bigger challenge is to group the operations properly, at least if one is doing a “pure” list based version (which is what I tend to do).

That being said, below is a Q&D/PoC clpfd version with several warts (i.e. possibilities for improvement):

  • It does not group the operations at all. Instead it assumes the operations are left to right, e.g. (((((10 + 5) * 3) - 25) * 50) - 1). Thus it only shows 36 solutions to the [1,3,5,10,25,50] = 999 instance (instead of 77).
    The parenthesis are added afterwards during the printing of the solution (also an ugly wart).
  • It’s slower than countdown3.pl (from your repo): 1.5s (f36 solutions) vs 0.261s (77 solutions).
  • A proper model would use global_cardinality/2 to ensure that a number is not used more than the number of occurrences in the given list. However, I skipped that and instead useall_different/1 assuming that only distinct integers are in the given list (which means that it cannot solve the quite famous [4,4,4,4] = 24 problem.)

The model uses three lists of decision variables:

  • X: includes the selected numbers (in order) to use. The list is created from length 2 to the full length (all numbers).
  • Ops: the operations coded as 1…4, for +,-,*, and // (integer division) respectively
  • Y: contains the (intermediate) results of
    X[I] Op[I] X[I+1]
    The last element of Y should unify with Target

Here’s the code, with warts and all:

:- use_module(library(clpfd)).
go :-
    L = [1,3,5,10,25,50],
    Target = 999,
    countdown(L,Target,X,Ops),
    p(X,Ops),
    nl,
    fail,
    nl.
go.

countdown(L,Target, X,Ops) :-
    length(L,LLen),
    between(2,LLen,Len),
    length(X,Len),
    list_domain_disjunction(L,Domain),        
    X ins Domain,
    
    length(Y,Len),
    % Some arbitrary allowed range of the intermediate results
    Y ins -100000000..100000000,

    OpsLen is Len-1,    
    length(Ops,OpsLen),
    Ops ins 1..4, % 1:+,  2:-, 3:*, 4://

    % Note: This should really be global_cardinality/2
    %       but I'm lazy...
    all_different(X),

    % Ensure that the equation is correct
    check_c(X,Ops,[],Y),
    last(Y,Target),
        
    flatten([X,Y,Ops],Vars),
    
    labeling([ffc,enum,down],Vars).

% Create the equation
check_c([],_Op,Y,Y).
check_c([V],_Op,Y,[V|Y]).
check_c([V1,V2|Ls],[Op|Ops],Y0,[V|Y]) :-
    make_op(V1,V2,Op,V),
    check_c([V|Ls],Ops,Y0,Y).

% Convert the operatos number to constraints
make_op(A,B, Op,Res) :-
  (Op #= 1) #==> (Res #= A + B),
  (Op #= 2) #==> (Res #= A - B),
  (Op #= 3) #==> (Res #= A * B),
  (Op #= 4) #==> (A #= Res * B). % Res = A / B division 

% convert a list of integers to a proper clpfd domain
list_domain_disjunction([D|Ds],Disj) :-
        foldl(disj,Ds,D,Disj).
disj(A,B,C) :-C = \/(B,A).

% Print the solution
p(X,Ops) :-
    length(X,Len),
    forall(between(1,Len,_),write("(")),
    p_(X,Ops).
p_([],_).
p_([V],_) :- format("~d)",[V]).
p_([V1|Vs],[Op|Ops]) :-
    OpsV = [+,-,*,//],
    nth1(Op,OpsV,OpS),
    format("~d) ~w ",[V1,OpS]),
    p_(Vs,Ops).

The output is:

((((50) - 10) * 25) - 1)
((((25) - 5) * 50) - 1)
(((((25) * 10) - 50) * 5) - 1)
(((((10) * 25) - 50) * 5) - 1)
(((((5) - 3) * 50) * 10) - 1)
(((((5) - 3) * 10) * 50) - 1)
(((((25) - 10) + 5) * 50) - 1)
(((((5) - 10) + 25) * 50) - 1)
(((((25) + 5) - 10) * 50) - 1)
(((((5) + 25) - 10) * 50) - 1)
((((((50) // 5) * 3) + 10) * 25) - 1)
((((((25) // 5) - 3) * 50) * 10) - 1)
((((((25) // 5) - 3) * 10) * 50) - 1)
((((((50) // 10) + 3) * 5) * 25) - 1)
...

(The model is also available here: http://hakank.org/swi_prolog/countdown.pl )

1 Like

This is an interesting comparison. FYI: I made a performance comparison quite a few years back which also tested Haskel; the number would probably need updating and the example was not making use of CLPFD: Why Prolog? - ProB Documentation

I know this is the SWI Prolog discussion list, but you may also want to have a look at the commercial SICStus Prolog. It has a very efficient CLPFD implementation in C, but it has limited precision (i.e., overflows can occur as opposed to SWI). I have ported your example to make it work with both SWI and SICStus and there is about an order of magnitude improvement (0.1 sec vs 1.1 sec on my MacBook Air M2 to run the goal “go.”).

Here is my version running in both SWI/SICStus

/*

  Countdown in SWI Prolog

  See Peter Ludemann's SWI-Prolog Discord post 
     Prolog program is much slower than Haskell – why?
     https://swi-prolog.discourse.group/t/prolog-program-is-much-slower-than-haskell-why/6266/16

  as well as his GitHub repo: https://github.com/kamahen/nerdle

  Model created by Hakan Kjellerstrand, hakank@gmail.com
  See also my SWI Prolog page: http://www.hakank.org/swi_prolog/
  
  Ported to SICStus Prolog by Michael Leuschel

*/

:- use_module(library(clpfd)).
go :- statistics(walltime,[_,_]), go2,
      statistics(walltime,[_,T]), format('Walltime ~w ms~n',[T]).
go2 :-
    L = [1,3,5,10,25,50],
    Target = 999,
    countdown(L,Target,X,Ops),
    %print(setup_countdown(L,Target,X,Ops)),nl,
    p(X,Ops),
    nl,
    fail,
    nl.
go2.

:- if(current_prolog_flag(dialect, sicstus)).
:- write(sicstus),nl.
:- op(600,xfx,ins).
:- op(200,xfx,#==>).
:- use_module(library(between)).
:- use_module(library(lists)).
% SICStus does not have ins/2:
ins([],_).
ins([V|TV],Domain) :- in(V,Domain), ins(TV,Domain).

% SICStus does not have foldl/4:
foldl(MPred,List,Start,Result) :-
    foldl2(List,MPred,Start,Result).
foldl2([],_Pred,Value,Value).
foldl2([Elem|Rest],MPred,OldValue,NewValue) :-
    call(MPred,Elem,OldValue,Value),
    foldl2(Rest,MPred,Value,NewValue).

% SICStus does not have flatten/2:
flatten(List,FlatList) :- flatten1(List,[],FlatList).
flatten1([],L,L) :- !.
flatten1([H|T],Tail,List) :- !, (var(H) -> List=[H|FlatList] ; flatten1(H,FlatList,List)), flatten1(T,Tail,FlatList).
flatten1(NonList,Tail,[NonList|Tail]).

% Quick and dirty implementation of forall/2:
forall(Call1,Call2) :- call(Call1),(call(Call2) -> fail).
forall(_,_).

% we use element/2 from SICStus; SICStus does not like the \/ terms generated in disj
elements(L,X) :- maplist(element(L),X).
:- else.
elements(L,X) :- list_domain_disjunction(L,Domain),        
    X ins Domain.
% convert a list of integers to a proper clpfd domain
list_domain_disjunction([D|Ds],Disj) :-
        foldl(disj,Ds,D,Disj).
disj(A,B,C) :-C = \/(B,A).
:- endif.

countdown(L,Target, X,Ops) :-
    length(L,LLen),
    between(2,LLen,Len),
    length(X,Len),
    elements(L,X),
    
    length(Y,Len),
    ins(Y,-100000000..100000000),
    OpsLen is Len-1,
    
    length(Ops,OpsLen),
    ins(Ops,1..4),

    % Note: This should really be global_cardinality/2
    %       but I'm lazy...
    all_different(X),

    % Ensure that the equation is correct
    check_c(X,Ops,[],Y),
    last(Y,Target),

    flatten([X,Y,Ops],Vars),
    
    labeling([ffc,enum,down],Vars).

% Create the equation
check_c([],_Op,Y,Y).
check_c([V],_Op,Y,[V|Y]).
check_c([V1,V2|Ls],[Op|Ops],Y0,[V|Y]) :-
    make_op(V1,V2,Op,V),
    check_c([V|Ls],Ops,Y0,Y).

:- current_prolog_flag(dialect, X), write(dialect(X)),nl.

:- if(current_prolog_flag(dialect, swi)).
% Convert the operatos number to constraints
make_op(A,B, Op,Res) :-
  (Op #= 1) #==> (Res #= A + B),
  (Op #= 2) #==> (Res #= A - B),
  (Op #= 3) #==> (Res #= A * B),
  (Op #= 4) #==> (A #= Res * B). % Res = A / B division 
:- else.
% SICStus uses #=> and not #==>
make_op(A,B, Op,Res) :-
  (Op #= 1) #=> (Res #= A + B),
  (Op #= 2) #=> (Res #= A - B),
  (Op #= 3) #=> (Res #= A * B),
  (Op #= 4) #=> (A #= Res * B). % Res = A / B division 
:- endif.


% Print the solution
p(X,Ops) :-
    length(X,Len),
    forall(between(1,Len,_),format("(",[])),
    p_(X,Ops).
p_([],_).
p_([V],_) :- format("~d)",[V]).
p_([V1|Vs],[Op|Ops]) :-
    OpsV = [+,-,*,//],
    nth1(Op,OpsV,OpS),
    format("~d) ~w ",[V1,OpS]),
    p_(Vs,Ops).

forall/2 is defined as

 forall(Gen,Test) :- \+ ( Gen, \+ Test ).

And yes, SWI-Prolog’s clp(fd) has proven to be very solid, but also slow. I have no clue how much room there would be to improve on that. For example, how much time is involved in low-level comparison and intersection of domains? Moving that to C is probably easy.

Looking at the constraint solutions, it seems to me:

  • they’re more complicated than my (and @jan’s) generate-and-test solution (but my original solution was 50% more code, so maybe the constraint solutions can be refined somewhat).
  • they aren’t faster

(Also, I’m not sure that they’re correct; for example, the problem statement would allow a number to be repeated in the list of allowed numbers, but I would @hakank’s use of all_different/1 find all the solutions in that situation?)

Some of the complication in the constraint solutions is because clp(fd) uses integers, so a bit of extra work is needed to map between non-integer items and integers (I’ve run into this in the past when playing with clp; is there a variant of clp(fd) that works with atoms rather than integers?).

But it also seems that the “countdown” problem is ill-suited to constraint solving. Typically, pruning depends on some kind of monotonic feature: if a sub-solution isn’t satisfied, then no solution that uses it will be satisfied either. But for “countdown”, this doesn’t work because “+” and “*” in subexpressions increase the expression’s value but “-” or “/” decrease it. The only pruning in the Haskell solution (and in mine) is in eliminating some solutions that are mathematically equivalent (e.g., add/subtract 0, multiply/divide by 1, or commute the operands). [There’s also a performance boost in @jan’s version, by moving the is/2 computation to a place where it can be optimized.]
In the end, for this problem, all the permutations need to be tried (but some don’t need to be fully computed because there is a mathematically equivalent permutation), so a constraint solver can’t gain much performance; and the constraint formulation seems to be more complicated than the naïve generate-and-test solution.

On the other hand, it’s pretty clear that backtracking (and/or constraint propagation) is more efficient than Haskellś lazy evaluation of list comprehensions; and I think the backtracking solution is easier to understand than the lazy lists (to me, anyway – in general I find “back chaining” easier to understand than “forward chaining”).

2 Likes

Yes, this simple model only shows some of the solutions (as mentioned in my post) since it does not handle different grouping of the expressions. Such a grouping is much more messier to do in “plain” CP. than in a non-CP program (as you and others have shown). That’s surely a shortcoming of the model, as well as it require that the listed numbers are distinct (the model uses all_different/1 instead of global_cardinality/2).

At the risk of rat-holing into Haskell optimization – logict: A backtracking logic-programming monad. offers some functionality on this front, which might be worth looking into, or at least bookmarking for future investigation.

cheers
meng

In SWI-Prolog it is linear in the size of the term, where shared subterms count as one (and cyclic terms are handled correctly). But yes, you can imagine quadratic implementations.