Semi-deterministic version of `nth0` as C primitive

I’ve spent some time looking at the performance characteristics of library(simplex) (linear optimization) and on one particularly extreme example it’s spending about a third of the time calling lists:nth0 semi-deterministically (looking up column coefficients in a matrix structure constructed from lists). So in the spirit of memberchk/3, I wonder if there’s a case to be made for implementing the equivalent of lists:nth0_det/3 as a C primitive. I would imagine the implementation is pretty similar to memberchk except using a counter value as a terminating condition rather than unification.

FYI: the current implementation of lists:nth0_det:

nth0_det(0, [Elem|_], Elem) :- !.
nth0_det(1, [_,Elem|_], Elem) :- !.
nth0_det(2, [_,_,Elem|_], Elem) :- !.
nth0_det(3, [_,_,_,Elem|_], Elem) :- !.
nth0_det(4, [_,_,_,_,Elem|_], Elem) :- !.
nth0_det(5, [_,_,_,_,_,Elem|_], Elem) :- !.
nth0_det(N, [_,_,_,_,_,_   |Tail], Elem) :-
    M is N - 6,
    M >= 0,
    nth0_det(M, Tail, Elem).
1 Like

What exactly do you mean by “semi-deterministically”? Do you mean that you are finding the index in the list based on a value?

?- nth0(I, [1,2,3], 2).

Or that both the index and the value are ground?

?- nth0(1, [1,2,3], 2).

Or do you just mean that the first argument (the index) must be a a non-negative integer?


Either way:

It might be better to implement the matrix as a flat term; so instead of [1,2,3,4] or [[1,2],[3,4]] you’d have v(1,2,3,4) or v(v(1,2),v(3,4)).

Or of course just have an array or matrix structure implemented in C, using an array v[]

Iterating the Prolog list in C might be faster but still will have the same issue with access time depending on the index. In contrast, this is constant time:

?- arg(2, v(1,2,3), X).

Sorry, I thought that was a commonly understood term. One reference: SWI-Prolog -- Manual

Semi-deterministic predicates are predicates that either fail or succeed exactly once and, for well behaved predicates, leave no choicepoints

In this case the first argument is always an integer.

This a pre-existing module (about 1500 lines of Prolog) so I’m not looking to re-implement it. I also think a C implementation has potentially a broader benefit, and although it may still be a linear (rather than constant) time performance, the linear constant will be much smaller for a C version compared to the existing Prolog.

But I will look at other options, including “do nothing”, if this fails to gain any traction.

Semi-deterministic external predicates are pretty easy to write for SWI-Prolog, once you’ve had a bit of practice. (Backtracking predicates are more work)

If you’re not in a rush, I can try writing nth0 as a test case for the changes I’m currently making to SWI-cpp.h. It shouldn’t be very difficult. (I assume that the first argument is always instantiated and that the predicate should throw an exception if it’s either uninstantiated or not a non-negative integer – it’s easier to write the external predicate if that’s the case)

So I’m definitely not in a rush. Regarding errors (as you may recall) , I would happier if if any new primitive mirrored the existing lists:nth_det so it could easily be dropped in there as well, but not a huge deal (just add more pre-call tests). I would be happy to provide an update to library(simplex) to take advantage of it.

Don’t mind O(n), just don’t want the overhead of a Prolog recursion vs. a tight C loop. I also think there are valid reasons why library(simplex) uses lists, namely the incremental addition of constraints, so using compounds would require copying arguments out to a list, adding a new item, then reconstructing the compound (with “univ”). Not to mention that I really don’t want to rewrite simplex.

My main point is that if the C primitive memberchk/2 is available as a semi-deterministic variant of member/2, for valid reasons I assume, why not provide the same for nth0/3 (primitive name T.B.D.).

If you want a significant speedup you probably need to dive below the level of the C(++) public interface. You also need to consider constraints in most cases. All in all I doubt we should strive for moving library(lists) to C as much as possible and we’d better spent time speeding up nth0/3 using better compilation. Just a data point:

80 ?- numlist(1, 1 000 000, L), time(nth0(999999, L, V)).
% 333,335 inferences, 0.015 CPU in 0.015 seconds (98% CPU, 22571438 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...],
V = 1000000.

81 ?- numlist(1, 1 000 000, L), time(memberchk(999999, L)).
% 2 inferences, 0.013 CPU in 0.014 seconds (97% CPU, 150 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...].

Which hints that the unwinding in Prolog is about as good as the C version. Long ago the difference was much bigger, but the Prolog compiler and runtime system improved and the need to handle constraints required more high level functions for the C version, slowing it down.

1 Like

Not sure why this is the case for nth0. Unlike memberchk, no (head) unifications are required until the nth element is reached.

This result is a bit surprising to me so I’ll need to rethink my assumptions.

Followup: I would expect nth0 to “perform” more like length; on my laptop:

?- numlist(1, 1 000 000, L), time(nth0(999999, L,X)).
% 333,334 inferences, 0.037 CPU in 0.039 seconds (93% CPU, 9127937 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...],
X = 1000000.

?- numlist(1, 1 000 000, L), time(length(L,X)).
% 2 inferences, 0.005 CPU in 0.005 seconds (99% CPU, 426 Lips)
L = [1, 2, 3, 4, 5, 6, 7, 8, 9|...],
X = 1000000.

2 Likes

Not much followup discussion (other than Peter’s offer) so thought I’d share my motivating example: The Stigler Diet Problem  |  OR-Tools  |  Google Developers

This is a minimization problem with 77 variables and 9 constraints. According to the web page, available tools in C, Java, and Python produce a solution in about 1 ms. (Apparently, when originally solved in 1949, this problem took 120 man-days using desk calculators.) Out of the box, library(simplex) takes about 1.5 sec on 8.5.14 (MacOS fat binary), so about 1500 times slower than the documented OR tools. Simple tweaks to remove a couple of unnecessary rational arithmetic operations reduces this to about 650 ms. - a nice improvement.

As explained earlier, profiling data indicates nth0/3 is now a significant bottleneck. As suggested arg/3 is an O(1) alternative to nth0 when the data is in a compound term rather than a list so I wrote a simple version of nth0_chk which copies the list into a term and then uses arg:

nth0_chk(N,L,X) :-  % integer(N), N >=0, 
	C=..[c|L],
	N1 is N+1,
	arg(N1,C,X).

Strategically replacing nth0 by nth0_chk in two locations reduced the execution time from 650 ms. to 440 ms. (now less than a factor of 500). Note that this isn’t near optimal since a full copy of the whole list is produced on each call and, on average, only half the the list is required anyway. A more focused benchmark accessing the 38th element (the average) of a 77 element list shows an almost 2x improvement for this specific case and one would expect a properly implemented primitive to do somewhat better (the last query copies a half list for comparison):

?- numlist(1,77,L), time((between(1,1000000,_), nth0(38,L,X), fail)).
% 15,000,003 inferences, 1.437 CPU in 1.437 seconds (100% CPU, 10441438 Lips)
false.

?- numlist(1,77,L), time((between(1,1000000,_), nth0_chk(38,L,X), fail)).
Correct to: "simplex:nth0_chk(38,L,X)"? yes
% 4,000,001 inferences, 0.758 CPU in 0.759 seconds (100% CPU, 5275028 Lips)
false.

?- numlist(1,38,L),time((between(1,1000000,_),nth0_chk(38,L,X),fail)).
Correct to: "simplex:nth0_chk(38,L,X)"? yes
% 4,000,001 inferences, 0.464 CPU in 0.464 seconds (100% CPU, 8620172 Lips)
false.

As an aside, for me performance analysis is complicated due to the fact that AFAICT MacOS fat binaries appear to run 20-30% slower on Intel than the the old Intel-only releases, presumably due to some C compiler option restriction. Sure would be nice if this wasn’t the case even if it entails rebuilding a custom version for Intel from source.

1 Like

You cannot make an omelette without breaking some eggs.

The above doesn’t make any sense, if you don’t move (=..)/2 out of the nth0_chk/3.
Do it only once, before nth0_chk/3 is called multiple times. That was the suggestion.

It has only near traditional nth0/3 performance, since (=..)/2 is also O(n). So you
have first O(n) in (=..)/2 and then O(1) in arg/3, which is the same as O(n).

There are a few Prolog systems where (=..)/2 isn’t O(n), like LPA Prolog. On the
other hand in LPA Prolog arg/3 is O(n). But usually (=..)/2 takes as

much time as the list is long. Since it needs to copy each argument reference.

Its easy to verify that the speedup is a tremendous factor 10x, 100x, etc…
depending on the size. You can run these benchmarks by yourself:

nth0_chk_out(N,C,X) :-  % integer(N), N >=0, 
	N1 is N+1,
	arg(N1,C,X).

Now I get:

?- numlist(1,77,L),
    time((between(1,1000000,_), nth0(76,L,X), fail)).
% 27,000,001 inferences, 1.641 CPU in 1.632 seconds
(101% CPU, 16457143 Lips)
false.
?- numlist(1,777,L),
   time((between(1,100000,_), nth0(776,L,X), fail)).
% 26,100,001 inferences, 1.609 CPU in 1.603 seconds (100% CPU, 16217476 Lips)
false.

?- numlist(1,77,L), C=..[c|L],
    time((between(1,1000000,_), nth0_chk_out(76,C,X), fail)).
% 3,000,001 inferences, 0.141 CPU in 0.126 seconds
(112% CPU, 21333340 Lips)
false.
?- numlist(1,777,L), C=..[c|L],
    time((between(1,100000,_), nth0_chk_out(776,C,X), fail)).
% 300,001 inferences, 0.016 CPU in 0.012 seconds (129% CPU, 19200064 Lips)
false.

You can lower the polynomial degree of the complexity of
the algorithm. Which is most visible when you have large arrays.

I have pushed a change to nth0/3 that tries to skip the first Index elements of the list in C. That is simple and preserves all nasty cases such as the list being partial, optionally ending in a constrained variable. The effect is not impressive though. Measured on a MacOS M1 system the new version starts to win for Index > 6, with only a very modest loss before. From about Index > 100 the C based version wins by about 4 times.

The real gain is in using terms for arrays …

Thanks for doing this; I would call the result impressive so it’s all relative. For small index values, the implementation hardly matters, but for values greater than 20, an improvement of 2-4 times is significant (to me). Also note that the arrays of interest here are not overly large (77 columns); just not “toy size”.

I understand that - when I implemented immutable arrays as an arithmetic type, that’s exactly what I did. But that’s not what Markus Triska did when he wrote simplex which is about 1500 lines of Prolog. And as I tried to explain, that’s not an “egg” I’m prepared to break. If nth0 took zero time, it would only improve overall performance of this particular problem by less than a factor of 2.

As you may have guessed, I’m looking for the “low hanging fruit”, not the ultimate solution. I’m not that unhappy with the status quo but it can be better and Jan’s enhancement to nth0 will benefit any users of that predicate with medium to long lists, not just library(simplex).

Those with other requirements will need to look for other solutions, of which you have listed many.

very interesting subject, i make nutritional software for gyms and i have implemented this generating function several times, but i have never had the time yet to investigate a really good algorithm.

in general people dont want to eat liver in the same meal as where they eat cereals, and this is difficult to implement.