How would you parallelize between(1,20,N), prime(N)?

Problem is that some programmers might nevertheless prefer Python, Haskell, Elexir, etc… since these languages might provide parallel streaming (a form of OR-parallelism realizing parallel backtracking).

Even recent Java introduce spliterator etc… and can do some generate-test in parallel. Maybe Prolog is a little lacking here, although different Prolog systems offer some solutions here. Like for example:

High level thread primitives
https://www.swi-prolog.org/pldoc/doc/SWI/library/thread.pl

Homework: How would you parallelize between(1,20,N), prime(N) ?

--------- part that was missing from the thread migration ------------

One important programming pattern in Prolog is generate and test. But you don’t use findall/3 for it. Here is an example, testing whether a number is a prime number, spot the generate and the test:

prime(N) :-
    M is floor(sqrt(N)),
    between(2,M,K),
    N mod K =:= 0, !, fail.
prime(_).

Here is an example run:

?- between(1,20,N), prime(N), write(N), nl, fail; true.
1
2
3
5
7
11
13
17
19
true.

findall/3 on the other hand is a primitive aggregation predicate. So we might have a new pattern, which is generate-test-aggregate, which data science might like. Time to write a book about it.

BTW: There is no difference between a test and a filter, they are just Prolog goals. Don’t tell any Python, Haskell, Elexir, etc… programmer, that Prolog is that easy.

2 Likes

It is not completely elegant, but IMO not too bad either :slight_smile: A concurrent_include/3 wouldn’t be that hard to write.

primes(N, L) :-
    numlist(1, N, Candidates),
    concurrent_maplist(is_prime, Candidates, Result),
    include(integer, Result, L).

is_prime(N, false) :-
    M is floor(sqrt(N)),
    between(2,M,K), N mod K =:= 0,
    !.
is_prime(N, N).
107 ?- time((primes(1 000 000, L), length(L, N))).
% 21,921,613 inferences, 5.542 CPU in 9.603 seconds (58% CPU, 3955186 Lips)
L = [1, 2, 3, 5, 7, 11, 13, 17, 19|...],
N = 78499.
2 Likes

While I like the answer by Jan, this might help to improve upon it.

Parallel Prime Sieve: Finding Prime Numbers

Can you give a reference that gives more details on this.

I get the part about doing parts of the problem in parallel. I am also thinking that because the answer by Jan pre-generated the list numlist(1, N, Candidates) that is not allowed/prefered. What I am not getting is how the other languages would use parallel streaming to solve this problem in a way that is different than the answer by Jan. Mostly I don’t understand what is meant by how the word backtracking in parallel backtracking works in those languages. :slightly_smiling_face:

The proper way to implement parallel streaming (as you call it) is to create a message queue, have one (or more) threads feeding in terms to process and N threads reading. A simple prototype is this:

:- meta_predicate
    pstream(?, 0, ?, 0, +).

pstream(Var, Generator, WVar, WGoal, NWorkers) :-
    message_queue_create(Q, [max_size(1000)]),
    length(Workers, NWorkers),
    maplist(thread_create(work(Q, WVar, WGoal)), Workers),
    forall(Generator, thread_send_message(Q, work(Var))),
    forall(between(1, NWorkers, _), thread_send_message(Q, done)),
    maplist(thread_join, Workers).

work(Q, WVar, Goal) :-
    repeat,
    thread_get_message(Q, Msg),
    (   Msg = work(WVar)
    ->  call(Goal),
        fail
    ;   !
    ).

primes(Updo, Primes, Threads) :-
    message_queue_create(Results),
    thread_create(( pstream(C, between(1, Updo, C),
                            WC, test_prime(WC, Results),
                            Threads),
                    thread_send_message(Results, end_of_file)
                  ),
                  _, [detached(true)]),
    message_queue_members(Results, Primes).

message_queue_members(Queue, List) :-
    thread_get_message(Queue, H),
    message_queue_members(H, Queue, List).

message_queue_members(end_of_file, _, []) :-
    !.
message_queue_members(H, Q, [H|T]) :-
    thread_get_message(Q, H2),
    message_queue_members(H2, Q, T).

test_prime(N, Results) :-
    \+ ( M is floor(sqrt(N)),
         between(2,M,K),
         N mod K =:= 0
       ),
    thread_send_message(Results, N).

This doesn’t scale very well either:

?- time(primes(1 000 000, L, 1)), length(L, N).
% 157,006 inferences, 0.552 CPU in 10.394 seconds (5% CPU, 284480 Lips)
L = [1, 2, 3, 5, 7, 11, 13, 17, 19|...],
N = 78499.

?- time(primes(1 000 000, L, 2)), length(L, N).
% 157,004 inferences, 0.489 CPU in 6.381 seconds (8% CPU, 321170 Lips)
L = [1, 2, 3, 5, 7, 11, 13, 17, 19|...],
N = 78499.

?- time(primes(1 000 000, L, 4)), length(L, N).
% 157,004 inferences, 0.361 CPU in 4.875 seconds (7% CPU, 434847 Lips)
L = [1, 2, 3, 7, 5, 11, 13, 19, 17|...],
N = 78499.

?- time(primes(1 000 000, L, 8)), length(L, N).
% 157,004 inferences, 0.505 CPU in 10.893 seconds (5% CPU, 310887 Lips)
L = [1, 2, 3, 5, 7, 11, 13, 19, 17|...],
N = 78499.

That is to be expected as for the lower numbers all the exchange work is probably too much. The somewhat surprising result is that if we change the code to find primes in a range, we get:

?- time(primes(900 000, 1 000 000, L, 1)), length(L, N).
% 14,456 inferences, 0.045 CPU in 1.211 seconds (4% CPU, 321732 Lips)
L = [900001, 900007, 900019, 900037, 900061, 900089, 900091, 900103, 900121|...],
N = 7224.

?- time(primes(900 000, 1 000 000, L, 2)), length(L, N).
% 14,454 inferences, 0.039 CPU in 0.671 seconds (6% CPU, 369823 Lips)
L = [900001, 900007, 900019, 900037, 900061, 900089, 900091, 900103, 900121|...],
N = 7224.

?- time(primes(900 000, 1 000 000, L, 4)), length(L, N).
% 14,454 inferences, 0.040 CPU in 0.379 seconds (10% CPU, 363849 Lips)
L = [900001, 900007, 900019, 900037, 900061, 900091, 900121, 900089, 900103|...],
N = 7224.

?- time(primes(900 000, 1 000 000, L, 8)), length(L, N).
% 14,454 inferences, 0.049 CPU in 1.121 seconds (4% CPU, 294992 Lips)
L = [900001, 900007, 900019, 900037, 900061, 900089, 900091, 900103, 900121|...],
N = 7224.

?- time(primes(900 000, 1 000 000, L, 16)), length(L, N).
% 14,454 inferences, 0.050 CPU in 1.516 seconds (3% CPU, 286556 Lips)
L = [900001, 900007, 900019, 900037, 900061, 900089, 900091, 900103, 900121|...],
N = 7224.

I’m didn’t expect that. I guess it is related to contention on one of the queues.

The test results are from a dual AMD EPYC 7401 24-Core Processor, which means 48 cores and 96 threads. Without queues involved this typically scales close to linear to 48 tasks. This is most likely a contention issue on one of the threads. In the first attempt there is a report on every number, but in my second there is only a report on primes. So, it is most likely the first queue that distributes the work is the problem. Instead of a single queue we could use a queue per thread and distribute the candidates over these threads. The advantage of a single thread is that you do not have to worry about the distribution of work, which is particularly important if the amount of work spent on a single task is hard to predict (as is the case for the prime example).

This probably requires a smarter queue implementation. Currently this is a linked list of terms protected by a mutex and condition variable (and an emulation thereof for Windows).

This is bare metal here and CWI, i.e., no VM, just plain Fedora Linux on the raw machine). Running normal Prolog predicates (even dynamic) is completely read-only and involves no locks. As said, this typically scales well. The suspect shared resource is the queue to which the candidates are sent and from which the workers read the candidates. A worker will grab the queue mutex and check for available messages. If found (I guess likely in this case as the writer seems much faster) it removes the message and unlocks the mutex. Otherwise it waits on the condition variable, which unlocks the mutex to re-lock it when it is signalled (so we get two lock/unlock pairs). And no, it slows down far too quick to suspect memory bandwidth.

I tried without, which simply means the writer never blocks. Didn’t make a noticeable difference. The problem is quite likely the number of workers grabbing work from the same queue. It seems to be something worthwhile to look into. Anyone interested, please go ahead. I’m too deeply involved in tabling right now.

I did a bit of timing. Turns out best performance is at 5 thread. At this speed the queue writer and readers are about in sync. With more threads the readers block, going through the way more expensive route of waiting for the condition variable and thus slowing down. So, the ideal number of readers (workers) is the number that can just not keep up with the writer. Thinking about it, this should not be a real surprise.

Multi-threaded loading is normal in SWI-Prolog. Any thread can load a source file. This checks whether any thread may already be loading this file. If not, the file is loaded by the calling thread. Else the thread waits for the thread that is doing the load to complete. This behavior is necessary for demand loading to work reliably.

#Threads Wall Time Cores * Time
1 15.747 15.747
2 7.927 15.854
4 3.957 15.828
8 1.988 15.904
16 1.102 17.632
32 0.716 22.912
64 0.467 29.888
96 0.446 42.816
48 0.603 28.944
24 0.787 18.888

So, seems pretty much ok up to 24, the number of cores per CPU. From there to 48 we start to see serious degradation. I don’t know what it does. Whether it uses the cores of both CPUs or hyperthreading on one CPU or a mixture of these. Total time does keep going down though.

The 3rd column is simply the multiplication of the 1st and 2nd. Thus, it should remain constant if speedup is perfect. The number of logical CPUs is 96 (2 chips, each 24 cores, each core 2 threads). So, we see almost perfect behavior up to 24 threads, Some degradation to 48 and then more degradation.

The regular tabling algorithm (SLG resolution) does not guarantee a linear (prolog SLD-like) evaluation, so the results will be out of order compared to prolog’s SLD. There have been however other tabling algorithms proposed which allow linear results:

See Tabling: Linear Resolution (SLT)

1 Like

SWI-Prolog currently uses local scheduling (which is by far the most popular). This implies that all answers are added to the table before they are returned. Answers are thus always returned in table order, which is a trie that uses hash tables for choice points. Thus, the order is basically undefined and not even consistent between runs.