Here is a first stab at writing a tutorial on parallel, or concurrent, programming with SWI Prolog. I’ve posted it here for some “peer reviewing” before submitting a version to https://prologhub.pl/ for posterity.
A reason multithreading is dangerous is it only takes a small bug to freeze a computer.
In the example below, however many cores a given computer has are told to keep listening for messages on a queue so that whichever one is free or quickest can grab and execute them as they arrive. Without carefull coding, all cores may end up waiting for perpetuity (or a reboot) if messages never come.
There are various types of problems which can be speeded up by dividing them between the growing number of cores available in modern processors. The most common is probaby the manager-workers pattern, illustrated by Jan’s concurrent_maplist code.
The problem I’m working on — speeding up the decision of a move in games like chess or checkers — falls under “consensus problems” (as far as I understand it) which has fortunately been very well explained by Turing-award winner Leslie Lamport, and I hope to add a Prolog version of his algorithm here in due course. This involves stable shared storage between the parallel processes, a complexity which doesn’t arise in manager-workers problems like maplist. (SWI Prolog has wrappers for pthread.h mutex functions, but since I’m not familiar with them, I’m planning on going the more Prologish and elegant way of using dynamic, assertz and retract).
A trap most novices will fall into when rewritting traditional sequential maplist/3 into a parallel version is a resulting output list shuffled into a random order (assuming they are not cheating by looking at Jan’s concurrent_maplist code which shows how to avoid that).
One of the things I discovered working through Jan’s code was the debug family of predicates. So to start playing with concurrent_maplist/3, I suggest typing this at the swipl commandline:
?- debug(concurrent). true. ?- concurrent_maplist(succ, [1,2,3,4], L). % Concurrent: waiting for workers ... % [Thread 4] Worker: received goal(1,thread:call(user:succ,1,_30),[_30]) % [Thread 4] Worker: received goal(2,thread:call(user:succ,2,_116),[_116]) % [Thread 4] Worker: received goal(3,thread:call(user:succ,3,_202),[_202]) % [Thread 4] Worker: received goal(4,thread:call(user:succ,4,_288),[_288]) % [Thread 4] Worker: received done % Waiting: received done(1,) % Concurrent: Job 1 completed with  % Concurrent: waiting for workers ... % Waiting: received done(2,) % Concurrent: Job 2 completed with  % Concurrent: waiting for workers ... % Waiting: received done(3,) % Concurrent: Job 3 completed with  % Concurrent: waiting for workers ... % Waiting: received done(4,) % Concurrent: Job 4 completed with  % [Thread 3] Worker: received done L = [2, 3, 4, 5].
I’m working on an old and cheap PC which only has two cores, and the above simple code is effectively run sequentially with the worker labeled [Thread 4] doing everything while lazy worker [Thread 3] only acknowledges “done” (ie, knocking off time).
The key thing to note that instead of just
call(succ, InVar, OutVar), the term sent via a message queue to the workers to execute is indexed as in
goal(+Idx, :Goal, +Var) and the return value sent back to the manager by the message queue it listens to is
done(+Idx, +Var) so the manager knows where it belongs in the output list irrespective of what order the parallel processes return their results.
The simple examples used in multithreading tutorials tend to be slower than just doing things the old-fashioned sequential way. The basic steps are:
- Create message queues along with the right number of threads (a controversial topic I’ll touch on now)
- Rewrite the goals with input variables as call terms and place them in the worker queue
- Receive the outputs from the workers in the manager queue and place them in their correct place in the output list
- Finally clean up the threads and message queues to make their hardware resources available for new tasks
The above takes longer for simple predicates like succ/2 than just running them sequentially, so concurrency isn’t the right tool for all jobs.
I found going through Jan’s code in the thread library very educational. Much of it is devoted to keeping concurrent_maplist bidirectional and he has also created versions to handle different number of arguments for the goal. I’m just briefly going to go through the key predicates specific to multithreading which were new to me.
Along with understanding the basics of thread_create(:Goal, -Id, +Options)
is the question of how many “workers” are optimal.
One helpful predicate which I only discovered through this exercise is:
If I run that on a my webhosting provider’s server, the result is 40 whereas my cheap home PC only has two cores. Sadly, my multithreading experiments don’t run forty times faster on my Webfaction account, which I suspect shows the operative word in Intel’s proprietary Hyper-threading technology is hype.
The thread.pl library has a predicate
workers(List, Count) which returns Count as the number of cores, unless the length of the list is smaller. (A joke I thought of while writing this is that if you create too many threads, you paralyse instead of parallalise your code, but it probably needs more work).
Join or detach?
One of the many things I find confusing about concurrent programming is which options to pick when. The SWI Prolog implementation documented at Multithreaded applications seems to offer the entire Posix menu, either as options to be listed in the third argument of
thread_create(:Goal, -Id, +Options) or subsequently by predicates such as
Though you can micromanage which core does what with option
affinity(+CpuSet), the consensus appears to be that these decisions are best left to the operating system.
My bias when in doubt is just to use the default options (there is a builtin predicate thread_create(:Goal, -Id) tailored for cowards like me), concurrent_maplist uses the default
detached(false) option, and then iteratively calls thread_join(+Id, -Status) for each worker in the cleanup phase.
Jan has provided an example using thread_create with the
detached(true) option in this discussion forum providing an example of parallelising an include filter.
An advantage of “joined threads” (ones with the default detached(false) option) is they can be cleaned up neatly at the end by calling
thread_join(+Id, -Status), or simpler yet
thread_join(+Id) if you are sure it’s a simple example such as
succ(?Int1, ?Int2) which is never going to be fed any types that would make it barf.
Something that tripped me up about thread_join is I thought it killed the thread. Instead, it waits for the thread to finish — which with buggy code may mean eternity — before freeing its resources.
In some cases I may want the thread to continue running as a background daemon (ie a detached process) while the rest of my program continues, and I think the problem I’m working on of searching a game tree may fall within this ambit, though I’m not sure how to go about this yet.
Creating message queues
In your setup phase you call message_queue_create(?Queue) which returns a unique ID looking something like
<message_queue>(0x560dd4a8c140), which can be subsequently gotten rid of in the cleanup phase with message_queue_destroy(+Queue).
If you don’t want to reference the message queue as Id or whatever variable name, there is an option to set your own alias which becomes globally available — an example is main, which you can check by entering
thread_self(Id). at the swipl command line — and the documentation says:
Though you don’t have to call message_queue_destroy/1 if you’re not using an alias (which I don’t since I’m ideologically opposed to globals), I feel it good style to use this type of template:
setup_call_cleanup(message_queue_create(Id), business_section(Id, ...), message_queue_destroy(Id) ).
The manager-workers pattern is a bit more complex in that two message queues need to be created and subsequently destroyed — one the manager uses to send jobs to workers, and another the workers send their finished product to the manager.
When threads are created, each instance has as its goal a listening loop which have two styles of templates. I don’t think there is much practical difference between them, but I find the most common type which goes as follows confusing:
listening_loop(QueueIn, QueueOut) :- thread_get_message(QueueIn, JobIn), ( JobIn \== end_of_queue -> do_job(JobIn, JobOut), thread_send_message(QueueOut, JobOut), listening_loop(QueueIn, QueueOut) % Confusing recursion statement ; thread_send_message(QueueOut, hometime) ).
What confuses me about the above style is I tend to think of recursive statements as having a base, terminating case, and a Current input argument and a Next output argument which progresses toward the base case.
Streams don’t behave like normal Prolog variables in that their contents changes even though their name stays the same, so as a matter of taste I find writing these as failure driven loops easier to understand:
listening_loop(QueueIn, QueueOut) :- repeat, thread_get_message(QueueIn, JobIn), ( JobIn \== end_of_queue -> do_job(JobIn, JobOut), thread_send_message(QueueOut, JobOut), fail % Repeats the loop ; thread_send_message(QueueOut, hometime), ! % Terminates the loop ).
This is it so far. I hope to add code for my concurrently programmed “consensus” best game move player soon.