Some notes on writing a concurrent programing Howto. Comments and corrections welcome

Here is a first stab at writing a tutorial on parallel, or concurrent, programming with SWI Prolog. I’ve posted it here for some “peer reviewing” before submitting a version to https://prologhub.pl/ for posterity.

I’ve used @jan’s code for concurrent_maplist/3 in the thread library as learning material, which is not quite what I want for my problem, which is among the reasons this is still work in progress.

The ability to do parallel programming appears to be an advantage SWI Prolog has over Python, Ruby, JavaScript and possibly other scripting languages which have something called a Global Interpreter Lock (GIL). I won’t pretend to know what that is, just that the tutorials on multithreading I’ve found on the web tend to be written in Python or Ruby, even though the small print says those programming languages can’t actually do it.

Multithreading needs to be handled with caution: sandboxed environments like Swish forbid it, and I’m guessing the reason JavaScript is “single threaded” is because it’s primarily designed to run in the sandboxed environment of a browser.

A reason multithreading is dangerous is it only takes a small bug to freeze a computer.

In the example below, however many cores a given computer has are told to keep listening for messages on a queue so that whichever one is free or quickest can grab and execute them as they arrive. Without carefull coding, all cores may end up waiting for perpetuity (or a reboot) if messages never come.

There are various types of problems which can be speeded up by dividing them between the growing number of cores available in modern processors. The most common is probaby the manager-workers pattern, illustrated by Jan’s concurrent_maplist code.

The problem I’m working on — speeding up the decision of a move in games like chess or checkers — falls under “consensus problems” (as far as I understand it) which has fortunately been very well explained by Turing-award winner Leslie Lamport, and I hope to add a Prolog version of his algorithm here in due course. This involves stable shared storage between the parallel processes, a complexity which doesn’t arise in manager-workers problems like maplist. (SWI Prolog has wrappers for pthread.h mutex functions, but since I’m not familiar with them, I’m planning on going the more Prologish and elegant way of using dynamic, assertz and retract).

A trap most novices will fall into when rewritting traditional sequential maplist/3 into a parallel version is a resulting output list shuffled into a random order (assuming they are not cheating by looking at Jan’s concurrent_maplist code which shows how to avoid that).

One of the things I discovered working through Jan’s code was the debug family of predicates. So to start playing with concurrent_maplist/3, I suggest typing this at the swipl commandline:

?- debug(concurrent).
true.

?- concurrent_maplist(succ, [1,2,3,4], L).
% Concurrent: waiting for workers ...
% [Thread 4] Worker: received goal(1,thread:call(user:succ,1,_30),[_30])
% [Thread 4] Worker: received goal(2,thread:call(user:succ,2,_116),[_116])
% [Thread 4] Worker: received goal(3,thread:call(user:succ,3,_202),[_202])
% [Thread 4] Worker: received goal(4,thread:call(user:succ,4,_288),[_288])
% [Thread 4] Worker: received done
% Waiting: received done(1,[2])
% Concurrent: Job 1 completed with [2]
% Concurrent: waiting for workers ...
% Waiting: received done(2,[3])
% Concurrent: Job 2 completed with [3]
% Concurrent: waiting for workers ...
% Waiting: received done(3,[4])
% Concurrent: Job 3 completed with [4]
% Concurrent: waiting for workers ...
% Waiting: received done(4,[5])
% Concurrent: Job 4 completed with [5]
% [Thread 3] Worker: received done
L = [2, 3, 4, 5].

I’m working on an old and cheap PC which only has two cores, and the above simple code is effectively run sequentially with the worker labeled [Thread 4] doing everything while lazy worker [Thread 3] only acknowledges “done” (ie, knocking off time).

The key thing to note that instead of just call(succ, InVar, OutVar), the term sent via a message queue to the workers to execute is indexed as in goal(+Idx, :Goal, +Var) and the return value sent back to the manager by the message queue it listens to is done(+Idx, +Var) so the manager knows where it belongs in the output list irrespective of what order the parallel processes return their results.

The simple examples used in multithreading tutorials tend to be slower than just doing things the old-fashioned sequential way. The basic steps are:

  1. Create message queues along with the right number of threads (a controversial topic I’ll touch on now)
  2. Rewrite the goals with input variables as call terms and place them in the worker queue
  3. Receive the outputs from the workers in the manager queue and place them in their correct place in the output list
  4. Finally clean up the threads and message queues to make their hardware resources available for new tasks

The above takes longer for simple predicates like succ/2 than just running them sequentially, so concurrency isn’t the right tool for all jobs.

I found going through Jan’s code in the thread library very educational. Much of it is devoted to keeping concurrent_maplist bidirectional and he has also created versions to handle different number of arguments for the goal. I’m just briefly going to go through the key predicates specific to multithreading which were new to me.

Creating Threads

Along with understanding the basics of thread_create(:Goal, -Id, +Options)
is the question of how many “workers” are optimal.

One helpful predicate which I only discovered through this exercise is:

current_prolog_flag(cpu_count, Cores).

If I run that on a my webhosting provider’s server, the result is 40 whereas my cheap home PC only has two cores. Sadly, my multithreading experiments don’t run forty times faster on my Webfaction account, which I suspect shows the operative word in Intel’s proprietary Hyper-threading technology is hype.

The thread.pl library has a predicate workers(List, Count) which returns Count as the number of cores, unless the length of the list is smaller. (A joke I thought of while writing this is that if you create too many threads, you paralyse instead of parallalise your code, but it probably needs more work).

Join or detach?

One of the many things I find confusing about concurrent programming is which options to pick when. The SWI Prolog implementation documented at Multithreaded applications seems to offer the entire Posix menu, either as options to be listed in the third argument of thread_create(:Goal, -Id, +Options) or subsequently by predicates such as thread_detach(+Id), thread_exit(+Term)

Though you can micromanage which core does what with option affinity(+CpuSet), the consensus appears to be that these decisions are best left to the operating system.

My bias when in doubt is just to use the default options (there is a builtin predicate thread_create(:Goal, -Id) tailored for cowards like me), concurrent_maplist uses the default detached(false) option, and then iteratively calls thread_join(+Id, -Status) for each worker in the cleanup phase.

Jan has provided an example using thread_create with the detached(true) option in this discussion forum providing an example of parallelising an include filter.

An advantage of “joined threads” (ones with the default detached(false) option) is they can be cleaned up neatly at the end by calling thread_join(+Id, -Status), or simpler yet thread_join(+Id) if you are sure it’s a simple example such as succ(?Int1, ?Int2) which is never going to be fed any types that would make it barf.

Something that tripped me up about thread_join is I thought it killed the thread. Instead, it waits for the thread to finish — which with buggy code may mean eternity — before freeing its resources.

In some cases I may want the thread to continue running as a background daemon (ie a detached process) while the rest of my program continues, and I think the problem I’m working on of searching a game tree may fall within this ambit, though I’m not sure how to go about this yet.

Creating message queues

In your setup phase you call message_queue_create(?Queue) which returns a unique ID looking something like <message_queue>(0x560dd4a8c140), which can be subsequently gotten rid of in the cleanup phase with message_queue_destroy(+Queue).

If you don’t want to reference the message queue as Id or whatever variable name, there is an option to set your own alias which becomes globally available — an example is main, which you can check by entering thread_self(Id). at the swipl command line — and the documentation says:

Explicitly created queues come in two flavours. When given an alias, they must be destroyed by the user. Anonymous message queues are identified by a blob and subject to garbage collection.

Though you don’t have to call message_queue_destroy/1 if you’re not using an alias (which I don’t since I’m ideologically opposed to globals), I feel it good style to use this type of template:


  setup_call_cleanup(message_queue_create(Id),
                     business_section(Id, ...),
                     message_queue_destroy(Id) 
                    ).

The manager-workers pattern is a bit more complex in that two message queues need to be created and subsequently destroyed — one the manager uses to send jobs to workers, and another the workers send their finished product to the manager.

Listening loops

When threads are created, each instance has as its goal a listening loop which have two styles of templates. I don’t think there is much practical difference between them, but I find the most common type which goes as follows confusing:

listening_loop(QueueIn, QueueOut) :-
   thread_get_message(QueueIn, JobIn),
   (    JobIn \== end_of_queue
   ->   do_job(JobIn, JobOut),
        thread_send_message(QueueOut, JobOut),
        listening_loop(QueueIn, QueueOut)  % Confusing recursion statement
   ;    thread_send_message(QueueOut, hometime)
   ).

What confuses me about the above style is I tend to think of recursive statements as having a base, terminating case, and a Current input argument and a Next output argument which progresses toward the base case.

Streams don’t behave like normal Prolog variables in that their contents changes even though their name stays the same, so as a matter of taste I find writing these as failure driven loops easier to understand:

listening_loop(QueueIn, QueueOut) :-
   repeat,
   thread_get_message(QueueIn, JobIn),
   (    JobIn \== end_of_queue
   ->   do_job(JobIn, JobOut),
        thread_send_message(QueueOut, JobOut),
        fail  % Repeats the loop
   ;    thread_send_message(QueueOut, hometime),
        !     % Terminates the loop
   ).

This is it so far. I hope to add code for my concurrently programmed “consensus” best game move player soon.

4 Likes

For a higher-level threading API see:

https://logtalk.org/manuals/userman/threads.html

It compiles to the low-level threading API found on SWI-Prolog, ECLiPSe, XSB, and YAP. For examples of threaded engines, see:

https://github.com/LogtalkDotOrg/logtalk3/tree/master/examples/engines

For examples of and-parallelism and competitive or-parallellism see:

https://github.com/LogtalkDotOrg/logtalk3/tree/master/examples/threads

For the Prolog threads standardization proposal see:

https://prolog.logtalk.org/viewtopic.php?f=3&t=2

ECLiPSe was latest system to follow this proposal.

2 Likes

Unfortunately, this is as far as my tutorial got since I went off on a tangent learning Erlang. I’ve read about four books on Erlang (Joe Armstrong’s is the best) and it suffers from the usual problem of lots of evangelism on concurrency oriented programing, few practical examples of why you can’t achieve the same thing in some other language. (I haven’t really encountered anything Erlang can do that SWI Prolog can’t).

Everyone seems to agree that given the rate at which multiprocessing hardware is getting cheaper and more common, new ways of writing software are needed, leading to a revival in Erlang/Elixer along with newer concurrency oriented languages such as Julia, Go…

I’d like to get back into explaining parallelism to myself and others properly in Prolog, but am stuck in Erlang for a while yet.

Eventually, yes. If you do not intend to wait for a thread, use the detached(true) option when creating the thread. Otherwise, if the handle is anonymous (no alias(Name)) it is a blob that is eventually garbage collected by the atom garbage collector. If the thread is still running this detaches the thread and if it is not running it joins the thread. As atom-gc may not run very often you can pile up quite a few dead threads this way. It is probably not wise to rely on garbage collection of threads and merely see is as a courtesy of the system to get rid of threads that somehow survived in a misbehaving program. Note that named threads (alias) are never reclaimed automatically.