Multiple threads not using available cores

I’m using: SWI-Prolog version 8.2.4
on a 32-core CPU running linux OS, fedora 34

I am running a large number of threads, 27 at the time, with the intention to make use of most of the cores.

But what I’m getting is: when the program is running, htop reports only one or two cores being used at any given time. Most of the threads are shown as sleeping. The program eventually finishes, and produces the right results. However, it looks like most of the cores go unused.

My code looks like this:

control:-
	file_to_list(closed_repr_f4_short_test,NodeList),
	length(NodeList,Len),
	iota(Len,Iota), clean(middle_zero),
	maplist(node_to_middle_zero,Iota,NodeList),
	current_prolog_flag(cpu_count,CPUCores),
	Running is min(CPUCores,Len) - 5, Running > 0, !,
	thread_create(init(Running,Running),_Control,[alias(control)]).

node_to_middle_zero(ID,Node):-
	middle(Node,Middle),
	recordz(middle_zero,[ID,Middle]).

init(Remain,0):-
	loop(Remain).
init(Remain,ToCreate):-
	ToCreate > 0,
	recorded(middle_zero,[ID,Middle],Ref),
	erase(Ref),
	single_launch(ID,Middle),
	NewToCreate is ToCreate - 1,
	init(Remain,NewToCreate).


loop(_Remain):-
	recorded(middle_zero,[ID,Middle],Ref),
	erase(Ref),
	get_message_and_record,
	single_launch(ID,Middle),
	fail.
loop(Remain):-
	finish(Remain).

finish(0):-
	thread_exit('control thread is done.').
finish(Remain):-
	Remain > 0,
	get_message_and_record,
	NewRemain is Remain-1,
	finish(NewRemain).

get_message_and_record:- !,
	thread_get_message([OldMiddle,OldMiddleCount]),
	recordz(middle_count,[OldMiddle,OldMiddleCount]).

single_launch(ID,Middle):- !,
	current_prolog_flag(cpu_count,CPUCores),
	Core is ID mod CPUCores,
	atomic_list_concat([m,t,ID],'_',TID),
	thread_create(mch_thread(Middle,ID),_MCH_Thread,[alias(TID),affinity([Core])]).

mch_thread(Middle,ID):-
	thread_self(Self),
	middle_count_hybrid(Middle,MiddleCount),
	thread_send_message(control,[ID,[Middle,MiddleCount]]),
	thread_detach(Self),
	thread_exit([Self,Middle,MiddleCount]).

Hard to say. There are surely some dubious things in your code. First of all, thread_exit/1 is deprecated. Seems you just want the thread to stop, which happens if you just make it succeed (or fail or throw an exception, but for detached threads that results in a warning. Using the affinity option to lock threads to a core is probably also a bad idea. Using recorded/3 and friends is old pre-ISO school. If you want to store work to do, a message queue might be a good alternative.

If you want simple concurrency, concurrent_maplist/2 and friends keep your code simple.

None of this really explains why the speedup is so poor. The core of the concurrent computation seems to be middle_count_hybrid/2, which is not included. There could be a clue. Notably operations that affect shared resources such as the recorded database (safe, but not at all optimized for concurrent access as it is outdated), heavily modified dynamic predicates (consider thread_local/1), etc.

On most platforms you can run ?- prolog_ide(thread_monitor). (also from many of the IDE menus) to get some more insight into what is happening.

1 Like

Thank you Jan. You do give me a lot of ideas to explore, and I will. You are right that the core of the concurrent computation is done by middle_count_hybrid2. All the instances of it access recorded data, but none of them modify it. Some of that data is stored in thousands of records, and some of it in thousands of clauses(actually facts) of a dynamic predicate, using hashing for faster access. All those records and facts are created prior to the call to control, and they are not modified at all during the execution of control.

Unfortunately though, concurrent access to records is realized using a simple lock (mutex). So, multiple threads using the recorded database will mostly fight each other, even if they only read.

That is better. Concurrent reading of clauses runs without any locks and without any write operations in shared memory. Writing (assert) is coordinated using a very briefly held lock.

That makes me wonder. Anything besides Prolog’s normal clause indexing?

You can use mutex_statistics/0 to dump some stats on mutex usage. You have to guess a little what the names mean (or check the sources). The L_* named mutexes are system-wide.

My guess is you should replace the recorded database by dynamic predicates or do the dirty work and make handling of the recorded database thread-safe without locks. I’ll surely be interested in a PR doing this :slight_smile: A fairly easy change would be to have locks associated with each key in the recorded database rather than a single lock for all of them.

2 Likes

Thank you Jan. You nailed it. Replacing the recorded database with dynamic predicates made all the difference. Now all the cpu cores are humming. I believe your other comments/suggestions will allow me to make further improvements in the code.

2 Likes