How to measure context switching, operating system and user land?

Did somebody ever systematically measure context switching for SWI-Prolog?
Would it be possibly to not only measure thread context switching but
also coroutine context switching? Here is my current test case:

fib(0, R) :- !, R=1.
fib(1, R) :- !, R=1.
fib(N, R) :- M is N-1, fib(M, A), L is M-1, fib(L, B), R is A+B.

fib_hell(0, R) :- !, R=1, sleep(0).
fib_hell(1, R) :- !, R=1, sleep(0).
fib_hell(N, R) :- M is N-1, fib_hell(M, A), L is M-1, fib_hell(L, B), R is A+B.

I guess sleep/1 is not allowed to simply do nothing for a zero argument.
It should go into some operating system call so as to change the scheduling
of the current thread. For thread switching, I assume so, I get:

/* SWI-Prolog 9.1.4 */
?- time(fib(29,_)).
% 2,496,119 inferences, 0.188 CPU in 0.193 seconds (97% CPU, 13312635 Lips)
true.
?- time(fib_hell(29,_)).
% 3,328,158 inferences, 0.328 CPU in 2.861 seconds (11% CPU, 10142958 Lips)
true.

For coroutine switching, using nodeJS setImmediate(), I get:

/* Dogelog Player 1.0.5 */
?- time(fib(29,_)).
% Wall 1680 ms, gc 73 ms, 3466927 lips
true.
?- time(fib_hell(29,_)).
% Wall 4717 ms, gc 57 ms, 2293147 lips
true.

For some other thread switching, using Thread.sleep():

/* Jekejeke Prolog 1.5.6, JDK 1.8 */
?- time(fib(29,_)).
% Up 365 ms, GC 4 ms (Current 03/04/23 20:33:35)
true.
?- time(fib_hell(29,_)).
% Up 541 ms, GC 3 ms (Current 03/04/23 20:33:40)
true.

I am assuming a t < 0.0 check implementation in my benchmark, so that
sleep(0) is like operating system thread yield? For SWI-Prolog it seems implementation
is not consistent, sometimes t < 0.0 and sometimes t <= 0.0 is checked:

int Pause(double time)
https://github.com/SWI-Prolog/swipl-devel/blob/308c9221b3676062c1ff46c13fa3782cd3311e39/src/os/pl-os.c

But frankly I don’t know what my own Thread.sleep() invocation from
JDK 1.8 exactly does. Maybe it is cheating? Anyway, everybody wishing
that Doug Lea would again write a book about these matters.

Trealla Prolog isn’t that lucky either:

/* Trealla Prolog 2.11.17 */
?- time(fib(29,_)).
   % Time elapsed 0.716444s
   true.
?- time(fib_hell(29,_)).
   % Time elapsed 63.235085s
   true.

On the other hand Scryer Prolog isn’t doing that bad:

/* Scryer Prolog 0.9.1-187 */
?- time(fib(29,_)).
   % CPU time: 0.383s
   true.
?- time(fib_hell(29,_)).
   % CPU time: 0.674s
   true.

Comparison of the overhead ratio, all measured on Windows and WSL2 platform:

System fib/2 fib_hell/2 Ratio
Jekejeke 365 541 148%
Scryer 383 674 175%
Dogelog 1680 4717 281%
SWI 193 2823 1463%
Trealla 716 63235 8832%

sleep(0) does lots of complicated things depending on the OS. For Windows it creates a waitable timer, see pl-nt.c function Pause(). I guess some of these can go as all still alive operating systems have something better. What we want is something that provides good sub-second sleep, long sleeps and allows for signal processing while sleeping.

I’ve added some stuff to call shed_yield() (POSIX) or SwitchToThread (Windows) if time is 0. That gives me 0.108 and 0.645 wall time for these tests. That is under Wine on Linux. Real Windows is probably a bit different.

On Linux the times are 0.083 and 0.435 (wall).

I have little clue what this proofs though …

The WASM version runs on node. It probably still uses setTimeout(). See src/wasm/prolog.js. I’m not entirely sure how to run the npm version with node. Probably easy. For a local build simply run node src/swipl.js.

As is, the WASM version can only cooperate in a single fiber (if that is the right terminology). I.e., you cannot have a query yield (Q1), start a new query (Q2), yield from that as well and then resume Q1. You can start Q2, it may yield, but you must close it before you can resume Q1.

I just pushed a patch that allows for multiple engines in the single threaded version. Possibly that is enough to have multiple coroutines (after extending prolog.js and possible also the C interface). I don’t know what to do when we want to start Q2. We now have two options (1) just create it in the same engine, so we are in the situation above. Or (2), create a new engine and create it in this new engine.

How this this work? Does JavaScript async mean it creates a fiber (with a stack) when this function is called? Is there some way to figure out “which fiber I’m in” which could help me to decide to create an engine? That would match the current design when embedding SWI-Prolog in a multi-threaded application. Here, we have two options

  1. Test that the thread has a Prolog thread associated. If so, create a query. Else create a Prolog engine for the current thread and create the query.
  2. Have a pool of Prolog engines. If we need to make a call, get one from the pool, connect it to the current OS thread, create and run the query and return it to the pool. JPL (the Java interface) is doing this by default (it can also be asked to use (1)).

As this stuff is interesting for the WASM version and maybe for a Python binding, I gave it a try. See GitHub - JanWielemaker/fibers: Proof of concept for using engines to implement fibers

This uses the auto-yielding that is provided for the WASM version but supported in all versions. It can run arbitrary Prolog code.

Performance is about 5% worse due to the auto-yield scheduling. So, running on the single threaded engine version it performs about equal to the threaded version. It can run on both. Runs on older versions, but exception handling requires the latest git version.

The scheduler is really simple. I added set_immediate/1 and set_timeout/2 to create new fibers. Timeout scheduling is far too simple. I guess you need the waiting threads in a different queue. To make this realistic you’d at least cooperative versions of the I/O predicates. SWI-Prolog’s read_term/3 would be a challenge as it is written in C. An incomplete term may be ready on the input, so waiting for input to be ready and then read may block.

As is, yes but indeed in a very silly way. It just keeps objects of type start(Time,Engine) in the run queue until Time is passed. That should do pretty well as long as there are real runable tasks. If there are only waiting tasks you get a short polling loop.

Probably you should keep waiting tasks in a tree sorted by the first to fire. Now you check after each run of the runnable tasks for the head of the waiting task. If there are no runnable tasks you sleep until the first in the waiting tasks. Something like that …

I was mostly interested whether concurrency based on engines and auto-yielding works. It does :slight_smile: Do do something useful with it you first need event-based I/O.