How to measure context switching, operating system and user land?

Did somebody ever systematically measure context switching for SWI-Prolog?
Would it be possibly to not only measure thread context switching but
also coroutine context switching? Here is my current test case:

fib(0, R) :- !, R=1.
fib(1, R) :- !, R=1.
fib(N, R) :- M is N-1, fib(M, A), L is M-1, fib(L, B), R is A+B.

fib_hell(0, R) :- !, R=1, sleep(0).
fib_hell(1, R) :- !, R=1, sleep(0).
fib_hell(N, R) :- M is N-1, fib_hell(M, A), L is M-1, fib_hell(L, B), R is A+B.

I guess sleep/1 is not allowed to simply do nothing for a zero argument.
It should go into some operating system call so as to change the scheduling
of the current thread. For thread switching, I assume so, I get:

/* SWI-Prolog 9.1.4 */
?- time(fib(29,_)).
% 2,496,119 inferences, 0.188 CPU in 0.193 seconds (97% CPU, 13312635 Lips)
?- time(fib_hell(29,_)).
% 3,328,158 inferences, 0.328 CPU in 2.861 seconds (11% CPU, 10142958 Lips)

For coroutine switching, using nodeJS setImmediate(), I get:

/* Dogelog Player 1.0.5 */
?- time(fib(29,_)).
% Wall 1680 ms, gc 73 ms, 3466927 lips
?- time(fib_hell(29,_)).
% Wall 4717 ms, gc 57 ms, 2293147 lips

For some other thread switching, using Thread.sleep():

/* Jekejeke Prolog 1.5.6, JDK 1.8 */
?- time(fib(29,_)).
% Up 365 ms, GC 4 ms (Current 03/04/23 20:33:35)
?- time(fib_hell(29,_)).
% Up 541 ms, GC 3 ms (Current 03/04/23 20:33:40)

I am assuming a t < 0.0 check implementation in my benchmark, so that
sleep(0) is like operating system thread yield? For SWI-Prolog it seems implementation
is not consistent, sometimes t < 0.0 and sometimes t <= 0.0 is checked:

int Pause(double time)

But frankly I don’t know what my own Thread.sleep() invocation from
JDK 1.8 exactly does. Maybe it is cheating? Anyway, everybody wishing
that Doug Lea would again write a book about these matters.

Trealla Prolog isn’t that lucky either:

/* Trealla Prolog 2.11.17 */
?- time(fib(29,_)).
   % Time elapsed 0.716444s
?- time(fib_hell(29,_)).
   % Time elapsed 63.235085s

On the other hand Scryer Prolog isn’t doing that bad:

/* Scryer Prolog 0.9.1-187 */
?- time(fib(29,_)).
   % CPU time: 0.383s
?- time(fib_hell(29,_)).
   % CPU time: 0.674s

Comparison of the overhead ratio, all measured on Windows and WSL2 platform:

System fib/2 fib_hell/2 Ratio
Jekejeke 365 541 148%
Scryer 383 674 175%
Dogelog 1680 4717 281%
SWI 193 2823 1463%
Trealla 716 63235 8832%

sleep(0) does lots of complicated things depending on the OS. For Windows it creates a waitable timer, see pl-nt.c function Pause(). I guess some of these can go as all still alive operating systems have something better. What we want is something that provides good sub-second sleep, long sleeps and allows for signal processing while sleeping.

I’ve added some stuff to call shed_yield() (POSIX) or SwitchToThread (Windows) if time is 0. That gives me 0.108 and 0.645 wall time for these tests. That is under Wine on Linux. Real Windows is probably a bit different.

On Linux the times are 0.083 and 0.435 (wall).

I have little clue what this proofs though …

The WASM version runs on node. It probably still uses setTimeout(). See src/wasm/prolog.js. I’m not entirely sure how to run the npm version with node. Probably easy. For a local build simply run node src/swipl.js.

As is, the WASM version can only cooperate in a single fiber (if that is the right terminology). I.e., you cannot have a query yield (Q1), start a new query (Q2), yield from that as well and then resume Q1. You can start Q2, it may yield, but you must close it before you can resume Q1.

I just pushed a patch that allows for multiple engines in the single threaded version. Possibly that is enough to have multiple coroutines (after extending prolog.js and possible also the C interface). I don’t know what to do when we want to start Q2. We now have two options (1) just create it in the same engine, so we are in the situation above. Or (2), create a new engine and create it in this new engine.

How this this work? Does JavaScript async mean it creates a fiber (with a stack) when this function is called? Is there some way to figure out “which fiber I’m in” which could help me to decide to create an engine? That would match the current design when embedding SWI-Prolog in a multi-threaded application. Here, we have two options

  1. Test that the thread has a Prolog thread associated. If so, create a query. Else create a Prolog engine for the current thread and create the query.
  2. Have a pool of Prolog engines. If we need to make a call, get one from the pool, connect it to the current OS thread, create and run the query and return it to the pool. JPL (the Java interface) is doing this by default (it can also be asked to use (1)).

Coding such an event loop could be a hardening challenge for
the engine API of SWI-Prolog. Was able to get this far, the two tasks
terminate almost simultaneously, meaning the interleaving works:

?- go((start(task(30)), start(task(30)))).

These are the commands available for the tasks and the task definition:

stop :- engine_yield(stop).
start(G) :- engine_yield(start(G)).
sleep :- engine_yield(sleep).

fib_hell(0, R) :- !, R=1, sleep.
fib_hell(1, R) :- !, R=1, sleep.
fib_hell(N, R) :- M is N-1, fib_hell(M, A), L is M-1, fib_hell(L, B), R is A+B.

task(N) :- fib_hell(N, X), write('fib('), write(N), write(')='), write(X), nl.

This is the event loop, what I don’t know yet how to replace the explicit sleep/0 by auto-yield:

go(G) :-
   engine_create(stop, G, F),

% loop(+List)
loop([E|L]) :-
   engine_next(E, H), !,
   (H = stop ->
       J = L;
    H = start(G) ->
       engine_create(stop, G, F),
       append(L, [F,E], J);
    H = sleep ->
       append(L, [E], J);
loop([_|L]) :-

Edit 05.03.2023
Because the loop/1 has a terminating clause loop([]). It can be also
used for time measurement, you can simply do:

?- time(go((start(task(30)), start(task(30))))).
% 10,770,175 inferences, 2.359 CPU in 2.373 seconds (99% CPU, 4564842 Lips)

Cannot yet do this measurement in Dogelog Player, since I have no go/1
task group predicate yet. How could I measure this thing instead?

As this stuff is interesting for the WASM version and maybe for a Python binding, I gave it a try. See GitHub - JanWielemaker/fibers: Proof of concept for using engines to implement fibers

This uses the auto-yielding that is provided for the WASM version but supported in all versions. It can run arbitrary Prolog code.

Performance is about 5% worse due to the auto-yield scheduling. So, running on the single threaded engine version it performs about equal to the threaded version. It can run on both. Runs on older versions, but exception handling requires the latest git version.

The scheduler is really simple. I added set_immediate/1 and set_timeout/2 to create new fibers. Timeout scheduling is far too simple. I guess you need the waiting threads in a different queue. To make this realistic you’d at least cooperative versions of the I/O predicates. SWI-Prolog’s read_term/3 would be a challenge as it is written in C. An incomplete term may be ready on the input, so waiting for input to be ready and then read may block.

As is, yes but indeed in a very silly way. It just keeps objects of type start(Time,Engine) in the run queue until Time is passed. That should do pretty well as long as there are real runable tasks. If there are only waiting tasks you get a short polling loop.

Probably you should keep waiting tasks in a tree sorted by the first to fire. Now you check after each run of the runnable tasks for the head of the waiting task. If there are no runnable tasks you sleep until the first in the waiting tasks. Something like that …

I was mostly interested whether concurrency based on engines and auto-yielding works. It does :slight_smile: Do do something useful with it you first need event-based I/O.

Now I made a new version of my non-fibers and fibers API. I removed the
name “engine” from the API, so as to avoid confusion. Engines are more
lower level than the Python idea of callbacks and tasks.

The API now reads:

  • Part 1: Callbacks (non-fibers) (Changed)
    They are Stackless and run in the main Engine current Task of the Current Thread.
    In my current take, they run without Auto-Yield and without Yield-Allowed.

    • os_call_later(G, D, T):
      The predicate succeeds in T with a new timer. As a side effect
      it schedules the goal G to be executed after D milliseconds.

    • os_call_cancel(T):
      The predicate succeeds. As a side effect it cancels the timer T.

  • Part 2: Tasks (1:N fibers) (Changed)
    They are Stackful and create their own Engine Task in the Current Thread.
    In my current take, they run with Auto-Yield and with Yield-Allowed.

    • os_task_current(E):
      The predicate succeeds in E with the current engine task.

    • os_task_abort(E, M):
      The predicate succeeds. As a side effect the engine task E gets
      the message M signalled.

    • os_task_create(G, E):
      The predicate succeeds in E with a new engine task for the goal G.
      The task gets immediately scheduled to be executed.

Edit 08.03.2023
Nice addition to the current API spec and already implemented for JavaScript
and Python. A callback has now the context available of the task that scheduled it.

?- os_task_current(T), write('task='), write(T), nl.

?- call_later((os_task_current(T), write('task='),
write(T), nl), 100), sleep(500).

?- create_task((os_task_current(T), write('task='), write(T), nl)), sleep(500).
task=[object Object]

?- create_task(call_later((os_task_current(T), write('task='),
write(T), nl), 100)), sleep(500).
task=[object Object]

And there is a new Prolog flag allow_yield, which can be
illustrated, you can query what the above current API spec says:

?- current_prolog_flag(allow_yield, A), write('allow_yield='), write(A), nl.

?- call_later((current_prolog_flag(allow_yield, A), write('allow_yield='),
write(A), nl), 100), sleep(500).