The subtle costs of garbage collection

jan · February 28, 2023, 1:01pm

I don’t know. It could be simpler than untangling the current engines from threads. I have not dived into either of these though.

Yes. The SWI-Prolog stack shifter both enlarges and shrinks. The multi-threaded version provides thread_idle/2 that reduces the memory footprint of the calling engine. Single threaded provides trim_stacks/0 which does only the stack part. It is used by the toplevel to minimize the stacks after answering a query, waiting for the user.

Otherwise I agree to your observations about an event loop. Most of that is not very hard. If you deal with blocking I/O and timers you are practically there. I think all is there to write an event loop in Prolog. I don’t really see the point though (except for when working with languages that use an event loop). For most applications preemptive multi-threading is pretty easy to handle in Prolog. The reason for that is that the primary working memory of most Prolog code is the stacks and these are separated anyway.

jan · February 28, 2023, 1:37pm

I am aware of that. I wonder how useful it is on modern OSes and hardware. Threads scale easily to the thousands, the complexity of threads is serious, but the slow down is rather small. But, if you want, all is there to write a web server using the N*M design (N threads, each using M engines). Note that engines can hop between threads

Note that the current HTTP server has dynamic worker pool scheduling were workers are added based on the workload (can be programmed) and they exit when they have to wait too long for new work. Creating and joining a thread takes about 10 microseconds, so we don’t need many waiting.

peter.ludemann · March 1, 2023, 9:09pm

I’ve done Java scaling into the thousands with no problem; but I’ve also seen significant performance problems due to the overheads (cache misses) in switching threads.

About 20 years ago (but I think this still applies), I saw a multi-threaded email server (written in C++, running on BSD) turned into a single thread with cooperative multi-tasking, for ~10x performance improvement. I’m not sure if this is equivalent to “fibers”, but it’s similar - the cooperation was done by replacing epoll/kevent/whatever with a “dispatching” equivalent, so each operation ran until it had to do an IO or similar. (On an 8-CPU machine, a single process with hundreds or thousands of threads was changed to be 8 processes, each process using cooperative multi-tasking.)

The reason for the big speed-up was that thread switching did bad things to the cache; I presume that this would still apply today, perhaps even moreso (I’ve observed ~3x performance differences on one Prolog program, between two machines that differ by 10x in the amount of cache).

peter.ludemann · March 1, 2023, 9:12pm

Wouldn’t this mean that things would break if you added a dynamic/1 directive to one module without doing it everywhere? And you’d need to also add thread_global/1 everywhere, I think.

jan · March 2, 2023, 10:06am

Possibly on some workloads. In the early days of multi-threaded SWI-Prolog running things concurrently typically meant a performance improvement with 2 or 4 threads, after which the performance started dropping down, often quickly getting below the performance of 1 thread. At least on Linux I no longer observe that on most workloads.

Using this driver code

Driver

:- meta_predicate
    conc_table(+,0),
    conc(+,0).

conc_table(Max, Goal) :-
    format('~t~w~15|~t~w~15+~t~w~15+~n', ["#threads", "total", "per run"]),
    between(0,Max,Step),
    Conc is 1<<Step,
    call_time(conc(Conc, Goal), Time),
    PerSol is Time.wall/Conc,
    format('~t~D~15|~t~3f~15+~t~3f~15+~n', [Conc, Time.wall, PerSol]),
    fail.
conc_table(_, _).

conc(N, Goal) :-
    length(Threads, N),
    maplist(thread_create(Goal), Threads),
    maplist(thread_join, Threads).

we get this result. Machine is 16 cores, 32 threads. So, you see close to perfect speedup up to 16 threads (total time remains almost flat). Some more speedup to 32 (hyper threading) and after that flat performance.

?- conc_table(11, rtest_chats(100)).
       #threads          total        per run
              1          0.674          0.674
              2          0.668          0.334
              4          0.690          0.173
              8          0.723          0.090
             16          0.806          0.050
             32          1.201          0.038
             64          2.400          0.038
            128          4.640          0.036
            256          9.139          0.036
            512         18.337          0.036
          1,024         36.686          0.036
          2,048         73.649          0.036

There is still a reason for not using system threads: scheduling gets unfair as each of the threads has equal priority. Well, you can work around that using scheduling settings.

My assessment is roughly this, where details depend on workload and also OS (and possibly hardware).

Threads use more resources (stack, OS resources)
Fibers probably scale better with many often blocking tasks
Threads scale better with many CPU bound tasks
Fibers are easier to program with
If you use threads+fibers you get probably best scalability. On the other hand, where old OSes used to have that for implementing POSIX threads, e.g., Linux doesn’t bother.
With only fibers you can only use one CPU. With threads+fibers you can scale to multiple cores, but you get most of the complications of preemptive multi-threaded programming in return.

jan · March 2, 2023, 10:42am

I know. That provides no resource sharing though. There are quite some cases where you really want that. Consider Prolog services over large Prolog databases. Fibers can also not hop between workers (I guess), so you may end up with unbalanced distribution of work over your workers. If you’d use SWI-Prolog threads and engines as fibers you could have engines hopping between threads.

I don’t see much gain for web servers though. As is, we have a thread doing the socket accept and after the accept send the socket to the worker pool. That is a dynamically scaling (up and down) pool of threads where some one will pick up the socket, read the request, do the computation and write the result. Unless a lot of data is involved and the network is slow, the thread will be working. Websockets use a set of open connections and (on most systems) the poll() API to get requests. They also use a dynamically sizing set of threads to handle the messages.

Two things go “wrong”: (1) long polling with many clients will create (too) many threads and engines can help here. (2). Long read/write over slow networks can leave (too) many threads mostly idle. Threads scale to the thousands, but maybe you can get many more using fibers. Fibers still need shared atom and clause garbage collection that may limit scalability.

jan · March 2, 2023, 11:09am

You shouldn’t, but you need between fibers on the same worker.

Anyway, concurrent web server programming is typically really easy in SWI-Prolog and comes at pretty low cost. The need for thread synchronization such as mutexes is rare for such tasks. The low level machinery and libraries guarantee basic consistency. Multiple related changes to shared resources can be handled easily using transaction/1. No need to think of how many workers you need, make sure CPU intensive tasks are spread over (enough) workers, setup message passing, etc.

There are some scenarios where there can be many idle threads that may profit from fibers, but even there I’m not sure whether it will pay off.

I know the model is popular and for a reason for imperative OO languages. Typical Prolog computation involves few global resources though, so you get a lot of isolation for free.

peter.ludemann · March 2, 2023, 9:08pm

That’s the design that Google internally uses for its webservers (in fact, for almost all servers – by default, a server provides http access for things like monitoring, modifying parameters, debugging, looking at logs, etc.) IIRC, there were some issues with a worker setting a callback and if the callback ran in a different thread (I don’t know the details as to why this was a problem; also, there was a standard higher-level framework that guaranteed that the worker and its callback ran in the same thread).

peter.ludemann · March 4, 2023, 2:50am

There are more than one kind of context switch. I’m thinking of context switching where “cache affinity” is lost, TLBs are flushed, etc. Possibly this isn’t as important nowadays, as Jan’s benchmarks indicate upthread (The subtle costs of garbage collection - #38 by jan). Whether that’s the case or not, one of the big advantages of fibers is that it doesn’t lose cache affinity as much as regular threads do, because the computation stays on the same CPU as much as possible; the other advantage is that a fiber runs until it needs something (e.g., waiting for IO) whereas threads might be interrupted while doing computations; so, they can lose cache affinity at that point and again when they reach a epoll/kevent/etc.

Topic		Replies	Views
SWI does some good garbage-collection/optimisation these days General	23	1064	October 28, 2021
Performance implications of `--threads=false` in embedded swipl General	14	475	October 16, 2023
VM code refactor - thoughts? Request For Comments on-topic-only	113	4087	May 30, 2021
Fluctuating performance and sometimes "The following threads wouldn't die: [gc]" General	3	384	February 17, 2021
Porting the SWI-Prolog benchmark suite: comparing 8 Prolog systems Nice to know benchmark , performance , comparison	27	2228	November 29, 2023

The subtle costs of garbage collection

Related topics