The subtle costs of garbage collection

jan · March 2, 2023, 10:06am

Possibly on some workloads. In the early days of multi-threaded SWI-Prolog running things concurrently typically meant a performance improvement with 2 or 4 threads, after which the performance started dropping down, often quickly getting below the performance of 1 thread. At least on Linux I no longer observe that on most workloads.

Using this driver code

Driver

:- meta_predicate
    conc_table(+,0),
    conc(+,0).

conc_table(Max, Goal) :-
    format('~t~w~15|~t~w~15+~t~w~15+~n', ["#threads", "total", "per run"]),
    between(0,Max,Step),
    Conc is 1<<Step,
    call_time(conc(Conc, Goal), Time),
    PerSol is Time.wall/Conc,
    format('~t~D~15|~t~3f~15+~t~3f~15+~n', [Conc, Time.wall, PerSol]),
    fail.
conc_table(_, _).

conc(N, Goal) :-
    length(Threads, N),
    maplist(thread_create(Goal), Threads),
    maplist(thread_join, Threads).

we get this result. Machine is 16 cores, 32 threads. So, you see close to perfect speedup up to 16 threads (total time remains almost flat). Some more speedup to 32 (hyper threading) and after that flat performance.

?- conc_table(11, rtest_chats(100)).
       #threads          total        per run
              1          0.674          0.674
              2          0.668          0.334
              4          0.690          0.173
              8          0.723          0.090
             16          0.806          0.050
             32          1.201          0.038
             64          2.400          0.038
            128          4.640          0.036
            256          9.139          0.036
            512         18.337          0.036
          1,024         36.686          0.036
          2,048         73.649          0.036

There is still a reason for not using system threads: scheduling gets unfair as each of the threads has equal priority. Well, you can work around that using scheduling settings.

My assessment is roughly this, where details depend on workload and also OS (and possibly hardware).

Threads use more resources (stack, OS resources)
Fibers probably scale better with many often blocking tasks
Threads scale better with many CPU bound tasks
Fibers are easier to program with
If you use threads+fibers you get probably best scalability. On the other hand, where old OSes used to have that for implementing POSIX threads, e.g., Linux doesn’t bother.
With only fibers you can only use one CPU. With threads+fibers you can scale to multiple cores, but you get most of the complications of preemptive multi-threaded programming in return.

Topic		Replies	Views
How to achieve a x1000 performance increase till 2025 and why architecture matters Nice to know	2	564	September 7, 2021
Garbage Collection tuning in Go for Microservices -- saved 70K Cores Discussion	1	338	April 5, 2022
Performance differences? General	5	560	September 5, 2019
Prolog uses only 13% of CPU and takes an hour to complete General	23	1430	September 8, 2020
Garbage collect, atoms , heaps - freeze limits for gpu General garbage-collection	2	288	September 14, 2023

The subtle costs of garbage collection

Related topics