I don’t know. It could be simpler than untangling the current engines from threads. I have not dived into either of these though.
Yes. The SWI-Prolog stack shifter both enlarges and shrinks. The multi-threaded version provides thread_idle/2 that reduces the memory footprint of the calling engine. Single threaded provides trim_stacks/0 which does only the stack part. It is used by the toplevel to minimize the stacks after answering a query, waiting for the user.
Otherwise I agree to your observations about an event loop. Most of that is not very hard. If you deal with blocking I/O and timers you are practically there. I think all is there to write an event loop in Prolog. I don’t really see the point though (except for when working with languages that use an event loop). For most applications preemptive multi-threading is pretty easy to handle in Prolog. The reason for that is that the primary working memory of most Prolog code is the stacks and these are separated anyway.
I am aware of that. I wonder how useful it is on modern OSes and hardware. Threads scale easily to the thousands, the complexity of threads is serious, but the slow down is rather small. But, if you want, all is there to write a web server using the N*M design (N threads, each using M engines). Note that engines can hop between threads
Note that the current HTTP server has dynamic worker pool scheduling were workers are added based on the workload (can be programmed) and they exit when they have to wait too long for new work. Creating and joining a thread takes about 10 microseconds, so we don’t need many waiting.
I’ve done Java scaling into the thousands with no problem; but I’ve also seen significant performance problems due to the overheads (cache misses) in switching threads.
About 20 years ago (but I think this still applies), I saw a multi-threaded email server (written in C++, running on BSD) turned into a single thread with cooperative multi-tasking, for ~10x performance improvement. I’m not sure if this is equivalent to “fibers”, but it’s similar - the cooperation was done by replacing epoll/kevent/whatever with a “dispatching” equivalent, so each operation ran until it had to do an IO or similar. (On an 8-CPU machine, a single process with hundreds or thousands of threads was changed to be 8 processes, each process using cooperative multi-tasking.)
The reason for the big speed-up was that thread switching did bad things to the cache; I presume that this would still apply today, perhaps even moreso (I’ve observed ~3x performance differences on one Prolog program, between two machines that differ by 10x in the amount of cache).
Wouldn’t this mean that things would break if you added a dynamic/1 directive to one module without doing it everywhere? And you’d need to also add thread_global/1 everywhere, I think.
Possibly on some workloads. In the early days of multi-threaded SWI-Prolog running things concurrently typically meant a performance improvement with 2 or 4 threads, after which the performance started dropping down, often quickly getting below the performance of 1 thread. At least on Linux I no longer observe that on most workloads.
we get this result. Machine is 16 cores, 32 threads. So, you see close to perfect speedup up to 16 threads (total time remains almost flat). Some more speedup to 32 (hyper threading) and after that flat performance.
There is still a reason for not using system threads: scheduling gets unfair as each of the threads has equal priority. Well, you can work around that using scheduling settings.
My assessment is roughly this, where details depend on workload and also OS (and possibly hardware).
Threads use more resources (stack, OS resources)
Fibers probably scale better with many often blocking tasks
Threads scale better with many CPU bound tasks
Fibers are easier to program with
If you use threads+fibers you get probably best scalability. On the other hand, where old OSes used to have that for implementing POSIX threads, e.g., Linux doesn’t bother.
With only fibers you can only use one CPU. With threads+fibers you can scale to multiple cores, but you get most of the complications of preemptive multi-threaded programming in return.
I know. That provides no resource sharing though. There are quite some cases where you really want that. Consider Prolog services over large Prolog databases. Fibers can also not hop between workers (I guess), so you may end up with unbalanced distribution of work over your workers. If you’d use SWI-Prolog threads and engines as fibers you could have engines hopping between threads.
I don’t see much gain for web servers though. As is, we have a thread doing the socket accept and after the accept send the socket to the worker pool. That is a dynamically scaling (up and down) pool of threads where some one will pick up the socket, read the request, do the computation and write the result. Unless a lot of data is involved and the network is slow, the thread will be working. Websockets use a set of open connections and (on most systems) the poll() API to get requests. They also use a dynamically sizing set of threads to handle the messages.
Two things go “wrong”: (1) long polling with many clients will create (too) many threads and engines can help here. (2). Long read/write over slow networks can leave (too) many threads mostly idle. Threads scale to the thousands, but maybe you can get many more using fibers. Fibers still need shared atom and clause garbage collection that may limit scalability.
You shouldn’t, but you need between fibers on the same worker.
Anyway, concurrent web server programming is typically really easy in SWI-Prolog and comes at pretty low cost. The need for thread synchronization such as mutexes is rare for such tasks. The low level machinery and libraries guarantee basic consistency. Multiple related changes to shared resources can be handled easily using transaction/1. No need to think of how many workers you need, make sure CPU intensive tasks are spread over (enough) workers, setup message passing, etc.
There are some scenarios where there can be many idle threads that may profit from fibers, but even there I’m not sure whether it will pay off.
I know the model is popular and for a reason for imperative OO languages. Typical Prolog computation involves few global resources though, so you get a lot of isolation for free.
That’s the design that Google internally uses for its webservers (in fact, for almost all servers – by default, a server provides http access for things like monitoring, modifying parameters, debugging, looking at logs, etc.) IIRC, there were some issues with a worker setting a callback and if the callback ran in a different thread (I don’t know the details as to why this was a problem; also, there was a standard higher-level framework that guaranteed that the worker and its callback ran in the same thread).
There are more than one kind of context switch. I’m thinking of context switching where “cache affinity” is lost, TLBs are flushed, etc. Possibly this isn’t as important nowadays, as Jan’s benchmarks indicate upthread (The subtle costs of garbage collection - #38 by jan). Whether that’s the case or not, one of the big advantages of fibers is that it doesn’t lose cache affinity as much as regular threads do, because the computation stays on the same CPU as much as possible; the other advantage is that a fiber runs until it needs something (e.g., waiting for IO) whereas threads might be interrupted while doing computations; so, they can lose cache affinity at that point and again when they reach a epoll/kevent/etc.