We have a solver in which we recently included concurrency by means of first_solution/3. Some times first_solution/3 creates and runs up to 128 threads (on an 8 core computer). Each thread is enclosed in a time_out/2 call. So the idea is to wait for the fastest thread but up until some time_out signal arrives.
We have noticed that when there are many threads running the whole computation runs beyond the time_out set in time_out/2. For instance:
?- time(setlog(oplus({X,Y,Z/R},S,T) & noplus({X,Y,Z/R},S,T),2000,Cons,Res,tryp(prover_all))).
% 138,944,518 inferences, 21.690 CPU in 3.537 seconds (613% CPU, 6405999 Lips)
Res = failure.
In this case the timeout is set to 2 s (2000 ms) but time/1 says the goal has run for 3.537 seconds. We were waiting for a Res = time_out answer produced in around 2 s. The tryp(prover_all) argument asks the solver to run 128 threads. When we reduce this number the computing time tends to approach the timeout. When we call the solver in single-thread mode (i.e. the standard solver) the running time equals the timeout.
Is it possible to lose a time_out signal when there are many threads running concurrently? Can we do something about it (besides reducing the number of threads)?
Furthermore, even if we reduce the number of threads (say to 32) but the same call is run over an over again in rapid succession, it seems that not all the threads of the previous calls are killed thus making the computer to go overloaded. Is this normal? Could we have a problem in our code?
I do have one indeed but it would require for you to download and install our solver and then run the goal we’re using as example. The version of the solver on which we’re running the example can be downloaded here:
Thanks for the report. Note that the actual implementation of call_with_time_limit/2 depends on the OS (well, Windows vs. the rest). It does in both cases work using a separate thread that signals the threads waiting for a timeout. On a system with 128 task competing for 8 cores, this can lead to some delay.
Second, reaching the timeout raises a time_limit_exceeded exception. If you catch that, you may have a problem.
Alternatively, and possibly simpler, add one additional task to the pool that simply does sleep(Time), thow(something). If there is a thread that completes before, this is killed. Else, if this expires, all the others are killed.
We’re using time_out/3 instead of call_with_time_limit/2. Are you suggesting us to move to the latter? Do you think it’s safer/better to use call_with_time_limit/2 rather than time_out/3?
We’re not catching time_limit_exceeded.
Alternatively, and possibly simpler, add one additional task to the pool that simply does
Unfortunately, our emulation is not fully compatible with the SICStus original. Notably, Time is measured in wall-time instead of virtual CPU time. Virtual CPU time is hard in threaded-environments. On most systems, you probably need a thread that measures the CPU usage of the monitored thread.
This bit in the docs of time_out/3 seems to point to a problem and propose a work-around.
time_out/3 is implemented by raising and handling time_out exceptions, so any exception handler in the scope of Goal must be prepared to pass on time_out exceptions. The following incorrect example shows what can happen otherwise:
?- time_out(on_exception(Q,(repeat,false),true), 1000, Res).
Q = time_out, Res = success
Good. Originally, I think, it was implemented differently. SWI-Prolog always handled timeout events using exceptions, though the exception is time_limit_exceeded, with some preparation to turn this into time_limit_exceeded(Context). Ideally, this should be synchronised.
If SICStus compatibility is not a target, I’d stick with call_with_time_limit/2 (or call_with_time_limit/3 to use the Context).
AFAIK, SICStus uses the POSIX signal SIGVTALRM (maybe SIGXCPU) to trigger a timeout on CPU time. Unfortunately that does not work in multi-threaded environments as, AFAIK, time is measured for the process and it is not defined which thread will receive the signal. So, instead, all alarm signalling in SWI-Prolog uses a separate thread that maintains a schedule of which thread needs to be signalled when and then uses Prolog’s thread_signal/2 to signal the right thread. On heavy load, this can be delayed quite a bit as we see. Possibly we could improve by raising the priority of the scheduling thread. As it uses practically no CPU, this would otherwise be harmless.