Time_out signal lost when many threads have been created?

maximilianocristia · March 5, 2025, 1:31pm

Hello,

We have a solver in which we recently included concurrency by means of first_solution/3. Some times first_solution/3 creates and runs up to 128 threads (on an 8 core computer). Each thread is enclosed in a time_out/2 call. So the idea is to wait for the fastest thread but up until some time_out signal arrives.

We have noticed that when there are many threads running the whole computation runs beyond the time_out set in time_out/2. For instance:

?- time(setlog(oplus({X,Y,Z/R},S,T) & noplus({X,Y,Z/R},S,T),2000,Cons,Res,tryp(prover_all))).
% 138,944,518 inferences, 21.690 CPU in 3.537 seconds (613% CPU, 6405999 Lips)
Res = failure.

In this case the timeout is set to 2 s (2000 ms) but time/1 says the goal has run for 3.537 seconds. We were waiting for a Res = time_out answer produced in around 2 s. The tryp(prover_all) argument asks the solver to run 128 threads. When we reduce this number the computing time tends to approach the timeout. When we call the solver in single-thread mode (i.e. the standard solver) the running time equals the timeout.

Is it possible to lose a time_out signal when there are many threads running concurrently? Can we do something about it (besides reducing the number of threads)?

Furthermore, even if we reduce the number of threads (say to 32) but the same call is run over an over again in rapid succession, it seems that not all the threads of the previous calls are killed thus making the computer to go overloaded. Is this normal? Could we have a problem in our code?

Thanks in advance for any help.

All the best,
Maxi

brebs · March 5, 2025, 4:15pm

Do you have a Minimal reproducible example - Wikipedia

maximilianocristia · March 5, 2025, 5:04pm

I do have one indeed but it would require for you to download and install our solver and then run the goal we’re using as example. The version of the solver on which we’re running the example can be downloaded here:

https://www.fceia.unr.edu.ar/~mcristia/setlog.zip

Just unzip the file on any folder. Then do as follows:

$ swipl
?- consult('setlog.pl').
?- time(setlog(oplus({X,Y,Z/R},S,T) & noplus({X,Y,Z/R},S,T),2000,Cons,Res,tryp(prover_all))).

(You’ll see a couple of messages of missing files but they aren’t necessary in this case)

If the example is reproduced in your computer as in ours then you’ll see something like:

% 134,015,890 inferences, 20.945 CPU in 3.182 seconds (658% CPU, 6398491 Lips)
Res = failure.

meaning that the goal was run for more that 2 s (i.e. 3.182 s) as indicated in the timeout.

Don’t hesitate to contact me if you have any problems with the solver.

Thanks in advance!

jan · March 5, 2025, 6:40pm

Thanks for the report. Note that the actual implementation of call_with_time_limit/2 depends on the OS (well, Windows vs. the rest). It does in both cases work using a separate thread that signals the threads waiting for a timeout. On a system with 128 task competing for 8 cores, this can lead to some delay.

Second, reaching the timeout raises a time_limit_exceeded exception. If you catch that, you may have a problem.

Alternatively, and possibly simpler, add one additional task to the pool that simply does sleep(Time), thow(something). If there is a thread that completes before, this is killed. Else, if this expires, all the others are killed.

maximilianocristia · March 5, 2025, 7:14pm

Thanks for the answer.

We’re using time_out/3 instead of call_with_time_limit/2. Are you suggesting us to move to the latter? Do you think it’s safer/better to use call_with_time_limit/2 rather than time_out/3?

We’re not catching time_limit_exceeded.

Alternatively, and possibly simpler, add one additional task to the pool that simply does

OK, we can try that.

Thanks!

Boris · March 6, 2025, 6:52am

bug

Unfortunately, our emulation is not fully compatible with the SICStus original. Notably, Time is measured in wall-time instead of virtual CPU time. Virtual CPU time is hard in threaded-environments. On most systems, you probably need a thread that measures the CPU usage of the monitored thread.

This bit in the docs of time_out/3 seems to point to a problem and propose a work-around.

PS: I also found this in the SICStus docs:

time_out/3 is implemented by raising and handling time_out exceptions, so any exception handler in the scope of Goal must be prepared to pass on time_out exceptions. The following incorrect example shows what can happen otherwise:
?- time_out(on_exception(Q,(repeat,false),true), 1000, Res).
Q = time_out, Res = success

jan · March 6, 2025, 8:09am

Good. Originally, I think, it was implemented differently. SWI-Prolog always handled timeout events using exceptions, though the exception is time_limit_exceeded, with some preparation to turn this into time_limit_exceeded(Context). Ideally, this should be synchronised.

If SICStus compatibility is not a target, I’d stick with call_with_time_limit/2 (or call_with_time_limit/3 to use the Context).

AFAIK, SICStus uses the POSIX signal SIGVTALRM (maybe SIGXCPU) to trigger a timeout on CPU time. Unfortunately that does not work in multi-threaded environments as, AFAIK, time is measured for the process and it is not defined which thread will receive the signal. So, instead, all alarm signalling in SWI-Prolog uses a separate thread that maintains a schedule of which thread needs to be signalled when and then uses Prolog’s thread_signal/2 to signal the right thread. On heavy load, this can be delayed quite a bit as we see. Possibly we could improve by raising the priority of the scheduling thread. As it uses practically no CPU, this would otherwise be harmless.

maximilianocristia · March 6, 2025, 1:25pm

Thanks for all the answers and suggestions. We’ll try with call_with_time_limit/2. I’ll let you know the result as soon as possible.

Topic		Replies	Views
Thread creation in first_solution General	4	54	December 22, 2024
When using thread_signal(ThreadID, throw(my_exception)): foreign predicate system:>/2 did not clear exception Help!	2	1034	June 18, 2021
Is it possible to create something similar to call_with_time_limit/2 which doesn't kill the goal after timeout? Help!	6	479	September 28, 2019
Goal Evaluation with Timeouts Help!	14	1487	January 24, 2025
Enhanced thread signal interface Request For Comments	12	712	October 13, 2021

Time_out signal lost when many threads have been created?

Related topics