Engine timeout and cross-thread destroy issues

Supported by Claude, I’m looking for a way to implement the stateless HTTP API using engines instead of threads, as engines are cheaper to store in the cache. It’s looking good except for one thing: the engine-backed variant does not currently implement a robust timeout for nonterminating goals. Claude tested the obvious approaches; call_with_time_limit/2 did not interrupt tight loops inside engines, and destroying a running engine from another thread was unsafe in this SWI build.

Claude, the floor is your’s …

Environment

Observed on:

SWI-Prolog version 10.1.3 for fat-darwin

Programmatic version:

?- current_prolog_flag(version_data, Version).
Version = swi(10, 1, 3, []).

Issue 1: call_with_time_limit/2 does not interrupt engine_next/2

For ordinary Prolog code, call_with_time_limit/2 behaves as expected:

?- catch(call_with_time_limit(0.1, (repeat, fail)),
         Error,
         writeln(Error)).
time_limit_exceeded

But the equivalent goal inside an engine does not appear to be interrupted:

?- engine_create(true, (repeat, fail), Engine),
   catch(call_with_time_limit(0.1, engine_next(Engine, _)),
         Error,
         writeln(Error)).

Expected behavior:

The call should raise time_limit_exceeded, or the limitation should be documented.

Observed behavior:

The call does not return or throw within several seconds.

During one run, SWI also printed this diagnostic before the process was killed externally:

foreign predicate system:repeat/0 did not clear exception:
    error(signal(alrm,14),context(repeat/0,_))

This suggests the alarm signal may be delivered but not converted into the expected time_limit_exceeded exception while executing through engine_next/2.

Minimal reproduction

:- initialization(main, main).

main :-
    current_prolog_flag(version_data, Version),
    format('SWI version: ~q~n', [Version]),

    catch(
        call_with_time_limit(0.1, (repeat, fail)),
        PlainError,
        format('plain loop: ~q~n', [PlainError])
    ),

    engine_create(true, (repeat, fail), Engine),
    catch(
        call_with_time_limit(0.1, engine_next(Engine, _)),
        EngineError,
        format('engine loop: ~q~n', [EngineError])
    ),

    writeln(done).

Expected output:

SWI version: swi(10,1,3,[])
plain loop: time_limit_exceeded
engine loop: time_limit_exceeded
done

Observed output:

SWI version: swi(10,1,3,[])
plain loop: time_limit_exceeded

Then the process hangs in the engine case.

Issue 2: engine_destroy/1 can segfault when called from another thread

As a workaround for the timeout problem above, I tried running engine_next/2 in a worker thread and destroying the engine from the main thread when a timeout expires. This appears unsafe: if one thread is actively running an engine and another thread calls engine_destroy/1 on that engine, SWI segfaults.

Expected behavior:

engine_destroy/1 should fail, throw, or safely interrupt/destroy the engine. It should not crash the process.

Observed behavior:

Segmentation fault.

Minimal reproduction

:- initialization(main, main).

main :-
    current_prolog_flag(version_data, Version),
    format('SWI version: ~q~n', [Version]),
    engine_create(true, (repeat, fail), Engine),
    thread_create(run_engine(Engine), Worker, []),
    sleep(0.05),
    writeln('Destroying engine from main thread while worker is running it...'),
    engine_destroy(Engine),
    writeln('engine_destroy/1 returned'),
    thread_join(Worker, Status),
    format('Worker status: ~q~n', [Status]).

run_engine(Engine) :-
    catch(
        engine_next(Engine, Answer),
        Error,
        (   format('Worker caught: ~q~n', [Error]),
            fail
        )
    ),
    format('Worker answer: ~q~n', [Answer]).

Observed output:

SWI version: swi(10,1,3,[])
Destroying engine from main thread while worker is running it...

ERROR: Received fatal signal 11 (segv)

Why this matters

The use case is a stateless HTTP API that evaluates Prolog goals with paged responses. A thread-backed implementation can cache suspended toplevel actor threads between requests, but engines look like a better fit because cached engines should be much cheaper than cached threads.

For example, an engine-backed cache entry can store:

cache(GoalId, NextOffset, Engine)

instead of a suspended actor thread:

cache(GoalId, NextOffset, Pid)

The engine can be made to yield whole protocol chunks, so it does not need to compute one solution beyond the current response:

chunk_session(Goal, Template, Offset, Answer) :-
    engine_fetch(Limit0),
    Limit = count(Limit0),
    chunk_loop(Goal, Template, Offset, Limit, Answer).

chunk_loop(Goal, Template, Offset, Limit, Answer) :-
    answer(Goal, Template, Offset, Limit, Answer),
    (   Answer = success(_, true)
    ->  engine_yield(Answer),
        engine_fetch(NextLimit),
        nb_setarg(1, Limit, NextLimit),
        fail
    ;   true
    ).

This works for normal finite goals and cached paging. The missing piece is a robust way to bound execution time for nonterminating goals.

Questions

  1. Should call_with_time_limit/2 interrupt nonterminating Prolog execution while it is being run through engine_next/2 or engine_post/3?
  2. If not, is this an intentional limitation that should be documented?
  3. Should engine_destroy/1 detect and reject attempts to destroy an engine that is currently running in another thread?
  4. Is there a recommended pattern for implementing timeouts around engines used this way?

engine_next/1 is specified:

% Switch control to Engine and if engine produces a result, switch
% control back and unify the instance of Template from
% engine_create/3,4 with Term.

Mostlikely call_with_time_limit/2 is bootstrapped from thread_signal/2.
If thread_signal/2 places the signal inside an engine context of the
parent engine that called engine_next/2, then the child engine

will not see this signal. Happens also in other Prolog systems,
when a signal is sent to a task. It will not automatically propagate
to other tasks. Erlang was a little bit different since it had a link

from agent to spawned agents, and did do some book keeping.
But many modern async libraries did abandon this link, and the
viewpoint of async tasks is that they are living in a sea of tasks.

One structuring method that became more popular recently,
like in the Python world, is then a so called task group, if you
would have a task group notion, and add a task group parenthesis

around your doing, you could tear down a task group. I am
currently using task groups in a notebook application. The
implementaion is a little limmited since there is only one task

group, but more mature systems allow multiple task groups
even addressing some quirks of the sea of tasks model that
are persisting to the life cycle of such tasks:

The Heisenbug lurking in your async code
Will McGugan - February 11, 2023
https://textual.textualize.io/blog/2023/02/11/the-heisenbug-lurking-in-your-async-code/

More a feature than a bug, i.e. task garbage collection.

Indeed. Not sure that has a reasonable fix. It has a reasonable work-around: apply the time limit inside the engine to the goal running for the next answer.

It should not segfault. Pushed a fix that makes the call wait until the engine is done, which implies this example hangs.

You should either do your timeout management inside the engine or use threads. On Linux these scale well enough, i.e., 100K threads won’t starve the machine. MacOS has a lot thread limit (some thousands). I don’t know about Windows.