How to shut down a multithreaded program (not necessarily gracefully)

I’m using: SWI-Prolog version 8.1.17
I want the code to: I’m running a long running program using the swi-mqtt-pack library that is apparently starting up additional threads for callbacks. There are times when I’m getting a MQTT disconnect message from the MQTT broker and in the callback I’ve set up for that message I want my code to just exit (it runs in a Docker container and will be restarted).

I am currently trying to exit the program with:
thread_send_message(main, quit).

But sometimes I’ll get a message like:
% The following threads wouldn’t die: [(3,0x16ef3d0)]

So it appears that I’d have to go through each of the threads the program might have started and kill them one by one, but I don’t know how to do this. Since I’m doing this from within a callback it could very well be a subthread that I’m doing this from. The “main” of my code is like:

main :-
setup_signals,
start_servers(swi_mqtt_brain),
wait.

where the start_servers starts up the mqtt connection and registers a callback on_disconnect. It’s in the predicate called by the on_disconnect callback that I want to shut down everything, quit, and let my container be restarted by Docker.

The machinery tries to do just that, basically killing all threads using thread_signal(Thread, abort) and than it waits for a second for all the threads to come to a halt. If some thread doesn’t stop before the timeout it just calls the OS exit() routine which should exit the process.

Originally only the main thread could call halt/1, but now any thread can call halt/1. Telling main to stop is still fine of course.

I’m invoking my Prolog code as follows:
/bin/sh -c ‘cd /rules; ./brain.pl’

And in brain.pl I’ve got (not showing all the code, just what I hope is needed):

:- initialization(main,main).

main :-
    attach_display_db, 
    setup_signals,
    start_servers(swi_mqtt_brain),
    wait.

setup_signals :-
    on_signal(int,  _, quit),
    on_signal(term, _, quit),
    on_signal(hup, _, reload).

start_servers(ClientId) :-
   connect(ClientId),
   format('-- connect as mqtt client~n'),
   subscribe_all,
   log('brain started'),
   at_halt(format('bye~n')),
   get_short_datetimestamp(Timestamp),
   format('~s brain started~n',[Timestamp]).

quit(_) :-
    get_short_datetimestamp(Timestamp),
    format('~s Quitting on signal~n',[Timestamp]),
    thread_send_message(main, quit).

shutdown_servers :-
   get_short_datetimestamp(Timestamp),
   format('~s brain shutting down~n',[Timestamp]),
   log('brain shutting down'),
   unsubscribe_all,
   disconnect.

wait :-
    thread_self(Me),
    repeat,
        thread_get_message(Me, Msg),
        catch(ignore(handle_message(Msg)), E, print_message(error, E)),
        Msg == quit,
        !,
        shutdown_servers,
        halt(0).

my_on_dis(_,_) :-
   get_short_datetimestamp(Timestamp),
   format('~s MQTT disconnected~n',[Timestamp]),
   quit(_).

So what I’m seeing in the container log when there is a MQTT disconnect is:

20200616 5:50 pm MQTT disconnected
20200616 5:50 pm Quitting on signal
20200616 5:50 pm handle_message quit
20200616 5:50 pm brain shutting down
ERROR: /rules/brain.pl:20: user:main: false
bye

Note the ‘bye’ message at the bottom is the result of at_halt() in setup_servers. The
ERROR: /rules/brain.pl:20: user:main: false
is the initialization line:

:- initialization(main,main).

But the problem is that the SWI Prolog process doesn’t always actually exit - it just seems to stay alive though I don’t know what it’s doing. So the ./brain.pl doesn’t exit, and the container doesn’t exit. I thought that call to thread_send_message(main, quit). in quit/0 would be enough to kill the entire process but it doesn’t reliably.

main/0 fails if shutdown_servers/0 fails. I guess that can be the case if unsubscribe_all/0 fails or disconnect/0 fails.

Especially with foreign extensions and threads involved it is not very hard to deadlock during shutdown. The typical solution I use is to attach the C debugger and get the stack traces. The rough sequence is

% gdb swipl pid    # pid is the process id of the hanging server
(gdb) info threads
# find interesting threads and do
(gdb) thread N   # id of interesting thread
(gdb) bt
<stack dump>
(gdb) call PL_backtrace(25,1)
<Prolog stack dump for thread>

See PL_backtrace(). If the error stream is rebound, you can use this to get the backtrace in the gdb terminal. The arguments are the same as for PL_backtrace().

(gdb) printf "%s", PL_backtrace_string(25,1)