Pengine client recovery after server timeout

I’m trying to find a way to gracefully recover a client after the server times out a pengine session. What I think is happening:

  1. A pengine session is established but after a timeout on the server side, it terminates the session. I’m not sure if it sends a message at that point but, in any case, the client isn’t ready to receive it. (It’s in the read phase of a read/eval loop.)

  2. After some time the client complete’s the read and generates an “ask” and gets back an existence error; makes sense. However at this point I can’t find a way to “destroy” the client side part of the pengine. pengine_destroy/1 appears to succeed but pengine_property/2 indicates the pengine is still there.

The consequence of this is that the event loop for any pengines subsequently created don’t terminate normally because the loop thinks there are other pengines alive.

Is this a hole in my understanding or a hole in the protocol implementation?

I think you are using the Prolog client (pengine_rpc) to access a remote server? I woudn’t be surprised if error handling is lousy. Too many things to worry about it now. I am happy to see a pull request though :slight_smile:

I’m actually trying to write my own Prolog client to access a remote server (in another process) to make sure I understand the protocol before attempting a javascript version.

Everything seems to be working fairly well until the server times out and leaves the client side in a bit of a mess. So far my only strategy is to restart the client SWIP and start over. Obviously I’d like something a little less drastic. I’ll dig into the pengine code to see if I can figure out what’s required.

1 Like

I think the provided JavaScript client deals with this stuff ok. You’ll have to check the code what is supposed to happen. I think if you sent a request and wait for an answer you get a timeout error, after which the server side dies (and the client should thus assume the server is gone). Of course, the communication may also break due to a timeout or network event (connection reset by peer) and trying to ask for an answer may result in an existence error because the server reclaimed the pengine for some reason.

I don’t think I’m dealing with with a client side timeout:

  1. The client has an active pengine but is not waiting for a reply.
  2. The server times out. Nothing happens at the client, perhaps because the protocol is master/slave (?). I think this is the root cause because it’s an autonomous action taken by the server, i.e., it’s the master.
  3. The client subsequently sends a request.
  4. The server responds with an existence error. I’m thinking that the client then tries to send a message to destroy the (already dead) remote pengine.
  5. The server responds with another existence error. The destroy response which (I think) triggers the cleanup at the client end never arrives and the client pengine remains in limbo.

One possibility: If the server gets a destroy request for a non-existant pengine, send a destroy reply (the server is already in the correct state).

That sounds like a plausible scenario. The client must respond on (4) by deleting its admin for the pengine and raising an error to its caller. Also consider the case that the server is actually restarted and the pengine is thus gone. In that case I guess you would also expect a 404? In the timeout scenario this seems appropriate as well as it allows the server to clean everything it knows about the client immediately.

The server could remember it killed the pengine and answer with a destroy, but I don’t think this is any more informative and merely raises the question as to when it should assume the client will never ask and thus it must dispose the info the pengine is killed.

P.s. The way it works in the implementation is that the Pengine writes a message to the response queue with a timeout. That waits until an HTTP request for the message arrives. If sending times out the Pengine will commit suicide and remove all its traces (at least, that is the picture I have in my memory).

I agree. The key thing is that the client “deletes the admin”. Perhaps that should always be done when it receives an existence error which, I assume, is how the server always responds when there isn’t an active pengine on its end. I would also assume a 404 gets returned when the pengine service itself isn’t running.

If the client can’t always arrange to delete the admin under these circumstances, then the application needs the ability to force quit a dead pengine so it doesn’t prevent the event loop from terminating.

Yes. Only, I have little clue what you wrote on top of what and thus whether this is a bug in pengines.pl or something in your code. If you want help, please make some code available so I don’t have to guess …

My experiment is to construct a read/eval/print loop - like the Prolog top-level except the queries are sent to a remote server. A query of exit. exits back to the normal top level. To start the loop, enter the query p_listener(ServerURL). which is implemented as:

p_listener(URL) :-
	pengine_create([server(URL),destroy(false)]),
	pengine_event_loop(client_handler,[]).

Then there are a set of client_handler clauses. Here’s a relevant subset:

client_handler(create(ID, _)) :-
	writeln(pengine_create(ID)),
	p_query(ID).

client_handler(success(ID, [X], false)) :-
	!,
	writeln(X),nl,
	p_query(ID).

client_handler(success(ID, [X], true)) :-
	write(X),
	p_next(ID).
	
client_handler(failure(ID)) :-
	writeln(false),nl,
	p_query(ID).

client_handler(stop(ID)) :-
	p_query(ID).

client_handler(error(ID, error(existence_error(pengine,ID),Etc))) :- 
	!,
	fail.  % ??

client_handler(error(ID, Err)) :-
	writeln(Err),nl,
	p_query(ID).
	
client_handler(destroy(ID)) :-
	writeln(pengine_destroy(ID)).
	

The listener part ( p_query and p_next):

p_query(ID) :-
	prompt1('>?- '), 
	catch(read(Query),Err,(writeln(Err),fail)), !,
	(Query=exit ->
		pengine_destroy(ID) ;
		pengine_ask(ID, Query, [])
	).
p_query(ID) :- p_query(ID).  % read failed, try again.

p_next(ID) :-
	current_input(In),
	read_pending_codes(In,Chs,[]),    % flush input chars
	prompt1(' '),get_single_char(C),  % get single char
	put_code(C),nl,                   % echo with nl
	(C==59 ->                         % if a ';'
		pengine_next(ID,[]) ;         %    get next
		pengine_stop(ID,[])           %    else stop query
	).

So the question is what should the application do when error event existence_error(pengine,ID) arrives? And event destroy(ID)?

This might be completely irrelevant to your problem, but I remember when I was trying to get a PL server running on an Ubuntu EC2 instance on an AWS server I kept having the server time out and nothing I did in the code made any difference. What I found did work was to use the tmux command (terminal multiplexer) to start a session that would persist when I logged out of the server. After calling tmux I’d then start up my PL server, and disconnect from the session (with Ctrl-B, D). After that I could happily log out of the server and there would be no time outs.

To rejoin it later, use
tmux attach

Thanks for the tip. I need to get into server testing at some point but my main current focus is a robust working client; something that can survive server instability without hanging in the event loop due to dead client-side pengines.

If you’re looking for hints about the protocol, I implemented

though I was just guessing. AFAIK no one has actually used it in production.

Thanks for taking an interest. I have a strong suspicion that it’s a Prolog client issue having to do with how events are handled (event loop dependent on admin state). As a Prolog client is not my end objective in any case, I should probably move on.