Web Prolog ISOBASE Test Cluster - Hack Away, Find Security Bugs

Hi All,

Based on an early prototype, I’ve deployed two Web Prolog ISOBASE profile nodes. This is alpha quality code. I wanted to put this up early here because I want to find security issues now rather than later when there are important things on these machines and I’m relying on them for public testing by the masses.

There’s not much to see from the outside other than the fact that queries work. Also, distributed queries work, too. All of this has been provided by Torbjörn, and I’ve only fixed a few bugs to make it work in a cluster on named hosts instead of only ‘localhost’.

There are two nodes with sample knowledge bases at:

http://one.prolog.computer:3010 and http://two.prolog.computer:3010

There’s a link at the bottom of the page which opens a new tab with the source of the knowledge bases loaded on each node.

These nodes can utilize each other for distributed Web queries via RPC. For example, go to the http://one.prolog.computer:3010 page and submit the following example query (thank you Torbjörn!):

?- actress(M, scarlett_johansson, _),
          rpc('http://two.prolog.computer:3010',director(M, D)).    

This allows true stateless distributed Web queries, with query and index cached on the client-side.

We need to know if these nodes can be hacked. They use the built-in sandboxing available by swi-prolog. I can imagine all kinds of ways that sandboxing can be broken, and like I said, I want find out now if there’s a security hole.

We are now looking into a good way to do performance and scalability testing, too. Ideas appreciated!

All my best,
Damon

2 Likes

Thanks for trying it out! No, what we refer to as the ISOBASE profile doesn’t support code injection. Here’s the diagram showing how Web Prolog, very tentatively, has been split up into profiles:

As you can see, unless we are querying an ISOTOPE or ACTOR node, source code injection isn’t supported.

The ISOBASE profile provides two things: a stateless HTTP API (or “RESTful” if you want) which can be used from e.g. JavaScript, and the rpc/2-3 predicate which can be used from Prolog.

The shell is written in JavaScript and talks to the node over the stateless HTTP API. The interaction is of course restricted (no I/O etc), but that’s what you should expect for an ISOBASE node.

Suppose you’d make the following query:

?- movie(A,B).
A = american_beauty, 
B = 1999 <blinking cursor>

Now, if Damon decides to restart his nodes tonight, you would still be able to continue exploring the answers to your query tomorrow, and get the correct second solution, just as if the node had not been restarted:

?- movie(A,B).
A = american_beauty, 
B = 1999 ;
A = anna,
B = 1987 <blinking cursor>

That’s a true sign of a stateless API. Only the client keeps track of the state of the interaction, and it represents it with a pair consisting of a the current query and an integer, that’s all. The node doesn’t need to remember anything from the previous interaction.

The statelessness of the API makes it necessary to handle slow or non-terminating queries in a special way, as we cannot abort a query once it has been submitted. Here’s an example:

?- repeat, fail.
Error: Time limit (1s) exceeded.

To get an idea of how the stateless HTTP API works, you may want to select the following links. (How the responses are shown depends on your browser.)

http://one.prolog.computer:3010/ask?query=movie(A,B)&limit=5

Then let’s ask for the next five solutions:

http://one.prolog.computer:3010/ask?query=movie(A,B)&offset=5&limit=5

Adding a request parameter format=prolog gives us Prolog text rather than JSON:

http://one.prolog.computer:3010/ask?query=movie(A,B)&limit=5&format=prolog

The rpc/2-3 predicate is a meta-predicate for making non-deterministic remote procedure calls. It allows a process running in a node A to call and try to solve a query in the Prolog context of another node B, taking advantage of the data and programs being offered by B, just as if they were local to A.

That is why, as Damon showed, the following works from the shell at http://one.prolog.computer:3010

?- actress(M, scarlett_johansson, _), 
   rpc('http://two.prolog.computer:3010',director(M, D)).
D = peter_webber, 
M = girl_with_a_pearl_earring ;
D = sofia_coppola,  
M = lost_in_translation ;
... 

(It’s a silly example, of course, since director/2 is available at http://one.prolog.computer:3010 as well. We need to come up with better examples here.)

Here’s how rpc/2-3 is implemented on top of the stateless HTTP API:

rpc(URI, Query) :-
    rpc(URI, Query, []).
    
rpc(URI, Query, Options) :-
    parse_url(URI, Parts),
    option(limit(Limit), Options, 1),
    format(atom(QueryAtom), '(~q)', [Query]),
    rpc(Query, 0, Limit, QueryAtom, Parts, Options).
    
rpc(Query, Offset, Limit, QueryAtom, Parts, Options) :-    
    parse_url(ExpandedURI, [
        path('/ask'),
        search([query=QueryAtom, offset=Offset, limit=Limit, format=prolog])
      | Parts
    ]),
    setup_call_cleanup(
        http_open(ExpandedURI, Stream, Options),
        read(Stream, Answer), 
        close(Stream)),
    wait_answer(Answer, Query, Offset, Limit, QueryAtom, Parts, Options).

wait_answer(error(Error), _, _, _, _, _, _) :-
    throw(Error).
wait_answer(failure, _, _, _, _, _, _) :-
    fail.
wait_answer(success(Solutions, false), Query, _, _, _, _, _) :- !,
    member(Query, Solutions).
wait_answer(success(Solutions, true), 
            Query, Offset0, Limit, QueryAtom, Parts, Options) :-
    (   member(Query, Solutions)
    ;   Offset is Offset0 + Limit,
        rpc(Query, Offset, Limit, QueryAtom, Parts, Options)
    ).

The most important thing here is that the HTTP API is stateless. This is what makes it different from the HTTP API that’s available in library(pengines), and the pengine_rpc/2-3 that this library offers.

A naive implementation of the stateless HTTP query API is easy to build. For a request of the form <BaseURI>/ask?query=<Q>&offset=<N>&limit=<M> the node needs to compute the slice of solutions 0..N+M, drop the solutions 0..N and respond with the rest of the slice. Even a CGI script could do this, but this would be slow, not only because CGI by its nature is slow, but also since such a naive approach may involve a lot of recomputation. Computing the first slice (i.e. the one starting at offset 0) is as fast as it can be, but computing the second slice involves the recomputation of the first slice and, more generally, computing the nth slice involves the recomputation of all preceding slices, the results of which are then just thrown away. This, of course, is a waste of resources and puts an unnecessary burden on the node.

To implement the HTTP API in a way that makes it both stateless and scalable we use a novel kind of caching scheme that Jan W and I experimented with some years ago, and which has now been refined and reimplemented. It’s somewhat complicated, and I won’t describe it here unless you ask me to, but it seems to work just fine. One way to see that recomputation is avoided is to run the following query at http://one.prolog.computer:3010:

?- (sleep(0.9), X=a ; X=b).

You will then find that because of the call to sleep/1, the answer X=a will take around a second to appear, but when you ask for the second answer, it will appear immediately. This shows that the first disjunct isn’t recomputed.

2 Likes

Yes, thanks for spotting these bugs. We are not going to fixed them yet though since the terminal is just a quick hack to make it possible to demo Web Prolog online. In the future, I think SWISH should be used as a front-end instead.

1 Like

I don’t think that HTTP becomes stateful when keep-alive is used, at least not in any interesting sense. WebSocket, however, is a truly stateful protocol. But I’m glad you asked for source code, because here comes the implementation of the API. It is not exactly the implementation used in the demo nodes (as that’s a bit too hacky to show), but it should give you an idea of how the API can be built on top of pengines.

:- use_module(library(http/thread_httpd)).
:- use_module(library(http/http_dispatch)).
:- use_module(library(http/http_parameters)).

:- http_handler(root(ask), http_actor_manager, [spawn([])]).

:- dynamic cache/3.

http_actor_manager(Request) :-
    http_parameters(Request, [ 
        query(QueryAtom, []),
        offset(Offset, [integer, default(0)]),
        limit(Limit, [integer, default(1)])
    ]),
    read_term_from_atom(QueryAtom, Query, []),
    query_id(Query, QID),
    (   with_mutex(cache, retract(cache(QID, Offset, Pid)))
    ->  self(Self),
        pengine_next(Pid, [
            limit(Limit),
            return_to(Self)
        ])
    ;   pengine_spawn(Pid),
        pengine_ask(Pid, offset(Offset, Query), [
            limit(Limit)
        ])
    ),
    receive({
        failure(Pid) ->
            respond(failure);
        error(Pid, Exception) ->
            respond(error(Exception)) ;
        success(Pid, Solutions, true) ->
            Index is Offset + Limit,
            with_mutex(cache, assertz(cache(QID, Index, Pid))),
            respond(success(Solutions, true));
        success(Pid, Solutions, false) ->
            respond(success(Solutions, false))   
    }).
  
query_id(Query, QID) :-
    copy_term(Query, QID0),
    numbervars(QID0, 0, _),
    term_hash(QID0, QID).

To make it easier to understand, the above code illustrates the main idea of our caching scheme, but leaves out many of the details that are necessary for a fully working node. For example, it does not prune the cache when it has grown too big (that should be done from the top when a max-size has been reached), it does not implement the mechanism that will terminate a pengine when it has run for too long, and it does not implement the format option. Code for handling all of this would be easy to add, and our demonstrator adds it.

I’ll be happy to explain even further, if you want me to. Also, the following illustration might help:

1 Like

If you scroll you’ll see that the definition of query_id/2 is in fact there.

Dynamic DB manipulation should not - and can not - work at all in the ISOBASE profile, so that’s yet another bug.

1 Like

It’s coming! This is an absolute must. Won’t be long now.

Yes. Pending. This is a must have. (Replied the same to one of your other post below).

Source code is here: https://github.com/Web-Prolog/swi-web-prolog-lite .

However, to understand what is going on, it might be better to have a close look at the code here.

Excellent questions.

My long term goal is to create a p2p secure distributed search engine with a conversational agent interface on mobile. The sponsored hardware would be your mobile phone and everyone else’s. My hope, and this is beyond the scope of just Web Prolog, is that our prolog.computer cluster will prove out Web Prolog and seed the network using Web Prolog as the interaction language.

This requires that the other profiles beyond ISOBASE be implemented as well to do things like code injection for peer knowledge bases and distributed Web search queries. Also, I foresee WebAssembly playing a big role.

This is very much a research project. I’m stepping out on a limb here as not all the details are understood, but I’m a big believer in Release Early Release Often. Always interested in hearing everyone’s thoughts.

I’m grateful for Torbjörn’s work here as it fleshes out the security model in many ways and enables the use of Prolog as a central feature and implementation path. I’m interested in algebraic and so-called Explainable AI approaches to indexing and caching derived from the queries and knowledge bases themselves. That is all very far-future so I cannot really answer detailed questions there as I’m still researching.

Please be kind. Don’t shred my ideas just yet. I need help exploring these ideas. Helpful comments and criticism is definitely wanted. :slight_smile:

Sincerely,
Damon

Yes. Edge-AI is exactly what I’m talking about. Cell phones are the primary deployment target. Like that article says, 80% of mobile devices will have AI chips by 2020. A secure p2p search engine which protects your identity and keeps your queries safe from prying eyes. Using distributed queries which hop from host to host means they can be anonymized. This should boost security.

You mean better as in never fails? No, I don’t think so. Computers fail, software fails - that’s life. Or do you mean better as in faster? For queries that both can handle, an SWI-Prolog implementation of an ISOBASE node might be marginally faster since there are things it doesn’t need to do in order to deliver a response, such as creating a module, or register its identity. On the other hand, it needs to search the cache before it tries to compute a response, so I don’t think the difference is measurable.

What I dont understand is how the cache should work for actors.

It works for pengines that don’t do I/O or dynamic db modifications. Thus, it works for pure Prolog, but also for many Prolog programs that are not pure, as long as they don’t do I/O or dynamic db modifications. Note that a pengine is a kind of actor, so it works for some actors, but not for all. It’s easy to see that a strict request-response protocol cannot work for arbitrary actors.

Arbitrary actors, such as pengines that do I/O or dynamic db modifications, should be queries over the WebSocket API. In such cases, an ACTOR node is needed. This will be much faster than SWISH, especially for applications that do a lot of “server push” - a game engine which very frequently needs to communicate the state of the game to a client, for example.

The library behind SWISH (library(pengines)) relies on something akin to longpolling. Unfortunately, when it comes to concurrent and distributed programming, library(pengines) promises more than it can deliver. Hopefully, in the future, a Web Prolog node can replace the SWISH backend.

Yes, to some extent, Damon and I probably do have different goals. I don’t see this as a problem. And yes, a sophisticated conversational agent running on a node and talking to a browser needs to talk over Websockets and thus needs to run on an ACTOR node. Alternatively, it can keep the state on the client (in JavaScript or in Prolog running in the browser), and use the stateless API when it needs to consult resources located elsewhere.

Of course it can - some of them. But try to do the following in SWISH, a straightforward translation of an Erlang program into Web Prolog. Two actors, residing on two different nodes, playing ping-pong:

ping(0, Pong_Pid) :-
    Pong_Pid ! finished,
    io:format('Ping finished',[]).
ping(N, Pong_Pid) :-
    self(Self),
    Pong_Pid ! ping(Self),
    receive({
        pong ->
            io:format('Ping received pong',[])
    }),
    N1 is N - 1,
    ping(N1, Pong_Pid).
    
pong :-
    receive({
        finished ->
            io:format('Pong finished',[]);
        ping(Ping_Pid) ->
            io:format('Pong received ping',[]),
            Ping_Pid ! pong,
            pong
    }).

start :-
    spawn(pong, Pong_Pid, [
        node('http://another_node.org'),
        src_predicates([pong/0])
    ]),
    spawn(ping(3, Pong_Pid), _, [
        src_predicates([ping/2])
    ]).

I have already pointed you to an PoC implementation of an ACTOR node. It’s here: https://github.com/Web-Prolog/swi-web-prolog. Unfortunately, it is not yet safe to open up to the world, so you have to download, install it locally and run it there. If you do, you’ll find there is a tutorial too, that will let you try examples such as the above ping-pong program, as well as many other examples.

It’s a good example of something that cannot be done in SWISH, not even when library(pengines) is imported.

At this point in time, no. Later, yes, when we have figured out how to use library(sandbox) in this case.

Time is measured from the moment a pengine starts to compute a solution to the query until it has found one. There is a setting that allows the owner of a node to change the time limit.

The Web Prolog subset that can be used for querying an ISOBASE node doesn’t come with such primitives, but if your node wraps a Prolog system that can run A and B in parallel, and your node-resident program imports predicates defined in terms of such primitives, it might work. Making two HTTP requests in parallel from (say) a JavaScript program running in browser may also work. The Erlang-ish concurrency primitives are only available when programming against an ACTOR node.

Yes, an actor can be passed an option timeout at creation time. The process will be terminated when either of the two timeouts is exceeded.

Telescript seems to be framework for mobile agents. Web Prolog does not aim for mobility, except in the sense that a running actor can decide to create a copy of itself on a remote node. However, note that if the original actor then decides to terminate, the copy running remotely will (by default) be forced to terminate too. This is an effect of the linking mechanism that Web Prolog inherits from Erlang. When running on the Web it is important that the default for the link option is true, and that only an authorised user (e.g. the owner of the node) can set it to false.

An obvious example of the use of linking is that when a programmer closes/leaves/reloads the web-based IDE, the top-level pengine which served the shell as well as actors that may have been spawned by this pengine are forced to terminate. (This is what happens in SWISH too.)

Note that the discussion above is only relevant when we are programming against the ACTOR profile. Let’s return to the properties of the ISOBASE profile. When making a stateless HTTP request to a node, notions such as linking isn’t really relevant. For example, if you submit a query ?-movie(A,B) in the shell at http://one.prolog.computer:3010 and then, after having seen the first answer, close or reload the window/tab, the remote pengine will still be kept in the cache in a suspended state. Later, it may become useful to another client who is requesting the second answer to ?-movie(A,B), and so on. For the benefit of any client that can make use of it, the pengine will be around until the query has either been run to completion, or the max size of the cache is exceeded so that some pooled pengines - which may include “your” pengine - must be terminated and removed from the cache.

Thanks, I believe you’re right! Will have a look at it tomorrow.

I’m not familiar with shared tabling, and therefore I’m not sure what you’re saying here. But I’d like to try to respond to the following question.

Let me try to explain how it works with respect to the following implementation:

:- http_handler(root(ask), http_actor_manager, [spawn([])]).

http_actor_manager(Request) :-
    http_parameters(Request, [ 
        query(QueryAtom, []),
        offset(Offset, [integer, default(0)]),
        limit(Limit, [integer, default(1)])
    ]),
    read_term_from_atom(QueryAtom, Query, []),
    query_id(Query, QID),
    (   retract(cache(QID, Offset, Pid))
    ->  self(Self),
        pengine_next(Pid, [
            limit(Limit),
            return_to(Self)
        ])
    ;   pengine_spawn(Pid),
        pengine_ask(Pid, offset(Offset, Query), [
            limit(Limit)
        ])
    ),
    receive({
        failure(Pid) ->
            respond(failure);
        error(Pid, Exception) ->
            respond(error(Exception)) ;
        success(Pid, Solutions, true) ->
            Index is Offset + Limit,
            assertz(cache(QID, Index, Pid)),
            respond(success(Solutions, true));
        success(Pid, Solutions, false) ->
            respond(success(Solutions, false))   
    }).
  
query_id(Query, Query).

(Note that I’ve simplified query_id/2 in order to make my line of reasoning easier to follow.)

Suppose this is the current cache:

cache(mortal(_), 1, 55321100).
cache(mortal(_), 1, 66783421).
cache(p(_),      3, 84926378).

And suppose two identical requests of the following form are made at the same time:

GET http://ex.org/ask?query=mortal(Who)&offset=1

When they try to retract the first cache/3 clause, only one of them will succeed. (At least that’s how I understand the documentation on retract/1 at retract/1) It will thus grab the pengine with the pid 55321100. The other request will grab the pengine with the pid 66783421.

For a short while, there will only be one clause in the cache, namely

cache(p(_), 1, 84926378).

but when both requests have returned their responses, the cache has been updated and might look as follows:

cache(p(_),      3, 84926378).
cache(mortal(_), 2, 66783421).
cache(mortal(_), 2, 55321100).

So far, since retract/1 is an atomic operation, nothing bad has happened. And AFAICS, in this scenario, nothing bad can happen. So far, I see no need for a mutex.

However, the above implementation is not complete. We also need a mechanism that will prune the cache when it its max size is exceeded. Suppose the max size is 3. (Of course, normally it would be 1,000, or 10,000, or something like that.)

Now, suppose the following request is made:

GET http://ex.org/ask?query=member(X,[a,b])&offset=0

The cache will be of no help here, so a new pengine will be spawned which will do the job. Since there are more solutions, a clause will be added to the cache:

cache(p(_),            3, 84926378).
cache(mortal(_),       2, 66783421).
cache(mortal(_),       2, 55321100).
cache(member(_,[a,b]), 1, 98237833).

Now, the cache has exceeded its max size and must be pruned. This is done from the top since that’s were we have the oldest entry. The clause cache(p(_),3,84926378) must be removed and the pengine with the pid 84926378 must be forced to terminate.

This is where we have a potential race condition. Because when the pruning mechanism tries to retract cache(p(_),3,84926378), there may also be an incoming request of the form:

GET http://ex.org/ask?query=p(X)&offset=3

We have two cases:

  1. The incoming request retracts the clause cache(p(_),3,84926378) before the pruning mechanism has a chance to do so. The pruning mechanism will then instead retract cache(mortal(_),2,66783421). This, AFAICS, won’t lead to any problems.

  2. The pruning mechanism retracts the clause cache(p(_),3,84926378) before the incoming request has a chance to do so. This only means that the incoming request needs to spawn a new pengine. No problem here either.

There is no need for a mutex in the context of the pruning mechanism either. The “entry point” into the cache is always through the atomic retract/1. Once a thread has succeeded in retracting a cache/3 clause, no other thread can get at it, and thus no other thread can get hold of the corresponding pengine. (Note that for each pengine in the system, there can exist at most one cache/3 clause.)

I know only too well how difficult programming with concurrency becomes in the presence of shared state, so I’m perhaps a bit too optimistic when I boldly claim that 1) the proposed scheme will work, and 2) that there is no need for mutexes. (I thought there was, but I now believe there isn’t.)

Do you, or anyone else, see any holes in my line of reasoning?

1 Like

My tabling idea doesn’t apply to your current prototype. Since in your prototype you do cache processes(*) and not results. But tabling caches results. Nevertheless one could also try an actor prototype that is closer to tabling.

I guess in shared tabling the result caching is extended to cache results plus some process(*) info, because of the deadlock management. So this type of tabling is a little closer to your prototype, but still it is not the same.

(*) Erlang terminology, but its in fact is typically a thread inside an OS process.

1 Like

@j4n_bur53 Thank you for the race condition bug report. Torbjörn has submitted a patch for this. As a token of appreciation, I’ve tipped your github account with a few BAT crypto currency.

Again, thank you for the help. :slight_smile:

Yours faithfully,
Damon