Merrits of Erlang style for real-time systems

What are the merrits of Erlang style for real-time systems?

Hard real-time or soft real-time?
(Telephone systems, which are what Erlang was developed for, are soft real-time.)

BTW, Ericsson wasn’t the only company working on Prolog for telephone switching; BNR (Nortel) had its own Prolog research group. When I worked at BNR on telephone switches (DMS-100 et al), the general consensus was that writing reliable software with C and Unix derivatives (the AT&T way) was amusing, to say the least.

Anyway, off the top of my head, one big advantage of Erlang is the ability to replace modules without stopping the system. (We developed a way of doing that for the DMS-100, but it was a non-trivial problem; Erlang makes it much easier.)

Fair enough, this time it may well have been me who failed to adhere to the stick-to-the-topic discipline. On the other hand, error handling is often hopelessly intertwined with normal operation, and I guess it’s sometimes difficult to determine what should be thought of as an error. Leaving a process in a living (but idle) state on a remote node despite no longer having any use for it since, as in your example, a cut on the client side made it clear that no more backtracking over nodes will take place - is that an error? It’s a programmer’s error in the sense that the programmer could have avoided it (e.g. by wrapping the query in once/1), and it leads to a problem for sure, especially when the node on which the “stale” actor resides is owned by someone else. SWISH and library(pengines) deals with this problem by imposing a time limit on how long a stale pengine is allowed to hang around. When this limit is exceeded, the pengine is terminated.

Erlang is often described as a concurrent and distributed language with a functional core. When it comes to error handling, Erlangers seem to argue (in my mind correctly) that there’s a need for two fundamentally different ways of handling errors. In sequential Erlang code, catch and throw can and should be used, but for concurrent and distributed programming, tools such as error messages, monitors and links are more important. (In addition, Erlangers often seem to argue that too much “defensive programming” using catch and throw is a mistake, and that “let it crash” (and then restart) is often a better error recovery model.)

So how about error detection and error recovery in Web Prolog - thinking about it as a concurrent and distributed language with a relational/logic programming core? I believe we should allow ourselves to be inspired by Erlang here too. Now, when I think about setup_call_cleanup/3, I find easy to think about it as a construct, alongside catch and throw, for error detection and recovery in sequential Web Prolog, but not as a very useful construct for dealing with errors and recovery from errors outside sequential code. For this, error messages, monitors and links seem to be more useful.

I think you need to do a lot less of that in Erlang (or in Web Prolog) - most of that is supposed to happen automatically.

In the context of the definition of rpc/2-3 it serves to “convert” an error message into an exception, which is, as far as I can see, the only sane way to deal with it in this context. In other, more “asynchronous” contexts, e.g. when pengine_spawn/1-2 and friends are used as is, other ways to deal with this message are probably called for, and they are left for the programmer to decide.

Yes, that’s what it means.

Let’s look at the following statechart instead, which is the one to keep in mind when discussing Web Prolog.

PLAP_statechart

It’s important to understand that the statechart captures only one (but important!) aspect of what’s going on, not everything. One thing that it doesn’t capture is the effect of the exit option we discussed before. If exit(true) is passed when pengine_spawn/1-2)is called, the pengine will always be automatically destroyed upon reception of the messages success(false), stop, failure, or error. That is, the pengine is set up to run just one query to completion, which may, of course, involve the reception of several next messages and the sending of several answer messages.

While that makes sense to me, when I think of Erlang I think about its history and why it was specifically created for telephone exchanges. When software is used in a real-time embedded system, halting a physical device because of an error would be a bad thing. A common way to handle such errors in a embedded system is to use a Watchdog timer.

So while what you say is true, I find that many programmers either have no to little experience with embedded systems, or that embedded system is all they do. So the concept of let it fail to someone not use to embedded system programming would sound shocking, but many don’t know that when it is allowed to fail, the watchdog timer will kick in and reboot the system. Also such systems know this and are designed to reboot very quickly. In some systems, even taking a second is to long to reboot.

As I like to often note, when you fly in a plane (Boeing or Airbus) who or what is really flying the plane. Most would say the pilot, but it is really the software running on the computers, the pilot just gives input to the computer. If the software were to crash, think of the problems.

Don’t take this to mean that all errors in a embedded system use the fail and reboot, that is also far from reality. The key difference is that in embedded systems, you just reboot as a last course of action instead of just halting.

The same technology has been moving into the automotive world for the last few decades.

Ok, or perhaps it’s about both error handling and (a particular kind of) “garbage collection”, and how these two things interact?

In any case, I agree that ensuring the termination of pengines and other kind of actors (and don’t forget the other kinds of actors!) at the proper point in time is an important topic. Part of the reason it is both important and surprisingly tricky, is that the program and the node on which we want to run the program may have different owners. In this case, who’s responsible for cleaning up the garbage in the form of stale pengines - the person who has written and is running the program, or the owner of the node? So far, as I see it, we’ve touched upon a couple of ideas:

  1. The client (i.e. the person who authored the program or the query) is responsible. As far as I can see, this would only work if the client and the owner of the node is the same person (or organisation), e.g. if you’re setting up a cluster of nodes inside a firewall and/or with proper authorisation. This is how distribution in Erlang normally works. However, in the case of Web Prolog, we want a scheme which is more open, in the sense that the Web is open.

  2. We also have the case which can be illustrated by the way SWISH works. As long as the sandbox code is fine with it, SWISH allows unauthorised clients to run any query or program on the SWISH server. So far at least, this has worked fine. If a client sends the query ?-repeat,fail to the SWISH backend, it will work, but only for a while, until (after a minute or so, depending on an owner-controlled setting) a timeout occurs, which kills the pengine running the query. And in a situation like this

    ?-member(X,[a,b]).
    X = a 
    

    if the client waits long enough (perhaps 5 minutes or so, again depending on a setting), then the pengines will be destroyed, and the second solution to the query will no longer be available.

    So, in this case, the client has a lot of freedom and has almost full control over the remote pengine, but the final say, when it comes to looking after resources, belongs to the owner of the node.

  1. In the case of this situation

    ?- rpc('http://ex.org', member(X,[a,b])).
    X = a 
    

    the same scheme as in 2) can be used by the owner of http://ex.org, but there may be another way which is better - the way I had in mind when I wrote this post. Here’s what I write in the Erlang’19 paper about the idea (which was first proposed by Jan Wielemaker back in 2009 or so):

    "Interestingly, it also turns out that since rpc/2-3 does not produce output or request input, it can be run over HTTP instead of over the WebSocket protocol. In our proof-of-concept implementation this is the default transport.

    To retrieve the first solution to ?-mortal(X) using HTTP, a GET request can be made with the following URI:

    http://remote.org/ask?query=mortal(X)&offset=0 
    

    Here too, responses are returned as Prolog or as Prolog variable bindings encoded as JSON. Such URIs are simple, they are meaningful, they are declarative, they can be bookmarked, and responses are cachable by intermediates.

    To ask for the second and third solution to ?-mortal(X), another GET request can be made with the same query, but setting offset to 1 this time and adding a parameter limit=2. In order to avoid recomputation of previous solutions, the actor manager keeps a pool of active pengines. For example, when the actor manager received the first request it spawned a pengine which found and returned the solution to the client. This pengine - still running - was then stored in the pool where it was indexed on the combination of the query and an integer indicating the number of solutions produced so far (i.e. 1 in this case). When a request for the second and third solution arrived, the actor manager picked a matching pooled pengine, used it to compute the solutions, and returned them to the client. Note that the second request could have come from any client, not necessarily from the one that requested the first solution. This is what makes the HTTP API stateless.

    The maximum size of the pool is determined by the node’s settings. To ensure that the load on the node is kept within limits, the oldest active pengines are terminated and removed from the pool when the maximum size is reached. This may mean that some solutions to some subsequent calls must be disposed of, but this will not hurt the general performance."

    As I wrote before, if you want to know (much!) more, have a look at section 6.3 - 6.5 in the longer manuscript. The tutorial in the PoC provides a couple of examples.

    Using this scheme, the node becomes 100% responsible for the maintenance of resources, which, in many cases, may be a good thing.

Yes, although as things have turned out, Erlang has been used for many kinds of applications outside telecommunication, and for web-based applications in particular (WhatsApp is often quoted). And Elixir, which compiles to the Erlang VM, is, in combination with the Phoenix web framework, one of the hottest web-related technologies around.

I agree with most of what you’re saying, though, although a distinction is often made between soft and hard real-time systems. I believe the claim is that Erlang is appropriate for programming soft real-time, but maybe not for hard real-time embedded systems. Real-time web applications should likely be counted as soft.

Also, the “let it crash” strategy must probably be given a somewhat different interpretation when we are dealing with a system for web programming. In any case, the notion of links between processes, where one process is forced to terminate if another one crashes is applicable. For example, when SWISH is closed, or the page is reloaded, any remote process which is created by SWISH must be terminated, and so on.

About the Erlang “let it crash” philosophy and watchdog timers … both were used in the DMS-100 telephone switches and they weren’t written in Erlang. But of equal importance was error handling – every crash was caught by something, a log message was written, and action taken (typically, restart the failed process). This was very robust – the system could be rebooted without loosing ongoing telephone calls.

But “let it crash” was used for handling unanticipated situations. There was plenty of code that looked at return codes and did something sensible (e.g., exponential backoff, and eventually crash if the timeout became too long; this would typically trigger the parent to do some cleanup and restart the failed process).

[DMS-100 had a message-passing OS, which evolved a fair bit with experience, and had many similarities to Erlang as an OS; the DMS code was written in an extended Pascal, which – if you remove the syntactic differences – reminds me of Go.]

Side note. IIRC, SWI-Prolog have some support for hot patching but I cannot seem to find it in the website. Logtalk supports hot patching using categories.

1 Like

There is library(hotfix). That merely allows you to replace files in a saved program using files with the same name in some directory. The real crux is that SWI-Prolog allows safe reloading of files that are being executed. See here.

2 Likes

Well, in Joe Armstrong’s Ph.D. thesis, “let it crash” occurs 5 times, and Google counts to 97300 hits when I search for ’ “let it crash” erlang’.

The “let it crash” philosophy is just one way to achieve disciplined error handling - a (frequently misinterpreted) strategy for obtaining fault tolerance.

Yes, thank you Peter, I believe that’s how it should be understood. Erlang even has a generic supervisor library/framework (aka a behaviour in Erlang jargon) dedicated to this strategy.

Yes, I think so too.

Related to this thread, here’s a post (adapted from examples in my manuscript) which hopefully will serve two purposes:

  1. A reminder that an ordinary non-pengine actor can be quite useful on its own, so that you don’t always need to spawn a pengine in order to get something done, and

  2. An example showing how a generic server with hot code swapping may be written in Web Prolog.

First, here’s an example inspired by a fridge simulation example from a beginner’s textbook book on Erlang:

fridge(FoodList0) :-
    receive({
        store(From, Food) ->
            self(Self),
            From ! ok(Self),
            fridge([Food|FoodList0]);
        take(From, Food) ->
            self(Self),
            (   select(Food, FoodList0, FoodList)
            ->  From ! ok(Self, Food),
                fridge(FoodList)
            ;   From ! not_found(Self),
                fridge(FoodList0)
            );
        terminate ->
            true
    }).

Here’s a test session:

?- spawn(fridge([]), Pid, [
       node('http://fridge.org’),
       monitor(true)
   ]).
Pid = 6734313@'http://fridge.org’.

?- self(Me), $Pid ! store(Me, meat), 
   $Pid ! store(Me, cheese).
Me = 9201673@'http://local.org’.

?- flush.
Shell got ok(6734313@'http://fridge.org')
Shell got ok(6734313@'http://fridge.org')
true.

?- $Pid ! take($Me, cheese).
true.

?- flush.
Shell got ok(6734313@'http://fridge.org',cheese)
true.

?- $Pid ! terminate.
true.

?- flush.
Shell got down(6734313@'http://fridge.org’,exit).
true.

Here’s how we may program a generic stateful server in Web Prolog which can also handle hot code swapping:

server(Pred, State0) :-
    receive({
        rpc(From, Ref, Request) ->
            call(Pred, Request, State0, Response, State),
            From ! Ref-Response,
            server(Pred, State);
        upgrade(Pred1) ->
            server(Pred1, State0)
    }).

The first receive clause matches incoming rpc messages specifying a query, performs the required computation, and returns the answer to the client that submitted the query. The second clause is for upgrading the server.

As can be seen from the first receive clause, the generic server expects the definition of a predicate with four arguments to be present and callable from the server. In the case of our refrigerator simulation the expected predicate may be defined as follows in order to obtain the required specialisation of the generic server:

fridge(store(Food), FoodList, ok, [Food|FoodList]).
fridge(take(Food), FoodList, ok(Food), FoodListRest) :-
    select(Food, FoodList, FoodListRest), !.
fridge(take(_Food), FoodList, not_found, FoodList).

If we choose, the client can be built in a style, also commonly practised by Erlang programmers, which guarantees that the communication between client and server stays synchronous:

rpc_synch(To, Request, Response) :-
    self(Self),
    make_ref(Ref),
    To ! rpc(Self, Ref, Request),
    receive({
        Ref-Response -> true
    }).

Note the generation of a unique reference marker to be used to ensure that answers pair up with the questions to which they are answers.

This code too is generic and follows a common idiom in Erlang that implements a synchronous operation on top of the asynchronous send and the blocking receive. The predicate rpc_synch/3 waits for the response to come back before terminating. It inherits this blocking behaviour from receive/1, and it is this behaviour that makes the operation synchronous.

Here is how we start a server process on a remote node and then use rpc_synch/3 to talk to it:

?- spawn(server(fridge, []), Pid, [
       node('http://remote.org')
   ]).
Pid = 673645513@'http://fridge.org’.

?- rpc_synch($Pid, store(meat), Response).
Response = ok.

?- rpc_synch($Pid, take(meat), Response).
Response = ok(meat).

Now, suppose we want to upgrade our server with a faster predicate for grabbing food from the fridge, perhaps one that uses an algorithm more efficient than the sequential search performed by fridge/4. Assuming a predicate faster_fridge/4 is already loaded and present at the node we can make the upgrade without first taking down the server. This means that we can retain access to the state (and thus not risk losing any food in the process):

?- $Pid ! upgrade(faster_fridge).
true.

Erlang/OTP has a generic behaviour for this too - the server behaviour - which can also handle hot swapping of code.

I’m not sure all this is needed in the context of the Web, but it’s nice to see that it’s possible.

More precisely, “let it crash” is for robustly handling this situation:

(As opposed to, say, Windows, which might pop up a window on a display at an airport, where there’s no keyboard for responding.)

2 Likes

All of the process and synchronization design was in the OS, not the language because the designers admitted that they didn’t know what would work best. In the end, message passing with read timeouts was the most common situation, and the OS was optimized for making this very fast. Presumably Ericsson observed a similar situation in their code, hence Erlang.
(With Protel, buffer overflows didn’t exist because the language contained the notion of “slice” and the compiler could optimize away many range checks. (I can’t remember how use-after-free was avoided, but I don’t recall it being a problem.) Even so, we estimated that the various runtime checks cost about 10% overhead, but judged the cost worthwhile for a high-reliability system)