WASM performance

I had a brief discussion with @jfmc about the performance of the WASM version. Jose claims a difference of about 2 times compared to the naive version. For SWI-Prolog this is closer to 10 times. Jose suspected threaded code may be problematic. Threaded code is a trick where the VM instructions are direct addresses into the VM code dispatching switch. So, I promised to check. Well, SWI-Prolog has three modes for compiling the VM:

  • A classical switch.
  • Threaded code
  • Compile each instruction to a function. Now the function pointer is used as VM instruction. This is the default for the WASM version as it builds the fastest and ran the fastest when the WASM port materialized.

But, things have changed. Using Emscripted 4.0.15 (latest) and node v22.19.0 (Fedora 42), we get

  • A classical switch: 0.38 sec
  • Threaded code: 0.38 sec
  • VM functions: 0.61 sec

So, I think we should switch. Unfortunately the first two crash on one of the tests :frowning:

2 Likes

The story is more complicated. Using threaded code or a switch results in a huge function that has many local variables. In the way setjmp()/longjmp() is handled, recursive calls to the VM interpreter quickly fill the JavaScript stack. Only 3 levels of recursion of the VM interpreter used by one of the tests already exhausts the node JavaScript stack :frowning:

This can be fixed by using WASM exception handling (-fwasm-exceptions), but this makes the system again about 30% slower :frowning:

Not sure where to go. Possibly we can get rid of the (few) setjmp()/longjmp() usages.

So, the main VM loop uses C setjmp()/longjpm() to allow for PL_throw(), which allows raising an exception without using the usual return-failure way of unwinding the stacks. Using PL_throw(), calling longjmp() was the way it worked long ago. Now this path is deprecated, but still in use to deal with GMP allocation failures that result from allocating huge numbers (later more).

I added an, for now, internal #define O_THROW to make the code supporting PL_throw() optional. That leads to some interesting observations:

  • The performance of the WASM version improves by 35%
  • Using GCC-15, the performance on x86_64 (AMD3950X) improves by 13%
  • Using Clang-20 (Fedora 42), we gain 6% on AMD3950X.
  • While the monolithic (switch or threaded) code emulator is 60% faster on WASM, it is still slightly (barely measurable) slower on the above Clang-20 on Fedora 42 platform.

It is tempting to get rid of setjmp/longjmp, at least for the main interpreter loop. GMP is one of the problems. All GMP functions are void. GMP calls a hook to do memory allocation, but this is not allowed to fail. That is why we keep track of the GMP allocations and use longjmp() in case the requested memory is too large. LibBF, our GMP alternative, has a better API, but as yet we emulate GMP using LibBF.

As is, disabling PL_throw() makes several tests fail because the system terminates with a fatal allocation error rather than using a normal Prolog exception. In general, I think it is not acceptable for arithmetic evaluation to terminate the process this way.

Ideas are welcome.

(It’s an interesting problem, but I still have no familiarity with SWI’s source code, so just a couple of ideas while reading the above.)

Using threaded code or a switch results in a huge function that has many local variables.

Would you have a link to the source code? In particular, I am failing to figure the need for “many local variables”.

Anyway, AFAIK, the JS VM to today does not support tail call optimization, which might have to do with the stack problem: though there are ways around that, namely “trampolines”, e.g. see here: https://douglasmoura.dev/en-US/understanding-tail-call-optimization-with-javascript

GMP is one of the problems.

Could you put GMP inside a thin wrapper that encapsulates away GMP’s requirement for setjmp/longjmp?

Just clone the git repo and check out src/pl-wam.c (should be renamed :slight_smile: ). The interpreter is PL_next_solution(). Most of the body is defined in src/pl-vmi.c, a file that is ‘#include’-ed into src/pl-wam.c. PL_next_solution() is hard to read, notably because it plays around a lot with cpp to compile the actual implementation of the instructions in the described three ways.

I’m also a little puzzled by the large number of local variables. I suspect they are caused by helper variables used by the many inlined functions. Without optimization these are apparently all allocated independently, so the system fails to compile. Using -O1 it does compile, but apparently there are still quite a few left.

The recursion issue (typically not deep, but used by callbacks such as with_output_to/2, sig_atomic/1, etc.) appears to be occur because setjmp()/longjmp() in Emscripten is implemented by wrapping WASM functions in JavaScript functions and use exceptions. This seems to copy (part of?) the local variables of involved WASM functions to JavaScript. I must admit I don’t understand in detail what is going on. The alternative is to use (still very new) WASM exception handling. This indeed solves running out of JavaScript stack, but reduces performance by about 30%. That seems to be caused by the boilerplate to make it possible to handle exceptions.

Well, is see two ways out. One would be to define a new big number API to be used in Prolog itself that exploits LibBF’s capabilities to gracefully handle allocation problems. Than this API should be implemented on top of LibGMP and LibBF. That won’t solve the issue for GMP, but the WASM version uses LibBF for reduced size and simple uniform license. Mostly, it is a lot of work. I guess the simplest way is to define libGMP wrappers that are not void functions, but return a failure code. Next both need to be implemented and all Prolog bignum code needs to check the return values and act appropriately.
To make LibGMP safe under this, all functions would need a setjmp/longjmp() wrapper inside. That might get rather slow :frowning:

The other way out is have setjmp/longjmp around arithmetic evaluation rather than globally. That requires some serious redesign of the VM for handling optimized arithmetic. Most likely it will also introduce a noticible slowdown.

Yes another thought might be to wrap PL_next_solution() in a helper that does the setjmp/longjmp. ChatGPT says this will make things only worse, but I have my doubt this is right. Anyone?

Also possible would be to introduce C++ and use C++ exception handling. Looks like a lot of work with uncertain benefits though and I doubt we can make LibGMP safe this way.

Also possible is to estimate the size of the LibGMP allocations and prevent overflow this way. That is in part already done, but some functions are probably non-trivial.

No need to clone: those files are in the GitHub repo.

setjmp()/longjmp() in Emscripten is implemented by wrapping WASM functions in JavaScript functions and use exceptions […]

Indeed, exception handling should not be used for control flow, one reason being the horrible performances. Anyway, to me the problem still boils down to minimizing and even getting rid of any use of setjmp/longjmp in internal (i.e. SWI’s) code (see below).

The other way out is have setjmp/longjmp around arithmetic evaluation rather than globally.

I was indeed thinking along the lines of as local as possible, so thin wrappers around the few external things that may need long jumps (because these are only needed by some “external” things, right?): I mean specific ad-hoc “adapters” to the internal logic, internal logic which, OTOH, would be one and committed to the most efficient implementation possible, in particular free of any use and even idea of long jumps…

Anyway, that was the general idea. I’ll also try and have a look at that code, though I cannot promise anything on that side… But I will follow any developments with interest. – BTW, why do you say src/pl-wam.c should be renamed? What would you say I will find in that file?

It is not. It is used for Prolog exceptions. Only when using the VM as a set of functions it uses setjmp/longjmp to terminate the VM loop.

Because the VM is (very loosely) based on the ZIP VM rather than the WAM.

Anyway, it turns out the solution is in the other direction: add a wrapper around PL_next_solution() that deals with the setjmp/longjmp() and remove it from the function that implements the VM loop. See 7b4b138ba0da0e1e8336a0712f2fdbe143fad85d:

ENHANCED: Move setjmp() out of the VM main function

Using setjmp() harms register allocation, which slows down the VM.
Some data points: Clang: 6% (Clang-17 on Apple M1 as well as Clang-20
on AMD3950X), GCC: 13% (GCC-15 on AMD3950X), WASM: 35% (Emscripten
4.0.15 on Node.js 22.19 on AMD3950X). MinGW-14: 18% (Windows binary
running on AMD3950X under Wine).

This is a quite surprising for a pretty trivial commit :slight_smile: This also solves the WASM recursion limitation over PL_next_solution(). It now handles 218 levels on Node.js 22.19. That is plenty. Before it managed 3 levels, which is a bit too tight.

Thanks for thinking with me. You are more than welcome to dig into this :slight_smile: I was quite surprised to see that some minor changes double the performance on WASM. It is still 4 times slower than native though :frowning:

1 Like

Used for handling exception, exception handling is of course fine. :slight_smile: I was specifically referring to how setjmp()/longjmp() is implemented in Emscripten.

Because the VM is (very loosely) based on the ZIP VM rather than the WAM.

OK, I see, thanks.

Anyway, it turns out the solution is in the other direction: add a wrapper around PL_next_solution() that deals with the setjmp/longjmp() and remove it from the function that implements the VM loop.

In general, that is exactly the direction I was talking about: of course, I do not know the details…

This seems to contradict a little. The implementation now places the setjmp() in a (small) wrapper around the function implementing the VM main loop. This as high on the stack as it can be. As it seems the JavaScript wrapper needs to copy the local variables of the function holding the setjmp() (I think, not sure), the wrapper is now small. As the VM function no longer calls setjmp(), I think (again) that it gives the compiler more freedom in dealing with local variables and register allocation.

Anyway, the default benchmark set now runs in 0.190 seconds average per benchmark in Chromium. The native GCC version does the job in 0.07 seconds on the same hardware. 2.7 times seems in line with other experiences. Even more so if we consider that the Clang native version takes 0.10 seconds instead of 0.07. I wonder whether there is a simple trick to make Clang compete with GCC on SWI-Prolog as it in general seems to do?

Take home : the devil is in the detail …

Maybe I have misunderstood you, I don’t know: at this point I am simply confused… :wink:

My lesson learned is that setjmp()/longjmp() for exception handling is fine, but you should ensure that the setjmp() does not happen inside a (large) time critical function. As setjmp() is not cheap you should try to reduce the number of calls. The downside of that is that it complicates required cleanup.