Hello! As I mentioned in What is the status of SWI-Prolog ported to WebAssembly, I’m working on enhancing swipl’s WebAssembly port. It’s going along fairly well, with the one unfortunate exception that pl-wam.c takes approximately 180 seconds to compile, and it won’t even run (PL_next_solution() gives a runtime error: "Too many local variables") when compiled at -O0.
So I’ve started down a separate path of refactoring the WAM code away from the one-big-function paradigm and towards one-function-per-instruction. My thought is to kill two birds with one stone - to allow compilation on WASM regardless of optimization level, but also to make the control/data flow more explicit, to give modern compilers more of a chance to optimize.
Initial results are promising - I have a prototype function-mode build (currently compiling to native Linux/64/gcc), and there’s no noticeable performance difference between that and either the switch-mode or goto-mode builds: the ctest suite passes all tests, and it takes close enough to exactly the same amount of time in each mode that I’d probably need to average out over a couple-hundred runs just to tell which was fastest. Is there a better benchmarking test that I can run?
Core devs, especially @jan - what are your thoughts on this move? Should it be restricted to the WASM build? If it proves to be as performant as the single-function version, should the old version be removed entirely given the clutter it requires in PL_next_solution()? Should it go even further? I can already think of some potential mechanisms that this model would enable, including the possibility to precompile sequences of WAM bytecodes to C, perhaps even whole procedures, with full optimization.
As for performance, the real test is to build the PGO optimized version as anyone interested in performance is using that. Depending on the platform it can be up to almost twice as fast as the normal -O3 build.
I did this at least 20 years ago. The fact that in the current VM the instructions look “function like” while they are in fact code fragments in a giant procedure are the remains of that It surely made the code a lot cleaner, but also about twice as slow and thus I reorganised the code to be effectively a single function but looking as close as possible as a set of functions.
Hardware changed (I was probably using SUN on SPARC those days) and notably C compiler technology improved a lot.
So yes, if it proves to have the same performance using the current technology I’m all in favor! I assume this means we have an array of function pointers, no? If so, we can also gain some performance by specializing VM instructions that use dynamic tests for debugging, signals and some more (currently mostly packed under the alerted flag) and swap the function pointer. Also creating mega instructions (replacing a common sequence of VM instructions by a single one) is a lot simpler. That used to be an important optimization technique. I’m unsure how useful it is these days.
@dmchurch do you know whether this multithreading support could also get us a working Prolog console in browser? Currently the blocking read call deadlocks the event loop leaving no way for the user to provide data for the read call.
It absolutely can! I haven’t actually gone and made a page-based console or anything, but I’ve used low-level primitives with my test build and it runs just fine. (In fact, it doesn’t even need to use a multithreaded build - having the Prolog code running in a worker solves that issue on its own.) That’s not the direction I’ve been directly pointed in, though; my MVP has been “get pengines working, run Swish against in-browser implementation”. And that’s currently sitting behind “convince WebAssembly to shorten my compile-test time below three minutes”, ha.
Aha! Fantastic fantastic info here, @jan, thank you so much. Once I’ve gotten my function-mode code in slightly neater order I’ll publish the branch to my fork, even if I’m not quite ready to send in the PR yet. I’m looking forward to seeing what kind of proper benchmark results I can get!
And yes, I’m currently trying two implementations, using the same VMCODE_IS_ADDRESS switch, one with an array of function pointers mapped to codes, the other as raw function pointers emitted by the compiler. There’s much less advantage to be gained from the latter, though, because you can’t force a C compiler to tailcall, so it’s gotta return to PL_next_solution() prior to dispatching the next function no matter what.
It also occurs to me that swapping out function pointers would be a good way to support something like processor-specific optimizations determined at runtime, so that’s cool too.
Anyway, I’ll keep you in the loop as to how it goes!
BTW, if you’re having trouble with getting gcc or clang to optimize the way you want, I have some connections to people who work on the compiler, and might be able to get some advice from the gurus. (No promises – I haven’t talked with them for over 3 years and I know that they’re very busy.)
Ha, I think I saw that very SO post! Unfortunately, as I discovered when I made a little test cradle, while I can easily enough convince a compiler to do tail recursion without growing the stack, I can’t get it to tailcall a sibling function the same way (even if they have not just the same arg types but also the same arguments). If you’d like to dig into it further, by all means go ahead, I’m certainly curious! Don’t go to too much trouble, though - while saving on the function prologue/epilogue execution time would certainly be nice, my biggest reason for wanting to go the tailcall route would be to help the compiler understand the control flow better, and if there isn’t any general support for TCO, I doubt there would be that much optimization effort put into a -fforce-experimental-tailcalls option or something of the sort.
As for @peter.ludemann, what we are dealing with here is to map a VM instruction to a function one-to-one. This means running the VM is typically something like:
for(;;)
{ (*vmi[*PC++])();
}
Now I think @dmchurch wants to save the function pointer in the VM code (as the current version stores the switch addresses in the VM code using GCC’s &label syntax to get the address from a label). Now we can get this code (disregarding arguments):
for(;;)
{ (*PC++)();
}
Ideally however we would not have to do this, but instead have the VM instruction doing what is below. This would work fine if the C compiler would do last call optimization (also in debug mode b.t.w. as otherwise debug mode would become useless).
some_instruction()
{ <do some work>;
(*PC++)();
}
I’m not sure I would like this anyway as one of the nice things about having the VMI instructions returning is precisely that they return That would allow getting rid of various dirty hacks and allow for easy to maintain “jumbo instructions”. It is not unlikely that gains more than it looses.
I’m probably in any case in favor of using a function pointer array as that allows is to dynamically adjust the VM instructions between debugging and normal execution versions.
There’s really no security implications here. If you have access to write arbitrary values to arbitrary locations (which you would need to alter the opcode table - which already exists in swipl, by the way), you already have full control of the system. The simplest and most common tactic at that point is just to overwrite the return address of the current stack frame, in which case the host processor itself will jump to any address you like. And even if not that, the swipl VM has the FCALL functionality and opcodes, which instructs it to call whatever function pointer you pass it.
Much like with hardware microprocessors, the security implications need to be addressed at a higher level than the VM implementation - in this case, via sandbox functionality.
I recall looking at some embedded software requirements (MISRA?), and they didn’t allow recursion (because of the possibility that the stack could overflow). And there were a lot of other standard software practices that were not allowed. So, I somehow doubt that Prolog would be accepted, at least at the lower levels, even though it could greatly simplify that software. (If you want to see truly scary stuff: Better Embedded System SW: A Case Study of Toyota Unintended Acceleration and Software Safety)
At a higher level, there are probably already interpreters for things like javascript (or Java, for that matter, if Android Auto is used; although maybe Dalvik nowadays compiles to machine code), so whatever security issues are used at that level, something similar could probably be done with Prolog.
For anyone who’s interested, my initial barebone tests are showing about a 20% performance loss on the VM-as-functions version (testing a PGO gcc9 build for both), which isn’t a bad place to be starting from! I’ll be pushing the code to my repo at some point later tonight.
Is this on a recent x86? I wonder what the performance difference would be on M1? The x86 needs to spend a lot of transistors to make up for the design flaws of the x86 ISA; the M1 can spend the transistors on things like fancier branch prediction and better caching …
(Some years ago, I did some performance investigation of Quintus Prolog on PowerPC … it turned out that some undocumented architectural features of the PowerPC made it run much faster than IBM’s “Red Book” calculations)
Yes, this is on my laptop, which has an 8th-gen i7. And, considering this really is just a 1:1 translation - the processor is doing all the same steps it was doing before in the branch model, only now with added stack transitions - I’m not particularly surprised to see it take a hit! It’ll be interesting to start doing some profiling and see where the time is getting eaten up.
I only know it has been audited in financial and telecom contexts. Both resulted in a couple of things to improve where it was unclear there was a bug or not but there was a simple way to make the code obviously right. The VM itself is far too hard for automated tools to proof its safety. I guess that holds for most VM based systems.
Is this difference a lot more than the non-PGO version? You claimed nearly no difference before. That made me excited as a function based approach can also simplify and speedup some things so we would end up with something both faster and cleaner. Recovering 20% could be a challenge and there are too many users who will be pretty unhappy to hear about any significant performance loss. We’ll see …
With an approach similar to the one below you should not get any performance penalty and full inline.
This will permit also to have a debug and non-debug run.
Thanks. If I understand this correctly this allows for both a function based and switch based approach using the same source, no? Defining the code in a macro makes the code pretty unreadable as we need \ at the end of each line and to the debugger the whole thing becomes one line
I don’t know what @dmchurch did. a VMI now looks like this (as I think you know):
This tells the system there is a VMI named H_FIRSTVAR with no special properties (0 is a flag field) and one argument which is the offset of a first var, i.e… an environment variable seen for the first time. By defining VMI appropriately it is pretty easy to turn this into a function. NEXT_INSTRUCTION means go on and there are a couple of other macros that tell tell system to backtrack or continue execution in another VM instruction. These can simply be translated to return some_code.
This should be pretty straightforward. A serious problem is that some VMIs share some variables and (typically) several VMIs prepare some of these variables before jumping to a common implementation. That requires a bit more restructuring. I’m curious how @dmchurch handled this. Who knows. Maybe we have some clever people around that find a way to make the system cleaner and faster
The variables passing should be handled with argument passing to common helper functions.
If we want to preserve both the function and switch based approach, we should introduce some macros for such gotos.
I agree that comparison on the same machine with the two implementation models is useful; but I also caution that laptops often have poorer performance for the same CPU than desktops (and mobile-optimized CPUs are often significantly worse than similar desktop-optimized CPUs, and that’s without getting into details of the other chips); and a VM is the kind of thing that can be greatly affected by things like caching, memory performance, hyperthreading.
In a past life, I had to tune a finite-state machine … IIRC, I got something like 5x by examining instruction-level traces and doing things like reordering if-then-else’s and judicious inlining. (I wonder how PGO plays with [[likely]] and [[unlikely]] – and are these the same as __builtin_expect?) I also tried 4 different compilers and got very different results. Compiler choice and specific optimizations mattered a lot on a RISC machine, not as much on a CISC (x86), hence my musings about M1.