VM code refactor - thoughts?

That’s fair, I suppose. I’ll pull the auto-generated stuff out into a build-directory file, though I will keep the single-output architecture of mkvmi.c. It’s just too useful to have a polymorphic include file that can act as a “repeat this line of code for all instructions/helpers”. Also, this way we don’t have to have custom-written .h code living inside double-quoted strings in a C file. That bit certainly tripped me up when I was trying to examine the swipl code on github, before I checked it out.

There is one more thing that occurred to me: shouldn’t we make the registers struct be part of the LD struct? That saves a register as well as the need for SAVE_REGISTERS. PL_new_solution() can save the current register struct in a local variable and restore that on return to allow for recursive calls (which we need). A PL_next_solution without state in local variables might make other stuff much simpler as well. Think about forking a Prolog thread the Unix way, i.e. where both threads continue executing the same goal.

As long as there is no checked in generated code I think I’ll be happy :slight_smile:

YES! Thank you, thank you. I was hoping that’d be doable but I wasn’t sure if the impedances would (or could be made to) match. There’s a lot of finer-point stuff like that I’d like to go over with you, actually, once I’m not so heads-down in the mechanics of it.

Interesting read that notes VM implementations specifically near the end. We’ll see where this goes. As is, clang looses big time from gcc (about half the performance for the clang compiled version on the Mac with current Xcode compared to gcc 10).
I did a little browsing for gcc. It does do tail call optimization at the higher optimization level. I couldn’t find a way to force it to use that for specific functions on lower optimization levels.

Well, a search for musttail gcc does return results in the context of “jit”. Not from the gcc docs though :frowning:

edit well, if gcc can do so at the higher optimization levels it might be possible to use this for production releases and the simple loop for development builds.

Sorry for the radio silence, folks! I just got my second Pfizer shot, which on the one hand is hooray! But on the other hand I’m definitely still recovering. Should have some updates for you soon though, hopefully!

4 Likes

Hi all! I’m back with another hit single update, this time including pull request! So long as everything checks out, you should soon be able to experiment with the function-mode VMI implementation without switching away from master. The code is still disabled at present; GCC builds will operate in branch-mode, while LLVM builds operate in switch-mode. To enable it, you’ll still need to edit pl-wam.c and uncomment the appropriate line (helpfully and appropriately tagged with HACK, until the performance losses are mitigated, at which point it’ll go into the Cmake configuration like everything else).

New in this version: register-mode functions. If your development environment is enough like mine, which is to say GCC on an x86-64 Linux system, you can uncomment the two defines after VMI_FUNCTIONS as well to enable an even-more-experimental mode in which the registers don’t get passed via the standard C calling convention, but are instead pinned to host processor registers while the VM executes.

One thing to note: the wakeup, trace, and backtrack handlers have moved from the bottom of PL_next_solution() in pl-wam.c into pl-vmi.c, where they are now “VMH” pseudo-instructions. This means that in the default non-function implementation, they are still included into PL_next_solution(), while in the function-mode implementation they are standalone functions.

2 Likes

Great @dmchurch! All compiled and tests nicely. Walked through most of the changes. 99% is fairly mechanical so I guess we should not expect many regression issues. As is, this is a cleanup of some of the VM interactions. It is also pretty complicated C preprocessor code which I hope we can mostly get rid of if it turns out there is a way to switch completely to the function based version. That would be a big step forward.

There is a bit weird thing going on wrt PGO optimization. First I fixed the mistake that PGO compilation didn’t work at all (Ann: SWI-Prolog 8.3.24 - #2 by jan). Using the PGO training set I now observe the following:

Version Average on bench/run.pl
Public master, normal build 0.302
Public master, PGO 0.210
VMI-Functions, normal build (functions not enabled) 0.283
VMI-Functions, PGO (functions not enabled) 0.264
VMI-Functions, normal build (functions enabled) 0.305
VMI-Functions, PGO (functions enabled) 0.270
VMI-Functions, normal build (functions enabled, registers) 0.300
VMI-Functions, PGO (functions enabled, registers) 0.252

Note that there is some variance in this. Typically running the same benchmark a couple of times has a variance of about 0.010 (about 3%). All measurements are on AMD3950X CPU, GCC 9.3.0 (Ubuntu 20.04).

The good news seems that the function version does a good job compared to the switch version. It is also nice that the new source outperforms the old one without using PGO, but it is sad that it looses over 20% compared to the old PGO version :frowning: Do you have a clue why, @dmchurch?

Ha, you should see the preprocessor definitions file I was using before I cleared out all the dead-end exploration I did! I called it “macrobatics.h”. There was integer arithmetic. There were list manipulation primitives. There were closures.

Anyway, the main question, why is vmi-functions performing so poorly? Very good question. The only actual changes I can think of between the goto-mode code from the original and the branch are:

  • The goto labels changed from D_BREAK_LBL to instr_D_BREAK. Should be zero impact, unless I missed a reference in some way that didn’t cause a build failure.
  • The attvar wakeup handler is now listed before the VMI instructions. Should be zero impact, there’s no fall though, the compiler is free to reorder anyway.
  • Registers are members of a struct instead of individual variables. Shouldn’t make a difference to the optimizer, unless there’s some magic happening with in-memory layout of the variables, which seems unlikely given how widespread the usage is. Technically, it could increase the stack frame size if some of the register variables actually managed to avoid getting spilled into memory locations, but given the size of the function that seems incredibly unlikely.
  • The pl-alloc indirect functions now have internals which use struct-return. Should be zero impact, all these are marked inline and the optimizer should be able to figure that out.
  • Helper vars are now scoped to PL_next_solution instead of block-scoped. Ideally it shouldn’t make a difference? But it might cause more register spilling. Anyway I tried to minimize that with

Okay, never mind, I think I know what’s going on here. I threw all the various helper arg types in a union as an attempt to indicate to the compiler “when one of these is used, the others don’t matter, feel free to reuse the stack space” but I think it backfired; the union is probably forcing the compiler to spill these to the stack, rather than leaving them in registers. Try getting rid of the union declaration at the top of PL_next_solution and adjusting the HELPER_ARGS macro to compensate. I’ll be doing the same over here, of course, to see if I can reproduce and fix.

This is all beyond me since I haven’t worked at this level for many years, but to understand what PGO is I looked it up and noticed the intel description.

Apparently, PGO is an optimization that is based on instrumented analysis of code, so it has to see code running – does this mean that the function version needs to be run many times in some instrumented way before PGO can do its thing ?

Dan

LoL :slight_smile:

The rather odd thing is that the normally compiled (-O3) version outperforms the original but the PGO version looses. I’d indeed expect the compiler to figure out most. In particular the union as that is defined inside the PL_next_solution() function. For the register_file struct, this is defined outside the function. would the compiler be smart enough to see that it doesn’t leak outside the function and thus it can use the fields just as normal variables, property ordering them and possibly reuse the same location if there is no conflicting use?

Tried the two changes below (remove union and helper_args. This results in a lot of warnings as variable ‘wakeup’ set but not used. No measurable impact on the performance of both the normally compiled and PGO versions though. Is that what you had in mind?

#define HELPER_ARGS(n)			n

---

FOREACH_VMH(T_EMPTY,
    ,VMH_ARGSTRUCT, ,VMH_NAME,;
  )

edit tried moving the typedef struct register_file into PL_next_solution(). Makes no difference.

edit 2 Removed the register_file struct altogether. Now I get back the old performance, as well as considerably better performance without PGO: 0.268 without and 0.205 with PGO. It seems GCC isn’t smart enough to decide that a local struct can be reordered at will. My default compiler is GCC 9.3.0. GCC 10 behaves the same though.

edit 3 asked Stack overflow c - GCC local struct vs individual variables and PGO optimziation - Stack Overflow

The simple idea is that normally the compiler has to make a lot of guesses as to how often something is used (does unrolling or inlining make sense?), which branch is most commonly taken, etc. By first running a representative program using instrumentation and then recompiling with the output of this instrumented program the compiler doesn’t have to guess that much. The more representative the program, the better it is. Ideally you use the target program. SWI-Prolog uses the programs in the bench subdir. This works fairly well, although some areas are barely touched by these benchmarks. If you care about these, write more programs and send a PR.

Basically, yeah. Here’s a version that passes tests and doesn’t spit out any compiler warnings:

diff --git a/src/pl-wam.c b/src/pl-wam.c
index 3b16528ba..46ac70a5d 100644
--- a/src/pl-wam.c
+++ b/src/pl-wam.c
@@ -3050,7 +3050,7 @@ static vmi_instr jmp_table[] =
 
 #else /* VMI_FUNCTIONS */
 
-#define HELPER_ARGS(n)                 helper_args.n
+#define HELPER_ARGS(n)                 helper_args_ ## n
 #define _VMH_DECLARATION(Name,na,at,an)        helper_ ## Name:
 #define _VMH_GOTO(n,args...)           VMH_ARGSTRUCT(n) __args = {args}; \
                                        HELPER_ARGS(n) = __args; \
@@ -3097,12 +3097,10 @@ PL_next_solution(qid_t qid)
   Code PC;
 
 #else /* VMI_FUNCTIONS */
-  /* define local union with all "helper arguments" (formerly SHAREDVARS) */
-  union
-  { FOREACH_VMH(T_EMPTY,
-      ,VMH_ARGSTRUCT, ,VMH_NAME,;
-    )
-  } helper_args;
+  /* define all "helper arguments" structures (formerly SHAREDVARS) */
+  FOREACH_VMH(T_EMPTY,
+    ,VMH_ARGSTRUCT, ,HELPER_ARGS,;
+  )
 
 /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Get the labels of the various  virtual-machine instructions in an array.
@@ -3256,6 +3254,11 @@ resumebreak:
 #include "pl-vmi.c"
   }
 
+  /* ensure all helper args show as "used" */
+  FOREACH_VMH(T_EMPTY,
+    (void) ,HELPER_ARGS,;
+  )
+
 #endif /* VMI_FUNCTIONS */
 
   assert(0);

Don’t have time to run any PGO tests right this moment, but I’ll definitely keep digging, and I’ll let you know what I find.

Aha, I think I’ve tracked down the issue. @jan, can you test the latest push to vmi-functions and verify?

Curiously enough, this issue is one I had to deal with in the function-mode implementation as well, so I just pulled the code out of the #ifdefs. I’m happy to say that the root cause - the switch from local variables to a struct - is one of the things I mentioned in my analysis, though I did miss something. Specifically, that there is in fact one register that never gets spilled onto the stack because it changes constantly: PC. Having to update the RAM shadow of the PC value on every single increment is disastrous for performance, which is why in the function-mode implementation, I changed it into a value that gets passed around in parameters and return values.

The commit I just pushed to the vmi-functions branch removes PC from the register_file structure entirely; in the non-function-mode builds, it is again just a local variable in PL_next_solution().

It is a big step, moving from 0.265 to 0.220. Still, if we forget about the struct and define all variables independently we get 0.205, which is even a bit better than the current master. As said before, do not takes these figures too serious as an about 0.01 variation is common. The overall picture after some runs is fairly persistent though.

I thought this may be due to the compiler reordering the automatic variables. This looks like a misconception. It seems the compiler does reorder them from the declared order simply using the rule “smallest object first, preserve order for equally sized objects”. PGO or not doesn’t seem to change that. Just to be sure I did the same by hand for the struct to find what was to be expected: it doesn’t make any difference.

It is problematic that figuring out the addresses has effect in itself. This is to be expected as the address leaks to a function that is unknown to the compiler (Sdprintf()) and thus the compiler has to make sure this address contains the right value and the value is picked up from there at any time some external function is called. An interesting observation is that the impact of printing these addresses is the same as using the struct.

On SO it was noted that taking the address of any of the struct members may force the compiler to assume the entire struct is known externally. A grep shows we get the address of THROW_ENV (which can be outside the struct) and PC for e.g. equalIndirectFromCode(*k, &PC) (try a grep for '&[A-Z]' pl-wam.c pl-vmi.c).

Sorry for the self-reply. I’ve pushed some patches on top of your branch to GitHub - SWI-Prolog/swipl-devel at vmi-functions (i.e., vmi-functions branch on the main repo). This was on a comment on SO that leaking any of the addresses of a field in the register struct probably forces the compiler to assume pointers to all of them and keep them in the structure layout. The updates do:

  • Move the longjmp environment outside the registers as we need to get its address.
  • Use your VM_*Indirect() functions directly in the VM, also for the non-function version.
  • Add -DVMI_FUNCTIONS as CMake option to allow switching without editing.

Now we get the results below. For the classic GCC label addresses this implies a roughly
10% performance improvement for the non-PGO version and

Ubuntu 20.04, AMD 3950X

Version Time
Classic GCC 9.3 address labels 0.270
Classic GCC 9.3 address labels with PGO 0.210
Classic Clang 10 address labels 0.345
Classic Clang 10 address labels with PGO 0.245
Functions GCC 9.3 0.310
Functions GCC 9.3 with PGO 0.245
Functions Clang 10 0.325
Functions Clang 10 with PGO 0.265

Some timing on MacOS on a Macbook Air 2018

Version Time
MacOS Xcode 11, master branch 0.555
Classic XCode 11 address labels 0.504
Functions XCode 11 0.555

Intel i7-5557U, Ubuntu 20.10, GCC 10.2

Version Time
master branch 0.555
master branch PGO 0.337
Classic GCC 10.2 address labels 0.403
Classic GCC 10.2 address labels PGO 0.336

The good news is that this seems good enough to move forward with this branch. A bit weird is that on Linux the Clang (10) function version is faster than the label version, but in MacOS this is the other way around. XCode is Clang 11 AFAIK and the Mac has an Intel i5 CPU.

@dmchurch, please have a look at what I did. I think I made some things a bit ugly, notable dealing with the throw environment. Feel free to do a little cleanup. If you agree I’m happy to merge this into master and have it tested in practice :slight_smile:

2 Likes

Will this version become available on windows as well

This looks fantastic, @jan, thank you so much! I’ve always been intending to do some major cleaning up on the whole register variables bit after this gets merged; I just wanted to be sure my initial PR made as few changes to the actual VMI code as possible. Do you need me to merge your commits into my vmi-functions branch?

No reason it shouldn’t, @grossdan. The specific register selection I chose for the register pinning mechanism might need adjustment for the Windows x64 calling convention, but that’s all experimental, disabled-by-default code anyway. At some point I’ll start running some test Windows builds, though of course you could just do what I do and run it under WSL2 :grinning:

No. I think it can be merged as is and in due time we can do small stuff to make the code cleaner. Maybe today, otherwise tomorrow I’ll merge. Thanks!

Ha, I’d missed this and I’ve already pulled them in :sweat_smile: I’ll stop touching that branch until you’ve taken care of the merge, shall I?