VM code refactor - thoughts?

You might be interested in Peter van Roy’s PhD thesis (“Aquarius”). It used a compiler to generate code that could be competitive with C code. (The compiler was written in Prolog, and heavily used “extended DCGs”.))

You might also be interested in this talk (and others by Chandler and people on the Google C++ team) https://www.youtube.com/watch?v=rHIkrotSwcc (when you run hundreds of thousands of machines, even a 1% performance boost can make the accountants delight in paying a team of optimization specialists).

1 Like

In theory, a clever enough JIT compiler for Prolog might be on par with compiled C code, for equivalent code. In practice, I have no idea what I am talking about.

It occurs to me that the performance hit could be caused by significant cache misses in the instruction pipeline, and these would show up more on a laptop than on a server.

@jan - is there any particular ordering to the instructions in pl-vmi.c? It might be beneficial to put heavily used instructions together (or there might be clusters of instructions, such that if instruction A is generated, there’s a high likelihood of instructions B and C being next in the instruction stream). I don’t think PGO would do any such reordering; also, if the instructions are turned into functions, the linker might not put things at ideal locations.

Hey folks! Sorry for the long delay. Goodness, this has been active, though!

Anyway, some random thoughts: I do currently have a build where it’s possible to switch between function-mode and switch-mode just by changing some preprocessor defines. And in fact, I’ve published the branch! It’s at:

Please note, this is very much an experimental branch; feel free to check it out, clone it, whatever, but I’ve been playing fast and loose with my own git history, so be aware there may be rebases and force-pushes in your future if you base anything off of it. With that said, here’s a quick walkthrough:

  1. VMI instruction “functions” are now accompanied by VMH helper “functions”, with callsite-based arguments. Since this is still based around the transfer-of-control model the switch mode introduced, this is performed using a macro VMH_GOTO() as opposed to VMH_CALL() or anything like that. VMH_GOTO takes the name of the helper, along with the arguments; in switch_mode, this uses struct definitions and struct initializer syntax to mimic “argument passing”, which is easily optimized away by the compiler. This completely eliminates the SHAREDVARS mechanism, and also helpfully centralizes the list of "what shared-state variables exist in PL_next_solution(). As a result, a macro invocation like VMH_GOTO(foo,1,2,*bar) gets turned into:
{ struct helper_foo_args __args = {1,2,&bar};
  global_helper_foo_args = __args;
  goto helper_foo;
}
  1. The VMH definitions look a little weird, because of the limitations of macro processing. The argument types and argument names are listed separately, so the foo helper above might be defined:
VMH(foo, 3, (int, int, int*), (x, y, pointer))
{ *pointer = x + y;
  NEXT_INSTRUCTION;
}
END_VMH
  1. Speaking of END_VMH and the parallel END_VMI, I originally modified VMI() so that the macro invocation encompassed the body, like below. And, while this had the lovely effect of allowing you to include the file in a mode that suppresses the code itself, becoming effectively a header file, it had a pretty major downside: due to what the GCC team considers a “bug” that they have “intended to fix in the near future” for a little under two decades, anything produced by macro expansion all gets concatenated onto one line, which is, shall we say, Not Great For Debugging. So, we have instead moved to this model where we’ve got a start and an end macro for functions. Technically, we could have a starting macro and an extra close brace at the end, but I am very much not a fan of unbalanced braces in the source, so no.
VMI(H_FIRSTVAR, 0, 1, (CA1_FVAR),
{ etc etc...
})
  1. The mkvmi helper utility now creates the generated header in src/ rather than the build output directory. The intent here is to have it committed into the git repo, like a package lock file for a Node project, so that it’s obvious when and where anyone changes the VMI signature. The generated file is pl-vmi.ih, and it both gets included directly into pl-wam.c in a few places, and also forms the data source for some other newly-checked-in header files that used to be autogenerated by mkvmi, but now are just renderings of the macros in pl-vmi.ih.

That about sums it up for me, for now - I’m afraid I don’t have the energy to go through the whole discussion above, so if you have any other specific questions for me, please feel free to ask!

2 Likes

Well, I wouldn’t necessarily say that this change necessarily incurs that overhead. When I said this was a very basic implementation, I meant it - I’m passing around a lot more values than I really ought for code of this criticality, for one. I’d expect a more refined version of this same concept could reduce the performance loss fairly significantly, but I’ll know more once I’ve done a little more exploration.

1 Like

I build the stuff. Seems one has to set VMI_FUNCTIONS by hand and without it doesn’t build. That is for now irrelevant. I see what I like :slight_smile: I need to study this in more detail. I can confirm a 3% slowdown using default compilation and a 23% slowdown comparing the PGO versions. This on Ubuntu 20.04, gcc 9.3.0, AMD 3950X CPU.

I don’t like your choice for including the intermediate products from mkvmi in git (I’m very strict in avoiding build artifacts under source control). We’ll find a way for that.

There are surely big advantages. Just think how much easier it is to profile the VM.

We should definitely make this move if we can close the performance gap!

1 Like

I’m usually a big fan of not putting generated output in the build directory as well, but the way I figure it in this case is this: the usual practice is to write .c files with accompanying .h files by hand, even though one could just as easily generate the .h file off the C source through exactly this kind of text-parsing. In this case, it’s so critical that the .h file matches the C source exactly that we have an automated process that keeps it up-to-date for you. We could just as easily do this with a git hook, but what we currently have is a part of the build system, so I just kept it there. Does that make sense?

We could also change it to just being a consistency check, i.e. it throws an error if the header file doesn’t match the c file, if that would give things a better feel.

-Danielle

Do you think there’s any merit in my idea about clustering the implementations of VM instructions that tend to be used together? (pl-wam.c.o is 204K stripped, so it’s definitely bigger than the L1 cache on my machine) If so, I’d run vm_list/1 on the the *.pl files in the library (plus whatever else I can easily find) and look for clustering patterns.

On my workstation, the cache are:
L1: 4 x 32 KB 4-way I-cache; L2: 4 x 256 KB 4-way; L3: 6MB 12-way)
(access times: L1 is 4 cycles; L2 is 12 cycles; L3 is 42 cycles according to https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake#Memory_Hierarchy

Right, yes, I forgot about that bit, ha. I have fixed the non-function build in my local, and once that’s done, it only needs to be toggled in pl-wam.c to recompile between function and non-function versions.

1 Like

I’m gonna say “mostly no” on that one, at least in the present implementation. There are instructions that specifically transfer control to other instructions, mostly variants and whatnot, like T_TRY_VAR and T_VAR, and I’d expect the compiler will already be siting those close together, but what you’re talking about is unrelated instructions that tend to get clustered together, and the issue with trying to co-site those is that between every call, control returns to PL_next_solution(), which would cause some trouble with trying to keep the cache hot.

That said, I do have some other ideas in the queue that should help CPU cache times, so I’ll certainly keep that factor in mind!

My guess is that changing the indirect call into the big switch I suggested you will end more or less in the same compiled code of original.

Update: I’ve done a bit of rework on my branch, which as promised has involved some rebasing. If you’ve checked it out locally, make sure to do:

git fetch && git reset --hard @{u}

after, of course, stashing or otherwise preserving any local changes you might have made!

This update includes a commented-out #define VMI_FUNCTIONS 1 that will enable vm-as-functions mode if uncommented.

That’s certainly the intent! One of the reasons I keep rebasing and editing that branch is that I want to keep all the syntactic and semantic changes separate and discrete; each commit is intended to be (a) a fully-working build that, by default, compiles down to the same optimized code as the original and thus has no performance loss, and (b) a simple-enough change (like adding END_VMI to the end of every instruction definition) that it can be statically analyzed and proven to be the same implementation logic as the original.

Latest build cuts performance loss vs. original down from 20% to 10%. Still working on bringing it down further!

5 Likes

Just tried 1a97a5e0b5e3497619b48b10e516a46be919a9b6, but I see noticeable performance difference for the PGO version. Still looses by about 20 to 25% :frowning:

That is AFAIK not how CPU caches work. They work by caching small chunks of memory, I think 64 bytes on most Intel (based) machines. So your cache can hold some number of these thunks and I think it doesn’t matter whether these chunks are close together in memory or not. What does matter is that addresses that are used frequently together appear in the same cache chunk so you only get one cache miss rather than two.

At least, that is my understanding. If I’m wrong, please correct me.

P.s. I do see that for the PGO compiled version the classical one has pl-wam.o of 199340 bytes and the new one 268187.

P.s. Also note that pl-wam.o contains more than pl-wam.c. This file also includes the clause indexing and foreign function interface to allow the compiler to optimize and inline stuff together.

I don’t really buy the analogy. .h files are typically created by hand and many also document the public interface. They are things that not only keep a project internally consistent, but often also play this role to the outside world. I quite often look in the .h file of a project as it only lists the public stuff in a concise way. I never look in pl-vmi.ih. There is nothing interesting there. The “version” of the VM is maintained as a hash of the VM instructions signature. That -for me- is enough. Generated files complicate merging and I want to be able to jump backward and forward and exchange commits between the devel and stable branches. When moving to CMake I did a lot of work to remove all generated code and that simplified maintenance considerably. I don’t want to go back …

P.s. I think your changes will eventually be merged into the main source. Even if we fail to bridge the performance gap I think there is enough value in it for development and maintenance (for example it produces valuable profiling information on the VM) and thus if we can produce release versions as a switch we should all be happy.

Potentially relevant for reclaiming the performance difference: Apparently new versions of clang allow forcing tail-call optimization. Presumably not so useful right now, when we can’t require everyone have the bleed-edge clang, but hopeful for the future, I guess?

1 Like

Music to my ears :slight_smile:

Oh, sorry! It looks like I forgot to actually upload my work before going to bed last night. I’ll get the new code up on the branch as soon as I have a chance to get to my computer.

So as not to leave you in suspense, though: I used GCC’s explicit register variable mechanism to pin REGISTERS, LD, and PC into nonvolatile registers and removed them as arguments to the function calls. In effect, I created my own calling convention within the standard x64 call convention. :joy:

It goes without saying, of course, that for this to be effective, it requires someone to select viable registers for any given target architecture, but that only has to get done once.

It also occurs to me, though, that register-pinning LD in particular might make a good performance improvement across the board, if it were defined in pl-incl.h. Registers are thread-local, after all…

1 Like