VM code refactor - thoughts?

grossdan · May 13, 2021, 2:17am

Yet another thing to worry about – semantics of types and correctness definitions …

Dan

peter.ludemann · May 14, 2021, 8:12pm

An interesting article on optimizing Python and Javascript, although it’s The Register, so some of it might not be completely accurate: The quest for faster Python: Pyston returns to open source, Facebook releases Cinder, or should devs just use PyPy? • The Register

JavaScript has a JIT implementation (Google’s V8 engine) which beats Python on simple benchmarks but often isn’t a net win for “production” code because many of the performance-critical parts of Python’s libraries hae been rewritten in C. Pyston gives a ~30% performance boost by using a JIT – which fits with my observation of interpreter bytecode loop’s overhead when the individual VM instructions are moderately complex (which is the case for both Python and Prolog).

EricGT · May 15, 2021, 8:44am

This is tagged On-Topic-Only topic, please clean up the off topic items.

dmchurch · May 15, 2021, 5:08pm

I feel like it was more likely about the Rust digression? (shrugs) Either way, it doesn’t make that much difference to me.

peter.ludemann · May 15, 2021, 5:44pm

An article about Python’s plans for a 2x speed improvement:

I’m guessing that there are more pay-offs in Python for JIT-like optimizations; a quick look at the instructions in ceval.c shows fairly short instructions sequences for LOAD_FAST etc.

There are a few interesting comments in that code

computed gotos vs threaded code (15-20% performance difference)
opcode prediction (common sequences)
“periodic things” (related to async; don’t know if it’s applicable to Prolog)

It’ll be interesting to see what things the Python folks use to get their 2x performance improvement; and whether they can actually get 2x. (BTW, if you want to talk with any of them, I probably can make introductions via friend-of-a-friend; and their work will be opensource anyway, so you can see who is doing what.)

dmchurch · May 15, 2021, 8:50pm

I mean if @jan wants to enlarge the instruction set, more power to him! I won’t claim nearly as much expertise on operation and optimization of the WAM, so I’m sticking to the lower layers for my optimization work

peter.ludemann · May 20, 2021, 3:34pm

A couple more articles about what’s being proposed for Python; maybe a few of these ideas would be helpful for Prolog but most of them seem Python-specific.

Personal opinion: in the end, they’ll probably get some nice overall boost – perhaps 1.2x or 1.3x – but I’m skeptical of 2x unless they make some changes to the language. Similarly, I’m not hopeful for huge gains with Prolog, much as I’d like to see them, without changes to the language; and that’s already been tried, e.g. with Mercury or (in a sense) Erlang.

A concrete example: Instagram tried “inline caching for byte code” - which gave a 5% improvement in production. It takes quite a few such changes - and considerable effort - to get a 30% improvement because not every program is helped by such “optimizations”; in some cases the optimizations make things slower.

dmchurch · May 20, 2021, 3:41pm

I’ll certainly be taking a look at these! At some point, in my copious spare time To be honest, though? I absolutely am hoping to make some fairly ridiculous gains in swipl’s efficiency. I think an order of magnitude is probably too much to hope for, but 2x is where I’m setting my “initial goal” sights. This coming largely from my inspection of generated assembly in swipl, at least for the amd64 case. But we’ll see! And we’ll certainly have a bit of a ways to go before we get there

grossdan · May 20, 2021, 4:19pm

You know this high speed assembly based library of data structures that was posted about earlier, made me wonder – what if the VM core is rewritten in assembly – what would this achieve?

jan · May 20, 2021, 4:21pm

Sounds good Not sure where you want to look. One of the candidates is for 64-bit to change the data representation. As is, a Prolog word has

2 GC bits
3 tag bits
2 “storage” bits that tells that the value is relative to the base of a specific stack.

On 64-bit machines we have at least 8 bits that we can freely use. We could omit the storage bits and have direct pointers. This requires some changes to the stack shifter and garbage collector. It surely make a pointer lookup much more direct. AFAIK at least some CPUs ignore the high bits on a pointer and thus you can use the value directly as a pointer rather than the current steps: shift the word right by 7bits to get rid of the above, get the base pointer for the stack specified by the storage bits from the LD struct and compute the final address. That also means 6 bits for the tag at no price, so we can get rid of more indirect ways to separate the types. Just by heart, the types are

var
ref
rational with subtypes
- real rational (MPQ)
- big integer (MPZ)
- 64 bit int
- inline it (fits in the word-7 bits)
float
atom
string
compound

With 6 bits they can all get their own tag. I don’t think it is is particularly hard to do so. The main challenge is to keep the code working on 32-bit hardware. I don’t care too much about people with an old 32-bit only desktop. It is a pity to loose embedded devices though. Also WASM is 32 bits …

dmchurch · May 20, 2021, 4:31pm

I’ve certainly had some thoughts along those lines, and I’m not above writing assembly versions of certain routines if that’s the only way to get the performance! That said, I’d prefer to figure out a way to write the C code that convinces the compiler to generate the assembly we’re hoping for—and, if we can make the data flows clear enough and the data structures self-describing enough, it might even be able to manage it. After all, when you resort to writing assembly, you’re effectively forking your codebase, once for each platform you target for optimizations. And I’d prefer not to have to maintain that many implementations I know I’m a polyglot who enjoys learning new assembly dialects for fun, but I’m not gonna assume anyone else is, and the last thing I’d want is for me to be the only person capable of maintaining the VM code!

dmchurch · May 20, 2021, 4:40pm

I’d had that thought as well! Glad to hear you’re open to the idea

And yes, I’m certainly invested in maintaining 32-bit support, since I did after all get into this by poking at the WASM build

Out of curiosity, what would the downside be to adding one more metadata bit to the (32-bit) representation, bringing it up to an even 8 bits? Would it just mean that the compiler is one bit more likely to turn something into an externalized int, or would it break Prolog-level compatibility? I ask because it may in fact end up being more efficient to do byte-level operations to isolate the tag and the value, rather than bit-level.

Hmm, I suppose it would drop the max stack size on 32-bit from 128MB down to 64MB, which isn’t a small consideration. But if that could be avoided?

swi · May 20, 2021, 6:06pm

I think this is extremely important, there are many server applications for SWI prolog on older and cheaper 32 bit single board computers

jan · May 20, 2021, 6:19pm

That is indeed the reason. Is there a difference between x>>7 and x>>8? I’d be surprised. I very much doubt there is a way to avoid limiting the stack size as a result. Do keep in mind that using many threads is not an exception for SWI-Prolog programs. Each of them has three stacks!

Anyway, crucial is to make 64 bit faster. Loosing a little on 32-bit is probably not a big issue as long as it is not too bad.

grossdan · May 20, 2021, 7:01pm

If nothing else, but for a benchmark, having one assembly based version on the most popular processor architecture (x86?) might be worthwhile.

dmchurch · May 20, 2021, 7:17pm

It’s certainly possible! One of the potential advantages of using assembly is that we wouldn’t be restricted nearly as much by C calling conventions and the like, and moreover it would allow for composeability—if each VM instruction is represented by a predefined set of assembler instructions, then one could simply concatenate the requisite instruction implementations together to form a callable native predicate implementation.

That said, it’s not anything I’ve got planned for the near future, there are too many other bits of the refactor that have to come first. Once that’s done, though, we’ll see Are you volunteering to help write the assembly?

grossdan · May 21, 2021, 2:37am

I can’t deny (to myself) the excitement of delving into assembly – and also started to briefly looking at the youtube videos of fasm.

If i can (timewise) to do some assembly programming, and can get the help to understand how to do this well in the x86 environment – i.e. to make best use of how the processor within its hardware architecture is intended to be programmed and used – then would be glad to do it.

Dan

dmchurch · May 21, 2021, 4:29am

Haha, sounds great! I’ll definitely let you know when there’s a good time to start thinking about assembler implementations. In the meantime, if you haven’t seen it before, you might find it interesting reading to skim through the x64 instruction set manual:

and the x64 optimization manual:

grossdan · May 21, 2021, 4:35am

thank you.

Yes, noticed the first manual references on the fasm site and glanced through it … i then noticed MMX and similar, and was wondering how this could be put to use as well …

i.e. if we remove portability considerations and focus on benchmarking – and having one swi version for a broadly used architecture that is really fast.

Dan

peter.ludemann · May 21, 2021, 4:51am

I once worked on an all-assembler implementation of Prolog (IBM Prolog/370), which was written by a genius programmer. I’m also a good IBM/370 assembler programmer (or was – it’s been some years), and there was one other expert on the project … nevertheless, it was not easy to maintain the code, and the IBM/370 has a cleaner instruction set than x86 and also has 16 registers.

ARM (M1) is probably a nicer target than x86. (Or RISC-V or PowerPC or just about anything for that matter) But do you want to be maintaining multiple code bases?

I would suggest first finding hotspots and then looking at the assembler that the C compiler generates for them. There are also some additional optimization options that might help, plus link-time optimizations.

My 2-cents’ worth.

Topic		Replies	Views
Curious: How does Prolog "Byte code" compare to .NET IL (and some thoughts about enterprise ready systems) Help!	19	3329	March 10, 2021
SWI does some good garbage-collection/optimisation these days General	23	1098	October 28, 2021
Porting the SWI-Prolog benchmark suite: comparing 8 Prolog systems Nice to know benchmark , performance , comparison	24	2465	November 29, 2023
WASM performance General wasm	10	155	September 29, 2025
Performance SWI-Prolog 8.3.5 vs YAP 6.5.0 Nice to know	16	1950	November 2, 2020

VM code refactor - thoughts?

Related topics