C stack as environment stack?

I’ve been tossing around a number of thoughts as regards the prolog VM, and there’s one thing that keeps popping up in my head: what if we used the C call stack as the Prolog environment stack? Obviously there are some complications as regards, for example, returning to a stack frame underneath a choicepoint without deallocating it or destroying the choicepoint, but I have ideas for ways that could be addressed.

What are the other pitfalls, though? How would it interact with other parts of prolog not to have a classically-organized environment stack?

What happens when there’s a local stack overflow? I think that swipl can either reallocate the local stack or add a new “segment” but that probably doesn’t work for C. The stack_limit only specifies the 3 stacks, not the size of any of the individual stacks. SWI-Prolog -- Manual

Well, I figure you can either have an auto-expanding stack—like if you’re on Linux and you call mmap with the MAP_GROWSDOWN flag—or you give each prolog engine a fixed stack, which will either throw a prolog-level exception when an internal check shows that it has run out, or a SIGSEGV when it gets to the end of the mapped range (which can be caught and turned into a Prolog exception). Basically, the same thing will happen that happens when a native C program runs out of stack.

A Prolog program’s stack is nothing like a C stack … take for instance the following program

my_member(X,[_|L]):- my_member(X,L).

?- findall(X, between(1, 2_000_000, X), List), my_member(1_000_000,List).

Using a C-like stack would take over a million frames

Prolog does not use the stack in the same way (so it requires only a couple of frames)

1 Like

What you’re talking about here is last-call (or tail-call) optimization, and that’s a technique common to both Prolog and (some dialects of) C. I can write your example in C as follows, assuming tail-call semantics:

struct list {
  int head;
  struct list *tail;

bool my_member(int X, struct list *L) {
  if (!L) return FALSE;
  if (X == L->head) return TRUE;
  return my_member(X, L->tail);

As long as the compiler is capable of translating that return statement into a tailcall, e.g. with LLVM’s [[must_tail]] attribute or GCC’s -foptimize-sibling-calls command line option (which iirc is set at -O2 and above), my_member() will never use more than a single stack frame, and will be as efficient as the iterative version.

I’m not concerned about the question of how would I implement the Prolog stack using the C stack; trust that I have or will have figured that out. That part is, in the words of a former coworker of mine, a “known unknown”: I don’t have all the answers right now, but I know it’s an issue I’d have to address and I have ideas about how I’d work around it. I’m more concerned about the “unknown unknowns” - things that just aren’t on my radar, like parts of the prolog internals that might be attempting to directly examine the environment stack and expect it to be in a certain format.

1 Like

Yes, many prolog predicates are optimizable in that way … Even with this working there can be external predicates that you call back and forth to (examples: attr_unify_hook/2, portray/1). Thus inlining the family of predicate extents in order to create the opportunity to tail call

Sorry for the late reply. It appears my mail wasn’t functioning properly. Anyway, some of the issues have been raised here. All in all I doubt that it can work. A more simple question is “what would we gain?”

Well, it’s one of the things I’ve been considering for dealing with memory/garbage collection issues. All told, my optimization goals are to attempt to make Prolog code run as much like C code as I can; not because C is an inherently better or superior language (it’s not), but because modern microprocessors have tons of little shortcuts and optimizations that assume code will execute a certain way, and I want to take advantage of that as much as possible.

Take prefetch, for example: there’s no inherent law of Turing machines that says that if one address is accessed, it’s more likely that subsequent addresses will be accessed in the near future. However, the overwhelming majority of code that executes on large regions of data does so by iterating through it sequentially, from low addresses to high, so the hardware takes advantage of it and accelerates that use case. As such, it makes sense to follow the same patterns when writing new code.

So, in terms of “what benefits would we gain from it”, well, reclaiming stack memory becomes a lot easier; a backtrack could simply become a longjmp() with a trail unwind. Also, we never have to adjust our local environment pointers; once the processor returns to the VM code, the local variables it was accessing will by definition be back in scope and with registers pointing in the right direction. Not to mention, decoupling the local and global stacks means that one growing doesn’t have to affect the other. (Of course, you do then have to figure out how to deal with the pointer comparison method of determining variable age, if there’s no guaranteed relationship between the addresses - that complication I’m aware of, I’m curious of what others there are.)

But really, I’m mainly just leaning into the, “processors expect the stack pointer to jump up and down a lot, and to frequently access addresses near it, especially in combination with CALL and RET instructions”. Not the only avenue I’m pursuing, but one I certainly wanted to get some other input on!

1 Like

Ooh, good point. I hadn’t been thinking too much about the efficiency of longjmp(), either in terms of space or copying time, simply because we’ve already been using them for exception handling environments. It didn’t occur to me as such that I really do need a better mechanism for control passing, so that’s definitely on the list of things to explore.

This is great, keep it coming!

The Prolog environment stack is a bit different from the C stack though. One key aspect is that on a return it is not popped if the predicate succeeded with a choicepoint. The environment stack has two root pointers: the current frame and the current choicepoint. Frames have parents as normal. Choicepoints have a pointer the older choicepoint and a pointer to the frame that defines the environment that must be restored when backtracking to it. The frame stacks of the choicepoints at some point link up, i.e., the environment stacks of two choicepoints do have some common ancestor environment frame.

Unlike the C stack the Prolog environment stack can become huge on highly nondeterministic programs. This makes the fixed location (in virtual memory) allocation as used for the C stack rather unattractive.

That said, my eyes are old and I (thus) doubt this can work out. There have been enough cases where new people find improvements in places experienced people stopped looking :slight_smile:

This actually reminds me - I’m sure you’ve considered segmented stacks for storage, so as not to have to shift them quite so much. Is that just because of the pointer-math aspect, or are there other considerations there as well?

I surely considered segmented stacks. I don’t think it is worth the complications. Stacks shifts typically are far below 1% of the CPU time and rely on a information and data representation that we need for garbage collection anyway (GC needs a lot more than shift). The advantage is big as several operations get easier: deciding who is the oldest variable of two vars, compacting the stacks, scanning the stacks, deciding on choicepoints being more recent than environments, etc.

The main drawback of contiguous stacks is that the virtual address space may get crowded if there are many threads and mapped memory regions and we may not be able to find a contiguous region that is big enough. That was an issue on 32-bit systems, notably on older Windows versions that only provided 2Gb address space. On 64 bit systems we should be fine for a while :slight_smile:

Ha, yes - I’ve actually got an mmap implementation that spreads the stacks all over the address space, so they never have to shift at all, but like you said that only works quite so well in a 64-bit (well, 48-bit) address space.

Anyway, the reason I’m looking into it isn’t because the stack shifts themselves are eating much CPU time, but because there’s always a risk of stack shifting - and the need to save/restore the VM registers around every call out to other libswipl code is causing a lot of unnecessary register/memory churn—unnecessary precisely because stack shifts are so rare.

One way around this, on mmap’able systems at least, would be to go a Python-esque “forgiveness-before-permission” route and simply assume the values won’t change, have the GC unmap the old regions whenever there’s a stack shift, and handle the segfault gracefully by getting the new values and resuming execution. That requires signal handling, though, which is never particularly fun, and it doesn’t work on systems where we don’t have some level of control over the hardware memory map.

Alternately, we could go all-in on the register-pinning and pin all three of PC, ARGP, and FR to hardware registers, at which point GC running on that thread could alter the pointer values with the VM being none the wiser. Of course, that’s three hardware registers—four, if you include LD—that are now unusable for any other purpose in the whole of the libswipl codebase. And, for 32-bit x86 systems especially, I think that may in fact be all the available registers. (And, if I’m remembering my computer science courses correctly, not allowing the computer to use any registers could in some cases be a Bad Thing.)

Anyway yeah, that’s why I’m trying to figure out what else can be done with the Prolog stacks :slight_smile:

That, but also recall Python is at the OS level single threaded. SWI-Prolog is not. So one thread unmapping a stack may cause another thread to remap the same area :frowning: Of course, rather than unmapping you may remap the area as non-accessible. When can you release it? Older versions did use this (the system was single threaded back then). The portability is a nightmare though. Initially avoiding the explicit checks paid off some with some 10% performance boost. When I decided to remove that stuff the loss was more like 1 or 2%. All in all, what happens now is an

if ( unlikely(stack overflow) )
  rc = enlarge_stacka();
  if ( !rc )
    goto exeption;

If the CPU is smart enough to prepare the pipeline primarily for the case there is no stack overflow the price should be small, no? Maybe valgrind can tell?