Question on profiler data port counts

From time to time I use the profiler to get a big picture view of where my code is spending time. The port count data can be sometimes difficult to interpret; here’s a simple example:

?- profile((between(1,1000,N),p(N),fail)).

where:

p(N) :- N mod 2 =:= 1 -> g1 ; g2.

g1.
g2.

Given the semantics of “If-Then;Else”, p/1 is determinstic so I would expect p’s Call and Exit port counts to be 1000, and Redo and Fail port counts to be 0 (with g1 and g2 each called half the time).

However the profiler data shows Redo and Fail port counts of 500. It looks like the Redo/Fails from “If-Then;Else” (not actually a predicate itself since the compiler optimizes it away) get accumulated up to the parent predicate.

I find this mildly confusing with this simple example, but when you have legitimate non-deterministic predicates (multiple clauses leaving choicepoints), making sense of the counts gets quite difficult.

Is there some way of discarding or separating out the port counts due to “If-Then;Else” from the parent predicate? Are there any other similar cases where apparent miscounting occurs?

The profiler counts are indeed a problem. AFAIK only the call port count is simple and reliable. Redo is complicated because it is hard to figure out which predicate to increment. Fail and exit are hard because last call optimization removes intermediate frames. If I recall well, exit is corrected by means of the dynamic call tree maintained by the profiler. I’m afraid I forgot about the status of the other ports.

I thought I could manage to make correctly counting the ports work and liked the idea of adding all ports to the profile. That failed. I’m unsure whether it is possible or not. Lacking the time to figure that out I decided to leave it as is.

If anyone wants to pick it up …

1 Like

Caveat:

In the specific case I cited, I’d really prefer nothing got incremented. It looks like the ‘c_ifthenelse’ instruction creates a choicepoint that is indistinguishable from a choicepoint created by the parent predicate. So backtracking causes the parent’s Redo port to be incremented. Maybe something could be added to the choicepoint to distinguish between the two cases so the additional increment could be avoided.

So, in this case, are the corrected “bogus” 'Fail’s due to the unwanted 'Redo’s?

Does the LCO issue not arise if debug mode is enabled?

Perhaps the doc could at least be updated with caveats covering the trustworthiness of the profiler data.

Not sure if there are wider ramifications but got numbers that make more sense to me if I comment out line 6908 in src/pl-vmi.c:

6908    Profile(profRedo(ch->prof_node));

But still not right; fixes parent counts (an improvement IMO) but leaves incorrect counts for goals in “if-then-else”. Will keep exploring.

So a variation on the fix reported earlier seems to be an improvement IMO and handles the simple test correctly:

Replace line 6908 in src/pl-vmi.c:

6908    Profile(profRedo(ch->prof_node));

with:

   LD->profile.current = ch->prof_node;

or

    if ( unlikely(LD->profile.active) ) LD->profile.current = ch->prof_node;  

The next issue is that the redo port is not incremented (but should be) when the failure is due to a simple test (filter), e.g. atom/1. So if you:

?- profile((between(1,1000,N),p7(N,M),fail)).

where:

p7(N,N) :- false.
p7(N,N) :- atom(N).
p7(N,N) :- false.
p7(N,M) :- succ(N,RN), M = RN.

The 2000 Redo’s due to false/0 are counted, but the 1000 due to the failure of atom/1 are not recorded. atom/1 gets compiled to a single instruction (vs. a call), but at this point I get lost in the WAM machinery (and macros). As far as I can see, the deep backtracking instruction isn’t used, so I’m not sure where it goes.

It also appears that if the profiler is run in debug mode, you get quite different data trees built, so there seems to be a lot of moving parts. Don’t think I can progress much further without a better roadmap.

It is complicated :frowning: Thanks for some first steps. I guess the first step should be a test set that covers all cases (possibly exception for exception handling, which comes with its own can of worms).

If I recall correctly, there is a notion of deep vs shallow backtracking, but essentially they are the same. So, some might be labelled incorrectly.

I kept poking at this, but never found a fix that removed the redo’s due to CH_JMP choicepoints from accumulating on the parent without something else going out of whack. But along the way, a couple of other questions popped up.

Assuming shallow and deep backtracking have their traditional meaninings, i.e., shallow backtracking is die to failure of unification in the head, why are the failure macros defined in pl-wam.c as:

#define FRAME_FAILED		GO(deep_backtrack)
#define CLAUSE_FAILED		GO(shallow_backtrack)
#define BODY_FAILED		GO(shallow_backtrack)

Shouldn’t BODY_FAILED go to deep_backtrack?

And shouldn’t all the B_* instructions fail via BODY_FAILED? This isn’t the case for B_UNIFY_VC, but of course it doesn’t really matter since CLAUSE_FAILED is the same as BODY_FAILED.

Possibly. Looking through this stuff to see whether we can reliably detect head unification failure I saw there was a comment that the difference didn’t matter any more. It is well possible that some instructions use the “wrong” failure as a result. It also turned out this is not so easy as inline unification also uses the H_* instructions, so at the VM level they are “head” failures. Duplicating the whole set with different failure seems unattractive. I experimented with a VM register that tracks whether we are in the head or body. That should work. The price doesn’t seem noticeable.

If you think this solves your problems I can push this branch. I’m not going to take this on board unless there is a clear benefit. If it can fix profiling port counts and may improve indexing later, that is a good reason.

The main issue for me is understanding the design intent. If there’s really no difference between shallow and deep backtracking, why have two? Getting rid of one (presumably shallow) would simplify things and facilitate maintenance and future evolution. If there’s a reason for both, document the circumstances when each should be used, and try and ensure that’s consistent with actual usage.

Regarding profiling, counting redo’s seems to be the most significant problem (for now). I observe that this only occurs in deep_backtrack, so I wonder why BODY_FAILED is is considered shallow. Just to confuse the issue, shallow_backtrack can continue to deep_backtrack. So can BODY_FAILED end up in a redo count update? Under what circumstances?

Also incrementing a redo count can cause an increment in parent redo’s as well. Why does that make sense? LCO? Something else?

To be honest, I don’t know what will solve my problem until I’ve solved it. But the current state of design clarity in the implementation isn’t helping much.

That is one way to view it. The other is that some distinctions were needed in some version. A rewrite no longer needs the distinction. Now we can change all “callers” or we can just leave it as is. There is some value in having access that these things are different. Now over the years it is not uncommon that the “callers” get polluted, i.e., some get wrong and as it doesn’t matter because they do the same anyway, the value of keeping the distinction gets lower. I guess that is were we are now. That could be a good moment to make everything uniform. Still, possible we like to reintroduce the distinction.

Consider

p :- q.
q :- r.
r.
r.

?- p,fail.

This should indicate a redo on p,q and r in the normal box model.

I think the first step is to get all these scenarios explicit and see how they relate to the VM events. Some relevant things the VM does are:

  • Head unification failures (classical shallow backtracking.
  • in-body if->then;else
  • inlined unification (Var = Term) in the body, using H_* instructions, so they act as head unification failures.
  • inlined body primitives (type tests, Var = Const unification, arithmetic, …). All these are currently det or semidet.
  • Prolog and foreign (det/nondet) predicates
  • Last call optimization

For the latter, we must consider that the profiler creates a dynamic call tree as an explicit tree object. So, it the above it creates a tree p->q->r. One of the things “fixed” with this is that on exit we will get an exit from r and p, but not from q as LCO has removed that. So, the exit of p sees there is still an open child q and will tick an exit for that as well. On a redo we get these events: redo(r), exit(r), exit(p). In the normal box model that would be redo(p), redo(q), redo(r), exit(r), exit(q), exit(p). I seems impossible to generate these events in the right order. If we get the counts correct, that is ok.

Ah, yes, I’d forgotten the details of the classical box model; back to the drawing board.

But that model also has the property that a redo count should only be incremented by a downstream failure. Incrementing Redo by an “if-then” surely violates that rule. So the current Redo count is incorrect (in general) and also the Fail count since that’s derived from the Redo count. (As stated in Clocksin & Mellish there can be at most one Fail per Call.)

Good list but I don’t think any of these relevant to the Redo count as defined by the box model? I think you may be falling into the same trap I did.

We should ask the question: what is useful port count information that a profiler can generate. I think what would useful to me is an accurate count of Call’s, Exit’s (or successes), and Fail’s. This aligns with deterministic (1 call, 1 Exit), semi_deterministic (1 Call, 1 Exit or 1 Fail) and nondeterministic (1 Call, 0 or more Exit’s, 0 or 1 Fail) terminology. I think the current Call and Exit counts are good but Redo’s are just noise. They’re currently necessary because the Fail count is derived from the other counts; if the Redo count is incorrect, so is the Fail count.

I’m wondering if we can flip the script and count Fail’s (as defined by the box model) instead of Redo’s (which can then be derived if useful). I think this could be done in the existing backtrack code as follows:

When backtracking occurs, increment the Fail count for each node starting at the profile_current node back to (using the node->parent link), but not including, the call_node which comes from the choicepoint.

Note that if the current node is the same as the choicepoint node, nothing gets incremented - that’s a failure internal to the clause, and not an an “event” that passes through the Fail port. What happens in the box, stays in the box.

Haven’t looked at this in any detail, so I’m probably missing something. Is there a hidden flaw?

Agree. That should be fairly easy to avoid as that are choicepoints that use the “jump” method. A little harder are the “shallow backtracking” redos. We probably want to avoid those as well. Although, there is some value in them as they hint that clause indexing is not doing its job. They are, as is, not easily distinguished from trying the next clause due to deep backtracking.

Shared misery :slight_smile: And yes, maybe. On the other hand, VM events is all we have. We can decide what to do with them and can tune the exact events a little. In the end, there are events in the box model that simply do not exist in the VM events as they are optimized away. We don’t want to disable optimization as that distorts the timing too much.

:+1: smart thinking!

I don’t know. It sounds plausible. Please give it a try!

Note that some calls will be off. VM defined predicates like atom/1 do not create a node in the call tree. Their time is simply added to the predicates that calls them. If we wish fix that we must call profCall() and friends in all of these. I don’t like all that complexity and because the “predicates” are so fast the overhead of the profiler skews the timing. Only the counts will be relevant. show_coverage/2 using the dir(Dir) option to annotate the sources gives a good picture for the impact of such conditions.

I’ll take that as encouragement. I’m looking to restrict any changes to the backtracking VM pseudo-instructions and the profiling code (pl-prof); hopefully everything else stays the same. For now I’ll also maintain the same external data structures so display is unaffected.

I also don’t plan on doing anything other than use the existing call nodes. I actually think it’s helpful to attribute the VM supported predicates’ time to the calling predicate - less clutter that way and it’s not as if the programmer could do much with them other than see them in the call tree.

1 Like

I have something that makes sense, at least to me. I’ve replaced the (internal) Redo count with a Fail count. On backtracking, a node can only fail if it’s not the node in the choicepoint you’re backtracking to. So any internal backtracking due to head unifications, in-body built-ins (including “if->then;else”), inlined unification, etc. do not count as Fail’s. I think this is consistent with the semantics of the Fail port in the box model.

The net effect is that predicates become opaque. The profiler only sees how it interacts with other predicates via the box model ports.

The Redo count is derived from the Call, Exit, and Fail counts. So it should also be consistent with the box model interpretation of the Redo port.

As before, the exit events are out-of-order since it takes an actual exit to infer all the exits that were lost due to last call optimizations. This part of the implementation is unchanged.

Once the dust finally settled, the code changes external to the profiler are minimal. All the calls to profRedo in the deep_backtrack pseudo instruction are replaced by a single call to profFail prior to the switch(ch->type) (choicepoint type) on line 6832 of pl-vmi.c. A profFail handler replaces profRedo in pl-prof.*. Mapping of the new counts to the existing semantics is done internally so the external interface stays the same.

As I said it looks good to me on a small selection of test cases as well a a significant clpBNR global optimization problem (145 predicates, 779 nodes, ~4.5 sec.). I can put together a PR for review and test. Let me know about any interesting corner cases or questions about the above.

1 Like

Sounds good! Please make a PR. Lazy way: don’t clean it up so we can see whether it works as expected and if it does we can see what needs to be cleaned.

PR #1125 generated.

Not sure what you mean, but changes are fairly well structured so should not be difficult to modify/remove.

1 Like

Thanks. Merged. Looks a lot more realistic. Whether it is always correct compared to the box model is still an open question, I guess.

It’s better but there still appears to be an issue with missing exits due to recursion. So not ready to close the book just yet.

1 Like

I noticed two runs of the same code came up with different “inferences” counts from time/1. Is there any sampling when using time/1? For example:

$ swipl -O countdown3.pl 
Welcome to SWI-Prolog (threaded, 64 bits, version 9.1.4-56-g119163351-DIRTY)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- time((solve([1,3,5,10,25,50], 999, Expr), fail; true)).
% 1,630,940 inferences, 0.688 CPU in 0.689 seconds (100% CPU, 2369411 Lips)
true.

?- time((solve([1,3,5,10,25,50], 999, Expr), fail; true)).
% 1,630,939 inferences, 0.655 CPU in 0.655 seconds (100% CPU, 2491127 Lips)
true.
$ swipl -O countdown3.pl 
Welcome to SWI-Prolog (threaded, 64 bits, version 9.1.4-56-g119163351-DIRTY)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit https://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- time((solve([1,3,5,10,25,50], 999, Expr), fail; true)).
% 1,657,368 inferences, 0.664 CPU in 0.665 seconds (100% CPU, 2495632 Lips)
true.

?- time((solve([1,3,5,10,25,50], 999, Expr), fail; true)).
% 1,630,939 inferences, 0.669 CPU in 0.670 seconds (100% CPU, 2436945 Lips)
true.