Question on profiler data port counts

Just something like this could suffice.

CFLAGS=-DO_PROFILE_EXACT cmake ...

I’ll give it a try for my personal build and see whether I can make it behave poorly. With some experience we can possibly reverse the default.

That works for me. I’m all for stress testing the recent changes.

What I do care about is unnecessary choicepoints. The det/1 directive is a big help with that; but a predicate can be made deterministic by adding a few cuts; I’d prefer the code to be written so it doesn’t generate choicepoints in the first place. There doesn’t seem to be much in the profiling statistics that help detect unnecessary choicepoints.

In theory, there should be. An unexpected Redo/Fail would suggest a “dangling” choicepoint to me, assuming of course the counts were trust worthy. Separating the expected from the unexpected is the issue.

A trivial example. Given:

p([_]).
p([_,_]).

Profiling with the improved profiler: (p([x,y]),fail) shows p with 1 Call and 1 Exit; (p([x]),fail) shows 1 Call, 1Exit, 1 Redo and 1 Fail.

Whether the profiler is the best tool for this job is another question, but the evidence is there if the counts are accurate.

I’m thinking of situations where a choicepoint is created but the computation succeeds … are you suggesting that the profiling should do a deliberate “fail” to force all dangling choice points to do a “redo”, and then looking at the profiler output for unexpected Redos?

Choicepoints tend to be invisible until failure causes backtracking and it hits one, so, yes failure is required is required to “observe” their effects. If the computation succeeds, I’m not sure why you care if there’s a cp (possible alternative solution) left behind.

But if you are, maybe prolog_current_choice/1 and friends (SWI-Prolog -- Manual) may be of interest.

It has quite big impact on memory usage and performance on “functional” code, i.e., code that implements a deterministic computation. The choice point blocks LCO and more data is accessible due to the additional call frame as well as due to the choicepoint and the data made accessible through it.

But no, det/1 is all we have (and => to perform a committed-choice match). Debugging is easy, using the graphical debugger or the * command of the toplevel.

Good “functional” code should as much as possible rely on clause indexing and use cuts as early as possible. When writing the code, this is fairly easy, although it would probably help to have the determinism info about built-in and library predicates more readily available. Automatic detection of whether cuts are correctly placed is probably harder. There are surely options for doing such both using static analysis and runtime inspection of what the cut actually does.

1 Like

Is there an equivalent for when running something non-interactively? I run some of my tests in “batch” mode.

Is there a way to see what the clause indexing does? When I used vm_list, I didn’t see any of the clause indexing instructions.

+1.

As soon as a predicate has more than one clause, the prospect of more than one solution exists - that’s what “or” in first order logic means. If that’s not the intent, fix it (cuts, clause indexing, reduce to a single clause). (Guards can also prevent multiple solutions, but require a cut to remove the CP). Note that any “non-deterministic” predicate can be used to make a single solution variant by wrapping it with a following cut.

Clause indexing may be helpful but I’m never sure what the JIT indexer does in any particular case (ref. recent indexing discussion), so I’ don’t rely on it for removing unwanted choicepoints.

In any case, the profiler is a post-execution analysis tool and not intended to address this issue. But the counts may provide useful evidence when such unneeded choicepoints are actually encountered on backtracking.

The unit tester complains on tests that succeed but with a choicepoint unless you tell it not to. And yes, handling the * toplevel operator is in Prolog if I recall well, so you can do all these things. Note that by intercepting the debugger you can examine the stacks and choicepoints before and after a cut.

vm_list shows the approach taken in the supervisor instructions. Most predicates use the JITI indexing. That should eventually change, especially for static predicates with fairly low numbers of clauses as the current indexing is designed for predicates with many, fairly uniform clauses.

Determinism constraints are guaranteed for first argument indexing, as well as JITI hash tables on a single argument. Combined (multiple argument) indexes do not give this guarantee. But, I’d rather keep the indexing a moving target, only providing the guarantee it won’t get significantly worse.

Is it possible to see the first argument information? Because the implementation and I might have different ideas about what “first argument indexing” means …

And it turns out that the information can be output in a nice form (as suggested by @swi in Print last choice point):

print_choicepoint :-
   prolog_current_choice(ChI),
   prolog_choice_attribute(ChI,frame,F),
   prolog_frame_attribute(F,goal,Goal),
   format("Last choice point: ~q~n", [Goal]).

Traditionally it selects on small integers, atoms and functors of compound terms. That always works and predicates whose first arguments are all one of the above are guaranteed not to leave dummy choicepoints. SWI-Prolog also indexes on floats, strings, big integers and rationals, but as these are larger objects it is possible some clause is considered candidate that shouldn’t be.

I was hoping that I could see the equivalent of the WAM switch_on_term, try_me_else, etc. instructions. I supposes it would be too voluminous to show the indexing information in jiti_list/0.

So spent (too much) time seeing if accurate counts could be done with a modest size profiler call tree. Recall that the problem was that exits/fails get lost due to the indirect-recursion optimization which facilitates the construction of small trees. Alternatives explored included adding surrogate placeholders (pointed back to original call tree node) that could be reclaimed after use (on exit or fail), and some sort of indirect recursion loop counting mechanism for adjusting fail/exit counts as the recursion unwound.

The first option was overly complicated and the surrogates ended up getting pinned by choicepoints (just like environments) so the saving wasn’t that great. The second option just never worked because I couldn’t find a way to properly identify which call nodes were actually in the recursive loop for treatment during the unwinding process.

So that leaves me back with a modal profiler controlled by an option. I have a prototype, as described below, and would welcome any feedback on whether this is an acceptable path forward.

The prototype supports one of two modes. In the first mode all port counts are accurate, but the size of the call tree can get quite large. In the second mode, the call tree is a more manageable size, but only the call port counts are accurate. In both modes time spent in each predicate and the callee/caller lists are retained. As is currently the case, this last mode permits long term profiling without consuming a lot of memory for the call tree.

For the prototype, mode section is controlled by an all_ports option which is false by default. This corresponds to the second mode above, which is basically equivalent to the current profiler semantics except that Exits and Fails are not even counted (in general they’re wrong) so that saves some overhead in the profiler compared to the current version. With the current post processing left as is, this is displayed as Exit and Fail counts of 0, and a Redo count of -Call count, i.e., Call’s + Redo’s = 0. So everything but the Call count is “nonsense” which shouldn’t mislead. You do get Times, Callers, Callees, and Call Counts so that may suffice in a large majority of use cases.

If the all_ports option is set to true, accurate counts on all ports wil be recorded at the expense of more memory needed for the call tree (sometimes significantly more). Accurate counts can provide a more detailed view of where predicates are spending time internally. For example, if the Redo count is significantly higher than the Call count, the predicate is generating alternative solutions which may, or may not, be the intension. The ration of Redos to Calls provides the average of how many clauses are tried via external backtracking (not due to head unification failures or in-body failures) before success (or ultimate failure). Indeed failure itself may not be an intended consequence.

A current downside of this mode is that indirect recursion manifests itself in additional "Cluster"s compared to the current model. This is, for the most part, undesirable clutter. Hopefully this can be cleaned up in the post collection analysis phase at some point.

The all_ports=true mode is more memory intensive so it’s meant to be used in the debugging phase on manageable subgoals of the overall application or perhaps testing individual library’s exported predicates. In addition to providing “time spent where” information, the profiler can provide a overall view of the target Goal’s execution that may be difficult to discern using the standard debugging tools; at least that’s a helpful use case for me.

Thanks for the investigation. I think the two modes are fine. I think discarding all but the call port in the “classic” mode is a bit overkill. Many of them look pretty much ok. I don’t have the working exactly in my head, but it looks plausible to figure out which counts are fine and which (probably) not. If we label the node that we basically attached in the wrong place (just a flag), each walk over the tree that does not walk over that node is correct, no? If that is true we can possibly even introduce an uncertainty as the number of times we walked over this wrong edge?

Even if that is all not possible I’d prefer to keep the current state and simply add a warning that counts may be incorrect and how to re-run with correct counts.

Not quite following you. No node is attached in the wrong place. But when you have indirect recursion, the current node is set to the equivalent ancestor, effectively backing up the branch, and any intervening nodes are effectively lost for the purpose of counting Exits and Fails. The only Exit/Fail that will be counted is the one when the recursion starts to unwind at the end of the loop. When Exit/Fails are missed that will be reflected in an incorrect Redo count since Redo’s are now calculated as Exit’s+Fail’s-Call’s. In many (most, all?) cases this will result in a negative Redo count. I guess that could serve as uncertainty factor, but I’m not sure it can be depended upon.

It’s easy enough to leave the counts as is. I was just hoping to achieve two objectives: a) eliminate any misinterpretation of almost correct counts (warning or not), and b) reduce the profiler overhead, since both exit and fail handling require looking back up the ancestor chain to either the node being resumed (exit case) or the choicepoint node (fail case) and that may not come cheap. Reducing overhead improves the accuracy of the timing data for use cases when the port counts “don’t matter”.

But I’ve probably been immersed in this too long, so if you have a better idea, I’d be glad to hear it.

P.S. If you’re interested the profiling overheads of indirect recursion, try profiling countdown_fd.pl from the current Haskell thread. clpfd drives its propagation through recursive application of attributed variable unification ($wakeup) - the recursion loop length is 9 but it’s not all that deep (maybe 20 for this example). I’ve seen distortion values typically in the 15-20% range. (I think distortion measures the percentage of elapsed time that is spent building the call tree and collecting statistics.)

Another profiler realted question while I’m looking at it. On MacOS:

?- time((between(1,1000000,_),member(X,[1,2,3,4,5]),fail)).
% 6,999,999 inferences, 0.682 CPU in 0.682 seconds (100% CPU, 10267150 Lips)
false.

?- profile((between(1,1000000,_),member(X,[1,2,3,4,5]),fail),[time(cpu)]).
false.

?- prolog_profile:prof_statistics(S).
S = summary{accounting:9, nodes:5, samples:99, ticks:803, time:0.8804530000000028}.

?- profile((between(1,1000000,_),member(X,[1,2,3,4,5]),fail),[time(wall)]).
false.

?- prolog_profile:prof_statistics(S).
S = summary{accounting:12, nodes:5, samples:180, ticks:804, time:0.9031078815460205}.

Observations:

  1. Overall cost of profiling this trivial example, based on total execution time, is about 30%. That includes kernel hooks for building the call tree and sampling overhead.

  2. On a lightly loaded system, calculating the thread time (time(cpu)) seems excessive to the point where almost half the samples are lost (99 vs. 180) compared to ‘time(wall)’. Since the sample interval is used to calculate ticks, this doesn’t have a major influence on the timing results as displayed, but using a smaller number of samples effectively affects the overall accuracy (below some application dependent threshold).

Is 2. an OS dependent issue only affecting MacOS? Can anything be done about it?

More comparative data (from MacOS):

?- time((between(1,1000000,_),statistics(runtime,RT),fail)).
% 1,999,999 inferences, 1.275 CPU in 2.522 seconds (51% CPU, 1569097 Lips)
false.

?- time((between(1,1000000,_),statistics(system_time,RT),fail)).
% 1,999,999 inferences, 1.272 CPU in 2.535 seconds (50% CPU, 1571722 Lips)
false.

?- time((between(1,1000000,_),statistics(real_time,RT),fail)).
% 1,999,999 inferences, 0.198 CPU in 0.199 seconds (100% CPU, 10077085 Lips)
false.

?- time((between(1,1000000,_),statistics(walltime,RT),fail)).
% 1,999,999 inferences, 0.202 CPU in 0.202 seconds (100% CPU, 9911289 Lips)
false.

I think I simply want to keep the “classical” view, possibly with a warning and ideally with an indication of which nodes have incorrect counts. To me it is good enough and so far it never let me down in terms of scalability.

Not collecting. That is done after the sampling completes. It merely samples the time spent in building and updating the call tree.

Mostly. On Linux I get 0.34 sec for CPU and 0.15 for wall time for the tests in your next message. Still, we are talking about 1.2us per fetch of the CPU time and 1k samples per second, so that is pretty much neglectable I’d say. On Linux, getting CPU time is purely CPU bound (same CPU and elapsed times). Might also have to be interpreted as MacOS counting 50% as “system time” and the rest as “user time”. I don’t know.