The profiler build a data structure that reflects the call tree (with some abstraction to avoid explosion on recursion). That is done from hooks in the VM. So, the tree and the counts should be correct except for the problems due to events lacking because some predicates are not real predicates but VM instructions and abstraction and optimization make it hard to get all the counts correct. The timing is added using sampling. Nowadays OSes provide high resolution timers. I wonder whether using these would be a better alternative now?
Quite big. Just consider a member/2 call on a long list. It will blow up the tree structure, cause the sampling to be prohibitively expensive (has to walk up the tree) and slow down the data collection. Deep recursion is quite normal in Prolog
Not really. I don’t recall the exact history of this abstraction. I think this stuff was the result of studying some literature on execution profiling. “Normal” languages are easier though, as you just have call and exit and typically not even last call optimization. Possibly you can invent some annotation on the nodes when detecting this situation?
I do recall there is quite a bit of literature about the various tradeoffs. I’m open to something else. It must be able to deal with large and complex programs though.
Direct recursion, as in member/2 can be optimized since there’s no intervening Exit’s to lose. The problem is deep (indirect) recursion, and it is quite normal. But recursion depth is usually a function of the (test) input data and one doesn’t necessarily have to run the profiler on large test data suites to get meaningful results. In many cases you just have to run a piece of the code with large enough reps to get a useful number of samples. The profiler is just a development tool and the programmer controls how it’s used - available memory is just another constraint.
If memory for profiler call nodes is a significant issue, perhaps using 64 bits for all the counts should be reconsidered. (I think the current pro filer call node size is about 100 bytes.)
From my reading of the code, no tree walking is done at sampling time; it just increments a few counters, one being the current node ticks. So I don’t see how the indirect recursion optimization reduces data collection time. In fact, dropping this (faulty) optimization should help because it doesn’t have to walk up the tree on every call to see if indirect recursion applies.
I’m not really looking for anything different; just trying to make the existing profiler a tool that that can be trusted. If the indirect recursion optimization produces incorrect counts, it should either be fixed (I can’t see how), or dropped. And, yes, dropping it will reduce the maximum size problem that can be profiled.
Ok, good to hear it doesn’t walk the stacks any more. It used to in older versions that merely ticked the predicates rather than building a call tree. Also good that direct recursion is fine.
I don’t agree with the observation that profiling is mostly used on (small) test data. I particularly use it on large programs with real data to get some insight on what is going on.
The output from ?- show_coverage(Goal, [dir(cov)]).
is probably more insightful for optimization purposes. It creates annotated source files in a new directory cov
, holding line-by-line information on calls, exits and and failures. Well, provided the source code information can be produced that also drives the graphical debugger.
We could consider a Prolog flag to choose between correct counts and scalability. Both seem to have their merits.
And yes, the counts should probably be 64 bits. They are on 64 bit hardware.
I typically run it on problems requiring 1000 nodes or so (~100K, 150 predicates, 3000 samples)) and definitely not an issue for that. The Prolog execution stack requirements far (and I mean far) exceed the amount of memory required for profiler call nodes.
I’d be interested on seeing how many profiler call nodes get produced for your large profiling tasks, with and without the optimization for indirect recursion. Is it practical to measure that? (Pretty easy to make a build without the indirect recursion optimization).
But I guess my real question is if a problem is too large to profile (as proposed), is it also too large to debug? Without LCO, is there any optimization for recursion in general?
Not sure I see the merit of a profiler with incorrect port counts. Well, maybe if you’re only interested in the sampling info. In any case, much better IMO as a profiling option rather than a flag. (Maybe an include counts option?)
Why? Seems like a memory/speed tradeoff at best. If memory is really more important, I can’t imagine the small difference in speed (even if 10x for the actual count update) makes that much difference overall.
But it’s really not going to make a huge difference - maybe 25% in the size of a call node.
Isn’t the usual reason for sampling isn’t resolution, but the overhead of collecting the statistics? But if the overhead problem can be reduced, high resolution would be nice; I suspect it would have shown that more time was spent in is/2 when I was trying to speed up the “countdown” problem.
I’m curious how you’d reduce the overhead of collecting the statistics? (I’ve written a few profilers, with sampling in a separate thread; and even when writing in assembler, there’s a non-trivial overhead and resulting statistics skew.)
Likewise – and I’m skeptical of the results I get from small programs because they don’t run long enough to get reliable statistics. (Similarly, I wonder if the benchmark programs used in building the PGO version of swipl
run long enough)
Most of my code is pretty deterministic (or has shallow backtracking), so I seldom use port counts – the cumulative time statistics are generally the most useful. If there is a lot of backtracking, the port counts don’t tell me much beyond “there’s a lot of backtracking here” but I knew that happen anyway.
These have been real life tests. Many thousands of predicates and nodes and up to hours of CPU time were surely no exception. As is, that is pretty manageable with an expected slow-down of less than 2 times. I guess hard figures are difficult to produce. Surely there are many programs for which not handling indirect recursion is fine and surely there are examples where it is prohibitive.
I don’t understand this. Profiling is no alternative for debugging in my experience. It is about getting a clue where the time is spent. The port counts are of some help to find that ok, some things take a lot of time because they are called a lot. That means we should seek for reducing the number of calls. Others have few calls and are slow, that might indicate there is some way to optimize this locally.
And yes, in debug mode with disabled LCO things tend to get really slow. Possibly we should implement something to deal efficiently with direct recursion in debug mode. Still for profiling I like using real data while for debugging I prefer something small.
I propose using 64 bits, no? The history is simple: in the old days we used long
and 4 billion inferences was a lot. The came Windows 64-bit and long
got a blind replacement for intptr_t
. Some of which was later replaced by something more adequate. Pushed a patch.
Agree, an option is much better.
I don’t know. Probably depends on the OS and possibly the hardware? (User) profile guided optimization is in my experience not always easy. Yes, it gives some clues and often finds real bottlenecks in poorly optimized code. Once the code is fairly good, it gets hard to find things to concentrate on.
I’m unsure, but I think it is not that relevant. The main contribution is probably to get better branch prediction. I don’t know whether it cares about hod/cold code. It it optimizes “cold” code for size rather than performance we should try to extend the coverage of the PGO benchmark suite. There is surely stuff that is not touched in the PGO benchmarks while there are realistic Prolog programs that depend on it.
I think you missed my point. We’re talking about an flawed optimization that reduces the amount of memory used by the profiler tree. The question is whether that optimization is sufficient to allow profiling to succeed (not exceed memory available) in real world cases when dropping the optimization (permitting accurate port counts) would fail.
Debugging and profiling are separate and distinct processes. But both consume memory (and time) over and above normal execution. My hypotheses is that disabling the flawed optimization during profiling would consume less memory that turning on debug mode (with subsequent loss of LCO). So if a program can be debugged, it can be profiled (with accurate port counts).
And let’s put some real numbers on this. Suppose you had a recursion of loop 5 predicates, e.g., p -> q -> r -> s -> t -> p ...
and that loop got to be 1000 deep. That’s 5000 extra nodes, or about 0.5 MB. Unless I’m missing something, I don’t see that as being a make/break proposition in very many scenarios.
Profiling is not just about where the time is spent, it’s how. There’s no point optimizing something that’s out of your control. I think port counts are a significant piece of that profiling puzzle. But they need to be trust worthy and they currently aren’t.
And those are the examples I would like to see, or at least some numbers. If they exist, I suspect they’re very rare. Where it’s absolutely critical, throw more (cheap) memory at the problem. Frankly I’d rather have data I can trust, and crash on a few programs that are too “large” (wide and deep recursion), than have data I can never trust on any size program.
“Small” is a relative term. It doesn’t necessarily mean short in execution time; just put them in a backtracking/fail loop, e.g.,:
profile((between(1,10,_),solve([1,3,5,10,25,50], 999, Expr), fail; true)).
And one of the ways of using the profiler is on smaller pieces of large programs. In the context of this discussion, “small” means non-excessive deep, indirect recursion. My suspicion is that anything that runs in debug
mode is capable of being profiled (and perhaps some that aren’t).
I’d welcome any data, i.e., node counts, from profiling “large” examples. Even better if I can run them myself to get a sense of what the current (flawed) optimization is achieving.
Note that enabling arithmetic optimization, you won’t see any is/2
data at all since it’s compiled into VM instructions. So arithmetic evaluation time will appear as part of the parent predicate’s self time.
Not sure what you’re referring to. I did mention that removing the optimization in question would save some time because it wouldn’t be necessary to look back up the “call stack” looking for indirect recursion on every call.
In my particular case, is/2 couldn’t be optimized (it was operating on a generic structure), so I did see its times in the profile. But I suspect that the times were wrong (based on what happened when I changed the code so that is/2 could be optimized out) because the code was doing a lot of calls to is/2 but each all was very quick (e.g., something like Expr = 1+2, Result is Expr
). So, if profiling could have been done without sampling, that would have given me better numbers, I think.
By this, I was referring to the overhead of capturing statistics using (as @jan says) “high resolution timers”, rather than capturing them by sampling … but I don’t know what changes to the profiler that @jan is thinking of by using “high resolution timers” – would each predicate entry/exit be recorded (and in each context, so that times could be accurately propagated to calling predicates)? [Now that I think about it, I’m wondering how the parents’ times are recorded in recursive situations, without double-counting … is there any documentation on how it’s done? I found Saumya Debray’s paper, but everything except the abstract seems to be paywalled.]
Not sure if there’s some misunderstanding. The profiler tree is built using handlers called from a few of the VM instructions. No sampling is involved in this process. The sampling interrupt handler just increments the total tick count and the tick count for the predicate currently being executed. If this happens inside one of the handlers, the tick is attributed to profiler overhead. Port counts are driven by the handlers so they can be accurate. Times are driven by sampling so it depends. I apologize if you already knew this.
That was my assumption - sampling gets replaced capturing/accumulating time in the handlers using a high resolution clock (not really a timer IMO). Things get a bit complicated when considering failures and LCO - where do the actual transitions between predicates occur? But that may be no different than how port counts are done in current profiler.
Currently all the parent/child times are calculated in post processing of the tree as you mentioned. This can be done because the tree captures the call context (parents/children). I haven’t had to look at the post-processing yet (hope it stays that way).
Lots of good stuff in this thread. A quick look at using the CPU time clock suggests this is far too slow (Linux). Possibly there are alternatives. Anyway, it is not the main topic.
I still don’t buy the link between debugging and profiling overhead. Debug overhead is a lot larger than profiling overhead (at least worst case). Preferably you run the debugger on isolated code fragments though. Occasionally, spy and break points that are triggered in debug mode cause so much slowdown or additional memory usage that it does not work. I typically work around that by inserting a call to gtrace/0 at a strategic position and run in normal mode. The downside is that often a lot of the context is lost due to LCO and GC. Tracing around after the trap works fine though.
You want to run the profiler on normal optimized code and, in my case, typically on the whole application rather than some fragment.
Anyway, I don’t care much about the counts, neither does @peter.ludemann, and @ridgeworks does. I guess the most sensible is to have an option to profile/2. We could even consider a build time option for now and see how it works. Looks like all that is needed is to skip the code that looks for a match in the call stack.
That is true; I largely work at the library level, both debugging and profiling. However, I would claim that the overheads are low enough that sizeable applications could also be profiled successfully.
If I understood @peter.ludemann 's comment, he didn’t care because he already knew the answer. I largely didn’t care before either (other than the Call count) because I couldn’t explain the numbers I was seeing. Now that they bare some semblance to reality, the game has changed for me. But a sample of three is hardly compelling evidence one way or the other.
Pending further evidence I don’t mind a build option - it is as simple as you suggest. I was considering my own custom build especially for profiling if there is no consensus.
But be aware that the Exit and Redo counts will be adversely affected by leaving the optimization in the code. (You can now even get negative Redo’s due to missing Exit’s.) None of this should matter if you don’t care.
Just something like this could suffice.
CFLAGS=-DO_PROFILE_EXACT cmake ...
I’ll give it a try for my personal build and see whether I can make it behave poorly. With some experience we can possibly reverse the default.
That works for me. I’m all for stress testing the recent changes.
What I do care about is unnecessary choicepoints. The det/1 directive is a big help with that; but a predicate can be made deterministic by adding a few cuts; I’d prefer the code to be written so it doesn’t generate choicepoints in the first place. There doesn’t seem to be much in the profiling statistics that help detect unnecessary choicepoints.

There doesn’t seem to be much in the profiling statistics that help detect unnecessary choicepoints.
In theory, there should be. An unexpected Redo/Fail would suggest a “dangling” choicepoint to me, assuming of course the counts were trust worthy. Separating the expected from the unexpected is the issue.
A trivial example. Given:
p([_]).
p([_,_]).
Profiling with the improved profiler: (p([x,y]),fail)
shows p
with 1 Call and 1 Exit; (p([x]),fail)
shows 1 Call, 1Exit, 1 Redo and 1 Fail.
Whether the profiler is the best tool for this job is another question, but the evidence is there if the counts are accurate.

An unexpected Redo/Fail would suggest a “dangling” choicepoint to me,
I’m thinking of situations where a choicepoint is created but the computation succeeds … are you suggesting that the profiling should do a deliberate “fail” to force all dangling choice points to do a “redo”, and then looking at the profiler output for unexpected Redos?