Question on profiler data port counts

Having used the two modes more on my own examples, I think this is a mistake. If there are two modes, they should be readily distinguishable from the presentation. I’ll leave it to someone else to figure out how and when to indicate questionable data; I really have run out of ideas.

And don’t forget the API (library(prolog_profile)), where "warnings: and “indications” might be problematical. I also think warnings, indications and words like “as long is there is no indirect recursion” aren’t very user friendly, particularly for new users, but maybe they aren’t the main focus of a profiling tool.

Having said that, I’ll add a third possibility to the now renamed ports option - classic, which will be the default and will result in numbers equivalent to the current presentation. (But I take no responsibility for explaining them.) For the record I am unlikely to ever use classic mode; for the same amount of memory ports(false) provides equivalent useful information (IMO) with slightly lower overhead. And you did say in a previous post:

Moving on:

I should have said sampling, not collecting, as in recording the ticks into the call tree. BTW I think “distortion” is also not a very user friendly term. It suggests, at least to me, that a higher number means the timing and port count data is less accurate. In fact, the only thing affecting accuracy is the number of “useful” samples; a better statistic to display might be that number, e.g., Total samples = sT, Useful samples = sU in t sec;... .

Casual readers may want to skip the rest of this.

Lets go back to basics. We have a Prolog engine capable of 10 million calls (inferences) per second, give or take. We’re trying to construct of model of what that engine is doing using a nominal sample interval of 5 ms. (minimum) so within a nominal interval about 50,000 calls are made on average. That’s roughly equivalent of trying to reconstruct a multi-mHz. signal with a sampling rate of 200 Hz. Clearly a few samples isn’t going to cut it; depending on code complexity, it’s going to take hundreds if not thousands of samples, i.e. typically on the order of 5 secs. worth, minimum.

Now the current handler is also trying to compensate for “jitter” in the sampling period. But what does that actually buy you when sampling a Mhz signal with a 200 Hz. sampling rate. I would claim nothing - sampling at a slightly different spot in time might cause a different node’s tick count to be incremented but it’s all going to average out given that a sufficient number of samples are collected.

Making it even more questionable is that the jitter is calculated in milliseconds, so about 5000 calls are made (on average) within the rounding error of the calculation.

What really counts is how many samples are collected assuming you have some assurance that the samples are relatively evenly distributed over the total profiling time. The latter seems to guaranteed(?) by the OS which says timer events will not occur before the set time period is reached, so there’s no real possibility of the samples being concentrated in a small region of the total time.

In terms of analysis the units of a tick are irrelevant as long as all ticks are equal. The other input requirement is the elapsed time of the profiling run, which can be based on either the thread clock or the wall clock. Time spent in each node is then calculated by multiplying the total profiling time by the ratio of the nodes tick count to the total tick count.

So if the tick units can be anything, why not make the tick equal to a sample? And the answer, after actually trying it, seems to be no reason at all. AFAICT short circuiting the jitter calculation and treating each sample as a single tick had no significant affect on the profiler data. And things just a whole lot simpler for the signal handler. In fact the handler doesn’t have to be aware of time at all. Now I don’t think I’ve made any bad assumptions in coming to this conclusion, but maybe I’ve missed something.

But I was surprised to see that this had no noticeable affect on the “lost sample” problem I was observing. However, changing the time option continued to have an affect on the number of samples collected per second. I conclude that the difference is not due to “lost” samples, but due to the rate at which the OS triggers the signal, i.e., out of our control. So my hypothesis is that, on MacOS at least, the thread timer interval does not include kernel time (including the overhead incurred by the timer itself in the kernel), but the thread clock does. This seems to validated by experimenting with sampling interval, e.g., from 1 ms. to 10 ms. In any case I don’t think it matters for the big picture since the SIGPROF timer is now only used to drive the accumulation of sample data.

With minimal timer overhead in the app itself, it’s interesting to consider the possibility of supporting an option for the timer interval (currently fixed at 5 ms.). Relatively modern hardware and OS’s can support considerably smaller (and obviously, larger) intervals, but I’m not sure when this flexibility would ever be useful.

Interesting thought. It’s been a long day. I have some doubts though. The call tree updates may (probably are?) not evenly distributed. To do it fair we should probably for each node have a field for the “admin” ticks and tick that if we are updating the call tree? Then the display would have “update call tree” nodes below many nodes. So, I think (but may be wrong) that the update overhead does “distort” the picture. And surely, the term is not good :frowning:

There is a problem. Windows doesn’t have a timer that follows CPU time, so we must deal with a wall time clock and figure out how much CPU time we used during the wall time slice. That is why we need this stuff. Then of course we may not use it on POSIX systems. But, SIGPROF itimers run on the process CPU clock rather than the thread CPU clock. All we know is that the thread CPU clock runs at the same speed as the process CPU clock if this is the only thread. If not, it runs slower. We have no clue by how much. There could be another thread using 100% CPU while the profiled one is mostly blocked on I/O. Of course, if we are alone, we could simply count ticks. Most of the time we are not though as there is almost always the (atom&clause) GC thread.

The 5ms is based on (my) Linux system. Possibly MacOS needs a different rate or we may need to reduce the overhead. For example, it might be wise to allocate the nodes in chunks. As long as we have no performance timers per thread I fear it remains a bit complicated and hard to tune right given different operating systems and hardware.

P.s. Linux CPU clocks have microsecond resolution.

Well you are sampling at a pretty slow rate, so there’s no guarantee that anything is evenly distributed, but that goes for both node ticks and admin ticks. Certainly more admin ticks will happen for nodes with higher accumulated ticks, but I don’t see how that contributes to any “distortion”, statistically speaking.

I was confusing clocks, as in CLOCK_THREAD_CPUTIME_ID with signals, as in SIGPROF. But my point was why does the signal handler have to read any clock? Specifically, why are the samples “weighted” by the time (however calculated) since the last sample was collected? (The time since the last sample has little to do with the current node execution time given the number of Call’s between samples is typically around 50,000.)

If the timer is not based on a thread clock, the only thing you have to guard against is the case when the thread doesn’t run at all. But that could be done relatively simply by just keeping a count of the number of Call’s since the last sample - 0 means the thread wasn’t run (blocked?) and no sample should be recorded.

For the final analysis, you do need the thread clock for measuring the total time for the profiler “session”.

I’m even less familiar with Windows than Posix. As far as sampling is concerned, I now don’t see any difference between Windows and Posix when no clock is required for sampling, as described above. If the only clock available is the wall clock then the total “thread” time can be approximated by scaling the total wall clock time time by the ratio of valid samples (where Call count > 0) to total samples (i.e., includes those discarded).

Not sure there needs to be much (sampling) overhead on any OS if the above is correct.

How might this help? (Approx 10 Call’s (inferences) occur within the resolution of the clock.)

It is too easy to get statistics wrong. If there would be no profiling overhead the results are (I think) correct in the sense that if you have it run longer and longer the results get closer and closer to reality. With profiling overhead that is no longer the case because the overhead is not proportional to the time spent in the predicate. So, an idea about the profiling overhead is an indication on how bad you are doing. How (and whether) to tell the user is another matter.

I don’t see that. For example, we may have a concurrent task during half our measurement. During that period, the handler will fire twice as fast because the process CPU clock runs twice the speed of the wall clock. As is, we nicely count the ticks during this period as half. The time to get the CPU time is not too much and yes, a high enough resolution is important here as we are counting 5ms during a normal one threaded workload, but for example already only 2.5ms when two threads run.

Windows is quite simple: we do not have a CPU clock (unless they added one in later versions; did not check), so we have to do with the wall clock. That is why I added the CPU time measurements in the first place.

Disagree. The only thing that matters is the number of “valid” samples are collected. It doesn’t matter if the overhead is 2% or 95%, as long as the overhead samples are “discarded”; what remains is an accurate reflection of where the application code itself was spending its time. If the number of valid samples collected is the same (obviously that takes a lot longer if the overhead is 95%), the results are equivalent, statistically speaking.

Also note that the time spent in the predicate is not proportional to the amount of thread time since the last sample either, given that a normal sample will include thousands of calls. More on this below.

Bottom line, the profiling overhead is only an indication of the profiling overhead, specifically the overhead of maintaining the call tree.

We do need to be precise about terminology, i.e., which clock is driving the signal handler and what is a “concurrent task”. For the current Posix implementation and time(cpu) option, the signal (SIGPROF) is driven by the process clock, and I’m assuming the concurrent task is a thread in the same process. (Other concurrent processes should not affect the operation of the signal handler in this scenario, i.e., it nominally triggers every 5 ms. of process time.)

In this scenario, the firing of the handler is not affected by how many threads are running, i.e., the process clock does not run at twice the speed, it raises the signal every 5 ms. of process time (by definition). But you don’t know how much time was spent specifically in the profiling thread over the last sample period. So, as currently done, you can read the thread clock and determine that, which is fine (but I think unnecessary). However, to then weight the current node according to that value just seems wrong. In all probability, that node did not run for anything near that amount of time (0-5 ms.?). Or equally wrong, the thread did not spend that whole time period doing “accounting”.

(It appears to me that you’re trying to accrue actual time in the call nodes. That is only a reasonable objective if the Call rate is approximately the same or slower than the sampling rate. For the Prolog profiler, the Call rate is several orders of magnitude larger than the sample rate.)

So I go back to my main point: the only thing that matters is the sample, not the time (CPU(?) or otherwise) between samples. And the sample counts are sufficient by themselves to give the timing percentages; only when converted to secs using the total profiling time, do you have to distinguish between thread, process, and wall clocks.

So obviously I disagree with this; resolution isn’t important for sampling because time isn’t. However, for the purpose of validating samples, it is important to detect when the thread didn’t run at all during the sample period.

I wouldn’t claim all this would make a significant difference in the final results for most usage scenarios (small or predictable competition for CPU resources). I do think it’s simpler, requires less sampling overhead, and is more robust (in general).

Another way of tackling the issue is to define a timer using the thread clock, which I believe is possible is the Posix function timer_create(..). Now the signal only fires every, e.g., 5 ms., of thread time, so every sample period is the same (barring lost signals) in thread time. In this case all the signal handler would have to do is increment the counts (no clock reading involved) as above (every sample is valid other than to separate accounting from normal nodes). But if you didn’t agree with my earlier arguments, this may not make sense either.

It seems you at least agree that it isn’t wrong to measure the CPU time. I still think it is right because

  • Suppose there is only one predicate. Now I get the right CPU time and you don’t because the clock runs the wrong speed.
  • Suppose there are two predicates, P running for a second and Q running for a second. During the first second another thread runs concurrently, so I get twice as many interrupts. With CPU time inspection it is right, without P gets 2/3 and Q 1/3.
  • We can go on like this, so “by induction”

I’m pretty sure the ration interrupts/calls doesn’t matter for this argument. Regardless of the ratio, disregarding overhead the result gets more and more accurate over time, approaching the true time distribution to arbitrary precision.

No, it’s actually wrong, but you get away with it because the errors get spread approximately evenly across all the predicates so the percentages aren’t greatly affected and the actual time presented is based on the end-to-end clock time, not that recorded in the individual nodes. (And to be precise, it’s currently thread time on Posix.)

Not true. I get the right CPU time. The predicate will get 100% of the samples and so 100% of the CPU time (specifically thread time) measured end-to-end.

And what do you mean the clock runs at the “wrong speed”? (See the definition of the CLOCK_THREAD_CPUTIME_ID.)

We get exactly the same number of interrupts/signals - 1 every 5 ms. of process time - because we’re using exactly the same timer.

So what does concurrently mean here. As far as its effect is concerned, I interpret this as meaning half the time the profiling thread will run during a given sample period and half the time it will not. I can detect that without using the clock, e.g., by keeping a count of the calls between samples (OS independent). So during the second sec., only half the samples will be recorded ending up with the same 2:1 ratio for P:Q that you do. This results in P at 67% and Q at 33%. Converting it to seconds of thread clock (total of 1.5 secs) for final presentation, that’s 1 sec. for P and 1/2 sec. for Q.

??

I think what this means is the more samples collected, the more accurate the result. This much we agree on, but I don’t think it’s relevant to the things we don’t agree on.

I seem to having a hard time geting my point across, so let me boil it down to a simple model to critique:

The prolog engine runs around 10 million inferences (Calls) per second; that’s 100 ns. per call on average. That means the current call node changes that often.

The sample rate is 200 per second of process time. When a sample is triggered there will be a current node “active”, but there is no indication of when the execution of that node began or when it will end. In most cases the transition to a different current node will be of the order of a microsecond. So I conclude that the only information relevant to the current node is that it got sampled, i.e., we can conclude nothing about the time it took to execute. Attributing elapsed thread time from the last sample to that specific node is wrong. (But because it’s consistently wrong across all nodes, it gives the appearance of being right.)

And that’s why elapsed (thread) clock time has no bearing on the sampling procedure. It is only relevant at measuring the end-to-end time of the entire profiling session. What is wrong with this logic?

clock_gettime() with CLOCK_THREAD_CPUTIME_ID is used to read the CPU time used by the thread to get the actual thread CPU time used since the last sample. The samples are triggered by setitimer() with ITIMER_PROF which, according to the POSIX standard, runs at process CPU. That should mean that if the process uses two cores it runs at double the speed of the real (wall) clock. I have not actually tested that and I’m not sure whether or not all systems conform. Notably Linux used to consider threads closer to processes than POSIX asks for, although they moved a lot closer over time. Even if this is ok, we still have Windows whose sampling rate runs on real time as well as POSIX systems that use process time.

Thus wrong. Maybe you missed the scenario.

  • thread 1, profiled, runs ?- profile((p,q)). p and q are the same, doing some work that takes one
    second CPU.
  • thread 2 runs concurrent, running something (could also be p or q as profile admin is per thread) that runs for 1 second. So, p runs concurrent with some thread and q runs alone in the process.

I would like to see that the p and q used the same time.

And I still claim that doesn’t matter. It just distributes the CPU time over the calls that happen to be on top of the stack at the sampling rate. If the thread gets little CPU time since the last sample, the value of these samples must reflect that. The p,q example show that. I can turn p and q into more complex (but equally shaped) call trees and both should still get the same timing result, regardless of concurrent activity, regardless of whether my samples follow real, process or thread time. Yes, in the latter case we don’t need this. The latter case surely is not something we can expect on all platforms though (if there are even platforms where this works).

If I’m following this discussion correctly, @ridgeworks is saying that port counts are likely to be more meaningful than CPU (or, maybe, port count divided by # of goals in the predicate, to adjust for predicate length).

How hard would it be to run some experiments with shorter sampling intervals? That might give some indication as to which theory is more likely to be correct (both @jan’s and @ridgeworks’s arguments seem valid to me …). I have one moderately complex piece of code that I’ve profiled in the past; but it might not be ideal because it does quite a bit of IO and also calls a subprocess using shell/1.

I think the penny finally dropped - sorry it took so long. The scenario you’re trying to guard against is over-valuing samples when there is a concurrent thread from the same process running on a different core, thus consuming process clock time and effectively resulting in “double clocking” or worse. (Concurrent here means real concurrency, not pseudo concurrency via time slicing or similar.) In this situation you want to under-value the “tick” for the current sample. Hopefully, over time, this under-valuing will be proportionally applied to all predicates being sampled. So nothing to do with timer jitter or trying to time predicates in an absolute sense, as it appeared to me.

I think a thread based timer would nicely solve the issue and remove any dependancy on assumptions about distribution of samples, but it doesn’t matter if there isn’t reasonably broad support across platforms. (Linux seems to have declared setitimer() obsolete, but it appears to be the only common denominator.)

I don’t think this later discussion has anything to do with port counts - maybe I should have started a new topic, but I didn’t think there were many paying attention at this point.

I actually have run examples with higher sampling rates. It ends up collecting more samples in the same profiling session, but since there were sufficient samples at the lower rate it didn’t really affect the end result. (Not very scientific, I admit.) My gut feel is that it only matters when you’re trying to profile a goal that doesn’t take very long. But I wasn’t trying to run a truly concurrent thread in the same process, which is @jan’s concern (although not mine).

I’m happy to run other tests if you have any suggestions.

I see it is in fact POSIX that declares it obsolete. They recommend timer_settime(), but that doesn’t seem to use signals, so you could only run it in some other thread? Hopefully I’m wrong, but I don’t see how you can use this for profiling except using the rather clumsy way I do it in Windows.

If you plan to add more options anyway, you can also add an option for the sampling rate. The chosen value is rather arbitrary and it is not unlikely other values are more suitable depending on hardware, OS and application. It will remain hard to the user to decide on a good value though. In my experience the exact values are not the interesting. The most satisfying use is if you have some program that is slow, find the bottleneck and get the code 100 times faster with a few small changes :slight_smile: Tuning to the last little bit takes a lot of time and the result could be different with the next hardware or Prolog version. For example, for a long time using between(1,10,X) with X bound was faster than X>=1,X=<10. It is now the other way around if optimized arithmetic is used because that stays in the VM rather than calling C.

I’m no expert but I think signals can be used. You associate an event treatment with the timer when you create it using timer_create(); its man page has a example showing how it’s done. However, I think it’s only supported on Linix (and other POSIX compliants) and would have to be emulated on Windows and MacOS (BSD).

Not really sure what the value is, but easy enough to do. Is it more natural to express it as a samples/sec. rate (e,g., 200) or an sampling period in ms. (e.g., 5)?

Agree. From the perspective of a black box, it can be pretty difficult to figure out what an ill-performing program is doing. But I most often use it for software tuning to figure out where optimizations can be significant, and how they might be accomplished (redesign, restructure, push into C, …) and there are usually tradeoffs involved. Better hardware or Prolog can only improve things, but you can’t depend on them and you have little control over what users have.

+++1 (for all arithmetic)

Ah, now I understand some of your confusion. Yes, on systems that support it, using timer_create(), etc. would avoid the need for these complications. Might be worthwhile. Unfortunately it only complicates the code as we than need three implementations rather than two (although they can still share the complicated part of the profiler) :frowning:

I was wondering about it as well. Definitely not ms as all timing in SWI-Prolog uses seconds represented as a double. I’m glad about that if you look at most base APIs that tend to come in seconds (old Unix sleep and alarm), milliseconds, microseconds and nanoseconds. But, sample_interval(0.005) is harder to understand than sample_rate(200) I’d say. The docs should probably mention that the value is rounded up to what the OS can do.

Now that I understand the rationale, I’m not particularly concerned. For tuning contained pieces of code, the concurrency issues can be managed so it’s just a matter of extra overhead in the sampling process.

I’ll take that as a vote for sample_rate. I think enforcing the value to be between 1 and 1000 (timer period between 1 ms. and 1 sec.) should be fine given my minimal testing on MacOS.

I would suggest allowing shorter intervals for platforms that support it. Perhaps with a warning that sample rates greater than 1000 might not work on all platforms. What would be the symptoms if the sample rate were set too high?

I picked an upper bound of 1000 because that looks to me to be guaranteed across all OS’s. (Windows timer resolution is 1 ms. as I understand it.) I also don’t see much point in having a faster sampling rate. Do you have particular use case in mind?

AFAIK, the “only” bad effect of an arbitrarily high rate is that eventually you spend all your time handling signals and then you start losing them. When this happens I think the profiling session takes an undefined amount of time (approaching infinity) to complete.

My machine shows performance between 2 and 10 MIPS, so a 1ms sampling interval gives 2,000 to 10,000 inferences (calls) per interval. That feels a bit high to me, especially if the program runs for only 1 or 2 seconds (yes, I’ve sometimes wanted to make some things run in sub-second time).

As for “too high” a sampling rate – I was wondering if on some platforms, specifying too short a sampling interval would result in an apparent infinite loop, with no indication as to why. If there’s a “minimum timeout” value available in the platform, that might be worth checking against.

BTW, when (many years ago) I wrote a profiler (in assembler) for a system that had an no only elapsed time timeout and no CPU-time timeout (OS/370), I made the interval adaptive, in an attempt to get the CPU intervals roughly equal. Presumably this same technique would work for Windows.

For reference, the current profiler runs at a fixed rate of 200 samples/sec so that’s around 50,000 calls/sample period, give or take.

But the bottom line is that for a statistical profiler, you have to collect sufficient samples (application dependent) for the results to be acceptably accurate. If the goal/query doesn’t last long enough for that to happen, then it has to be run in a loop, e.g., using between/3, to collect enough samples. Given such a wide discrepancy between the Call rate and the sample rate, pushing up the limit on the sample rate doesn’t change that materially IMO.

To really make a difference, I think you have to push the resolution of the thread clock and the cost of reading it to something close to 10 ns. At that point you can collect the actual time spent in the nodes as part of the kernel hooks and dispense with sampling entirely. But that seems pretty unrealistic at the moment.

In general I’ve not seen that other than for Windows which states that a 0 ms. value means as fast as possible. (Essentially this means all the real work had better be done by the signal handler.) For other platforms, I think it may well behave like an infinite loop (for that thread, maybe process), but I haven’t seen any minimum values published anywhere. I think that means the tool (in this case the Prolog profiler) has to guard against that situation.

I’ve submitted a PR to add the ports and sample_rate options to profile/2 as described previously. This means the sample-rate is constrained to be in the range 1 to 1000, but willing to relax this if need can be justified.

These options are not “sticky”, they are reset at the start of each profiling session.