Understanding the output of the profiler

Hi all,

I’m using: SWI-Prolog version 9.2.9

I’m troubleshooting a bottleneck in my application that uses swipl, so I ran the profiler as shown below.

profile_h(Goal) :-
    strip_module(Goal, Module, Pred),
    get_term_arity(Pred, F, A),
    (current_prolog_flag(optimise, false) ->
        writeln(user_error, 'WARNING: Profiling non-optimized code can be misleading')
    ;
        true
    ),
    format(user_error, 'INFO: Profiling ~p~n', [Module:F/A]),
    profile(Goal, [cumulative(false), top(100)]).

and this is the output I get (trimmed for clarity)

INFO: Profiling eclair:unit_work_orig/3
=====================================================================
Total time: 827.587 seconds
=====================================================================
Predicate                       Box Entries =    Calls+Redos     Time
=====================================================================
retract/1                         1,811,405 =1,469,505+341,900  30.5%
eclair:$metrics__value/6          1,308,047 =1,087,301+220,746  27.6%
eclair:parser_type_fqname/2       1,980,085 =1,980,085+0         5.7%
eclair:parser_get_type/2          4,247,369 =4,247,369+0         1.3%
nb_getval/2                     101,436,385 =101,436,385+0         1.1%
$memberchk/3                      9,942,183 =9,942,183+0         0.8%
eclair:parser_decl_fqname/2         272,729 =  272,729+0         0.8%
eclair:fQualType_getDesugaredType/32,819,578 =2,819,578+0         0.7%
retractall/1                        113,964 =  113,964+0         0.7%
term_hash/2                      29,634,269 =29,634,269+0         0.7%
garbage_collect_clauses/0                 1 =        1+0         0.6%
$garbage_collect/1                        7 =        7+0         0.6%
assertz/1                         3,967,256 =3,967,256+0         0.6%
eclair:ast_visitor_has_mode/1    56,679,914 =56,679,914+0         0.5%
eclair:parser_get_redecls/2         991,618 =  991,618+0         0.5%
utilities:matches/3               9,126,278 =8,649,698+476,580   0.5%
eclair:parsed_visitor_enter/2     1,616,883 =  920,024+696,859   0.3%
eclair:matches_approx/5           1,374,712 =1,373,625+1,087     0.3%
eclair:$mcrargpn__arg_expand_data/92,951,751 =1,532,630+1,419,121 0.3%
eclair:parser_expr_is_evaluatable/3 126,309 =  126,309+0         0.3%
eclair:parsed_visitor_exit/2      1,908,236 =  920,024+988,212   0.3%
eclair:regexec/2                  2,421,380 =2,421,380+0         0.3%
eclair:$unuslocl__decl/3            101,300 =   91,450+9,850     0.3%
nb_linkval/2                     21,574,876 =21,574,876+0         0.3%
eclair:g_type_selector__matches_aux/218,446,744,073,713,445,529 =8,250,221+18,446,744,073,705,195,308 0.3%
eclair:parser_loc_get_nspelling_info/43,229,208 =3,229,208+0         0.2%
eclair:parser_token_get_text/2    2,408,468 =2,408,468+0         0.2%
$bags:cleanup_bag/2               3,641,638 =3,626,803+14,835    0.2%
utilities:list_exists/3           1,484,346 =1,484,346+0         0.2%
eclair:has_function_props/10     15,178,388 =15,178,388+0         0.2%
parser:is_node/2                 12,904,999 =12,904,999+0         0.2%
eclair:is_inside_signature_for_any_function_decl/31,685,544 =1,685,544+0         0.2%
eclair:parser_fid_has_info/4      4,729,506 =4,729,506+0         0.2%
eclair:parser_type_add_qualifiers/3 697,502 =  697,502+0         0.2%
eclair:fStmt_getSourceRange/2     1,143,695 =1,143,695+0         0.2%
eclair:fTypeLoc_getSourceRange/2    311,247 =  311,247+0         0.2%
eclair:frame_matches/1           19,187,280 =19,187,280+0         0.2%
eclair:parser_file_get_info/3     2,063,583 =2,063,583+0         0.2%
eclair:g_decl_selector__matches_aux/218,446,744,073,713,234,421 =4,516,608+18,446,744,073,708,717,813 0.2%
eclair:node_has_handle/2          5,960,516 =5,960,516+0         0.2%
eclair:ast_subtree__is_selected/1 5,204,455 =5,204,455+0         0.2%
utilities:list_exists/2           3,625,976 =3,619,550+6,426     0.2%
eclair:matches_alt/5 18,446,744,073,714,797,408 =5,482,643+18,446,744,073,709,314,765 0.2%

Aside from the top offenders, which are not what interests me in this context, I’m trying to understand the reason behind the huge number of retries in e.g. matches_alt.

This is what I gathered so far from interpreting that output, but I may be totally mistaken:

  • It’s not possible that the predicate was redone that many times, especially given the very low cpu time percentage, and also because it’s a multi-clause predicate with at most tens of clauses, so clearly it’s not executed that many times;
  • Perhaps the number is negative, which then prompts the question why the number of retries may go into the negatives, unless somehow it’s due to overflow

I did test the last hypotesis (in a different run, but showing roughly the same behavior) and this is the result:

eclair:matches_alt/518,446,744,073,727,905,772 =20,682,929+18,446,744,073,707,222,843 1.1%
(gdb) print/x 18446744073707222843LL
$4 = 0xffffffffffdc773b
(gdb) print/x ~18446744073707222843LL
$5 = 0x2388c4
(gdb) print/d ~18446744073707222843LL
$6 = 2328772

which seems to confirm the hypotesis that the huge number is just a negative quantity.
Could it be that the number of redos is unsigned and therefore if they are subtracted (too much) it behaves like this? From my tests, the application works correctly, so perhaps it’s just manifesting in the profiler.

Thanks

Add ports(true) to the options. That gives correct port counts. Without, the call ports are fine, but the others vary. The downside is that on some cases of mutual recursion the recorded call tree can become huge. If you are interested primarily in time with some indication of activity, the default is fine. If you want to understand your program, ports(true) is the way to go.

Note that there is also coverage/2 to annotate your program behaviour by line. This relies on source code mapping though and does not work well on programs that use a lot of term_expansion/2 and/or goal_expansion/2.

1 Like

Hi Jan,

I did some experiments with ports(true), but as you indicated it has an excessive impact for most of the usecases. coverage/2 might be interesting to use to find unused code, though there are more than a few uses of goal_expansion/2 and term_expansion/2, so I’m not sure it would be effective. Thanks for the help.

Can you elaborate? Can you provide some information on memory usage? (See SWI-Prolog -- library(statistics): Get information about resource usage ). Having such data for different profiler options should determine whether the profiler is having a significant impact in your case; there may even be lingering bug fixes or optimizations that need attention. (Generate an issue if you think so.)

I also note that garbage collection (including clauses) is run very infrequently suggesting that the automatic mechanisms don’t seem to be reacting how you would expect if memory resources were being depleted.

Further the data in the original post indicates that 30% of the time is spent in retract/1. Assuming the counts for the call port are correct, that’s approximately 30% of 827 sec. for 1.5 million calls, or about 165 microseconds per call. That seems high, or at least worthy of further investigation.

And, as Jan points out, using the ports(true) will consume more memory when the program is mutually recursive (but not immediately recursive). Does your program fit this description? (Called/calling details from the call graph can be used to determine this.) If so, running such a program for 827 seconds (~15 minutes) is probably going to run out of memory at some point - using ports(true) may only be exacerbating an issue that already exists. If so, there may well be a use case that runs out of memory even when the profiler isn’t being used at all.

It’s also not clear (to me) whether you’re interested in general timing information to point to general areas in a large program with unknown performance characteristics that need attention. If so, the default where only the call port count is accurate makes sense and has fairly modest memory requirements.

If you’re more interested in the details at the individual predicate level, then having accurate port counts is useful, e.g., for non-deterministic predicates where clause ordering may be an issue. But in this case, you generally want to work with smaller programs or sub-programs (i.e., not programs that need to run for 15 minutes) where the memory overhead isn’t as big an issue. (This latter case dominates for my particular programs of interest.)

1 Like

Hi,

Memory usage statistics are a bit tricky due to various factors, but one thing that stood out was definitely the retract/1 statistic. While slightly strange, the program is deeply intricated (see here: | BUGSENG) and definitely does not run out of memory when not using the profiler, at least on sane inputs. We can see correct ports counts on simple examples (where “examples” is unfortunately only loosely translated into prolog input), but it doesn’t seem to scale too well. At this time, we opted for the call port count because we deemed it the most effective compromise to be used when we spot unusually slow performance on certain inputs, which then prompts for manual digging into the code, due to the intrinsic intricacy of the codebase.

So, in general it might be ok to look at this in more detail in the future, but for now the call port is fine for an initial assessment and then reduced examples with ports(true) if we need to dig deeper, if possible.

In any case, thank you both for the helpful insights!

Dynamic predicates are subject to erased clauses that are not yet reclaimed. There are (still) cases where this leads to serious amounts of not yet reclaimed memory as well as performance loss due to the need to skip erased clauses. If this is what is bothering you we might be able to improve on that given good example cases and some thought.

As is, erasing a clause sets the erased-generation. The need for clause GC is monitored based on statistics on the number of skipped erased clauses and the memory used by the erased clauses. If the need is established, the gc thread is triggered that will in due time asynchronously clean the dead clauses.

I assume that the issue is most severe when the table has a single key that is asserted and retracted zillion of times (as the query find an increasing number of erased clauses).
Do you have some idea about what the code should do to minimize this nasty effect?

In many case a global variable is a better option than a dynamic fact with one clause. That is one example though. The details are subtle and some can possibly be resolved with limited effort.

Actually it is not like a single clause, but a dynamic predicate with pairs key, value (with key composed by more parts). The value for the same key is changed zillion of times and should survive choice points.

If you do not need the logical update semantics, use a dict in a global variable and nb_set_dict/3. Adding a key is a bit harder, but you can add it to the dict and update the global variable. This may work better up to a certain amount of keys. The price of dicts is O log(N) and that of a dynamic predicate O 1 (provided the hash tables do their job). Unless you have over 100,000 keys I’d give the dict approach a try.

You could also try a trie :slight_smile: . That too looses logical update semantics, but adding and updating keys is O 1 (again hashed).

Both should (except logical update semantics), be semantically equivalent. The price for adding and retrieving complex values differs.