Profiling Prolog code

I’m working on a dashboard to provide as much as possible insight about the running Prolog program. This started as a toolkit to get insight in time and memory usage for a tabled program developed at CMU that I ported to SWI-prolog (branch swi). The work in progress is in webstat.

Most of this dashboard tool is a lot of JavaScript, but now that I have the basics to have charts, graphs and tables it gets easier :slight_smile: If you want to help, please step in. Eventually this tool should integrate all the inspection tools of SWI-Prolog in one easy to understand interface.

The current challenge is the execution profiler. The low-level stuff is older then SWI-Prolog’s recorded history, i.e. before 1992. The details have changed a couple of times, but currently reflects the gprof way of collecting data with some extensions. The current graphics view is also based on gprof.

The problem is that this way of collecting the execution context deals very poorly with mutual recursion and meta-calling predicates. The profiler has a hack that removes (some) meta-calling predicates from the graph, adding their cost to the predicate that calls them (see noprofile/1). Analyzing the state I discovered that the way mutual recursion is handled is broken. Since we have delimited continuations things got a lot worse. For short, it is time to look for something new.

Starting at Google Scolar I quickly discovered there is a whole lot of relevant literature. So far I selected

  • “Comparative performance analysis of calling context profilingdata
    structures” Prita Yadav and Paramvir Singh. DOI: 10.1007/s13198-016-0447-x
  • “Exploiting hardware performance counters with flow and context
    sensitive profiling” Glenn Ammons, Thomas Ball, James R. Larus.
    doi 10.1145/258916.258924
  • “Incremental call-path profiling” by Andrew R. Bernat and Barton P. Miller

For Prolog we have an additional challenge to deal with non-determinism.

Currently I hook into the VM, which tells me

  • When a predicate is called. That is reliable (and the only simple and reliable hook). It creates a node in the dynamic call graph and stores a reference to this node in the environment frame it creates for the call.
  • When we get an exit. We do know whether this is deterministic or not, but I currently do not use that. We know which call graph node exited (from the environment). Their may be nodes missing due to last-call optimization. Currently I walk up the dynamic call graph until I find the target node and exit all the intermediate nodes. I fear this is dubious under certain abstractions of the call graph.
  • On a failure we can find the frame (and thus call graph node) where we resume. I tried to translated that into redo/fail counts, but this seems harder than I expected.

Note that we want something at the VM level that doesn’t interfere with normal optimized execution. If we do interfere with that counting becomes much simpler but large programs can no longer be handled and timing will be very poor. We also want to be able to enable/disable recording on the fly, such that we can record a running program for a while.

Timing is currently based on sampling. We can also consider using performance timers at the various ports. I don’t know how portable that is and how much impact it has. Would be great to have.

Does anyone has good leads or ideas how to handle this? Ideally I’d like to end up with something similar to kcachegrind with additional relevant annotations for Prolog (determinism and tabling come to mind).

If anyone is looking for a project to contribute, please step in. There is a lot to do to find and specify the required accounting data and algorithms. Implementing in src/pl-prof.c is pretty much isolated from the rest of the system. Next there is the challenge to write (Prolog) routines to nicely present the results in the web interface.


I don’t have any technical suggestions but I affirm that the current profiler is indispensable for me. My one complaint is that the time is very inconsistent from run to run, and lately I find a big difference between elapsed time and Time With Children on the top-level predicate.

I am not sure if this will help, but it may.

RoaringBitmaps is a data structure that is extremely fast and it is used to store sets of integers. You can calculate set intersection, union, difference very fast, they measure in terms of CPU cycles. Much faster than hash tables. This could be very useful for profiling data, but also for other cases. You can find the C library here: CRoaring.

It is used by indexing engines like lucene, and also at ebay. A nice talk by the author is here:

EDIT: Just so you have an idea of the performance, doing union of sets of data from the 1881 census:

format              cpu cycles per value
------              --------------------
hash set                    200
arrays                      6
bitsets                     30
roaring bitmap              1

I know webstat is a work in progress, but it is looking very nice! :

Thank you for this work!