Memory usage

Hi,
how do I print information on memory usage during execution?
I have a program making heavy use of multithreading that surprisingly exhausts 750Gb of memory even if it does not use asserts and retracts

Fabrizio

The thread monitor is as good as it gets for monitoring stack usage. I have a little utility in my init.pl to start it.

tmon :-
    call(prolog_ide(thread_monitor)).

There are many utility predicates that allow examining memory usage. Someone should write a little utility that prints the “top 10”, be it large stacks, a huge dynamic predicate, a huge table, a predicate with many tables, etc.

If you like a low level approach, Linux has heaptrack. That will relate the memory usage to C stack traces and should give a hint.

1 Like

Thanks. The problem is that the process is executed non-interactively on a HPC machine.

Are they those at SWI-Prolog -- Memory Management ?

That makes it a little more complicated. library(statistics) does the data collection, so you can also use that and print the data.

I doubt. AFAIK that is a more general section on how stacks and some other memory areas are managed. Most objects have a xyz_property(size(Bytes)). This holds for clauses, predicates, modules, tries, etc. Tabling has its own set of diagnostics that is currently in my personal library at GitHub - JanWielemaker/my-prolog-lib: Personal SWI-Prolog utilities library. There are probably some more utilities there.

Note that for some workloads the allocator matters. That is why the default is tcmalloc on Linux when available.

I found out that I compile swipl with malloc. Does malloc leak memory?

That is more a matter if definition. Let us say that there are workloads under which the default “ptmalloc” that is part of glibc under Linux keeps building up memory that is freed by the application. So, theoretically it should be available to the application again, but in practice this does not happen (fast enough?). “tcmalloc” seems less sensitive to this problem. This notably affects web servers and SWISH, but as I do not really know what is going wrong, I’m not sure which workloads are involved.

That is what I meant.
So I’m running the experiment again with 450Gb of RAM available.
The last statistics call printed

% Started at Mon Aug  5 16:26:10 2024
% 85443.414 seconds cpu time for 3,010,589,621 inferences
% 69,639 atoms, 4,369 functors, 3,270 predicates, 38 modules, 53,284,829 VM-codes
%
%                     Limit   Allocated      In use
% Local  stack:           -       20 Kb    1,472  b
% Global stack:           -      256 Mb      136 Mb
% Trail  stack:           -       64 Mb      864  b
%        Total:      450 Gb      320 Mb      136 Mb
%
% 848 garbage collections gained 180,618,933,824 bytes in 81.219 seconds.
% 1 atom garbage collections gained 343 atoms in 0.158 seconds.
% 2 clause garbage collections gained 133 clauses in 0.000 seconds.
% Stack shifts: 0 local, 3 global, 11 trail in 0.061 seconds
% 98 threads, 0 finished threads used 0.000 seconds

which means that memory usage is small. However, an external node monitor tells me that only around 200Gb of RAM are available out of 450Gb. I guess there is memory leakage…
I will try to compile swipl with tcmalloc and see if it gets better.

Note that this does not include memory for the stacks of the threads nor memory in tables (tries). One of these is most likely where the memory is.

So I printed the stack statistics with

  statistics(stack,S),
  format(user_error,"stack=~D\n",[S]),

I get around 33Gb but the memory on the machine is exhausted (500 Gb).
I also print the heap with

  statistics(heapused,H),
  writeln(user_error,heapused=H).

and I always get 0. In the end the process is killed by a 9 signal.
This is the output of /usr/bin/time

Command terminated by signal 9
        Command being timed: "/hpc/home/rzf/bin/swipl --stack-limit=450g inst2ranks_oi100.pl"
        User time (seconds): 185394.21
        System time (seconds): 646.28
        Percent of CPU this job got: 3162%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:38:02
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 522110568
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 128744
        Minor (reclaiming a frame) page faults: 506542389
        Voluntary context switches: 154610
        Involuntary context switches: 19907991
        Swaps: 0
        File system inputs: 8134000
        File system outputs: 41304
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Is this due to malloc? I don’t use tabling, tries, asserts.

I would expect similar results with tcmalloc if the code doesn’t use threads and somewhat more usage if it does use threads (because tcmalloc tries to avoid inter-thread locking issues by associating memory with the thread where it’s allocated … tcmalloc was originally designed for multi-threaded webservers that heavily use callbacks and memory is often allocated in a thread, then freed in the callback, which can be in another thread, although it’s more complicated than that and an attempt is made to have the callback in the same thread).

Thanks. That is precisely what SWI-Prolog does when it comes to atoms and clauses that are being garbage collected. They are allocated by some thread that creates the atom or asserts the clause and always released by the gc thread.

For @friguzzi , I’d recommend trying heaptrack. The data collection uses no gui. It has a gui for analysing the result. You can probably run that after copying the data file to your desktop. That should give an indication of where the majority of the allocation happens and when the amount of memory reported by heaptrack is much less then what Linux tells you, you are hit by the ptmalloc issues.

I ran heaptrack.
These are the main results:

MOST CALLS TO ALLOCATION FUNCTIONS
65267025 calls to allocation functions with 1.08M peak consumption from
tmp_malloc
  at /hpc/home/rzf/swipl-devel/src/pl-alloc.c:1364
  in /hpc/home/rzf/lib/swipl/lib/x86_64-linux/libswipl.so.9
....
PEAK MEMORY CONSUMERS
120.58G peak memory consumed over 5467 calls from
PL_malloc_atomic_unmanaged
...
MOST TEMPORARY ALLOCATIONS
59259533 temporary allocations of 65267025 allocations in total (90.80%) from
tmp_malloc
  at /hpc/home/rzf/swipl-devel/src/pl-alloc.c:1364
  in /hpc/home/rzf/lib/swipl/lib/x86_64-linux/libswipl.so.9
...
total runtime: 459.41s.
calls to allocation functions: 69634372 (151574/s)
temporary memory allocations: 59784781 (130134/s)
peak heap memory consumption: 121.09G
peak RSS (including heaptrack overhead): 151.37G
total memory leaked: 121.09G
suppressed leaks: 1.79K

It seems there are massive memory leaks…

No. In normal production version, SWI-Prolog does not try to cleanup memory on halt. Only when compiled in debug version this happens. As there can be some nasty dependencies on data behind blobs, this is a tedious and not 100% safe process, do we normally leave that to the OS.

Do you have deeper stack traces? These are just the two main allocation functions :frowning:

Here they are lp4kg/out1.txt at 0b86684073b693b2afa578ee0a1a47b0d906a88c · friguzzi/lp4kg · GitHub

I have to add that this analysis was done while the process was still running, not at halt. And that the stack statistics was always 32G.

I can’t make much of the stack traces. The most expensive one seems to be about 0.5Gb contributed by the compiler for loading the code. I can’t find anything really big. Possibly you ran it too short? Anyway, you can also identify the suspect stack traces. If you cannot guess what is happening there from the functions involved, I’m glad to answer.

P.s. tmp_malloc() is a function that allocates large chunks outside the malloc/tcmalloc using virtual memory primitives. This memory can actually be returned immediately to the OS when done. It is used for the Prolog stacks and big temporary allocations for e.g., the findall/3 data buffering.

To undestand better I used the trace for the whole execution and I used the GUI version of heaptrack
I’m not an expert but I caught this


Does it make sense?
I also added to the repo the raw trace that can be analyzed with heaptrack_gui:

in @jan you want to have a look.

Unless I’m very mistaken, it seems the memory goes into message queues. If we are still talking concurrent_maplist/2, note that this does create all goals and sends these into a message queue. This implies all goals are copied. If there is a lot of data in the goal arguments, this gets pretty expensive (and slow).

Possibly concurrent_and/2 or concurrent_forall/2 are more appropriate? Note that both of these work with a generator that pushes new jobs into a queue of limited size. At least in theory we could also rewrite concurrent_maplist/2-4 to use this principle. If someone likes to do this, please make a PR.

Yes, there is a lot of data in the goal arguments. I will rewrite the code to avoid this, thanks for the tip.

Using concurrent_forall/2 and using dynamic Prolog clauses for the rules of the theory to check, I got two big improvements:

  1. now it takes about 1/4 of the time
  2. it does not exhaust memory.
    Here is the program
    lp4kg/wn18rr/inst2ranks_oi100_rules.pl at c34925f459537756295ff5415e1b8643ead5d010 · friguzzi/lp4kg · GitHub
    I didn’t use concurrent_and/2 because I don’t want the instantiations for the goal, I just want the goal to be executed for each solution of the generator.
3 Likes