Hi,
how do I print information on memory usage during execution?
I have a program making heavy use of multithreading that surprisingly exhausts 750Gb of memory even if it does not use asserts and retracts
Fabrizio
Hi,
how do I print information on memory usage during execution?
I have a program making heavy use of multithreading that surprisingly exhausts 750Gb of memory even if it does not use asserts and retracts
Fabrizio
The thread monitor is as good as it gets for monitoring stack usage. I have a little utility in my init.pl to start it.
tmon :-
call(prolog_ide(thread_monitor)).
There are many utility predicates that allow examining memory usage. Someone should write a little utility that prints the “top 10”, be it large stacks, a huge dynamic predicate, a huge table, a predicate with many tables, etc.
If you like a low level approach, Linux has heaptrack
. That will relate the memory usage to C stack traces and should give a hint.
Thanks. The problem is that the process is executed non-interactively on a HPC machine.
Are they those at SWI-Prolog -- Memory Management ?
That makes it a little more complicated. library(statistics) does the data collection, so you can also use that and print the data.
I doubt. AFAIK that is a more general section on how stacks and some other memory areas are managed. Most objects have a xyz_property(size(Bytes)). This holds for clauses, predicates, modules, tries, etc. Tabling has its own set of diagnostics that is currently in my personal library at GitHub - JanWielemaker/my-prolog-lib: Personal SWI-Prolog utilities library. There are probably some more utilities there.
Note that for some workloads the allocator matters. That is why the default is tcmalloc
on Linux when available.
I found out that I compile swipl with malloc. Does malloc leak memory?
That is more a matter if definition. Let us say that there are workloads under which the default “ptmalloc” that is part of glibc under Linux keeps building up memory that is freed by the application. So, theoretically it should be available to the application again, but in practice this does not happen (fast enough?). “tcmalloc” seems less sensitive to this problem. This notably affects web servers and SWISH, but as I do not really know what is going wrong, I’m not sure which workloads are involved.
That is what I meant.
So I’m running the experiment again with 450Gb of RAM available.
The last statistics call printed
% Started at Mon Aug 5 16:26:10 2024
% 85443.414 seconds cpu time for 3,010,589,621 inferences
% 69,639 atoms, 4,369 functors, 3,270 predicates, 38 modules, 53,284,829 VM-codes
%
% Limit Allocated In use
% Local stack: - 20 Kb 1,472 b
% Global stack: - 256 Mb 136 Mb
% Trail stack: - 64 Mb 864 b
% Total: 450 Gb 320 Mb 136 Mb
%
% 848 garbage collections gained 180,618,933,824 bytes in 81.219 seconds.
% 1 atom garbage collections gained 343 atoms in 0.158 seconds.
% 2 clause garbage collections gained 133 clauses in 0.000 seconds.
% Stack shifts: 0 local, 3 global, 11 trail in 0.061 seconds
% 98 threads, 0 finished threads used 0.000 seconds
which means that memory usage is small. However, an external node monitor tells me that only around 200Gb of RAM are available out of 450Gb. I guess there is memory leakage…
I will try to compile swipl with tcmalloc and see if it gets better.
Note that this does not include memory for the stacks of the threads nor memory in tables (tries). One of these is most likely where the memory is.
So I printed the stack statistics with
statistics(stack,S),
format(user_error,"stack=~D\n",[S]),
I get around 33Gb but the memory on the machine is exhausted (500 Gb).
I also print the heap with
statistics(heapused,H),
writeln(user_error,heapused=H).
and I always get 0. In the end the process is killed by a 9 signal.
This is the output of /usr/bin/time
Command terminated by signal 9
Command being timed: "/hpc/home/rzf/bin/swipl --stack-limit=450g inst2ranks_oi100.pl"
User time (seconds): 185394.21
System time (seconds): 646.28
Percent of CPU this job got: 3162%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:38:02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 522110568
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 128744
Minor (reclaiming a frame) page faults: 506542389
Voluntary context switches: 154610
Involuntary context switches: 19907991
Swaps: 0
File system inputs: 8134000
File system outputs: 41304
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Is this due to malloc? I don’t use tabling, tries, asserts.
try to compile swipl with tcmalloc and see if it gets better
I would expect similar results with tcmalloc if the code doesn’t use threads and somewhat more usage if it does use threads (because tcmalloc tries to avoid inter-thread locking issues by associating memory with the thread where it’s allocated … tcmalloc was originally designed for multi-threaded webservers that heavily use callbacks and memory is often allocated in a thread, then freed in the callback, which can be in another thread, although it’s more complicated than that and an attempt is made to have the callback in the same thread).
tcmalloc was originally designed for multi-threaded webservers that heavily use callbacks and memory is often allocated in a thread, then freed in the callback, which can be in another thread
Thanks. That is precisely what SWI-Prolog does when it comes to atoms and clauses that are being garbage collected. They are allocated by some thread that creates the atom or asserts the clause and always released by the gc
thread.
For @friguzzi , I’d recommend trying heaptrack. The data collection uses no gui. It has a gui for analysing the result. You can probably run that after copying the data file to your desktop. That should give an indication of where the majority of the allocation happens and when the amount of memory reported by heaptrack is much less then what Linux tells you, you are hit by the ptmalloc issues.
I ran heaptrack.
These are the main results:
MOST CALLS TO ALLOCATION FUNCTIONS
65267025 calls to allocation functions with 1.08M peak consumption from
tmp_malloc
at /hpc/home/rzf/swipl-devel/src/pl-alloc.c:1364
in /hpc/home/rzf/lib/swipl/lib/x86_64-linux/libswipl.so.9
....
PEAK MEMORY CONSUMERS
120.58G peak memory consumed over 5467 calls from
PL_malloc_atomic_unmanaged
...
MOST TEMPORARY ALLOCATIONS
59259533 temporary allocations of 65267025 allocations in total (90.80%) from
tmp_malloc
at /hpc/home/rzf/swipl-devel/src/pl-alloc.c:1364
in /hpc/home/rzf/lib/swipl/lib/x86_64-linux/libswipl.so.9
...
total runtime: 459.41s.
calls to allocation functions: 69634372 (151574/s)
temporary memory allocations: 59784781 (130134/s)
peak heap memory consumption: 121.09G
peak RSS (including heaptrack overhead): 151.37G
total memory leaked: 121.09G
suppressed leaks: 1.79K
It seems there are massive memory leaks…
It seems there are massive memory leaks…
No. In normal production version, SWI-Prolog does not try to cleanup memory on halt. Only when compiled in debug version this happens. As there can be some nasty dependencies on data behind blobs, this is a tedious and not 100% safe process, do we normally leave that to the OS.
Do you have deeper stack traces? These are just the two main allocation functions
Do you have deeper stack traces? These are just the two main allocation functions
Here they are lp4kg/out1.txt at 0b86684073b693b2afa578ee0a1a47b0d906a88c · friguzzi/lp4kg · GitHub
No. In normal production version, SWI-Prolog does not try to cleanup memory on halt.
I have to add that this analysis was done while the process was still running, not at halt. And that the stack statistics was always 32G.
I can’t make much of the stack traces. The most expensive one seems to be about 0.5Gb contributed by the compiler for loading the code. I can’t find anything really big. Possibly you ran it too short? Anyway, you can also identify the suspect stack traces. If you cannot guess what is happening there from the functions involved, I’m glad to answer.
P.s. tmp_malloc() is a function that allocates large chunks outside the malloc/tcmalloc using virtual memory primitives. This memory can actually be returned immediately to the OS when done. It is used for the Prolog stacks and big temporary allocations for e.g., the findall/3 data buffering.
To undestand better I used the trace for the whole execution and I used the GUI version of heaptrack
I’m not an expert but I caught this
version https://git-lfs.github.com/spec/v1
oid sha256:0208d9765db9df5a0321c5c3e07357204cdfdce0d6792208b6c99065e72d01c0
size 208969363
in @jan you want to have a look.
Unless I’m very mistaken, it seems the memory goes into message queues. If we are still talking concurrent_maplist/2, note that this does create all goals and sends these into a message queue. This implies all goals are copied. If there is a lot of data in the goal arguments, this gets pretty expensive (and slow).
Possibly concurrent_and/2 or concurrent_forall/2 are more appropriate? Note that both of these work with a generator that pushes new jobs into a queue of limited size. At least in theory we could also rewrite concurrent_maplist/2-4 to use this principle. If someone likes to do this, please make a PR.
If there is a lot of data in the goal arguments, this gets pretty expensive (and slow).
Yes, there is a lot of data in the goal arguments. I will rewrite the code to avoid this, thanks for the tip.
Using concurrent_forall/2 and using dynamic Prolog clauses for the rules of the theory to check, I got two big improvements: