Disabling tcmalloc: warning for data science and big data apps

tcmalloc does not release memory properly to the OS for applications that allocate a large chunk of memory and then release it. The problem has been there for a long time.

This starves other applications running on the same machine, causing an out of memory error.

I tried changing TCMALLOC_RELEASE_RATE as described here, but the problem remains.

Is there a way to disable tcmalloc and use the regular malloc?

UPDATE: I rebuilt from scratch with -DUSE_TCMALLOC=OFF, and now swipl releases the memory to the OS properly. I think this is a serious issue, since it can easily kill other applications that allocate and release large chunks of memory running on the same system. This is quite common for data science and big data applications which could die with an out of memory error. The problem is not noticed unless you have several big data applications which allocate and release large chunks of memory running on the same machine. For this reason, IMHO I would suggest to go back to the regular malloc, as this problem is very hard to track down and most users will not know what is going on. I am now building swipl from scratch with -DUSE_TCMALLOC=OFF and skipping the distro build because of this issue.

You can build your own version with cmake -DUSE_TCMALLOC=OFF.

The version of swipl that I installed from the standard PPA doesn’t use tcmalloc, according to ldd $(type -p swipl).

And I think you can remove tcmalloc from your system (on Ubuntu, I think it’s libtcmalloc-minimal4).

1 Like

Malloc libraries for 24x7 threaded applications is a difficult choice. Each of them has its own issues. I’ve switched the default from the system malloc to tcmalloc after experiencing that both SWISH and the backends running www.swi-prolog.org both kept growing while there was no (significant) memory leak in the traditional sense. In other words, most of the memory was wasted inside malloc. tcmalloc does a much better job. I also tried jemalloc. That was better then the system one, but for these workloads tcmalloc was the clear winner. I sent reports about my findings to this forum.

You can typically switch at build time as @peter.ludemann indicates. Depending on the platform you can often also use hacks like LD_PRELOAD on Linux to get the system malloc in front of one that is linked in.

To give an idea of the magnitude of the problem, here are the numbers for an app that allocated 9GB, and then released them after making a calculation:

13 ?- malloc_property(P), arg(1,P,V), format('~I',[V]).
19_121_688  % about 19 MB of actually allocated data after release, good!
P = 'generic.current_allocated_bytes'(19121688),
V = 19121688 ;
9_609_510_912 % but tcmalloc is holding on to 9GB !
P = 'generic.heap_size'(9609510912),
V = 9609510912 ;
...

As you can see the currently allocated bytes after release is only 19MB, which is great. But tcmalloc is holding on to about 9GB of memory 1 hour after the memory was released (the numbers were taken about one hour after the app released the memory).

This will starve all other apps except the one that did the large allocation first. This is why you won’t see the problem too much with SWISH because you need to do a very large allocation from a separate application; then you will see the app die with an out of memory error.

This does not happen with the system malloc, which releases the memory quickly.

I’m wondering what type of allocations is causing this problem? Large temporary allocations including the stacks explicitly use mmap() and friends and thus return memory to the system. The most likely causes that comes to mind are large amounts of temporary dynamic clauses, large amounts of (large) atoms and large tables.

One flaw of the current memory management is that, for example, memory associated with a dynamic predicate for storing its clauses and indexes is simply malloced. Typically other stuff that uses malloc will prevent the memory for the dynamic predicate to be released after all its clauses have been retracted. Depending on the use case clauses may have vastly different sizes, which complicates memory reuse even more. Similar problems are associated with applications that temporary use large amounts of large atoms or big tables. Ideally potentially big temporary objects should keep all the memory they use together.

The allocation is a large blob (atom) that represents a large matrix, on which calculations are done, then the data is properly released by the swipl GC a little time after the calculation is finished.

Everything is working well (I should say very well) from the swi-prolog point of view. The problem is that tcmalloc holds on to the 9GB heap even after prolog has released the data properly.

1 Like

I see. There are several ways out. One would be to use a string rather than an atom and rely on the stack allocation that is released. There are two issues. One is that strings are subject to stack GC and may thus be shifted if you make any call on Prolog that affects the stacks. The other is that strings are now either byte arrays or wchar_t arrays, but there are plans to turn them into UTF-8. I don’t know what that will mean for using them as byte arrays. Probably some solution has to be found as some applications depend on this. Finally you cannot share strings between threads.

If this rules out strings you can opt to store a pointer in the blob and allocate the array yourself using mmap(). That will work. It is a bit more work and if you want it to be portable you have to do it differently for e.g., Windows.

I final solution might be to use SWI-Prolog’s internal “use mmap() for large chunks” that also causes the stacks to be mmap()ed as well as some other potentially huge temporary allocations as used (for example) for storing findall/3 results.

In general atom-GC is not well suited to deal with giant atoms. It currently triggers if more than 10,000 atoms are known to be potentially garbage (flag agc_margin). Using Gb sized atoms that can become a little costly :frowning: In addition, it is a conservative garbage collector. This means there can be false references that prevents an atom to be garbage collected while it would be safe to do so. AGC scans the Prolog stacks for references to atoms asynchronously, i.e., while the Prolog thread that is being scanned continues execution. It just verifies that a cell has an atom tag and the atom-handle refers to a valid atom. The bit pattern could in fact be a float or some bytes from a string though.

I think the best way out is to store th array outside the blob and provide a predicate to free the array (this leaves the blob, but it is now small). If you miss the free call for some reason GC might do it for you. Several Prolog internals do that as well. For example a thread handle. If a thread is not joined by the user, atom-GC will do it for you (well, is likely do do it for you).

1 Like

I see, that is a great explanation of the different options, thanks for writing it down.

I am using the ffi pack to connect to a third-party library: the ffi pack allocates the small blob which then holds a pointer to the data, which is allocated by the third-party library. A memory release function is provided using the ffi pack ~FreeFunc notation, and I can’t change the allocation method of the third party library. On the other hand, tcmalloc is not giving me any benefits, only a major problem.

Which tool(s) are you using for monitoring memory? In the past, I’ve seen some tools reporting more memory used than was actually used. (I’m guessing this isn’t the case, unless you’re doing forks (which can sometimes result in memory getting double-counted) or if file cache memory is somehow getting counted against your process.)

tcmalloc itself, which is reported through malloc_property/1 in swipl. malloc_property/1 is not available unless tcmalloc is installed.

13 ?- malloc_property(P), arg(1,P,V), format('~I',[V]).
19_121_688  % about 19 MB of actually allocated data after release, good!
P = 'generic.current_allocated_bytes'(19121688),
V = 19121688 ;
9_609_510_912 % but tcmalloc is holding on to 9GB !
P = 'generic.heap_size'(9609510912),
V = 9609510912 ;
...

The tcmalloc numbers match all the other OS tools.

Unless you can change the third-party library, I guess using an allocator that gets this right is your only choice. I do not see a reason to change the default to build using tcmalloc when possible. As far as I can tell it does a much better job on most (threaded) Prolog programs. I never really understood why, but glibc’s ptmalloc doesn’t seem to deal well with SWI-Prolog’s scenario where freeing memory typically happens in in bursts in the GC thread while allocation is done by the other threads.

We learned there are also scenarios where tcmalloc does a bad job :frowning: It seems the Universal Memory Allocator is still not there :frowning:

For me, to get rid of tcmalloc is as simple as

LD_PRELOAD=/lib/x86_64-linux-gnu/libc.so.6 swipl

malloc_property/1 still thinks tcmalloc is there, but its numbers are bogus.

I haven’t looked inside tcmalloc, but I’m pretty sure that its normal use-case is running a lot of threads in a (kubernetes) container that sets a relatively tight limit on the amount of memory. So, it might have some check against the amount of available memory and decides whether to recycle a blob accordingly. If this theory is correct, changing the maximum memory (using ulimit, setrlimit, control groups, or similar) might solve your problem.

Thanks, I think this workaround will work.

I think you are quite correct, it seems to be designed to work in containers, and so it won’t kill other apps when it hogs the memory.

Also, you are correct that setting ulimit will prevent starving other apps (much like the solution that @j4n_bur53 proposed above of setting a limit from the command line).

However, this defeats the purpose of the big data app, which needs to use as much memory as needed for its calculations and then release it. Artificially limiting the amount of memory using ulimit or similar will simply prevent some workloads from being processed, as some are larger than others and the big data app tries to use the memory if available, releasing it as quickly as possible.

All in all, disabling tcmalloc is the most common sense solution. And we have two ways of disabling it:

  • using cmake -DUSE_TCMALLOC=OFF
  • or as Jan mentioned: LD_PRELOAD=/lib/x86_64-linux-gnu/libc.so.6 swipl or something similar.

Any reason to claim this is not good enough? Note that runtime switching from one malloc to another is not possible AFAIK. I don’t see a better way than the above. Please share if I’m missing something. Once upon a time SWI-Prolog did its own memory allocation. As is, most generally available allocaters are better than this home brew alternative and using the established API allows relatively easy switching. We just have to wait for the allocator that is best in every setting :slight_smile:

For me these solutions solve the problem, so I am ok with them.

My real worry is not for me, but for the average SWI-Prolog user.

If at some point an application --at some moment or with some specific input–allocates a large chunk of memory and then releases it, then they will encounter the problem, and it will never cross their minds that there is a problem with the underlying allocator.

In the worst case scenario other apps in their machine will starve for memory. They will simply think that they have to buy a bigger machine or add more RAM, or move applications to another machine; when in reality the problem was the memory allocator.

I have a partiality towards the less knowledgeable, average user, and so I think an out-of-the-box system should be tailored towards less knowledgeable/average users, and for this reason, I would choose to ship swipl with the system malloc by default, while compiling it myself with tcmalloc for SWISH, and telling advanced users that they can compile with tcmalloc by using -DUSE_TCMALLOC=ON or LD_PRELOAD=<tcmalloc> to get some advantages, but would also point out the pitfall of tcmalloc for the advanced user.

In other words, if I were maintaining swipl, I would add a little extra work for myself (i.e. recompiling SWISH with tcmalloc), rather than potentially affect a less knowledgeable user that may never even realize there is a problem with the allocator.

This is just my altruistic self, that wants to protect others even if it is a little bit more work for me.

The swipl that comes with Ubuntu (and on the PPA) uses regular malloc. If you build it yourself, you’ll get tcmalloc only if it’s installed on your system.
(I confirmed this by running ldd on the PPA executable and one I built from source, on Ubuntu 20.0.4)

(The PPA version also isn’t PGO-optimized, unfortunately)

(I presume that tcmalloc doesn’t exist for Windows)

I see, this is good.

wow, I wonder why. Now days it only needs to set the build type to PGO, no running scripts or anything else (I think Jan and dmchurch made this change a little bit ago).

Maybe the PPA version is PGO-optimized: Ann: SWI-Prolog 8.3.25 says “PORT: use PGO for the Ubuntu PPA build.”. But I don’t know how to verify this …

:slight_smile: I guess the question boils down to what is a more “normal” application: a multi-threaded application using a lot of temporary resources (volatile atoms, dynamic predicates) or an application that links an external library that relies on malloc() to release multi Gb sized allocations to the OS? Considering most serious applications I’m involved in I’m tempted to claim the first is a more typical scenario for SWI-Prolog.

That may change. Would be good for consistency as the self-compiled version prefers tcmalloc and both the snap and docker versions use it.

There seems to be a port. So far I didn’t bother. Multi-threaded 24x7 applications mostly run on Linux these days. I do not know whether or not the Windows native allocator suffers from any of these problems.

I added a build type option. It doesn’t seem to help :frowning: Need to figure out why. Probably it is overruled somewhere :frowning:

To me, the main conclusion is that it it might be wise to add a section to the manual and possibly a testable Prolog flag. Any volunteers to give the text a first shot? Possibly some copy and paste from this an earlier reports on allocation issues are a good start.

SWI-Prolog also releases most of its large allocations back to the OS, even when using tcmalloc. The problem here concerns 3rd party libraries that rely on a specific behavior of malloc(). Unless Java dictates how malloc for foreign code must behave I guess this behavior remains “unspecified”.

The problem happens with prolog also, not just external libraries:

2 ?- forall( between(1,30 000 000,N), assert(foo(N)) ), retractall(foo(_)), 
     sleep(3), garbage_collect_clauses.
true.

2 ?- process_id(P), format(string(C),'cat /proc/~w/status|grep ^Rss',[P]), shell(C).
RssAnon:	3800772 kB     % still hogging memory
RssFile:	   6852 kB
RssShmem:	      0 kB
P = 8170,
C = "cat /proc/8170/status|grep ^Rss".

However, we could provide a release_free_memory/0 predicate which would call MallocExtension_ReleaseFreeMemory() for tcmalloc and malloc_trim(...) for glibc malloc.

garbage_collect/0 could call this predicate. This is not infallible, but it is much better than the current situation.

$ LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so swipl -l /tmp/p.pl 
2 ?- forall( between(1,30 000 000,N), assert(foo(N)) ), retractall(foo(_)),
     sleep(3), garbage_collect_clauses.
true.

3 ?- process_id(P), format(string(C),'cat /proc/~w/status|grep ^Rss',[P]), shell(C).
RssAnon:	3800772 kB
RssFile:	   6852 kB
RssShmem:	      0 kB
P = 8170,
C = "cat /proc/8170/status|grep ^Rss".

4 ?- 'MallocExtension_ReleaseFreeMemory'.  % This is for tcmalloc, use malloc_trim for glibc malloc
Warning: ldconfig import: 5613: could not parse Cache generated by: ldconfig (GNU libc) release release version 2.33
true.

5 ?- process_id(P), format(string(C),'cat /proc/~w/status|grep ^Rss',[P]), shell(C).
RssAnon:	  41952 kB    % memory has been returned to the OS
RssFile:	   6860 kB
RssShmem:	      0 kB
P = 8170,
C = "cat /proc/8170/status|grep ^Rss".

p.pl:

:- module(m,[   malloc_trim/1,
                'MallocExtension_ReleaseFreeMemory'/0
            ]).

:- use_module(library(ffi)).

:- c_import( " #include \"malloc.h\"
               #include \"gperftools/malloc_extension_c.h\"
             ",
	     [ libc, libtcmalloc_minimal ],
	     [ 'malloc_trim'( int,[void] ),                     % glibc malloc
	       'MallocExtension_ReleaseFreeMemory'( [void] )    % tcmalloc
	     ]).