Poor memory reuse (tcmalloc?)

It turns out that ru_idrss is not supported under Linux.

As for OS tools … it appears that mallinfo might do what I want. Or perhaps use process_id(Pid) and then parse /proc/Pid/statm.

(Life was simpler in the old days.)

See https://www.swi-prolog.org/pldoc/doc_for?object=malloc_info%3Amalloc_info/1. It is hard to interpret the output though (requires a good tutorial on glibc malloc()) and some heuristics, See also the SWISH infrastructure that provides these stats based on various of these ideas.

Thanks - but it doesn’t seem to be worth the effort right now to figure out why I’m seeing variation in indexing look-up; performance is “good enough” with a few million clauses.
(BTW, Google has a faster malloc; it’s aimed at multi-threaded applications but I think even single-threaded code benefits. And it wouldn’t surprise me if it doesn’t play well with malloc_info.)

And the access time shouldn’t go down with more clauses. Variations could be due to different hash distribution.

Some of these libraries you can simply try using LD_PRELOAD= (on Linux). I don’t really think there is a one size fits all. It all depends on the workload and different Prolog programs can exhibit quite different workloads wrt. malloc(). There are a lot of aspects on which malloc implementations can be optimized.

That said, multi-threaded SWI-Prolog does seem to suffer from poor memory reuse, at least when we look at the SWISH stats page. I’m still not 100% sure whether these stats are correct. There might be a problem because the various worker threads (Pengines and HTTP) do the allocations of (notably) atoms and clauses while the gc thread does all the deallocations. I’ve been doing some investigation work on this, but it is fairly hard to figure out what is going on. The SWI-Prolog site suffers from similar issues. Ideas are welcome. I’m particularly interested to figure out where not reused memory was allocated for and with which size distribution.

Google’s TCMalloc may also be better at reuse; but it was mainly designed for reducing lock-contention with multiple threads.

This somewhat old paper describes it: http://goog-perftools.sourceforge.net/doc/tcmalloc.html

The latest version seems to be here: https://abseil.io/blog/20200212-tcmalloc

1 Like

Thanks. I’ve spent some time figuring out when the server of our website and swish tend to grow pretty much forever. To do so I’ve replayed a couple of days worth of log data against a local copy of the website server using 4 concurrent clients as quick as the server wants to handle the requests. This gives some interesting insights.

  • Running under heaptrack, the server is claimed to be using 200Mb at the end of the run. After some initial growing it also stays pretty stable. So there are no serious memory leaks in the conventional meaning.
  • Using glibc ptmalloc implementation it uses at the end of the run 1.3Gb RSS and 16.3 Gb virtual memory.
  • Using tcmalloc (v4.3.0) it uses at the end 400Mb RSS and 680Mb virtual memory.
  • Using jemalloc (v3) it uses at the end 630Mb RSS and 5Gb virtual memory.

An interesting comparison of the performance of these allocators is here. This isn’t very interesting to us. Unlike C++. very few Prolog programs depend critically on malloc() performance. Most of SWI-Prolog’s fine grained heap memory management is either not very time critical (e.g., allocating a clause) or SWI-Prolog allocates a larger buffer and allocates from that (for example findall/3), such that it reduces stress on malloc() and can release memory to the OS for large findall/3 answer sets. The current tabling implementation is an exception, but this too should be handled in a more clever way such that table space is cheaper and can be handed to the OS.

I’m going to have a look in exploiting tcmalloc(), possible also providing some of its advanced features.

3 Likes

Wow! That is quite a difference between tcmalloc and glbc’s malloc.

Yes. I’m investigating what makes the web server case so special. I have some clues, but not definite answers. It seems wise to change the default build to use tcmalloc(). I plan to do so and see what effect that has on swish and the website.

2 Likes

I’ve pushed a number of patches that by default links against tcmalloc. Now works on Linux and MacOS and probably various other Unix-like systems. Both www.swi-prolog.org and swish.swi-prolog.org now run on this version.

If you build the git version you first have to install the Google perftools library that contains tcmalloc. Depending on the OS:

apt install libgoogle-perftools-dev      # Debian, Ubuntu, Mint, etc.
dnf install gperftools-devel             # Redhat, SuSe, Fedora, CentOS, etc.
port install gperftools                  # MacOS using Macports

After building, run

?- check_installation.

which should tell you that tcmalloc is working ok.

Further info is here

If you run heavily multi-threaded code that uses a lot of memory you might want to see whether this makes a difference. Please report success or regression.

2 Likes

For 4-thread code (on a 4-core machine, using concurrent_maplist), I saw no significant change in CPU (all cores were 100% CPU after initial setup). Multiple reruns of both the PPA version of swipl and the github version showed enough variation in maxresident that I can’t say anything about the (e.g., values ranging between 306740k and 504864k), although overall I think the tcmalloc version uses ~5-10% more memory.

Thanks. It all depends what you do of course. The scenario where things go wrong using ptmalloc is probably notably when a lot of temporary memory is allocated for (dynamic) clauses and atoms that are subject to garbage collection. The first happens in multithreaded applications using a lot of dynamic predicates in the various threads for doing their job or SWISH, which compiles the user program into a temporary module. We see the second (atoms) often in tasks that process documents, representing words/tokens/… as atoms. Does any of this apply?

As all malloc comparisons show, there is no single best implementation. I think the system should have a default allocator that may not be the best for all scenarios but performs decent in all scenarios. So far it seems ptmalloc performs (in sense of RSS) bad in a couple of quite common scenarios.

It also seems we still have a problem. tcmalloc seems to never return memory to the OS, where ptmalloc allocates big requests directly using mmap() and returns them immediately on free. So, running

t(N,M) :-
    length(L, N),
    concurrent_maplist(do(M), L).

do(M, V) :-
    numlist(1, M, L),
    sumlist(L, V).
?- t(16, 1 000 000).

We have the old version using practically no memory after completing this job and the new version using 1.5Gb. To reproduce you need 16 cores to actually run the 16 jobs concurrently. If you do not have these you can use ?- set_prolog_flag(cpu_count, 16). to fool the system.

Some more work seems needed :frowning: tcmalloc provides functions to return memory to the OS or else we should allocate the stacks directly using mmap(). Did that long ago, but then I was happy I no longer needed to, simplifying OS requirements :frowning:

2 Likes

I’m getting the impression that it makes sense to have a closer look at jemalloc. See also https://arxiv.org/pdf/1905.01135.pdf

Opinions welcome. The whole thing relies heavily on the workload :frowning:

My code is pretty simple (and a bit simple-minded) – it reads in JSON, converts to a nicer predicate form, asserts the predicates, then does a few forall loops to output the facts. The JSON/convert/assert code is done in multiple threads (one per input file). So, there’s lots of creation of clauses and atoms, but no obvious opportunities for garbage collection.

If you like, I could modify this code to run twice, with a retractall and explicit garbage-collection – would that be useful?

As for tcmalloc – it was designed to handle the typical Google server, which is a highly multi-threaded HTTP server with a dispatch similar to SWI-Prolog’s http_handler. So, returning memory to the OS probably isn’t a significant consideration. (Google servers run within something like a Kubernetes container, so they are strictly resource limited. However, nobody cares if a server OOMs occasionally – the infrastructure is designed to tolerate this and restart/migrate servers as needed.)

I noticed this comment: “It is worth noting that TCMalloc requests memory from the OS in large chunks (typically 1 GiB regions). The address space is reserved, but not backed by physical memory until it is used.”

So, you might need to use tcmalloc’s sdallocx rather than free in some situations.

My recollection is that tcmalloc tried to be no worse than regular malloc for non-server applications. But I could be wrong (I recall that there was a long period of roll-out, to ensure that tcmalloc didn’t make things worse – there are important parts of the Goolge infrastructure that aren’t high qps web servers).

The story seems to be complicated. Google is mostly C++, which typically implies high frequency allocations that are typically also freed (deleted) by the same thread that allocated the object. That at least is my mental model. In this model performance is the key issue.

Contrasting, Prolog does most of the work on its stacks. These are currently allocated using malloc() and resized using realloc(). (Re)allocation performance is totally irrelevant here. The other parts where allocation is more critical is adding clauses to (dynamic) predicates, finding atoms and (right now) many of the tabling operations. All this goes to a rather non-standard allocation scheme where the gc thread does most of the deallocation. ptmalloc suffers badly due to its use of arenas. tcmalloc seems to be doing a lot better using per-thread caches.

Gnu (ptmalloc) has practically nothing to configure. tcmalloc has a bit more, but it doesn’t seem to support Prolog stack allocation well, where we typically want to mmap() the stack and release it on thread termination or resize. jemalloc has zillions of parameters, some of which may support the stacks just fine. I now have contact with three projects suffering from poor allocation/reuse. One of the options I’m playing with is to get the stacks out of the equation and use mmap() for them. It shouldn’t be too hard to write a malloc()/realloc()/free() based on mmap() alone and use that for the stacks.

2 Likes

Partially true. (consults somewhat faded 10-year-old memories …) Google code heavily uses callbacks (e.g., http-handler sends request to another node, then creates a callback to handle the result), and they can happen in a different thread; data areas tend to be passed by unique_ptr, so I’d say it’s fairly common for objects to be allocated in one thread and freed in another. (Many objects are also created/freed on the stack, but that’s not interesting; there’s also quite a bit of RAII, which can be either stack- or heap-based.)

Thanks for clarifying. I guess as long as all threads allocate and deallocate there is still no problem for ptmalloc, but we are in the situation where typically the worker threads allocate and the gc thread deallocates. Anyway, tcmalloc deals well with that, especially as it allows flushing the thread allocation cache. SWI-Prolog now calls that in the gc thread after running one of the global collectors. I don’t have a test to see whether that really helps.

Anyway, there is good news and bad news. I’ve pushed a patch that will cause the system to use mmap() directly for the stacks when available. That at least fixes this test from a previous post. It can’t be bad as no allocator is going to do a better job for the stacks. A small gain is that we now know how the requested stack size is rounded up and thus we can actually use the rounded up memory.

The bad news is that the website server didn’t do noticeably better as a result of moving to tcmalloc. Now restarted with the above improvements, so we’ll see …

edit This seems to help a lot. As this uses life data we can of course not be sure, but it seems both backend servers for www.swi-prolog.org now use a lot less memory.

1 Like