Memory management

For a large part thanks to the work started by @mgondan1 SWI-Prolog can now cleanup all allocated memory during shutdown. This in itself has found a number of memory leaks, notably in temporary module handling as used by SWISH. Moreover, it makes identifying memory leaks a lot easier. For most applications there were no significant memory leaks. Now we can also claim this for SWISH :slight_smile: See SWISH -- SWI-Prolog for SHaring (Memory usage section).

There is still one area of concern: memory that is temporary in use for notably tables (tabling) and dynamic predicates. SWISH, but probably also other applications, may allocate huge amounts of memory temporary for populated tables or dynamic predicates. As that is allocated in many small chunks (using malloc()) and easily gets mixed with other more long term allocations, this memory is rarely returned back to the OS. We now find SWISH often in a situation where it is using 1GB memory and lost 5Gb or more in memory that was freed, but not reused.

I think the answer to that is arenas, i.e., larger segments of memory to which we have malloc/free style access and that can be released as a whole when the underlying resource is done. This is not trivial though. Some aspects to consider:

  • Both dynamic predicates and tables may be thread specific or shared. Threads often die after some time, so it would be great of the memory allocated specifically for a thread stays together such that we can release it to the OS as the thread dies.
  • Both dynamic predicates and tables vary enormously in size. Many of them are really small, some can be huge (gigabytes). Notably tables may occur in vast numbers (millions). So, while for the big ones we would be interested in keeping their memory separated and page aligned such that we can release it, we do not want to introduce much overhead of the small ones.
  • At least for the shared predicates and tables, we typically cannot move any memory around (e.g., for compacting) as the memory structures are used by lock free algorithms.
  • Most of the memory is allocated in small objects that occur in a small number of fixed sizes. The only exception is the executable (virtual machine) code associated with clauses which may vary from a few words to megabytes.

Is anyone aware of technology that can deal with this?

One possibly relevant reference is the Rust typed_arena crate (Rust package)

A fast (but limited) allocation arena for values of a single type.

Allocated objects are destroyed all at once, when the arena itself is destroyed. There is no deallocation of individual objects while the arena itself is still alive. The flipside is that allocation is fast: typically just a vector push.

I’ve had a brief experience with it where I needed cyclic references in Rust, which this crate also facilitates. I cannot comment on its performance though…

This is more or less the rough idea. This is already in use in SWI-Prolog at the moment for findall/3 and allocations to walk over terms. These operations may temporarily need large amounts of memory. There is only one such thing active though, there are no intermediate free operations and these temporary allocations are only accessed from a single thread. Notably the walking use what is called internally a chunked stack. That is a stack of small objects of the same size for which the allocation is done in chunks that are big enough to make the overhead small.

The problem outlined above is much more complex. For dynamic predicates as well as tries (tables) we may have frequent free/alloc sequences, so we do need reuse. For large temporary objects you’d like to immediately allocate using mmap() in pages so you can quickly release back to the OS. When it is created though, the object is small and it is hard/impossible to predict whether it will stay small, get huge, will be subject to series of free/alloc or whether its lifespan is limited or it will live as long as the process. A classical trick is of course to assume the object remains small and change policy as it grows over some threshold. Given concurrent access we cannot move it. We might be able to recreate it and leave the old one to GC.

Note that each (large) object needs its own arena. If we would allocate for two dynamic predicates in one arena it could be that one of them is emptied at the end of some computation while the other one remains alive for the rest of the process.

Which memory allocator(s) are you settling on?
I found a blog post that claimed a significant saving by using tcmalloc vs glibc (and mentions some tuning parameters, plus a bug in glibc), although it’s not clear that it would help for your specific situation:

There’s also some discussion about per-thread arenas (which is a different meaning of “arena” than what you’ve mentioned) – I presume that this isn’t an issue? - By the tcmalloc designers, including some tuning parameters: TCMalloc : Thread-Caching Malloc

FWIW, protobufs has an arena management class (C++ only) that’s independent of the underlying memory allocator. The typical use case is for creating a protobuf message with a lot of fields, but it should work with other situations. In particular, Arena::Create is not restricted to protobufs, nor to tcmalloc.