During cleanup, the normal order of reclaiming blobs isn’t followed. Quoting from the documentation:
These objects are reclaimed regardless of their registration count. The order in which the atoms
or blobs are reclaimed underPL_cleanup()
is undefined.
Streams are a kind of blob (and therefore a kind of atom) and their cleanup order is also undefined. One result of this is that if a stream is created using Snew()
, the garbage collector can get deadlocked.
This is @jan’s resonse (lightly edited) when I sent a gdb traceback to him:
Shutdown first kills all threads, including the gc thread, so the final cleanup is single threaded (and skipped if killing the threads failed).
From the stack dump, this can happen if an archive becomes subject to GC that is still open (my guess, might be wrong). That poses an interesting problem. As the thread that abandoned the archive still has a lock on its streams we cannot close them and neither can the initiating thread as, while it has a lock, it no longer knows about the file.
This is an interesting case. I’ve before considered to generally close I/O streams when reclaiming stream blobs. This shows that it’s problematic.
I guess we have some options. I think all of these options imply adding a function to pl-stream.c that must be called to releases streams from GC. I see some possible functionality:
- If the stream is locked, leave it alone. Fairly safe, but looses precious resources such as file descriptors.
- Forcefully reclaim the stream and destroy it. As it is garbage, that should be safe. The stream may be linked to other (stream) resources that may also be garbage collected before or after us.
Might get complicated. Normal execution can guarantee order of garbage collections between dependent objects using references on the atoms/blobs. Shutdown cannot. Do we need some API to
find out that the system is performing its shutdown?
It’s not clear to me that an API for checking whether we’re in shutdown suffices. This API would help avoid double-free, but instead that could result in leaking resources.
The problem occurs if BLOB1 points to BLOB2 but BLOB2 is freed first. There are two ways of detecting that BLOB1 points to BLOB2:
- BLOB2 keeps some kind of reference count
- BLOB1 calls
PL_register_atom(BLOB2)
andPL_unregister_atom(BLOB2)
when it removes the reference.
For scenario #1, the “in shutdown” API can be used to prevent BLOB2 from freeing itself if its reference count is non-zero; eventually BLOB1 will be freed and it can call BLOB2’s release callback.
But scenario #2 seems to require re-running GC over both atoms and streams (PL_get_stream()
and PL_release_stream()
are similar to PL_register_atom()
and PL_unregister_atom()
in this respect). (In theory, there could be as many GC passes as blobs). [It’s not clear to me if PL_unregister_atom()
also needs to check for being in shutdown, which could add unacceptable overhead for the normal situation.]