Deadlock in futex_wait() when using archive_foldl() & threads

I have an application that reads multiple ZIP files, iterating over each entry in each ZIP file using archive_foldl() to read the entries. The outer loop is a call of maplist(), with an entry in the list for each ZIP file that’s to be read. That all works fine. If however the call to maplist() is replaced with concurrent_maplist() the application still reads the ZIP files OK, but the application hangs when halt() is called, with the following stack trace:

Thread 1 (Thread 0x7f926946e780 (LWP 95374) "p2k"):
#0  futex_wait (futex_word=0x55612e214570, expected=2, private=0) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x55612e214570, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f9269627fd9 in lll_mutex_lock_optimized (mutex=0x55612e214570) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x55612e214570) at ./nptl/pthread_mutex_lock.c:128
#4  0x00007f92698b8079 in S__close (s=0x55612e213100, flags=4) at /data2/tmp/swipl-devel-9.3.35/src/os/pl-stream.c:2119
#5  0x00007f9269844c15 in Sgcclose (s=<optimized out>, flags=<optimized out>) at /data2/tmp/swipl-devel-9.3.35/src/os/pl-stream.c:2172
#6  0x00007f926907d1e1 in libarchive_close_cb (a=<optimized out>, cdata=0x55612cca3480) at /data2/tmp/swipl-devel-9.3.35/packages/archive/archive4pl.c:369
#7  0x00007f9268358a8f in ?? () from /lib/x86_64-linux-gnu/libarchive.so.13
#8  0x00007f92683589af in ?? () from /lib/x86_64-linux-gnu/libarchive.so.13
#9  0x00007f9268358a36 in ?? () from /lib/x86_64-linux-gnu/libarchive.so.13
#10 0x00007f9268358b1c in ?? () from /lib/x86_64-linux-gnu/libarchive.so.13
#11 0x00007f926907cb45 in archive_free_handle (ar=0x55612cca3480) at /data2/tmp/swipl-devel-9.3.35/packages/archive/archive4pl.c:268
#12 ar_entry_close_cb (handle=0x55612cca3480) at /data2/tmp/swipl-devel-9.3.35/packages/archive/archive4pl.c:1262
#13 0x00007f92698b80ae in S__close (s=0x55612e213600, flags=flags@entry=0) at /data2/tmp/swipl-devel-9.3.35/src/os/pl-stream.c:2136
#14 0x00007f92698b8024 in Sclose (s=<optimized out>) at /data2/tmp/swipl-devel-9.3.35/src/os/pl-stream.c:2167
#15 0x00007f92698b488a in closeStream (s=0x55612e213600) at /data2/tmp/swipl-devel-9.3.35/src/os/pl-file.c:1476
#16 0x00007f92698b532c in closeFiles (all=all@entry=1) at /data2/tmp/swipl-devel-9.3.35/src/os/pl-file.c:1510
#17 0x00007f92698b51cf in dieIO () at /data2/tmp/swipl-devel-9.3.35/src/os/pl-file.c:1411
#18 0x00007f926980e167 in PL_cleanup (status=status@entry=65536) at /data2/tmp/swipl-devel-9.3.35/src/pl-init.c:1847
#19 0x00007f92698b3556 in haltProlog (status=65536, status@entry=0) at /data2/tmp/swipl-devel-9.3.35/src/pl-fli.c:5022
#20 0x00007f92698b32a5 in PL_halt (status=0) at /data2/tmp/swipl-devel-9.3.35/src/pl-fli.c:5069
#21 0x00007f926989eae9 in halt_from_exception (ex=73) at /data2/tmp/swipl-devel-9.3.35/src/pl-pro.c:129
#22 query_loop (goal=goal@entry=417541, loop=false) at /data2/tmp/swipl-devel-9.3.35/src/pl-pro.c:189
#23 0x00007f926989e85d in prologToplevel (goal=417541) at /data2/tmp/swipl-devel-9.3.35/src/pl-pro.c:661
#24 0x00007f92698a7c62 in PL_initialise (argc=<optimized out>, argc@entry=3, argv=<optimized out>, argv@entry=0x7ffe8e8c9fc8) at /data2/tmp/swipl-devel-9.3.35/src/pl-init.c:1524
#25 0x000055611f0ac0b3 in main (argc=3, argv=0x7ffe8e8c9fc8) at /data2/tmp/swipl-devel-9.3.35/src/pl-main.c:139

I found this old thread, this looks related?

Is this a bug? It certainly smells like one. Is there a workaround, for example would replacing archive_foldl() with a hand-crafted implementation that did all the archive opens/closes in the main loop rather in each concurrent_maplist() goal work?

Some digging with lsof revealed some calls to close() on the streams returned by archive_open_entry() were missing. With those addressed the app doesn’t hang on halt() any longer, but the difference in behaviour between maplist() and concurrent() maplist() suggests something is still amiss.

A related question: Is library(archive) thread-safe, in particular can archive_open_entry() be called concurrently to read entries from an archive in parallel? The docs suggest not:

If the stream is not closed before the next call to archive_next_header/2, a permission error is raised.

Just checking, I have some ZIP files that are rather slow to read when done serially.

The zip object is not thread safe. It is a state-full object. If you only do read access, you can open the archive multiple times. Thus, if you want to process all entries concurrently, either create N threads and make them process different entries by number or first generate an index of all entries and then spread the work to a pool of worker threads. You can do that using concurrent_forall/2 or concurrent_and/2.

1 Like

Thanks, but that doesn’t explain the hang on exit - each archive is opened in a separate thread, with the entries being processed serially by that thread.

The stack trace tells me that your code is not closing all archive handles properly, so this happens at program termination. However, blob cleanup handlers are called in arbitrary order during Prolog shutdown (read: the order they happen to appear in the atom hash table). The archive library doesn’t handle that very well. As the stream locks are recursive locks, it works fine if all is in the same thread, but deadlocks if multiple threads are involved.

In theory this could be fixed in library(archive) by keeping track of all dependencies between the various handles. This is pretty tricky though. I’m happy to apply a PR for that, but I do not plan to actively fix this. Just make sure you properly close all entry streams as well as the archive itself. setup_call_cleanup/3 is your friend!

Got it, thanks, that explains the difference between the maplist() and concurrent_maplist() scenarios.