Size limitations on qsave_program?

Hi -

I am computing some very large tables (somewhere between 1 million and 100 million entries - I don’t actually keep track of how big they get) which I assert as facts. I then try to save the program state so I can resume my computation with the pre-computed tables:

qsave_program('gtestr', [goal(main), stand_alone(true)])

This works well for smaller problems but SWI dies for the problems that induce the very large tables. The stacktrace is below. Note that the computation of the tables works well, even when they are very large (we have 270GB physical RAM and the stack limits are set accordingly), it is just the saving that fails.

I’m using swipl 8.1.30 for x86_64-linux, running on a Linux kernel 4.15.0-54-generic, and an Intel Xeon CPU E5-2640 v2 @ 2GHz processor with 16 physical cores (although nothing else is running, and my application is single-threaded).

Any idea what can go wrong?

Thanks,

– Bernd

ERROR: In:
ERROR:   [21] close(<stream>(0x558954b2e090))
ERROR:   [19] qsave:write_zip_state(<zipper>(0x558954b2dad0),runtime,none,[goal(...),...]) at /usr/lib/swipl/library/qsave.pl:168
ERROR:   [18] setup_call_catcher_cleanup(qsave:zip_open_stream(<stream>(0x558954b2d6b0),<zipper>(0x558954b2dad0),[]),qsave:write_zip_state(<zipper>(0x558954b2dad0),runtime,none,...),_7419814,qsave:zip_close(<zipper>(0x558954b2dad0),...)) at /usr/lib/swipl/boot/init.pl:564
ERROR:   [16] qsave:write_state(<stream>(0x558954b2d6b0),runtime,none,[goal(...),...]) at /usr/lib/swipl/library/qsave.pl:155
ERROR:   [15] setup_call_catcher_cleanup(qsave:open('v20.exe',write,<stream>(0x558954b2d6b0),...),qsave:write_state(<stream>(0x558954b2d6b0),runtime,none,...),_7419950,qsave:finalize_state(_7419994,<stream>(0x558954b2d6b0),'v20.exe')) at /usr/lib/swipl/boot/init.pl:564
ERROR:   [14] '<meta-call>'(qsave:(...,...)) <foreign>
ERROR:   [13] setup_call_catcher_cleanup(qsave:open_map(...),qsave:(...,...),_7420078,qsave:close_map) at /usr/lib/swipl/boot/init.pl:564
ERROR:   [11] qsave:qsave_program('v20.exe',gtestr:[...|...]) at /usr/lib/swipl/library/qsave.pl:136
ERROR:    [9] gtestr:main at /app/src/gtestr.pl:67
ERROR:    [8] catch(gtestr:main,error(io_error(write,<stream>(0x558954b2e090)),context(...,'Invalid argument')),'$toplevel':true) at /usr/lib/swipl/boot/init.pl:482
ERROR:    [7] catch_with_backtrace('<garbage_collected>',error(io_error(write,<stream>(0x558954b2e090)),context(...,'Invalid argument')),'<garbage_collected>') at /usr/lib/swipl/boot/init.pl:532
ERROR:
ERROR: Note: some frames are missing due to last-call optimization.
ERROR: Re-run your program in debug mode (:- debug.) to get more detail.

I wouldn’t expect problems below 2Gb sized saved states. Not sure this is before or after compression. Part of the saved state code is really old, happily using long types. This should be safe on non-Windows, but can lead to a 2Gb size limit on Windows. Other parts may use (unsigned) int.

I’d be interested in the message just before the “ERROR: In:”?

Hi -

The missing message line reads ERROR: -g gtestr:main: I/O error in write on stream <stream>(0x56096b817ac0) (Invalid argument).

Thanks,
Phillip

Interesting. I guess the way out is to use a C debugger (e.g., gdb) and put a breakpoint on Sseterr() and possibly Sset_exception(). The way to go on depends a little on the OS, your expertise in C debugging and how easy it is to get this reproduced by me. Is it as simple as merely generating a giant dynamic predicate and than creating a saved state?

Is it as simple as merely generating a giant dynamic predicate and than creating a saved state?

This seems to be the case yes. We were able to induce the same error using this program:

:- module(test, []).

:- dynamic t/3.

fill(0) :-
	!.
fill(X) :-
	!,
	X > 0,
	numlist(0, X, L),
	random_permutation(L, K),
	assert(t(X, L, K)),
	Y is X - 1,
	fill(Y).

main :- writeln(success).

make :-
	fill(40000),
	qsave_program('./test', [goal(main), stand_alone(true)]).

Have not yet had the time to set up a debug bench for this, but thanks for the breakpoint locations!

Regards,
Phillip

Thanks. This reproduces in 21 min using 36Gb memory thanks to the new dev machine with 64Gb sponsored by DataChemists :slight_smile: It stops with the somewhat strangely sized output file of 3.1Gb. I’ll attach a debugger to see whether this is something obvious.

A bit of debugging showed that a flag is required to enable writing zip files over 4Gb. Pushed a patch (498c7dce6b274675b6e0a7b43dc8e9922980b0af) to fix this.

The test now creates a 3.1Gb compressed state representing a 9Gb uncompressed state. It loads in 66 sec on my machine. I wonder what you want to use this for …

2 Likes

Thank you for the effort Jan! I will give that the patched version a go :slight_smile:

As for what we use this for, @thefisch is the mastermind behind an advanced test suite generation tool. Before it can generate whatever tests we specify, it needs to compute some standard properties for the given grammar. Snappy for small grammars, but slows down quickly as the grammar grows.

1 Like

Interesting. Not sure how relevant it is, but to avoid the 10 min wait before the table was filled to be saved, I changed the code a little to run in 16 theads, doing the job in 37sec. That opposed to loading the giant state in 65 sec. Here is how I did this,
run using ?- fill(40000, 16).

:- dynamic t/3.

rlist(Q) :-
    thread_get_message(Q, M),
    (   M == done
    ->  true
    ;   M = do(I),
        numlist(0, I, L),
        random_permutation(L, K),
        assert(t(I, L, K)),
        rlist(Q)
    ).

fill(N,M) :-
    length(Threads, M),
    message_queue_create(Q),
    maplist(thread_create(rlist(Q)), Threads),
    forall(between(0,N,I),
           thread_send_message(Q, do(I))),
    forall(between(1,M,_),
           thread_send_message(Q, done)),
    maplist(thread_join, Threads).
4 Likes

Very interesting indeed! I must say I have not made much use of the multithreading primitives in Prolog, but perhaps it is time that changed… :slight_smile:

Regarding the patch: we were able to create and load our massive saved state! Thanks a lot for the effort, this will be a huge boon to our work.

1 Like

Regarding the concurrent creation, I’ve added concurrent_forall/2,3 to library(threads), so we can create the DB using the code below. That does more or less the same as my previous post. The implementation of concurrent_forall/3 is way more complicated to deal with failure and exceptions in the worker pool to abandon the computation and fail/throw. It was a bit of a challenge :slight_smile:

fill(N,M) :-
    concurrent_forall(between(1, N, I),
                      ( numlist(0, I, L),
                        random_permutation(L, K),
                        assert(t(I, L, K))
                      ),
                      [threads(M)]).
5 Likes

Will this abort all threads prematurely, even killing sibling threads
when one X is reached so that X < 50 fails?

?- concurrent_forall(between(1,100,X), X < 50).

Or will it dry run the test until the end, even if this is logically not
necessary since the outcome was already determined?

I quote myself:

Ok, I see thread.pl, you use this in the workers:

thread_signal(Main, throw(Why))

If you replace this by a new result queue and a different logic you get balance/1.

Can two independent threads Thread1 and Thread2 call your concurrent_forall/2
for their own purpose. Or would there be some interference?