Quick Load Files

Does anyone have statistics on the load time of very large file, e.g in the 100s of megabytes or larger? On my laptop I did a consult with one of these files (~500,000 facts) and the load time was into the tens of minutes, not seconds. If several or more of these files are used at the same time the load time could take hours and then fail do to lack of memory.

Do not use consult for that. The compiler is far too general, dealing with source locations, checking for functions, etc. For lots of data, just use a read/assert loop. That is what library(persistency) does (well, it maintains a journal, so you can also retract). If the data doesn’t change you can also compile it using qcompile/1 and load the .QLF file using any of the normal load predicates.

500,000 facts should be really easy, but of course it depends on the facts. If they contain large data structures it can get arbitrary expensive.

Your timings seems ridiculously slow. Possibly one of the above explains. Else, please share defaults.

2 Likes

Feedback on loading large stable fact files (100 Megabytes to a few Gigabytes). Here are some load times using consult/1 of standard pl files and then again as a Quick Load Format (qlf) file.

Processor: Intel Core i7-5500U CPU @ 2.40 GHz
Ram: 8.00 Gb
D drive: USB 3.0 256Gb - SanDisk thumb drive

% -------------------------------------

File Size: 41.0 MB (43,034,398 bytes) Lines: 559077
Example line:

uniProt_identification(entry_name(swiss_prot,“001R”,“FRG3G”),reviewed,256).

Example usage:

?- time(consult('D:/Cellular Information/UniProt/facts/uniProt_fact_identification')).
% 61,499,195 inferences, **12.328 CPU** in 12.469 seconds (99% CPU, 4988528 Lips)
true.

qcompile('D:/Cellular Information/UniProt/facts/uniProt_fact_identification').
?- time(consult('D:/Cellular Information/UniProt/facts/uniProt_fact_identification.qlf')).
% 429 inferences, **0.219 CPU** in 0.337 seconds (65% CPU, 1961 Lips)
true.

*.pl Size: 41.0 MB (43,034,398 bytes)
*.pl GZip Size: 2.89 MB (3,033,871 bytes)
*.qlf Size: 22.3 MB (23,468,300 bytes)
*.qlf GZip Size: 3.52 MB (3,693,834 bytes)

% -----------------

File Size: 115 MB (120,774,927 bytes) Lines: 1215465
Example line:

uniProt_organism_english_name(entry_name(swiss_prot,“001R”,“FRG3G”),“FV-3”).

Example usage:

?- time(consult('D:/Cellular Information/UniProt/facts/uniProt_fact_organism_species')).
% 149,729,072 inferences, **28.672 CPU** in 29.530 seconds (97% CPU, 5222158 Lips)
true.

qcompile('D:/Cellular Information/UniProt/facts/uniProt_fact_organism_species').
?- time(consult('D:/Cellular Information/UniProt/facts/uniProt_fact_organism_species.qlf')).
% 429 inferences, **0.594 CPU** in 0.906 seconds (66% CPU, 723 Lips)
true.

*.pl Size: 115 MB (120,774,927 bytes)
*.pl GZip Size: 11.7 MB (12,324,123 bytes)
*.qlf Size: 71.9 MB (75,438,440 bytes)
*.qlf GZip Size: 13.8 MB (14,534,767 bytes)

% -----------------

File Size: 226 MB (237,750,972 bytes) Lines: 559077
Example line:

uniProt_sequence_data(entry_name(swiss_prot,“001R”,“FRG3G”),“MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL”).

Example usage:

?- time(consult('D:/Cellular Information/UniProt/facts/uniProt_fact_sequence_data')).
% 64,294,643 inferences, **21.563 CPU** in 22.370 seconds (96% CPU, 2981781 Lips)
true.

qcompile('D:/Cellular Information/UniProt/facts/uniProt_fact_sequence_data').
?- time(consult('D:/Cellular Information/UniProt/facts/uniProt_fact_sequence_data.qlf')).
% 429 inferences, **0.922 CPU** in 1.529 seconds (60% CPU, 465 Lips)
true.

*.pl Size: 226 MB (237,750,972 bytes)
*.pl GZip Size: 58.9 MB (61,827,281 bytes)
*.qlf Size: 212 MB (222,645,246 bytes)
*.qlf GZip Size: 62.7 MB (65,802,502 bytes)

% -----------------

File Size: 562 MB (589,457,325 bytes) Lines: 4492023
Example line:

uniProt_feature(entry_name(swiss_prot,“001R”,“FRG3G”),(“CHAIN”,1,256,“Putative transcription factor 001R.”,,“PRO_0000410512”)).

Example usage:

?- time(consult('D:/Cellular Information/UniProt/facts/uniProt_fact_feature_table_data')).
% 1,087,048,078 inferences, **191.406 CPU** in 194.518 seconds (98% CPU, 5679272 Lips)
true.

qcompile('D:/Cellular Information/UniProt/facts/uniProt_fact_feature_table_data').
?- time(consult('D:/Cellular Information/UniProt/facts/uniProt_fact_feature_table_data.qlf')).
% 463 inferences, **3.688 CPU** in 6.738 seconds (55% CPU, 126 Lips)
true.

*.pl Size: 562 MB (589,457,325 bytes)
*.pl GZip Size: 40.3 MB (42,274,614 bytes)
*.qlf Size: 535 MB (561,345,143 bytes)
*.qlf GZip Size: 52.4 MB (55,027,409 bytes)

% -----------------

File Size: 2.30 GB (2,472,802,790 bytes) Lines: 25624211
Example line:

uniProt_reference_authors(reference_id(entry_name(swiss_prot,“001R”,“FRG3G”),1),“Tan W.G.”).

Example usage:

?- time(consult('D:/Cellular Information/UniProt/facts/uniProt_fact_reference_authors')).
% 3,126,154,467 inferences, **649.938 CPU** in 671.015 seconds (97% CPU, 4809931 Lips)
true.

qcompile('D:/Cellular Information/UniProt/facts/uniProt_fact_reference_authors').
?- time(consult('D:/Cellular Information/UniProt/facts/uniProt_fact_reference_authors.qlf')).
% 429 inferences, **14.438 CPU** in 24.734 seconds (58% CPU, 30 Lips)
true.

*.pl Size: 2.30 GB (2,472,802,790 bytes)
*.pl GZip Size: 170 MB (178,322,169 bytes)
*.qlf Size: 1.34 GB (1,444,645,237 bytes)
*.qlf GZip Size: 211 MB (221,565,080 bytes)

% -----------------

The data in the files can basically be thought of like rows in an SQL table with the structure added. Thus the structure is redundant for each line, e.g. uniProt_identification(entry_name(_,_,_),_,_).


For qlf internal details see: pl-wic.c

1 Like

So, roughly 15-30x speedup. I suspect that it would be more if I/O weren’t a significant component.

Do you have file size figures for the .pl vs .qlf files? (Also for the gzip-ed .pl and .qlf file, to get a feeling for their information content)

The file size is comparable, depending on the style (comments, identifier length, expansion). A large part of the difference is the term/goal expansion done for normal compilation: QLF files simply hold the final VM code. Another important factor is that atoms are stored only once in a QLF file. This implies a second use of the same atom is a simple array lookup instead of tokenizing and looking up the atom in the global atom table. This technique is used for all the stuff that normally requires table lookups.

Seems on this Windows system reading becomes I/O bound. On Linux it is typically still CPU bound :slight_smile:

XSB has a dynload command next to loading normal source code. Possibly we need something similar, i.e., a simple loader that skips as many as possible of the normal compiler steps that deal with pre processing, source administration, etc. Another thing on my wish list would be a way to write QLF files directly from Prolog.

1 Like

Single data point. Using the Logtalk core files (compiled to and collected in a single Prolog file) and the embedding script for SWI-Prolog I get:

Prolog file: 1079943 bytes
QLF file: 340538 bytes

Another data point, on Linux with Intel Core i5-7500 CPU @ 3.40GHz x 4, with 16 GB RAM and SSD (although I don’t think that the SSD matters – the files probably are all in cache).

600,439 clauses (116MB in .pl file, 56MB in .qlf file)
load (qcompile): 12-13 sec
read-assertz: 3.2 sec
load (qlf): 0.45 sec
speedup: 25-30x for qlf, 7x for read-assertz

With the .qlf format, the dominant cost of loading becomes building the JIT indexes. I force these by some initial queries in parallel:

?-concurrent_maplist(index_pred, 
                     [fact(a,b,_), fact(a,_,_), fact(_,b,c), fact(_,_,c) ..., 
                      edge(a,b,_), 
                      ...]).
index_pred(Goal) :-
    statistics(cputime, T0),
    ( Goal -> true ; true ),
    statistics(cputime, T1),
    T is T1 - T0,
    debug(log, 'Indexed ~q in ~3f sec', [Goal, T]).

and this is the code for a read-assertz loop:

load :-
    open('my_facts.pl', read, In),
    statistics(cputime, T0),
    read_assert(In),
    statistics(cputime, T1),
    T is T1 - T0,
    format('Loaded: ~3f sec', [T]).

read_assert(In) :-
    read(In, T),
    (  T == end_of_file
    -> true
    ;  assertz(T),
       read_assert(In)
    ).
1 Like

I updated my results with timings for a read/assertz loop.
tl;dr: consult : read-asertz : qlf – 25 : 7 : 1
YMMV

1 Like