Quick Load Files

Does anyone have statistics on the load time of very large file, e.g in the 100s of megabytes or larger? On my laptop I did a consult with one of these files (~500,000 facts) and the load time was into the tens of minutes, not seconds. If several or more of these files are used at the same time the load time could take hours and then fail do to lack of memory.

Do not use consult for that. The compiler is far too general, dealing with source locations, checking for functions, etc. For lots of data, just use a read/assert loop. That is what library(persistency) does (well, it maintains a journal, so you can also retract). If the data doesn’t change you can also compile it using qcompile/1 and load the .QLF file using any of the normal load predicates.

500,000 facts should be really easy, but of course it depends on the facts. If they contain large data structures it can get arbitrary expensive.

Your timings seems ridiculously slow. Possibly one of the above explains. Else, please share defaults.

2 Likes

Feedback on loading large stable fact files (100 Megabytes to a few Gigabytes). Here are some load times using consult/1 of standard pl files and then again as a Quick Load Format (qlf) file.

Processor: Intel Core i7-5500U CPU @ 2.40 GHz
Ram: 8.00 GM
D drive: USB 3.0 256Gb - SanDisk thumb drive

% -------------------------------------

File Size: 41.0 MB (43,034,398 bytes) Lines: 559077
Example line: uniProt_identification(entry_name(swiss_prot,“001R”,“FRG3G”),reviewed,256).
GZip Size: 2.89 MB (3,033,871 bytes)

?- time(consult(‘D:/Cellular Information/UniProt/facts/uniProt_fact_identification’)).
% 61,499,195 inferences, 12.328 CPU in 12.469 seconds (99% CPU, 4988528 Lips)
true.

qcompile(‘D:/Cellular Information/UniProt/facts/uniProt_fact_identification’).
?- time(consult(‘D:/Cellular Information/UniProt/facts/uniProt_fact_identification.qlf’)).
% 429 inferences, 0.219 CPU in 0.337 seconds (65% CPU, 1961 Lips)
true.

qlf Size: 22.3 MB (23,468,300 bytes)
qlf GZip Size: 3.52 MB (3,693,834 bytes)

% -----------------

File Size: 115 MB (120,774,927 bytes) Lines: 1215465
Example line: uniProt_organism_english_name(entry_name(swiss_prot,“001R”,“FRG3G”),“FV-3”).
GZip Size: 11.7 MB (12,324,123 bytes)

?- time(consult(‘D:/Cellular Information/UniProt/facts/uniProt_fact_organism_species’)).
% 149,729,072 inferences, 28.672 CPU in 29.530 seconds (97% CPU, 5222158 Lips)
true.

qcompile(‘D:/Cellular Information/UniProt/facts/uniProt_fact_organism_species’).
?- time(consult(‘D:/Cellular Information/UniProt/facts/uniProt_fact_organism_species.qlf’)).
% 429 inferences, 0.594 CPU in 0.906 seconds (66% CPU, 723 Lips)
true.

qlf Size: 71.9 MB (75,438,440 bytes)
qlf GZip Size: 13.8 MB (14,534,767 bytes)

% -----------------

File Size: 226 MB (237,750,972 bytes) Lines: 559077
Example line: uniProt_sequence_data(entry_name(swiss_prot,“001R”,“FRG3G”),“MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL”).
GZip Size: 58.9 MB (61,827,281 bytes)

?- time(consult(‘D:/Cellular Information/UniProt/facts/uniProt_fact_sequence_data’)).
% 64,294,643 inferences, 21.563 CPU in 22.370 seconds (96% CPU, 2981781 Lips)
true.

qcompile(‘D:/Cellular Information/UniProt/facts/uniProt_fact_sequence_data’).
?- time(consult(‘D:/Cellular Information/UniProt/facts/uniProt_fact_sequence_data.qlf’)).
% 429 inferences, 0.922 CPU in 1.529 seconds (60% CPU, 465 Lips)
true.

qlf Size: 212 MB (222,645,246 bytes)
qlf GZip Size: 62.7 MB (65,802,502 bytes)

% -----------------

File Size: 562 MB (589,457,325 bytes) Lines: 4492023
Example line: uniProt_feature(entry_name(swiss_prot,“001R”,“FRG3G”),(“CHAIN”,1,256,“Putative transcription factor 001R.”,[],“PRO_0000410512”)).
GZip Size: 40.3 MB (42,274,614 bytes)

?- time(consult(‘D:/Cellular Information/UniProt/facts/uniProt_fact_feature_table_data’)).
% 1,087,048,078 inferences, 191.406 CPU in 194.518 seconds (98% CPU, 5679272 Lips)
true.

qcompile(‘D:/Cellular Information/UniProt/facts/uniProt_fact_feature_table_data’).
?- time(consult(‘D:/Cellular Information/UniProt/facts/uniProt_fact_feature_table_data.qlf’)).
% 463 inferences, 3.688 CPU in 6.738 seconds (55% CPU, 126 Lips)
true.

qlf Size: 535 MB (561,345,143 bytes)
qlf GZip Size: 52.4 MB (55,027,409 bytes)

% -----------------

File Size: 2.30 GB (2,472,802,790 bytes) Lines: 25624211
Example line: uniProt_reference_authors(reference_id(entry_name(swiss_prot,“001R”,“FRG3G”),1),“Tan W.G.”).
GZip Size: 170 MB (178,322,169 bytes)

?- time(consult(‘D:/Cellular Information/UniProt/facts/uniProt_fact_reference_authors’)).
% 3,126,154,467 inferences, 649.938 CPU in 671.015 seconds (97% CPU, 4809931 Lips)
true.

qcompile(‘D:/Cellular Information/UniProt/facts/uniProt_fact_reference_authors’).
?- time(consult(‘D:/Cellular Information/UniProt/facts/uniProt_fact_reference_authors.qlf’)).
% 429 inferences, 14.438 CPU in 24.734 seconds (58% CPU, 30 Lips)
true.

qlf Size: 1.34 GB (1,444,645,237 bytes)
qlf GZip Size: 211 MB (221,565,080 bytes)

% -----------------

The data in the files can basically be thought of like rows in an SQL table with the structure added. Thus the structure is redundant for each line, e.g. uniProt_identification(entry_name(_,_,_),_,_).


For qlf internal details see: pl-wic.c

1 Like

So, roughly 15-30x speedup. I suspect that it would be more if I/O weren’t a significant component.

Do you have file size figures for the .pl vs .qlf files? (Also for the gzip-ed .pl and .qlf file, to get a feeling for their information content)

The file size is comparable, depending on the style (comments, identifier length, expansion). A large part of the difference is the term/goal expansion done for normal compilation: QLF files simply hold the final VM code. Another important factor is that atoms are stored only once in a QLF file. This implies a second use of the same atom is a simple array lookup instead of tokenizing and looking up the atom in the global atom table. This technique is used for all the stuff that normally requires table lookups.

Seems on this Windows system reading becomes I/O bound. On Linux it is typically still CPU bound :slight_smile:

XSB has a dynload command next to loading normal source code. Possibly we need something similar, i.e., a simple loader that skips as many as possible of the normal compiler steps that deal with pre processing, source administration, etc. Another thing on my wish list would be a way to write QLF files directly from Prolog.

1 Like

Single data point. Using the Logtalk core files (compiled to and collected in a single Prolog file) and the embedding script for SWI-Prolog I get:

Prolog file: 1079943 bytes
QLF file: 340538 bytes