Persistent predicates based on RocksDB

I’m starting to work again on rocks_preds.pl, starting with multi-argument indexing. There doesn’t seem to be much detailed documentation on how SWI-Prolog does just-in-time indexing, so I started to read the code in pl-index.c, intending to learn from that and duplicate some of the logic in Prolog code (in rocks_preds.pl).

It occurs to me that a more general solution might be to extend the builtin assertz/1 to use multiple “backends”, one of which would use RocksDB. This would allow transparently extending existing programs to “unlimited” size clause databases(*). Does this make sense? If not, I’d still want to reverse-engineer the existing “JITI” code, to see what can be used by rocks_preds. Also, are there any test cases for indexing, so that I can ensure that indexing is being done correctly? (e.g., is there a way to make vm_list/1 show the “trust me” and similar 1st-argument indexing, or more details about the JIT indexing?)

(*) Although it shouldn’t be difficult to take existing code and modify it (and/or define term/goal expansion) to use the rocks_pred interface.

I think yes. We can already do so using wrap_predicate/4. The wrapper can detect it concerns a RocksDB pred and do the rocks thing or else call its wrapped goal to do the normal thing.

You’ll anyway have to do the indexing for the rocks version yourself. Borrowing some ideas is of course good. Maybe you find it can be done better :slight_smile:

There are tests. Just grep for jit in the test dir. The VM code shows little of the indexing. In SWI-Prolog that is not part of the clauses, but of the “supervisor”. These VM codes choose the indexing technique used (as well as some other housekeeping such as the mentioned wrappers, meta argument expansion, calling foreign code, etc.) The index returns a chain of candidate clauses and the last one is trusted (no choicepoint is created).

Where do I find the “supervisor” code that does this? Is it pl-vmi.c:3506 VMI(S_STATIC, …)?

I was thinking of something different from using wrap_predicate/4 … using rocksdb for the buckets, indexing, and storage; but somehow using the logic from the JITI code to decide which index to use. And adding an index as needed … although this could be a lot more expensive than for the in-memory situation, so we might want a way of explicitly setting indexes – at worst, the way i do this for in-memory, by doing a query with the instantiation pattern I want indexed.

The instructions are named S_*. They are produced by pl-supervisor.c, normally on the first call from the initial supervisor S_VIRGIN or explicitly after commands such as wrap_predicate/4.

I don’t really see that working and I doubt that is needed. One never knows of course :slight_smile:

Jan, thank you very much for RocksDB interface — it really helps to deal with billions of facts.
However, the load using assertz is too slow. Even in your benchmarks it takes hours,
while reading the comparable csv file takes tens of seconds.

Do you know if it is possible to speed up the load of predicates?

My current task is to store a CSV file with 0.1 billion lines as a fact database. I am using essentially a simple loop

store_facts_from_file(File, Pred, Header) :-
call_time(
forall(csv_read_file_row(File, Row, [functor(Pred),skip_header(Header)]),
rdb_assertz(Row)),
T),
print_message(informational,store_facts_from_file(File, Pred, Header, T)).

The speed is roughly the same as for RDF (Geonames) example GitHub - JanWielemaker/rocks-predicates: Put predicates into a RocksDB database

Experiments with Glean (glean.software) show that the storing of these facts should take tens of minutes, i.e. 10 times faster. In these experiments the facts are stored in RocksDB in batches of 1Gb size.

Perhaps we should also store facts in batches instead of calling rob_assertz for each fact individually?

Probably. The rocksdb predicate library is more a proof of concept than anything else. If you have serious plans with it, please contact me in a direct message. We can discuss your case and see what can be done.

1 Like

Unfortunately my experiments with Prolog are bound not to become a production code. On the other hand, this experiment with Rocksdb seems to be quite useful. At the end of the day, RAM usage was minimal — 32Mb (megabytes), while the resulting size of the database is about 15Gb.

Though for swi-prolog, it might be better not to play with RocksDB, but to optimize qcompile to work with big-data fact databases in linear time and constant RAM usage. I’ve tried to convert my .csv files into Prolog fact database and use qcompile on it. Unfortunately, it was only possible to compile 2Gb file, and it took 10Gb RAM. After the compilation the fact database was working great.

What do you think, may be it makes sense to write a C-based tool to write .qlf files directly from .csv? Using pl-wic.c. In theory it should work in a streaming fashion, using constant RAM.

How much RAM do .qlf files use, when we load them using consult? Approximately their size or much more?

1 Like

There is library(table) that allows for read-only access to records in a static file, where it uses a binary tree to index on one key. There is also the HDT package that provides read-only access to large compiled triple sets (RDF). But if you do not need the functionality of a full Prolog clause with a head and a body, why not just use a database? If it is about local data, sqlite is quite nice. You can access that through the bundled ODBC or one of the packages that provide an sqlite interface. The latter options are probably faster. Then you add a small layer on top of it to make it appear as a set of normal Prolog predicates.

2 Likes

A small remark, an in-memory DB in SQLite is “fast” (for some value of fast) but a persistent database can be really slow if you use autocommits (the default). One must handle the transactions manually. In other words, a “batch insert” is possible and roughly 1000x faster if you:

BEGIN;
-- many inserts, possibly with a prepared statement
-- and new bind variables for each individual insert
END;

Do you know where it is explained, how the predicates receive their “namespaces” when we create them using csv_read_file_row?

I’ve tried to assemble a predicate that stores data from .CSV in a tight loop with csv_read_file_row; I’ve put this function besides rdb_assertz/2 into rocks_preds.pl and instead of user:application (my CSV rows are transformed into application predicate), PI became just application — see rocks-predicates/rocks_preds.pl at 3071007fa6beed347418e4ed5ddcfa4af98f62a1 · JanWielemaker/rocks-predicates · GitHub

This is essentially what I use to write to RDB:

rdb_assertz_from_csv(File, Pred) :-
    default_db(Dir),
    rdb_assertz_from_csv(Dir, File, Pred).

rdb_assertz_from_csv(Dir, File, Pred) :-
    csv_read_file_row(File, FirstClause, [functor(Pred), line(1)]),
    rdb_open(Dir, DB),
    clause_head_body(FirstClause, FirstHead, _FirstBody),
    pi_head(PI, FirstHead),
%    print(PI),
    pred_property_key(PI, last_clause, KeyLC),
    (   rocks_get(DB, KeyLC, Id)
    ->  NId is Id+0
    ;   register_predicate(DB, PI),
        NId is 0
    ),
    % Writing the rows...
    forall( csv_read_file_row(File, Clause, [functor(Pred), line(LineNumber)]),
            (
                LineId is LineNumber + NId,
                rocks_put(DB, KeyLC, LineId),
                pred_clause_key(PI, LineId, KeyClause),
                clause_head_body(Clause, Head, Body),
                rocks_put(DB, KeyClause, (Head:-Body)),
                add_to_indexes(DB, PI, Head, KeyClause)
%                print((PI, KeyClause))
            )
        ).

Terms do not have a namespace in a predicate based module system. The meta_predicate/1 declarations in rocks_preds.pl “qualify” the term with the calling context (as in embedding the term into Module:PlainTerm).

It is not clear what you are trying to do to me. If you are aiming at a bulk insert I guess we need either a version of rdb_assertz/2 that accepts a list of clauses or something that accepts a goal where it uses bulk insert to add all solutions of the predicate.