Memory use for database

I’m using: SWI-Prolog version ??? 8.2.4 under fedora 34

I want the code to:

But what I’m getting is:

My code looks like this:
open a file for reading,
read(Term), assertz(predicate(Term)),
and repeats until the end_of_file

% your code here

I have a large file with terms, one per line, which I read and load into memory as facts, using assertz(predicate(Term)) . The ratio of memory use by swipl to file size is about 7 to 1, certainly more than I expected. Is there anything I can do to improve this ratio? All terms have the same format: [[A,B,C,D],E] where each entry is an integer.

This should reduce the memory usage quite a bit:

read([[A,B,C,D],E]), assertz(predicate_(A,B,C,D,E)).

and you can define

predicate([[A,B,C,D],E]) :- predicate_(A,B,C,D,E).

Is there some reason why you have the form [[A,B,C,D],E]? The general rule is: if a list is known length, use a term instead.

2 Likes

Have you seen library(persistency) yet?

library(persistency) will also create a file of all the asserts which I call a journal in other post here.

If you need to load the facts again, then using the journal file you need only call db_attach/2.

Thanks Peter. I will try that.

Thanks Eric. I will look into it, although persistence is not the issue. The problem I have is that the database uses much more memory than I expected.

Hi Peter,
I tried what you suggested and it does help. It reduces the factor from close to 7 to 1, down to under 6 to 1. That helps, but it is not quite enough for what I need.

I’m going to guess that you have a lot of different atoms. (You are using atoms in the terms and not strings?)

Perhaps if you give a few concrete examples of your terms, people can suggest a way of reducing the memory footprint.

1 Like

Clauses are in the end quite large. On a 64-bit platform an empty clause is 64 bytes. That is linked to the predicate using structure that takes 24 bytes, so you use 88 bytes + the virtual machine code. Additional clause indexes use another 24 bytes per clause per index. See also clause_property/2 to find the size of a clause.

YAP has an option to define a static predicate from a uniform table AFAIK. That can reduce memory a lot. We could do something similar if someone is willing to implement this or pay for the implementation. I have discussed this option before and have some ideas on how to do that :slight_smile:

Note that this means no rules and all clauses have simple atomic data where each argument has the same type for all clauses.

Jan,

Could such a static predicate be generated via an assert, i.e. during runtime, - instead of being predefined during compile time – while keeping to its simpler structure?

Or, is this an oxymoron static predicate dynamically generated :slight_smile:

Could there also be a performance boost in accessing such static predicates – somewhere i read that performance is inherently connected to size of data structure, so smaller size may lead to better access performance as well.

I do have some static tables in my code but, they aren’t massive, so the benefit would probably be limited – however, I do generate an graph dynamically which, once generated doesnt change.

Perhaps, in the future large portions of a dynamically generated graph could be provided as pre-existing in static table structures – along with extensions that are dynamically added.

thanks,

Dan

The overall idea I have once discussed is to taken the ideas for database column stores as implemented by MonetDB as a start. A column is basically a simple native array of objects of the same type (with some more complexity for string columns). That is also what e.g. R uses for representing vectors. The Apache Arrow project is also related.

The nice thing is that you can put that in memory or you can store the column in a file and use memory mapping. Next step is to make this available as a predicate, combining one or more such columns (of equal length) and adding indexes to them.

There are lots of decisions to take, such as which of the above projects to reuse, can we reuse indexing techniques? How about modifying the table? If we do, can we have Prolog’s logical update view? How can we use this to efficiently share date with R, MonetDB, etc?

This is a major project that is waiting for a good use case and involvement to get it realized (either as paid development or contributed code). A first version can simply be implemented as a normal C/C++ foreign plugin. Later versions could move part to the VM to avoid the foreign language call overhead.

1 Like

Thanks to all who responded to my question. With the different suggestions, I’ve been able to cut the memory footprint by some 40%, which helps a lot. Part of what I’ve done is replace
predicate([[A,B,C,D],E]) with predicate(A,B,List)
where List is a list of pred(C,D,E) collecting all the C,D,E that go with each A,B.
This however, has a run-time cost since my inquiries instead of
predicate(A,B,C,D,E)
now have to be
predicate(A,B,List), member(pred(C,D,E),List)
and each of these List has in the thousands of members.
All of A,B,C,D are grounded both in the database and at the time of inquiry, in fact they are just integers. E is the value that gets looked up.
Now I am wondering if searching for pred(C,D,E) in the list can be sped up by having a sorted list and/or using memberchk? For each value of A,B there is only one list, and for each value of C,D there is one entry in the corresponding list.

You’re hoping that compression on columns will yield benefits? IIRC, that works best if the data are sorted, so it’d be a bit of a break with classical Prolog semantics that use row order for retrieval (SQL, of course, doesn’t specify any retrieval order, which leads to some classic beginners’ mistakes).

Is there anything written up on this? Or are you thinking of a foreign-function interface to Apache Arrow?

See my reply above. I have looked at Arrow quite a while ago. I do not recall all the details. It surely is an interesting library. I have no good overview what we gain by using it though, notably which extra connections would result from using it, i.e. to what data formats or systems do we get additional or better interfaces? Does Arrow provide indexes, so we can select rows based on values for one or more columns?

The alternative I have been looking at is MonetDB as mentioned above. Embedded MonetDB would give compact storage, persistency, indexes and the possibility to get a column (in that scenario the Nth argument from a MonetDB backed up predicate) as a C array, e.g., an array of integers, floats and maybe even strings (not sure how they represent these). This array can both be the pointer to an in-memory array or a memory mapped file.

2 Likes

I think that are three scenarios here (maybe more):

  1. storing triples for ontologies, graphs. etc. (e.g., RDW, OWL, …)
  2. storing denormalized data, ideally where the columns compress nicely and the data are typically processed in (primary) key order
  3. conventional row-oriented data (especially if normalized or almost normalized).

#2 is the one where column-oriented databases work best, although they can be used for all situations (e.g., Google’s Bigtable is column-oriented). This works with both (primary) key-value (no-SQL) or multiple indexes; the latter are essentially tables that map values to the (primary) keys.

The original reasons for column databases was based on slow seek time on disks and relatively fast scanning through the data when read into memory. So, taking Bigtable as an example, the data are segmented into “shards” and each shard is a range of keys, stored in an sstable, as a set of key-value pairs. The data can be compressed, so accessing the data for a specific key may require a scan of the data (decompressing it), but the amount of scan is limited to the size of a single sstable or a chunk within the sstable (e.g., 64K).

[A detail: sstables are write-once, so there’s actually a “stack” of sstables for each shard of the data; each “stack” of sstables is periodically reprocessed into a single sstable. This is somewhat reminiscent of PostgreSQL’s way of storing data that minimizes write locks.]

Each column in the database is indexed by the primary key. The indexes at the Bigtable level point to shards (sstables); the assumption is that the in-memory processing of an sstable (including decompressing the data) is much faster than the IO. For a purely in-memory database, it is less clear than compressing the columns is worthwhile. If the data are typically processed in primary key order, it’s probably worth compressing the column data.

For case #1, it’s not clear that column database has any advantage, either in performance or space, especially if the data are atoms, whose representation is already a kind of compression (atoms are represented as integers, not strings); I suspect that a conventional row-oriented data structure would actually work better here. Similarly for case #3.

So, if we want a more compact data structure for large tables containing mostly atoms, something similar to a conventional row-oriented database is probably best. If specific needs arise - a column-oriented data format might be added later.

If we’re talking about data that’s too big to fit in RAM, we should probably also consider a way of distributing the data over multiple nodes. For that, something like MonetDB might make more sense.

For more on the advantages/disadvantages of column data, here’s a reasonable overview: Column-oriented DBMS - Wikipedia

2 Likes

Thanks for sharing. Google Bigtable seems a commercial-only solution that is only available in the cloud. Might be nice to have an interface to it. It cannot be part of the SWI-Prolog eco system under these conditions.

There are probably various sweet spots in this space.

  • As we have seen, clauses are fairly big. That is because there is no typing on the predicate arguments, there is an executable clause body attached, the logical update view requires generation numbers, we want quite a bit of meta-data that allows for reflection on the code. We also want concurrent updates and lock-free read access. Notable data (Prolog facts) often does not need all this though. It is often fine if arguments are typed (arg 1 is an integer, arg2 is an atom), there are no logical variables and no executable rule body. With such restrictions we can store the data much more compact in memory while preserving all the other Prolog goodies.
  • Use an (embedded) database. That provides (sometimes optionally) disk based storage and sometimes lots of other goodies (transactions, replication, access from multiple processes, network based access, etc.). It comes at a price though. I have been playing with BerkeleyDB and RocksDB more extensively and some years ago we played a bit with LevelDB and a few more, ending up with RocksDB as the fastest and most scalable (of course the tables may have turned. Anyone?).
    asserting 1,000,000 p/2 clauses takes 0.3 sec in Prolog, 1M queries on this run also in 0.3 sec. Doing the same in RocksDB requires 3.6 seconds to add them and 1.9 seconds for the 1M queries. To bridge from the RocksDB model Key-Value model where both keys and values are just byte blobs to Prolog with multiple indexes requires more RocksDB queries and some data conversion, so we are now at about 10 seconds lookup time for 1M queries to achieve something that approaches Prolog. Most likely that can be improved a bit. Doubling the performance might be possible, but that is about the limit, I fear.

What is the best way of measuring memory use? I’m currently using the ps command and looking at the change in VSZ or RSS.

I’m thinking of converting my code to use the new RocksDB interface, to see if it gives good enough performance (and to allow my database of facts to be much bigger). Presumably, I’ll need to tune the rocksDB cache, which would require some modifications to rocks4pl.cpp

When I load my data, I get an expansion of roughly 3x from the .pl form (1.5x from the .qlf form). I presume that most of this is in clauses – there are 7 indexes for a total of 1 million buckets, according to jiti_list/0 – how big is a bucket?

Details:
1 million facts: 151M; in QLF form, it’s 77M and it compresses to 8.5M (using bzip2).
When I load it, I get a change in VSZ of 550M (RSS: 450M). # atoms change: 311,0000; VM-codes change: 21.1million
When I load the file the first time, it takes ~40 seconds. The equivalent QLF file takes ~1 second to load. Building the 7 indexes in parallel takes about 4 seconds (sometimes less than a second, which is … strange)

The underlying sstable is available as open source: LevelDB and RocksDB. BigTable uses these for its columns. So, the simplest BigTable would have a single column and a single set of sstables (LevelDB / RocksDB).
The main thing that BigTable provides is a “sharding service”, so that the data are sharded across multiple “tablet servers”, each with its sstable(s). So, when a query is made, the bigtable tells the client which tablet server handles the key, and after that the client communicates directly with the tablet server.

There are some non-commercial databases listed here: RocksDB - Wikipedia
I don’t know how their capabilities compare with Bigtable.

1 Like