Scaling to billions of facts?


#1

Are there any papers on scaling Prolog databases to billions of facts (more than would fit in the largest available memory)? A SQL or noSQL database could be used, but I suspect performance would be awful for queries that normal Prolog would do very fast (e.g., if the facts are triples and I want to do a “join” across multiple triples … and it’s not clear to me that noSQL or SQL (with appropriate indexes and even some denormalization) would be adequately fast).


#2

I don’t know about papers. There have been some Prolog systems in the past that could store predicates on a file. IBM-Prolog? If I had to implement something like this, I’d think about the following:

  • Have a key-value store that stores the clauses by (index-)number as key and the.QLF representation for the clause as value. This would need some optimization to the QLF code handling for small objects. QLF is now a bit heavy in setting up the input/output, dealing with a header, etc.
  • Have one or more additional key-value stores that map index keys to lists of clause numbers. That can use the same technology that now implements JIT indexing for clauses.
  • Now to run
    • Create a shadow database containing a cache of normal clauses.
    • Determine index to use from the instantiation pattern
    • Lookup the clauses in the appropriate index database
    • Execute these clauses, which means
      • Ensure the next clause is in the cache
      • Execute it.

Note that if your data is static you can probably do way better by designing a clever file format and mapping that into memory.

A sensible alternative is probably do distribute the clauses of the predicate over multiple Prolog instances and distribute the query. Depending on the problem that might be quite simple. You will typically loose ordering and committing (cuts).


#3

If I had billions of facts, I would change the structure. I would start with designing a database store that can efficiently store and access billions of facts (a graph database for example), and then I would use Prolog to put in queries and work out answers to more complex questions via that database. Prolog itself wouldn’t have facts except as temporary caches of the larger store.


#4

@abaljeu – are you suggesting that I interface to something like this (because my data are essentially a bunch of graphs):

For a key-value store, perhaps

It’ll take a bit of thinking to do sharding across multiple Prolog instances, especially as the data has a small number of very high fan-out nodes. There might be a way of imposing a hierarchy on the nodes that helps with this, and a quick way of putting many of the nodes into cliques.


#5

I agree with Alan that if you don’t want fancy stuff such as arbitrary Prolog term and variables or you only want this in output arguments, using an external database directly to store and manage your data is most likely the best approach, possibly augmented with some cache (depending on the reasoning you need).

Note that there are connections to BerkeleyDB (bundled) and RocksDB (pack) that are fairly mature. There are many key-value stores, each with advantages and disadvantages.

— Jan