Scaling to billions of facts?

#1

Are there any papers on scaling Prolog databases to billions of facts (more than would fit in the largest available memory)? A SQL or noSQL database could be used, but I suspect performance would be awful for queries that normal Prolog would do very fast (e.g., if the facts are triples and I want to do a “join” across multiple triples … and it’s not clear to me that noSQL or SQL (with appropriate indexes and even some denormalization) would be adequately fast).

#2

I don’t know about papers. There have been some Prolog systems in the past that could store predicates on a file. IBM-Prolog? If I had to implement something like this, I’d think about the following:

  • Have a key-value store that stores the clauses by (index-)number as key and the.QLF representation for the clause as value. This would need some optimization to the QLF code handling for small objects. QLF is now a bit heavy in setting up the input/output, dealing with a header, etc.
  • Have one or more additional key-value stores that map index keys to lists of clause numbers. That can use the same technology that now implements JIT indexing for clauses.
  • Now to run
    • Create a shadow database containing a cache of normal clauses.
    • Determine index to use from the instantiation pattern
    • Lookup the clauses in the appropriate index database
    • Execute these clauses, which means
      • Ensure the next clause is in the cache
      • Execute it.

Note that if your data is static you can probably do way better by designing a clever file format and mapping that into memory.

A sensible alternative is probably do distribute the clauses of the predicate over multiple Prolog instances and distribute the query. Depending on the problem that might be quite simple. You will typically loose ordering and committing (cuts).

#3

If I had billions of facts, I would change the structure. I would start with designing a database store that can efficiently store and access billions of facts (a graph database for example), and then I would use Prolog to put in queries and work out answers to more complex questions via that database. Prolog itself wouldn’t have facts except as temporary caches of the larger store.

#4

@abaljeu – are you suggesting that I interface to something like this (because my data are essentially a bunch of graphs):

For a key-value store, perhaps

It’ll take a bit of thinking to do sharding across multiple Prolog instances, especially as the data has a small number of very high fan-out nodes. There might be a way of imposing a hierarchy on the nodes that helps with this, and a quick way of putting many of the nodes into cliques.

1 Like
#5

I agree with Alan that if you don’t want fancy stuff such as arbitrary Prolog term and variables or you only want this in output arguments, using an external database directly to store and manage your data is most likely the best approach, possibly augmented with some cache (depending on the reasoning you need).

Note that there are connections to BerkeleyDB (bundled) and RocksDB (pack) that are fairly mature. There are many key-value stores, each with advantages and disadvantages.

— Jan

#6

@peter.ludemann Any updates on this concept. I too might have a need for it and have an idea or two (one is based on a NoSQL databse and the other is based on the file storage system under the NoSQL database), but haven’t given it a trial run yet for a query. The data would be stored purely as facts; a few examples here. I am estimating that it will be a few to several Terabytes of facts by the time I get done.

#7

Sorry for the late answer…

I think HDT should be evaluated as a viable technology right now.
It’s by default installed in SWISH (at least the docker version I used to test bousi_pack), and Wouter Beek et al. are implementing notable systems. See for instance LOD Laundromat. More documentation: this PDF is the first result from my silly googling.

#8

@EricGT My leading contender right now is Erlang’s mnesia, partly because it allows multiple indexes (NoSQL typically only allows single indexes and, with some contortions, multiple indexes). But it’s not clear to me how best to do transitive joins (e.g., the “ancestor” predicate), or whether that matters if the query node is colocated with the data node. (I wonder if there’s a Prolog-mnesia connector …)

Another possibility is to implement a distributed set of pengines or similar, using the design paradigm from Bigtable. (The standard paper on that is a bit opaque on the underlying concepts, but it’s fairly straightforward. The downside is that it’s not clear to me how to best implement efficiently inverse (transitive) relationships.)

#9

When I first played with NoSQL, specifically Neo4j I had to use the same restraint used to learn Prolog, meaning that I had to resist the urge to take what I knew about SQL databases and expect something similar with NoSQL. In learning Neo4j by reading a database with embedded Java instead of via the query language it was obvious that indexes were only really needed to search for the starting node, once that node was found all of the rest was done by following hard coded links (relationships). If you have an outline of the organization of the data as relationship, then you can also via a few relationship steps from one starting node get to most starting nodes in a fraction of a second. Walking a relationship is so much faster than doing a relational join.

While I was amazed at how fast that was and the amount of data that Neo4j could handle because it is all stored in files, the query language was not as powerful as Prolog. So as I noted one of my possible solutions is to reimplement the basics of Prolog for queries to work the Neo4j DB data through the Neo4j API. If you look at those example facts, they are all of the form functor(id,value). These become either properties of the node or a relation to another node which is a hard coded relationship. Even if I can’t get Prolog queries working right away with Neo4j API I can still access large amounts of data quickly using the Java API.

Here is an example of a real world web site with Neo4j database that has lots of data stored and is fast to navigate (Reactome)

I am not getting what you mean by that. If it means that you have two nodes and you can easily go from one to the other via relationship (father(A,B)), how to do you do the reverse, e.g. go from B to A. If that is the case, again with Neo4j as an example you have a node (B), and know a relationship (father), just use the relationship method on a node which returns a pair of nodes. Don’t think of the relationship like (father(A,B)), think of it like (A <-father-> B). Of the two nodes returned, the one that is not the same as the current node is the other node. With Neo4j the relationship has a property that indicates if it is directional, e.g. ->, <-, or <->. If so then a check of the direction can be used to see if it is the reverse. Even if it is -> or <- the relationship is still there for both nodes to retrieve via the relationship.

Normally in Neo4j most people access it via the query language which can also do the same, but doing it via code showed me the details that cleared out the confusion.