Scaling to billions of facts?

Are there any papers on scaling Prolog databases to billions of facts (more than would fit in the largest available memory)? A SQL or noSQL database could be used, but I suspect performance would be awful for queries that normal Prolog would do very fast (e.g., if the facts are triples and I want to do a “join” across multiple triples … and it’s not clear to me that noSQL or SQL (with appropriate indexes and even some denormalization) would be adequately fast).

I don’t know about papers. There have been some Prolog systems in the past that could store predicates on a file. IBM-Prolog? If I had to implement something like this, I’d think about the following:

  • Have a key-value store that stores the clauses by (index-)number as key and the.QLF representation for the clause as value. This would need some optimization to the QLF code handling for small objects. QLF is now a bit heavy in setting up the input/output, dealing with a header, etc.
  • Have one or more additional key-value stores that map index keys to lists of clause numbers. That can use the same technology that now implements JIT indexing for clauses.
  • Now to run
    • Create a shadow database containing a cache of normal clauses.
    • Determine index to use from the instantiation pattern
    • Lookup the clauses in the appropriate index database
    • Execute these clauses, which means
      • Ensure the next clause is in the cache
      • Execute it.

Note that if your data is static you can probably do way better by designing a clever file format and mapping that into memory.

A sensible alternative is probably do distribute the clauses of the predicate over multiple Prolog instances and distribute the query. Depending on the problem that might be quite simple. You will typically loose ordering and committing (cuts).

1 Like

If I had billions of facts, I would change the structure. I would start with designing a database store that can efficiently store and access billions of facts (a graph database for example), and then I would use Prolog to put in queries and work out answers to more complex questions via that database. Prolog itself wouldn’t have facts except as temporary caches of the larger store.

1 Like

@abaljeu – are you suggesting that I interface to something like this (because my data are essentially a bunch of graphs):

For a key-value store, perhaps

It’ll take a bit of thinking to do sharding across multiple Prolog instances, especially as the data has a small number of very high fan-out nodes. There might be a way of imposing a hierarchy on the nodes that helps with this, and a quick way of putting many of the nodes into cliques.

1 Like

I agree with Alan that if you don’t want fancy stuff such as arbitrary Prolog term and variables or you only want this in output arguments, using an external database directly to store and manage your data is most likely the best approach, possibly augmented with some cache (depending on the reasoning you need).

Note that there are connections to BerkeleyDB (bundled) and RocksDB (pack) that are fairly mature. There are many key-value stores, each with advantages and disadvantages.

— Jan

@peter.ludemann Any updates on this concept. I too might have a need for it and have an idea or two (one is based on a NoSQL databse and the other is based on the file storage system under the NoSQL database), but haven’t given it a trial run yet for a query. The data would be stored purely as facts. I am estimating that it will be a few to several Terabytes of facts by the time I get done.

Sorry for the late answer…

I think HDT should be evaluated as a viable technology right now.
It’s by default installed in SWISH (at least the docker version I used to test bousi_pack), and Wouter Beek et al. are implementing notable systems. See for instance LOD Laundromat. More documentation: this PDF is the first result from my silly googling.

@EricGT My leading contender right now is Erlang’s mnesia, partly because it allows multiple indexes (NoSQL typically only allows single indexes and, with some contortions, multiple indexes). But it’s not clear to me how best to do transitive joins (e.g., the “ancestor” predicate), or whether that matters if the query node is colocated with the data node. (I wonder if there’s a Prolog-mnesia connector …)

Another possibility is to implement a distributed set of pengines or similar, using the design paradigm from Bigtable. (The standard paper on that is a bit opaque on the underlying concepts, but it’s fairly straightforward. The downside is that it’s not clear to me how to best implement efficiently inverse (transitive) relationships.)

When I first played with NoSQL, specifically Neo4j I had to use the same restraint used to learn Prolog, meaning that I had to resist the urge to take what I knew about SQL databases and expect something similar with NoSQL. In learning Neo4j by reading a database with embedded Java instead of via the query language it was obvious that indexes were only really needed to search for the starting node, once that node was found all of the rest was done by following hard coded links (relationships). If you have an outline of the organization of the data as relationship, then you can also via a few relationship steps from one starting node get to most starting nodes in a fraction of a second. Walking a relationship is so much faster than doing a relational join.

While I was amazed at how fast that was and the amount of data that Neo4j could handle because it is all stored in files, the query language was not as powerful as Prolog. So as I noted one of my possible solutions is to reimplement the basics of Prolog for queries to work the Neo4j DB data through the Neo4j API. If you look at those example facts, they are all of the form functor(id,value). These become either properties of the node or a relation to another node which is a hard coded relationship. Even if I can’t get Prolog queries working right away with Neo4j API I can still access large amounts of data quickly using the Java API.

Here is an example of a real world web site with Neo4j database that has lots of data stored and is fast to navigate (Reactome)

I am not getting what you mean by that. If it means that you have two nodes and you can easily go from one to the other via relationship (father(A,B)), how to do you do the reverse, e.g. go from B to A. If that is the case, again with Neo4j as an example you have a node (B), and know a relationship (father), just use the relationship method on a node which returns a pair of nodes. Don’t think of the relationship like (father(A,B)), think of it like (A <-father-> B). Of the two nodes returned, the one that is not the same as the current node is the other node. With Neo4j the relationship has a property that indicates if it is directional, e.g. ->, <-, or <->. If so then a check of the direction can be used to see if it is the reverse. Even if it is -> or <- the relationship is still there for both nodes to retrieve via the relationship.

Normally in Neo4j most people access it via the query language which can also do the same, but doing it via code showed me the details that cleared out the confusion.

Neo4j looks interesting. So, I think “a solution exists” and I can postpone thinking about this for now. (I found one benchmark that seems to indicate that multi-hop queries could be a performance problem, so I’ll have to think about whether my specific case would encounter that problem.)

[It’s ironic that Codasyl-style databases are making a comeback, after having been thoroughly displaced by relational databases. What’s next - hierarchical databases? But the ugliness of SQL still seems to remain …]

I can read that a few ways.

  1. To go from one node to a node K hops away takes N seconds.
  2. To go from one node to all nodes K hops away takes N seconds.

Based on how fast the data reads were once the database was up when I was reading it with embedded Java on Neo4j it looks more like 2 than one. To bad the article didn’t give the query syntax the meta of the graph and the results as a list of values.

In deciding to use Neo4j I experimented with learning how to use it for a few months, using the Cypher query language, Cypher command line shell, using the Java drivers and then finally using the embedded Java. I used the GUI front end, the raw data dumps of a query, created many test cases (JUnit 5), walked every relationship and node of a Reactome DB starting from one node, picked an IDE (IntelliJ), picked a build tool (Gradle) used libraries (APOC), created user defined procedures to Cypher, looked at the means to write a Prolog driver using Bolt, used Docker images, and even test loaded those Prolog facts cleanly in to a Neo4j database, saw that it is open source on GitHub, is actively supported for free (paid if you also need) by the community and company, has free books, and after all of that I was more than satisfied that using Neo4j with a Prolog driver (needs to be written) should do what I need.

EDIT

I also regularly encountered the out-of-memory issue when using Neo4j at first. Moving away from Cypher to using embedded Java gave me much more freedom in how to tackle a problem and make use of resources. Here is an example of the out-of-memory problem I encountered and a work around/solution to the problem. While this may not work for every out-of-memory problem, it does show that these problems are not a dead end and can be resolved by the end user.