Facts vs Local knowledge

I’m currently working on a Prolog project and have run into a conundrum. While using a knowledge base as facts leads to great efficiency (thank to indexing), I’m interested in incorporating ‘local knowledgebases’ stored in a list or ordset for the sake of organization. However, this seems to come at the expense of efficiency. Is there a way to achieve both efficiency and organization in my Prolog program?

Probably yes. There are other datastructure types, e.g. rbtrees: SWI-Prolog -- Manual

Or assert/1

Or tabling: SWI-Prolog -- Manual

Optimization/efficiency is highly problem-specific - so, what are the details of the problem?

I don’t know what you mean exactly, but relational databases are an interesting and efficient way to store and retrieve information and prolog is particularly suited for this kind of job. In more recent times I’ve seen that they have introduced new data structures like records and dictionaries, but i don’t know them very well. It depends on what you want to do, I think

Thanks, let me phrase my problem better:

I’ve been working on a program that deals with a list of changing facts. Each fact contains an UUID, some additional metadata, and a compound functor that can have different numbers of arguments. Here’s an example:

fact(UUID_1, OtherStuff_1, variant1(Thing1)).
fact(UUID_2, OtherStuff_2, variant2(Thing1, Thing2)).
fact(UUID_3, OtherStuff_3, variant3(Thing1, Thing2, Thing3)).

and my queries are things like:

fact(_, _, variant2(A, B)).

I have many collections of these facts, and I would prefer to encapsulate them
as ord_sets. However, querying these ord_sets using member/2 becomes
inefficient since the first argument is an UUID.

[fact(UUID_1, OtherStuff_1, variant1(Thing1)),
 fact(UUID_2, OtherStuff_2, variant2(Thing1, Thing2)),
 fact(UUID_3, OtherStuff_3, variant3(Thing1, Thing2, Thing3)),

I am looking for a way to get the encapsulation in the last example, with the efficiency of the first, without manually writing my own indexes or relying on manual argument order. Additionally, I am not interested in using a relational DB since my program has a lot of compound terms, and I hope to achieve deep indexing (terminology I took from Argument Indexing in Prolog - YouTube)

I can however phrase the problem as “monotonic” in the spirit of event sourcing (if a fact is no longer true, I can just invalidate it explicitly). Also, as I do a lot of repeated queries, I think some variant of tabling would be useful. I have discovered that I can table dynamic monotonic predicates, and I am considering this approach.

I would greatly appreciate any insight or recommendations!

Ordsets require O(n) for lookup. Rbtrees require O(log n). If you don’t like how rbtrees display, it’s easy to fix that using portray/1.

You didn’t say how many items you’re dealing with. I have some code that has over 50,000 items and lookup is not the most expensive thing (confirmed using profile/2).

Also, you didn’t say what operations you’re doing. Do you need set operations (intersection, etc.) or just lookup?

Here’s what I don’t understand of the rbtree argument here. I hope you can clear my misconception.

rbtrees are O(log n) if I’m using the right key to walk the rbtree. If I have a term like:

fact(UUID_1, OtherStuff_1, variant1(Thing1)).
fact(UUID_2, OtherStuff_2, variant2(Thing1, Thing2)).
fact(UUID_3, OtherStuff_3, variant3(Thing1, Thing2, Thing3)).

and I want:

fact(_, _, variant2(ParticularThing, X)).

then, as the functor is the same for all the terms, the rbtree is effectively indexed on UUIDs, since they are the first argument, no? If that’s the case, I won’t reach the response I want in O(log n). Am I missing something?

As for the other questions, I don’t have specific numbers, but I hope I could get to 100000 items, and it’s just lookup, no intersection or union.

If you have multiple keys, then you’ll need multiple rbtrees (they can all have the same values, just accessed by different keys). That’s no different from ordsets; it’s just that can use member/3 to lookup in an ordset by a non-key whereas rbtrees would require rb_in/3, rb_visit/2, or rb_map/2.

It’s also annoying to pass around a pair of rbtrees (or multiple pairs). This can be avoided using EDCGs.

For an example of using rbtrees: pykythe/pykythe/pykythe_symtab.pl at 8e9aa1f938a129c1a61c8c1da6b0ed12d4d7b648 · kamahen/pykythe · GitHub … I use EDCGs to pass the symbol table around in the pykythe.pl, but it’s moderately complex code, so probably not the best example. (I wrote pykythe_symbtab.pl when I wasn’t sure which storage technique I would use; benchmarking showed that for my use case, rbtrees were best, but library(assoc) or even the builtin “dicts” might be better for your use case.)

You should maybe normalize your data first. At that point you might get the JIT indexing to work correctly. The other option is to make your own index but I would first try the obvious solution.

Thinking about this a bit more …

The persistent predicates project currently allows only a single key but it shouldn’t be too difficult to modify it allow multiple keys (similar to just-in-time indexing of clauses).

And a similar approach should work for rbtrees. I’ll try implementing it - and after doing this, I’d do something similar for persistent predicates.

Let’s call the new data structure “mkrbtree” (for multi-key-rbtree"). With each mkrbtree, there would be a list of keys and an associated predicate that defines which key to use. For example, if we’re storing triples node(From,Edge,To), then we might define

data_index(node(From,_,_),  Index, Key), ground(From)    => Index = 1, Key=From.
data_index(node(_,Edge,To), Index, Key), ground(Edge-To) => Index = 2, Key=Edge-To.
data_index(node(_,_,To),    Index, Key), ground(To)      => Index = 3, Key=To.
data_index(node(_,Edge,To), Index, Key), ground(Edge)    => Index = 2, Key=EdgeTo.
data_index(node(_,_,_),     Index, _)                    => Index = 0. 

which would define 3 indexes: on From; on a combination of Edge and To; and on To.

So, with this data:

[node(n1, road, n2),
 node(n1, train, n3),
 node(n2, road, n3),
 node(n3, road, n4)]  

there would be three keyed lists (stored as rbtrees) plus one unkeyed list:

1-[n1-[node(n1,road,n2), node(n1,train,n2)], 
0-[node(n1,road,n2), node(n1,train,n2), node(n2,road,n3), node(n3,road,n4)] 

mkrb_lookup/3 would call data_index/2 to determine which index to use, then call rb_lookup/3 on the appropriate keyed list; if no key matches, then use member/3 on the unkeyed list. Siimlarly, mkrb_insert/4 would add to all the keyed lists (using rb_insert/4) plus to the unkeyed list.

This looks like a lot of data duplication, but the nodes would be shared amongst all the lists, so the only extra space would be for the indexes.

Anyway, this is a rough idea of how multiple indexes could be used with key-value data to do lookup as efficiently as clause lookup with indexing. Does this seem reasonable? (Some small details would change, of course; in particular, data_index/3 would be a bit different, to allow using both for lookup and insert, probably using once/1 instead of SSU.)

Great suggestions, thank you! May I ask a follow-up question? Is the argument indexing computational behaviour the same for:

  1. Facts
  2. Dynamic facts
  3. Facts issued with library(persistency)?

If your goal is ‘organization’, you could consider organizing facts using modules, e.g. one module for each ‘local knowledge base’ as you called it. If your local knowledge bases are a pre-defined set then you can assign one module to each; but you can also create modules dynamically using assertz/1.

This use of modules is common sometimes, in which the module represents a 'world" of knowledge.