Linking from clause store to terms on the stack

Currently, Prolog makes a clear distinction between facts stored in the fact-store, and terms created and referenced during goal processing in predicate variables that are stored on a stack space.

Facts in the fact-base are not backtrackable (unless one designs an explicit assert wrapper, that retracts a prior assert during backtracking) – hence an assert and retract is permanent, while terms referenced in variables during the processing of goals is backtrackable by default.

Could one evision a hybrid approach, where facts in the fact base have a pointer to “attributes” that live on the stack and are fully backtrackable.

The benefit of such a design could be the lookup time of such attributes – which can be reduced to a pointer based lookup, given a reference to a fact in the fact store.

Algorithms that temporarily associate meta data to facts for the purpose of processing, would experience significant lookup time savings – not unlike attributed variables – where attributes associated with variables during processing have fast lookup times (since attributes are associated with variables via pointers).

Some Prolog systems have reference data type, which can
point to an arbitary data structure. You find this reflected in JPL:

 +-- Term
 |    +-- JRef               <<<< here

https://jpl7.org/JavaApiOverview

You find papers about reference types in Prolog systems. It would be an implementation specific extension of an ISO core standard Prolog. So its not extremy standardized. Issues that need to be resolved are:

  • Equality comparison, i.e. (==)/2 and friends
  • Inequality comparison, i.e. (@<)/2 and friends.
  • Display during write, write_term/2 and friends.
  • Unification, (=)/2 and friends
  • Garbage collection, Prolog VM

I think in SWI-Prolog there is something like blobs, which can act as reference datatype. And you might have different types of blobs. In Jekejeke Prolog anything from the Java Object hierarchy can be a reference datatype, provide its not a Prolog number, atom, var or compound.

A reference type can also wrap a Prolog term. But this is extremly dangerous business, since not only the logical variables of a Prolog term are unbound, also the compounds of a Prolog term might cease to exist. But this might be nevertheless seen, for example I have a couple reference types in my

Prolog system to speed up constraint logic programming.

An example of reference datatype would be streams. This is part
of the ISO core standard. Another example of reference datatype, not
in the ISO core standard, are “record pointers”.

They were already found in early Prolog systems (DEC 10, C-Prolog)
and could match your requirement from here:

Its described here:

5.10. Internal Database
recorded(Key, Term, Ref)
recordz(Key, Term, Ref)
Etc…
https://www2.cs.duke.edu/csl/docs/cprolog.html

It even survived in SWI-Prolog:

The recorded database
https://www.swi-prolog.org/pldoc/man?section=recdb

I am not sure that a reference data type would suffice, to have this work.

You have to actually extend the fact store with another column as it were that is managed internally, since from the user point of view its always person(X) and not person(X, External_Ref_Blob).

Also, its a bit circular External_Ref_Blob would need to refer to the clause person(X, External_Ref_Blob) which can’t be done until its asserted. And, if it refers to person(X), then you add lookups, and loose the benefit of one pointer hop only.

It seems to me that what I have in mind requires extending core parts of prolog quite like attributed variables required extending core parts.

Dan

Thanks.

It looks like i wasn’t able to explain what i am looking for.

Let me make up an artificial example, just to illustrate.

Consider the following code. It essentially, it’s aim is to associate with each item(a), item(b) a calculated value. Due to backtracking, the calculated value for each item(a) and item(b) can change.

In the code below an assoc is used to map between item(a) and item(b) and respective calculated values, simulated here by the long_calculation/2 predicate. During backtracking, the assoc backtracks as well, and gets the next calculated values for each item(a) and item(b), respectively (one each, for each next backtrack request).

Now, the mapping is assoc based – so any time one wants to retrieve the current calculated value of an item, you have an assoc lookup – which can become expensive for a very large number of items – say, in the millions or tens of millions or more.

However, in C you wouldn’t use an assoc to map between an item and a calculated value – you would use a pointer (or simply another adjacent memory cell, in such a simple case of one value) to where item(a) is stored.

If you have a reference to a in the item table, then you can immediately retrieve the calculated value, say, via the pointer to its memory cell.

In Prolog, you have to take atom ‘a’ as key and look it up in the assoc to get the value. There is actually no other way to associate any item(a), item(b) to a backtrackable value, but use some kind of key-value data structure – and this is due to the dichotomy in Prolog between the fact store and the stack for terms bound to goal variables.

(one can implement it with backtrackable assert, but that is even more expensive, with all the indexing going on in the background, or one can read all items into a term on the stack, and use attributed variables (somehow) to associate via a pointer each item with its calculated value – which, however, replicates all items in stack memory – which is not feasible).

However, what if, in prolog along with item(a) that is stored in the fact base, one can store a pointer to an (backtrackable) attribute Value (or Term) as well, via a & operator, then a lookup item(a)&Attr would immediately bind the calculated value to Attr – and thereby save an assoc lookup, as in the code below.

For a very large list of items (and hence assoc) the saving can be substantial during lookup and during put – in particular when a put and get attribute is performed in a core area of the system that gets called a lot.

Dan

item(a).
item(b).


long_calculation(X, Count) :-
	item_result(X, Count).

item_result(a, 5).
item_result(a, 6).
item_result(b, 2).
item_result(b, 3).

found_items(Items, Assoc) :-
	empty_assoc(Assoc0),	
	aux_found_items(Items, Assoc0, Assoc).

aux_found_items([], Assoc, Assoc).

aux_found_items([First | Rest ], Assoc0, Assoc) :-
	item(First),
	long_calculation(First, N),	
	put_assoc(First, Assoc0, N, Assoc1),
	aux_found_items(Rest, Assoc1, Assoc).

top_level_process(Items, Result) :-
      found_items(Items, Assoc), 
      process_found(Assoc, Result).


?- top_level_process([a,b], Result).

An assoc or rb_tree lookup is O(log N), so the cost for a 100K is ~16; 1M is ~20; for 10M is 23. Are you sure that this will be the most expensive thing you do?

If you were to, for example, use a hash table in C, then the backtracking could also be moderately expensive (it’s pretty cheap for the assoc/rb_tree).

If I were you, I’d implement and measure rather than try to figure this out. I’d also create an abstraction layer that allow switching between assoc and rb_tree. I did that for a symbol table in pykythe/pykythe_symtab.pl at 9433a4516959f7ece1af7c8d081ce854e3cbbc9e · kamahen/pykythe · GitHub
My symtab grows to about 14K entries (log2: ~14) and it’s far from the most expensive thing in my code, according to profiling (and my code does a lot of both lookup and insertion). Your use case has a cost of roughly 2x mine, so I wonder if you’ll have a performance problem.

The trick is to use logical variables inside the assoc. Like this.
In the first pass create all logical variables. After the first pass all logical
variables are there and assoc is also already there:

:- use_module(library(assoc)).

make_vars(L, R, Assoc) :-
    empty_assoc(Assoc0),
    make_vars(L, R, Assoc0, Assoc).

make_vars([], [], Assoc, Assoc).
make_vars([X|L], [Y|R], Assoc, Assoc2) :-
      put_assoc(X, Assoc, v(Y), Assoc3),
      make_vars(L, R, Assoc3, Assoc2).

In the second pass, we can go trough backtracking, bind the logical variables to different values. Since the different values are not used in the keys of the assoc but only in the values, the existing variable bind side effect from the variables into the assoc is legit:

populate_vars([], []).
populate_vars([X|L], [Y|R]) :-
          item_result(X, Y),
          populate_vars(L, R).

item_result(a, 5).
item_result(a, 6).
item_result(b, 2).
item_result(b, 3).

found_items(L, Assoc) :-
    make_vars(L, R, Assoc),
    populate_vars(L, R).

Here is an example run:

?- found_items([a,b], A).
A = t(a, v(5), >, t, t(b, v(2), -, t, t)) ;
A = t(a, v(5), >, t, t(b, v(3), -, t, t)) ;
A = t(a, v(6), >, t, t(b, v(2), -, t, t)) ;
A = t(a, v(6), >, t, t(b, v(3), -, t, t)).

Credits: Using the compound v(_) is inspired by:
https://www.swi-prolog.org/pack/list?p=evil

Thank you for the observation.

Profiling leads me to see that a significant time is spent within the assoc put and get

The question is whether this can then be improved by other, more direct approaches, or whether i am looking down the wrong rabbit hole – and the focus needs to be indeed elsewhere.

Edit:

I have a few areas i am looking into – one of them is better handing of such a mapping – another is faster lookups of “adjacent” nodes – which are currently fact-based lookups.

Dan

For me, rb_tree gave a definite performance boost. (I don’t remember the details; but I think rb_tree were faster at inserts than assoc and similar for lookups – your use case might be different, of course).

thanks. I will need to study this approach to putting vars in the value.

i guess this also works when the initial list of items is unknown and itself part of the algorithm

Peter,

I am a bit confused — what is really the impact of a linear lookup at a cost of 16 vs. a pointer lookup in performance critical code.

I am for example told that in C++ for performance critical code (e.g. real time) I should avoid the use of virtual function calls – and stick to fixed addresses.

And here we are talking about (on average) 16 steps of misses lookups, before a value is obtained.

How does one square such two views ?

Dan

There are lots of options. You can also represent a graph “the C way”. Each node simply becomes a compound that has fields for whatever properties you want to attach to the node, including a list (or assoc) of outgoing edges (and possibly incoming edges). The elements in these edge lists can be the actual target nodes rather than some node identifier. The resulting structure is highly cyclic which means you need to be a bit careful traversing it to avoid running in circles. For the least I’d associate a unique id to each node. You can extend this data structure with an rbtree that maps unique node ids to the node compound. You can update the graph using setarg/3. Such as structure is a bit work to setup. I think the most useful property is that it is subject to backtracking. And yes, some operations are fast but many others are rather clumsy. It all depends on the critical actions you need perform.

Hi Jan,

Thank you.

And my apologies for laboring this same point again and again …

But, i guess, this all presupposes that the graph is represented as a Term linked to a Variable rather than in the fact base.

Do I understand this correctly?

I am looking for a performant way to keep the graph in the fact base, while still allowing very fast access to (retractable properties) and (static) adjacents …

Dan

Your code example, i.e. found_items/2, wasn’t like that. You kept an assoc as a perpetual argument. Can you make a code example? Do you want to replace put_assoc/4 by some sort of assert?

The classical SQL/relational database solution is to use:

  • primary keys play the role of C struct/array adresses
  • foreign keys play the role of pointers to C struct/array

@joeblog has recently made a tutorial on SQL for Prologers. Primary keys and foreign keys have a quite straight forward equivalent in Prolog. Namely primary keys:

/* SQL */
CREATE TABLE Persons (
    PRIMARY KEY (PersonID)

/* Prolog */
persons(123, ...)

Foreign keys:

/* SQL */
CREATE TABLE Orders (
    FOREIGN KEY (PersonID) REFERENCES Persons(PersonID)

/* Prolog */
orders(...., 123, ...)

Sorry for the confusion …

I think there are two separate issues here, but connected.

  1. how to implement graph properties that are backtrackable in a performant way.
  2. how to implement a graph so that adjacent lookup is highly performant – although, the graph structure itself doesn’t need to be backtrackable – graph creation code can be placed within transactions.

In the design of a graph structure in c one would take care of both requirements in the same graph data structure.

Dan

If you use primary/foreign keys you have ultra fast lookup. Since the lookup is done via Prologs native clause indexing. Hash tables have time complexity O(1).

You cannot beat time complexity O(1). The only factor that is inbetween Prolog and lets say the C programming language, is that in Prolog O(1) is some constant c1 and in C programming language the O(1) is some constant c2.

Usually c2 < c1. But depending on what you do, this is possibly irrelevant. It could be that other factors of Prolog then outperform C programming language, like non-deterministic search Prologs intelligent backtracking or deeper recursion Prologs last call optimization.

Edit 28.04.2021:
Many Prolog systems offer hash tables as perpetual arguments as well. This is an alternative to library(assoc) which has insert something O(log n) and retrieval something O(log n). The hash table library would then offer insert something O(1) and retrieval something O(1).

But unfortunately this is not standardisized as well. Here is an example of such a library:

library(hash)
Hash table library
https://eclipseclp.org/doc/bips/lib/hash/index.html

You could also make a library with a reference datatype, that realizes an alternative to a Prolog compound v(_) and what Jan W. mentioned. i.e. setarg/3, which is a backtrackable updated. Because setarg/3 doesn’t work for SWI-Prolog compiled clauses compounds.

But you could make a matrix, like @dtonhofer did matrixes in the past. The matrix would show different data modelling options, different operations and their time complexity. The classical operations are insert and retrieval. Here is a take:

insert retrieval
assoc O(log n) O(log n)
hash O(1) O(1)

That hash has insert O(1) is a little strange. But if you are lucky the hash keys are well distributed, and expanding the hash table and rehash happens only every once log n times often.

For the backtrackable thingy, this is either mapped to insert/delete or something else. You could add the operation needed to make things backtrackable and then perform an evaluation, what would be the best option for your problem. Write a paper and send us all a copy.

Edit 28.04.2021:
Some Prolog systems balance between scan and hash for their clause indexing. Which on first sight looks crazy, since the matrix looks as follows:

insert retrieval
scan O(n) O(n)
hash O(1) O(1)

But for small n, scan can be faster. See also:

Prolog JIT choosing between if-then-else and hash table
https://gist.github.com/jburse/9fd22e8c3e8de6148fbd341817538ef6#gistcomment-3709689

This is indeed what i am wondering about, suppose there is an N that is very large, a cyclic – i.e. such the lookup happens a lot of times and and all the time during processing.

The although c2-c1 might be quite small N(c2-c1) adds up quickly – its what I lliked to call the “background” noise that creates a compute drag …

Just as an analogy think about a real time system that scans an environment every second or fraction of second – such a hard upper bound might play into this.

Dan

If you rely on primitive operation performance in your algorithms, you are anyway lost. Lets say you optimize and optimize primitve operations for an algorithm I of O(N’^3) complexity for a certain problem. You can waste your whole life in optimizing algorithm I.

But then somebody has a bright idea and makes an O(N^2) complexity algorithm II for the same problem. He can have the slowest of the slowest primitive operations, nevertheless algorithm II will beat algorithm I for larger N on the same problems.

This should be kept in mind for product development…
Same with your graphs? What algorithms will run on them?

Edit 28.07.2021:
I recently had such a case, a sparse matrice with many zero values. And I stupidly computed a*b1, a*b2, …, a*bn even when a==0. A simple if-then-else before the loop gave me ca. 10-100 times speed-up and my life is now bliss.

great point :slight_smile:

Its true – since my work is research oriented – i may learn, eventually, that I have a lower bound i can’t pass – unless i rethink the whole thing.

So, i will eventually need to test N*c1 – and to learn what c1 is, and to see if it is low enough in absolute terms.