Does SWI-Prolog have N+K-trees?

Does SWI-Prolog have N+K trees?

Found

library(nb_set) (GitHub)

and

logk_ith/4 in library(dialect/xsb/basics)

and many more data structures, so before testing them out figured it couldn’t hurt to also just ask.


tl;dr

tl;dr

In writing an emulator for PostScript there is a need to emulate an array that can dynamically grow in size. In searching I first found arg/3 which requires a fixed length when initializing the data structure, e.g.

?- functor(Data,data,10),arg(5,Data,"five"),arg(5,Data,Value_5),arg(9,Data,"nine"),arg(9,Data,Value_9).
Data = data(_7070, _7072, _7074, _7076, "five", _7080, _7082, _7084, "nine", _7088),
Value_5 = "five",
Value_9 = "nine".

Does not work if you try to access an item beyond the initial size.

?- functor(Array,array,10),arg(5,Array,"five"),arg(5,Array,Value_5),arg(20,Array,"twenty"),arg(20,Array,Value_20).
false.

Not satisfied with having to resize the array searched for a data structure that used arg/3 and did the resizing. That lead me to N+K tree.

Note: I can’t even find a link for N+K tree at Wikipedia.

From “The Craft of Prolog” by Richard A. O`Keefe (WorldCat)

Section: 4.6 Efficient Data Structures

Suppose you want to represent a collection of several thousand items, but you will need fast access to to any one of them. A list is a good way of representing a sequence; it is not a good way of getting at a particular item. Provided you don’t want to replace elements, it is sufficient to use a tree of terms, with the terms being as wide as your system will permit. A general hack is the N + K-tree, where each node of the tree contains slots for N elements and K subtrees.

But from the little test code for arg/3 above realized that it might be simple to modify the N + K-tree to also allow updates.

So I then thought it might already exist in SWI-Prolog and so searched GitHub. Not only did I find more data structures than I knew about, some of them appear to not even be in the documentation.

Must say I learning a lot today about Prolog data structures.

There is no N+K tree in the library. Note that O’Keefe’s text was written with Quintus in mind whose compound terms were limited in arguments (I think 256). SWI-Prolog doesn’t have that limitation. N+K trees also do not avoid that you need to define an upper bound to the size. There is still some value when
changing elements. Given SWI-Prolog’s destructive assignment, that too isn’t worth too much. In other words, N+K trees are rarely needed in SWI-Prolog (one can still imagine cases where they are useful though).

You can use another trick that is used inside SWI-Prolog’s C code to realise dynamically resizing arrays without re-allocating them (such that we do not need locks for accessing their elements). The trick is to have an array of log2(upperbound). The first element contains an array of size 1, the second of size 2, the next of size 4, then 8, etc. Now you use the msb function to compute the most significant bit, use that to find the proper array in the logN one and than do a little offset computation to find the real location. That works a bit different in C where we allocate an array, but return a pointer to a location before the array such that v[n] is still the right location. So, the C access is a[msb(i)][i]. The Prolog version won’t be as fast and elegant, but still requires two pretty simple steps to get at your data.

2 Likes

Now, that’s quite a nice trick!

Just trying to make sure I understand it.

Details

It can be thought of as a user array that is indexed by using N and hidden arrays that implement the user array.

There is an initial size limit to the user array: upperbound.

To access any item in the hidden arrays requires two indexes, index1 and index2

Given N the indexes into the arrays are:

index1 = log2 msb(N)

index2 = N - msb(N)

Helpful tables to see the details

Table 1

n log2 2log2 n
1 0 1
2 1 2
4 2 4
8 3 8
16 4 16
32 5 32
64 6 64
128 7 128

 

Table 2

Array index1
1-based 0-based Array Size
1 0 1
2 1 2
3 2 4
4 3 8

 

Table 3

0-based array
N Bits msb log2 msb N - msb index1 index2
8421
1 0001 1 0 1-1=0 0 0
2 0010 2 1 2-2=0 1 0
3 0011 2 1 3-2=1 1 1
4 0100 4 2 4-4=0 2 0
5 0101 4 2 5-4=1 2 1
6 0110 4 2 6-4=2 2 2
7 0111 4 2 7-4=3 2 3
8 1000 8 3 8-8=0 3 0

 

Table 4

Same as table 3 without all of the calculations.

N index1 index2
1 0 0
2 1 0
3 1 1
4 2 0
5 2 1
6 2 2
7 2 3
8 3 0

So the data is held in the the hidden arrays indexed by index2 and if the user array needs to grow, a copy is made of the hidden array indexed by index1 and then more items are added. Since the hidden arrays with the data and indexed by index2 are not copied, this is much faster to extend the size of the array.

It is always a bit of work to get the math correct. After a little playing, I get at:

:- debug(dyn).

dyn_new(Max, Dyn) :-
    Subs is msb(Max)+1,
    functor(Dyn, dyn, Subs).

dyn_access(I, Dyn, Value) :-
    must_be(positive_integer, I),
    I1 is msb(I),
    I11 is I1+1,
    I2 is I-(2**I1-1),
    debug(dyn, 'Located at index ~D in ~D', [I2, I11]),
    arg(I11, Dyn, A2),
    (   var(A2)
    ->  Len is 2**I1,
        debug(dyn, 'Created sub array of length ~D for ~D', [Len, I11]),
        functor(A2, s, Len)
    ;   true
    ),
    arg(I2, A2, Value).

Then we can run some tests and (by induction) conclude we are fine :slight_smile:

% Located at index 1 in 1
% Created sub array of length 1 for 1
X = dyn(s(1), _10312, _10314, _10316, _10318, _10320).

116 ?- dyn_new(32, X), dyn_access(2, X, 1).
% Located at index 1 in 2
% Created sub array of length 2 for 2
X = dyn(_10296, s(1, _10444), _10300, _10302, _10304, _10306).

117 ?- dyn_new(32, X), dyn_access(3, X, 1).
% Located at index 2 in 2
% Created sub array of length 2 for 2
X = dyn(_10296, s(_10442, 1), _10300, _10302, _10304, _10306).

118 ?- dyn_new(32, X), dyn_access(4, X, 1).
% Located at index 1 in 3
% Created sub array of length 4 for 3
X = dyn(_10300, _10302, s(1, _10448, _10450, _10452), _10306, _10308, _10310).
4 Likes
Personal notes (click triangle to expand) ---

dyn_new/2 - creates a new dynamic array, Dyn, of Max size. Size must be a positive integer. Dyn is a composite structure of two levels. The first level is a term of arity msb(Max)+1, the second levels are a term of arity 2**N, e.g.

?- dyn_new(7,Dyn).
Dyn = dyn(_5538, _5540, _5542).

This only creates the first level, the second level is created using dyn_access/3 on an as needed basis. The size calculation: N = 7, msb(7) = 3.

One of the nice things about this is setting a really large value up front does not use a lot of space up front, e.g.

?- dyn_new(1000000,Dyn).
Dyn = dyn(_1776, _1778, _1780, _1782, _1784, _1786, _1788, _1790, _1792, _1794, _1796, _1798, _1800, _1802, _1804, _1806, _1808, _1810, _1812, _1814).

msb(1000000) = 19.


dyn_access/3 - When Value is bound and at location I there is a free variable, updates Value at index I for structure Dyn otherwise fails. When Value is free, retrieves Value at index I for structure Dyn.

Example of structure populated at indexes 1,3,6,9,12,15.

?- dyn_new(16,Dyn),dyn_access(1,Dyn,1),dyn_access(3,Dyn,3),dyn_access(6,Dyn,6),dyn_access(9,Dyn,9),dyn_access(12,Dyn,12),dyn_access(15,Dyn,15).
% Located at index 1 in 1
% Created sub array of length 1 for 1
% Located at index 2 in 2
% Created sub array of length 2 for 2
% Located at index 3 in 3
% Created sub array of length 4 for 3
% Located at index 2 in 4
% Created sub array of length 8 for 4
% Located at index 5 in 4
% Located at index 8 in 4
Dyn = dyn(s(1), s(_6694, 3), s(_6832, _6834, 6, _6838), s(_6974, 9, _6978, _6980, 12, _6984, _6986, 15), _6422).

Example or reading an existing value.

?- dyn_new(8,Dyn),dyn_access(1,Dyn,1),dyn_access(3,Dyn,3),dyn_access(6,Dyn,6),dyn_access(3,Dyn,Three).
% Located at index 1 in 1
% Created sub array of length 1 for 1
% Located at index 2 in 2
% Created sub array of length 2 for 2
% Located at index 3 in 3
% Created sub array of length 4 for 3
% Located at index 2 in 2
Dyn = dyn(s(1), s(_1938, 3), s(_2076, _2078, 6, _2082), _1666),
Three = 3.

Using dyn_access/3 with an index larger than Max returns false, e.g.

?- dyn_new(7,Dyn),dyn_access(8,Dyn,8).
% Located at index 1 in 4
false.

so while not all of the variables are allocated when the dynamic array is created, it is not unlimited by memory size but by the initial value.


The most import thing I learned by doing this is that when using arg/3 you don’t have to use state variables.

In other words you might expect dyn_access to be written as dyn_access(I, Value, Dyn0, Dyn) where Dyn0 is the state being passed in and Dyn is the new state being passed out. After understanding this and looking at the Prolog code for SWI-Prolog (libraries) I see how often arg/3 is used and why.


Here was a result that I didn’t expect but makes sense.

?- dyn_new(8,Dyn),dyn_access(1,Dyn,1),dyn_access(3,Dyn,3),dyn_access(6,Dyn,6),dyn_access(4,Dyn,Four).
% Located at index 1 in 1
% Created sub array of length 1 for 1
% Located at index 2 in 2
% Created sub array of length 2 for 2
% Located at index 3 in 3
% Created sub array of length 4 for 3
% Located at index 1 in 3
Dyn = dyn(s(1), s(_6576, 3), s(Four, _6716, 6, _6720), _6304).

Created a dynamic array of size 8, added values at index 1,3,6 and then read value at index 4, e.g. dyn_access(4,Dyn,Four). Was expecting the variable Four to be bound to the a system generated variable, but was not expecting Dyn to be updated with the name of the system generated variable, e.g. s(Four, _6716, 6, _6720). In a way it makes sense, because the unification of the the variable Four and the system generated variable is successful, and the printing routine has a choice of which variable to print and chooses the one with the user generated name over the system generated name. What I was expecting was something like Four = _6701.

Using write_canonical/ to write out the value of variable Four does work as expected.

?- dyn_new(8,Dyn),dyn_access(1,Dyn,1),dyn_access(3,Dyn,3),dyn_access(6,Dyn,6),dyn_access(4,Dyn,Four),write_canonical(Four).
% Located at index 1 in 1
% Created sub array of length 1 for 1
% Located at index 2 in 2
% Created sub array of length 2 for 2
% Located at index 3 in 3
% Created sub array of length 4 for 3
% Located at index 1 in 3
_
Dyn = dyn(s(1), s(_2008, 3), s(Four, _2148, 6, _2152), _1736).

While arg/3 can set a free/unbound value, it can not set a bound value. To set a bound values requires the use of setarg/3. See: Non-logical operations on terms

Example:

  1. This demonstrates using functor/3 to create a term with name and arity. Note the term has free variables.
?- functor(S,s,5).
S = s(_6508, _6510, _6512, _6514, _6516).
  1. This demonstrates using arg/3 to bind a value to a variable for for one of the free variables.
?- functor(S,s,5),arg(3,S,a).
S = s(_6616, _6618, a, _6622, _6624).
  1. This shows that using arg/3 on a bound variable will not succeed (fail).
?- functor(S,s,5),arg(3,S,a),arg(3,S,b).
false.
  1. This shows that using setarg/3 on a bond variable will succeed.
?- functor(S,s,5),arg(3,S,a),setarg(3,S,b).
S = s(_6730, _6732, b, _6736, _6738).

When using must_be/2 with compound terms containing free variables, many of the types will fail with exceptions when using must_be/2 because of the free variables, even though the compound term with free variables is a valid term.

Note for my case: When an invalid term is used an exception is desired as opposed to failing or silently failing because during the development of the code any invalid term should stop the processing immediately so that the exact location of the problem is identified, e.g. don’t have to step through the code or look at a stack trace. Also when the creation of a predicate is done in step with the creation of the test cases, having the exception right away leads to better bug identification and better test cases as the test cases with exceptions are also created. Once the code is fully developed and tested, then a second version of the code can be created that removes most to all of the must_be/2 calls.

In writing deterministic code, one often thinks in terms of just input parameters with arguments (bound values) and output parameters (free values/variables). The input arguments are typically checked with must_be(ground,Input_1) and the output arguments checked with must_be(var,Output_1)

However these checks fail when using compound terms created that have one or more free variables.

Example

  1. A positive result of type ground when used with a compound term that has no free variable.
?- functor(S,s,2),arg(1,S,a),arg(2,S,b),must_be(ground,S).
S = s(a, b).
  1. An exception occurs if one of the variables is free. However this should be considered valid input even though it contains free variables.
?- functor(S,s,2),arg(1,S,a),arg(2,S,b),must_be(ground,S).
?- functor(S,s,2),arg(1,S,a),must_be(ground,S).
ERROR: Arguments are not sufficiently instantiated
ERROR: In:
ERROR:   [14] throw(error(instantiation_error,_9156))
ERROR:    [9] <user>
ERROR: 
ERROR: Note: some frames are missing due to last-call optimization.
ERROR: Re-run your program in debug mode (:- debug.) to get more detail.
  1. A positive result of type compound when used with a compound term that has free variable(s).
?- functor(S,s,2),arg(1,S,a),must_be(compound,S).
S = s(a, _11154).
  1. A positive result of type nonvar when uses with a compound term that has free variable(s).
?- functor(S,s,2),arg(1,S,a),must_be(nonvar,S).
S = s(a, _1680).

Looking beyond using must_be/2 for argument validation, these work for a single level compound term,

?- =(s(_,_),s(a,_)).
true.

?- subsumes_term(s(_,_),s(a,_)).
true.

and this works for a multi-level compound term

?- =(s(_,s(a,_)),s(_,_)).
true.

However here is a problem I still have.

The code was working correctly until a compound term B was added that used free variables to an existing compound term A that did not use free variables, in other words, the code worked with must_be(ground,A) but now that A contains B, must_be(ground,A) no longer works and the test has to be relaxed to must_be(nonvar,A) which is not desired.

1 Like

i am still trying to get my head around this – but, that sounds to me like pointer arithmetic based on knowledge of how cells (cons) are stored on some global memory location that holds dynamic data.

Not clear from where the additional memory comes from that doesnt need reallocation?

There’s no global memory per se – it’s just a regular compound term.

Perhaps this will make it clearer what’s going on:

% set_values/3 fills in the values of _Dyn_ with 1:'x-1', 2:'x-2', etc.
set_values(I, Limit, _) :- I > Limit, !.
set_values(I, Limit, Dyn) :-
    format(atom(A), 'x-~d', [I]),
    dyn_access(I, Dyn, A),
    I2 is I + 1,
    set_values(I2, Limit, Dyn).

% set the values and print out the Dyn structure and the values.
?- dyn_new(25, Dyn),
   set_values(1, 20, Dyn),
   print_term(Dyn, []),
   forall(between(1,21,I),
          (dyn_access(I, Dyn, A),
           format('~d: ~q~n', [I, A]))).

dyn(s('x-1'),
    s('x-2','x-3'),
    s('x-4','x-5','x-6','x-7'),
    s('x-8','x-9','x-10','x-11','x-12','x-13','x-14','x-15'),
    s('x-16','x-17','x-18','x-19','x-20',_,_,_,_,_,_,_,_,_,_,_))
1: 'x-1'
2: 'x-2'
3: 'x-3'
4: 'x-4'
5: 'x-5'
6: 'x-6'
7: 'x-7'
8: 'x-8'
9: 'x-9'
10: 'x-10'
11: 'x-11'
12: 'x-12'
13: 'x-13'
14: 'x-14'
15: 'x-15'
16: 'x-16'
17: 'x-17'
18: 'x-18'
19: 'x-19'
20: 'x-20'
21: _13844

and running with debug(dyn):

% Located at index 1 in 1
% Created sub array of length 1 for 1
% Located at index 1 in 2
% Created sub array of length 2 for 2
% Located at index 2 in 2
% Located at index 1 in 3
% Created sub array of length 4 for 3
% Located at index 2 in 3
% Located at index 3 in 3
% Located at index 4 in 3
% Located at index 1 in 4
% Created sub array of length 8 for 4
% Located at index 2 in 4
% Located at index 3 in 4
% Located at index 4 in 4
% Located at index 5 in 4
% Located at index 6 in 4
% Located at index 7 in 4
% Located at index 8 in 4
% Located at index 1 in 5
% Created sub array of length 16 for 5
% Located at index 2 in 5
% Located at index 3 in 5
% Located at index 4 in 5
% Located at index 5 in 5
dyn(s('x-1'),s('x-2','x-3'),s('x-4','x-5','x-6','x-7'),s('x-8','x-9','x-10','x-11','x-12','x-13','x-14','x-15'),s('x-16','x-17','x-18','x-19','x-20',_16516,_16518,_16520,_16522,_16524,_16526,_16528,_16530,_16532,_16534,_16536))
% Located at index 1 in 1
1: 'x-1'
% Located at index 1 in 2
2: 'x-2'
% Located at index 2 in 2
3: 'x-3'
% Located at index 1 in 3
4: 'x-4'
% Located at index 2 in 3
5: 'x-5'
% Located at index 3 in 3
6: 'x-6'
% Located at index 4 in 3
7: 'x-7'
% Located at index 1 in 4
8: 'x-8'
% Located at index 2 in 4
9: 'x-9'
% Located at index 3 in 4
10: 'x-10'
% Located at index 4 in 4
11: 'x-11'
% Located at index 5 in 4
12: 'x-12'
% Located at index 6 in 4
13: 'x-13'
% Located at index 7 in 4
14: 'x-14'
% Located at index 8 in 4
15: 'x-15'
% Located at index 1 in 5
16: 'x-16'
% Located at index 2 in 5
17: 'x-17'
% Located at index 3 in 5
18: 'x-18'
% Located at index 4 in 5
19: 'x-19'
% Located at index 5 in 5
20: 'x-20'
% Located at index 6 in 5
21: _14340

Thank you.

Do you know why terms memory allocation uses this scheme ? What was the driver of the design decision?

Dan

I don’t think that @jan said that term memory allocation uses this scheme; rather, he said that the internal implementation of extensible arrays uses something like this scheme. I don’t know whether these extensible arrays are visible in any way at the user level or are only used for internal data structures.

It is only internal. The C equivalent is used at several places where reallocating (and thus often moving) a normal array would require locking or complicated lock-free mechanisms. Examples are arrays atoms, functors, threads, thread-local implementations of predicates, etc.