New hashtable library

Yes. Destructive assignment kills some of the beauty of Prolog. On the other hand, library(yall) (and foreach/2) performing a copy_term/2 on potentially large argument terms is not good either. I wonder whether that is avoidable, i.e., it does not need to copy these terms.

Here is the result with -O:

          Database raw and crude micro benchmark           
                     1,000,000 entries                     
Database  Operation                              Wall time
--------  ---------                              ---------
nb        1st lookup  .......................... 0.143 secs.
trie      1st lookup  .......................... 0.192 secs.
rec       1st lookup  .......................... 0.258 secs.
asrt      1st lookup  .......................... 0.804 secs.
consult   1st lookup  .......................... 0.805 secs.
hashtable 1st lookup  .......................... 1.629 secs.

nb        2nd lookup  .......................... 0.151 secs.
trie      2nd lookup  .......................... 0.192 secs.
rec       2nd lookup  .......................... 0.262 secs.
consult   2nd lookup  .......................... 0.295 secs.
asrt      2nd lookup  .......................... 0.297 secs.
hashtable 2nd lookup  .......................... 1.608 secs.

nb        insert      .......................... 0.243 secs.
trie      insert      .......................... 0.284 secs.
rec       insert      .......................... 0.340 secs.
asrt      insert      .......................... 0.443 secs.
hashtable insert      .......................... 5.033 secs.
consult   insert      .......................... 20.782 secs.

It improves things, but the hashtable is way slower than any of the other options.

If the code is moved to C, and the hashtable becomes an SWI-Prolog blob, wouldn’t this solve the problem of term copying, and in addition speedup things considerably?

EDIT: You already have foreign hashtable in C, but I this probably doesn’t backtrack which I guess was your main point with library(hashtable).

1 Like

Eliminating the need to copy the terms would be a great improvement, both semantically and also, probably, in terms of speed. I wonder, however, how could this be done.

Food for thought.

Perfect hash function

Using a perfect hash function is best in situations where there is a frequently queried large set, S, which is seldom updated.

Perhaps an option can be added to hashtable when the set is static or seldom updated. Obviously this would impact insert times negatively, but as we know the lookup times should improve and be consistent.


EDIT

In reading “Succinct Data Structures and Delta Encoding for Modern Databases” by Matthijs van Otterdijk and Gavin Mendel-Gleason and Kevin Feeney

Succinct data structures (Jacobson, 1988) are a family of data structures which are close in size to the information theoretic minimum representation.


EDIT

Source in C for minimal perfect hashing

Keeping the current data structure but doing most of the quite simple imperative work in C was what I was referring to. If you turn it into a blob you get a completely different beast that always needs to copy the keys and values and doesn’t backtrack. As is, neither keys nor values are copied. This may in fact change the figures considerably when keys and values are not just integers but large(r) terms.

Ideally you have the same semantics as the compiled version. This implies copying as well, but less. Whether that is possible, I do not know. Never really thought about that.

I pushed a branch foreach-no-copy that does this for foreach/2. This now happily works for hash tables. It reduces the overhead of foreach/2 significantly, although notably the pain of the findall/3 to collect all generator solutions still makes it slow. It also fixes a lot of the semantic issues with foreach/2. I have not merged this as I need some more time to be sure this really works. (ref) Test cases are welcome.

1 Like

Reran the benchmarks using foreach/2 instead of forall/2.
With the old regular foreach/2:


          Database raw and crude micro benchmark           
                     1,000,000 entries                     
Database  Operation                              Wall time
--------  ---------                              ---------
trie      1st lookup  .......................... 0.252 secs.
rec       1st lookup  .......................... 0.305 secs.
nb        1st lookup  .......................... 0.772 secs.
asrt      1st lookup  .......................... 0.864 secs.
consult   1st lookup  .......................... 0.872 secs.
hashtable 1st lookup  .......................... 1.643 secs.

trie      2nd lookup  .......................... 0.251 secs.
rec       2nd lookup  .......................... 0.306 secs.
asrt      2nd lookup  .......................... 0.356 secs.
consult   2nd lookup  .......................... 0.359 secs.
nb        2nd lookup  .......................... 0.778 secs.
hashtable 2nd lookup  .......................... 1.616 secs.

nb        insert      .......................... 0.305 secs.
trie      insert      .......................... 0.339 secs.
rec       insert      .......................... 0.410 secs.
asrt      insert      .......................... 0.582 secs.
hashtable insert      .......................... 4.852 secs.
consult   insert      .......................... 21.153 secs.

With the new foreach/2:

          Database raw and crude micro benchmark           
                     1,000,000 entries                     
Database  Operation                              Wall time
--------  ---------                              ---------
nb        1st lookup  .......................... 0.406 secs.
trie      1st lookup  .......................... 0.497 secs.
rec       1st lookup  .......................... 0.614 secs.
consult   1st lookup  .......................... 1.084 secs.
asrt      1st lookup  .......................... 1.087 secs.
hashtable 1st lookup  .......................... 1.612 secs.

nb        2nd lookup  .......................... 0.412 secs.
trie      2nd lookup  .......................... 0.503 secs.
consult   2nd lookup  .......................... 0.581 secs.
asrt      2nd lookup  .......................... 0.589 secs.
rec       2nd lookup  .......................... 0.671 secs.
hashtable 2nd lookup  .......................... 1.585 secs.

nb        insert      .......................... 0.586 secs.
trie      insert      .......................... 0.610 secs.
asrt      insert      .......................... 0.836 secs.
rec       insert      .......................... 1.059 secs.
hashtable insert      .......................... 4.585 secs.
consult   insert      .......................... 21.302 secs.

So overall doesn’t seem to be an improvement speedwise.
EDIT: both runs were using the -O switch.

1 Like

I guess the important question is, why was it initially implemented using copy_term/2?

Not entirely sure what you did, but the hash table tests doesn’t work with the old foreach/2. It just inserts each element in an empty hash table. The new foreach2 does get the correct result and is slower for that as inserting in a growing hash table deals with collisions and needs resizing. Finally, forall/2 is a lot faster. It always will be :slight_smile:

Because you need to call the different instantiations produced by the generator without backtracking (that is the whole point of foreach/2). The simple way to do that is like this, which obviously copies Goal using findall/3.

        foreach(Generator, Goal) :-
            findall(Goal, Generator, Goals),
            maplist(call, Goals).

The actual implementation was a bit smarter, using a findall/3 template based on the shared variables between Goal and Generator. Now we need to instantiate Goal with each instance of this template, but in Prolog we can instantiate a variable only once. So, a very dirty trick was needed that only reverts the instantiation of the template, preserving bindings created by Goal. I still need to think whether this is really correct. If anyone can shoot a hole in this idea, please do.

3 Likes

Merged into master. Docs:

foreach(:Generator, :Goal)
    True when the conjunction of  instances of  Goal created  from solutions
    for  Generator  is  true.  Except  for  term  copying,  this   could  be
    implemented as below.

        foreach(Generator, Goal) :-
            findall(Goal, Generator, Goals),
            maplist(call, Goals).

    The actual implementation uses findall/3 on a template created  from the
    variables shared between Generator and Goal. Subsequently, it uses every
    instance of this template to instantiate Goal, call  Goal and  undo only
    the instantiation of the template and  not other  instantiations created
    by running Goal. Here is an example:

        ?- foreach(between(1,4,X), dif(X,Y)), Y = 5.
        Y = 5.
        ?- foreach(between(1,4,X), dif(X,Y)), Y = 3.
        false.

    The predicate foreach/2 is  mostly used  if Goal  performs backtrackable
    destructive  assignment  on  terms.  Attributed   variables  (underlying
    constraints) are  an example.  Another example  of a  backtrackable data
    structure  is  in  library(hashtable). If  we care  only about  the side
    effects  (I/O,  dynamic  database, etc.)  or the  thruth value  of Goal,
    forall/2 is a faster and simpler alternative.  If Goal  instantiates its
    arguments it is will often fail as the  argument cannot  be instantiated
    to multiple values. It is possible to incrementally grow an argument:

        ?- foreach(between(1,4,X), member(X, L)).
        L = [1,2,3,4|_].

    Note that SWI-Prolog up to version  8.3.4 created  copies of  Goal using
    copy_term/2 for each iteration.