Fast Hash Tables

Greetings, everybody! I would like to ask if you know of any library anywhere that has got efficient (amortized constant time) Hash Tables. I am in dire need of a Hash Table for SWI Prolog, and it will contain 10k-700k key-value mappings. (of course, I will provide a hash function since keys are going to be tuples of ints: thinking of implementing those by 123’/'123 e.g.)

Having tried the library(assoc), I can tell that its O(logn) performance with those big constants isn’t favoring my application unfortunately… Therefore, speed is of utmost importance to me, and backtrackability is not (in this case) for me. I would prefer in this part of my code destructive modifications and as fast as possible.

Are you aware of any such implementations somewhere, or how I could make this work?

Note: I have also searched the newly proposed dict structure, and, as the documentation on its own states, it is totally inappropriate for speed needs…

Rbtrees proved to be faster for my application than assoc, but probably still not fast enough for you.

In that case, the first solution is to use a dynamic predicate and let clause indexing do the work. That has no problems with millions of clauses and gives excellent performance.

3 Likes

Just in case, an example may help, something like:

:- dynamic my_keystore/2.

kv_add(Key,Value) :-
    (   my_keystore(Key,Value)
    ->  true
    ;   assertz(my_keystore(Key,Value))
    ).

kv_get(Key,Value) :-
   my_keystore(Key,Value).

To test it:

kv_test(N) :-
   time(  forall( between(1,N,K),
                  kv_add(K,K) )
   ),
   time(  ( random_between(1,N,K1),
            kv_get(K1,K1) )
   ).

It is very fast:

8 ?- kv_test(100 000).
% 299,999 inferences, 0.037 CPU in 0.037 seconds (100% CPU, 8060872 Lips)
% 4 inferences, 0.000 CPU in 0.000 seconds (100% CPU, 143241 Lips)
true.

9 ?- kv_test(1 000 000).
% 4,799,999 inferences, 1.548 CPU in 1.550 seconds (100% CPU, 3100560 Lips)
% 4 inferences, 0.000 CPU in 0.000 seconds (99% CPU, 149745 Lips)
true.
1 Like

Update: on a second run, adding keys is much faster:

8 ?- kv_test(1 000 000).
% 2,999,999 inferences, 0.374 CPU in 0.374 seconds (100% CPU, 8022441 Lips)
% 4 inferences, 0.000 CPU in 0.000 seconds (99% CPU, 148440 Lips)

less than 400ms for adding a million keys and sub-millisecond lookup for one key. That is very good.

To get a better handle on the key lookup speed we can use:

kv_get_test(N) :-
   time(  forall( between(1,N,_),
                  (  random_between(1,N,K), kv_get(K,K) )
	        )
   ).

Getting 100,000 entries:

10 ?- kv_get_test(100 000).
% 599,999 inferences, 0.119 CPU in 0.119 seconds (100% CPU, 5049332 Lips)
true.

That is 1.2 µsecs per lookup, and that is INCLUDING generating a random number for each lookup!!! I wouldn’t look any further- thanks to Jan for such a great work :grin:

UPDATE: If I take the random number generation out, I am getting about 190 NANO-SECONDS per loopkup!! (I7 core). Hard to beat that.

1 Like

Hi Jan,

Why is assoc so much slower … can the code above be tweaked to support backtracking while essentially retaining speed …

Dan

assoc is a AVL (balancing binary tree) that requires log(N) calls to compare/3 and accompanying insert/re-balance operations. Compare that to clause index hash tables that compute that hash key from the value and do an O(1) lookup (provided a good distribution). Finally, clause indexing is in C and library(assoc) in Prolog.

No. The database is not backtrackable.

It is not very hard to write a hash table in Prolog that manages its data on the stacks. Several people have done so. library(nb_set) implements a non-backtrackable hash table. This implementation is easily changed to be backtrackable. It should also be pretty straightforward to write the whole thing in C and gain some extra performance. As is, we get this (distinct/2 uses nb_sets and spends most of its time there).

1 ?- time(forall(distinct(X, between(1, 1 000 000, X)),true)).
% 38,922,777 inferences, 3.514 CPU in 3.525 seconds (100% CPU, 11075635 Lips)
true.

For non-backtracking, you can also use tries:

6 ?- trie_new(T), time(forall(between(1, 1 000 000, X), trie_insert(T,X))).
% 2,000,001 inferences, 0.421 CPU in 0.421 seconds (100% CPU, 4751621 Lips)
T = <trie>(0x562eeac55680).

This makes me wonder whether tries are not a better basis for implementing distinct/1,2 …

Hi Jan,

In what way would such a hash table be (non) functionality different from AVLs – what is the key benefit of having an AVL as hash table – is it access time – but, i guess, that would also be improved in the “naive” implementation due to, for example, better indexing.

Access time for a hash table is typically better, but it is hard to define hash table update operations as a relation. This requires a copy of the main array, whereas updating a tree requires copying log(N) small terms. In other words, hash tables only make sense when implemented as a destructively updating data structure and this isn’t very Prolog minded.

distinct/1,2 plays a fundamentally different role: it modifies an answer stream without buffering the whole thing before producing the first answer. That property is quite interesting if you are not interested in all answers. As a consequence distinct answers are defined as the store not containing a variant. I doubt that can be avoided.

It just passes on the first answer for which it doesn’t have a variant in its hash table. Possibly it should unify the answer with the created copy. I think that complicates the code quite a bit though and I’m not 100% sure that is better or worse. The code is pretty simple and just ?- edit(distinct). should show it.