Performance cost of dictionaries?

I was trying to ground my understanding of the performance cost of dictionaries and compared to compound terms. Where I expected it to be slower, but not by substantial amounts.

When actually measuring it while doing a field swap, I found a few things surprising.

  • Dictionary was about 5x slower than compound terms (or 10x if you subtract baseline; but I don’t know if that’s a fair thing to do).
  • Delta between dictionary and compound terms is greater in SWI than the equivalent of clojure’s immutable data structures (clojure is about 1.5x difference between map and vectors, and generally 8x faster than prolog; but I’m not certain if this is just problems in my methodology)
  • No-op is surprisingly slow, taking about 600 ms compared to 10-20ms in other languages (I’m guessing this is just some prolog VM cost for backtracking and such?)

Now, I am in no way certain my methodology is correct. I’d be more than happy to be told I’m doing something silly or expecting something I shouldn’t be, this is a learning exercise for me after all.

Is this all actually expected? Would also love to be enlightened to the details on where the costs are going. Thanks!

Prolog benchmark

Code

baseline_no_op :- true.

swap_field_term(
    foo(A, B, C, D, E),
    foo(A, B, C, E, D)).

swap_field_dict(Foo1, Foo2) :-
    Foo2 = Foo1.put(_{d: Foo1.e, e: Foo1.d}).

run_test :-
    format("Baseline no op x 10M~n"),
    forall(between(1, 3, _),
	   time(forall(between(1, 10_000_000, _),
		       baseline_no_op))),
    nl,
    nl,
    format("Swap field term x 10M~n"),
    forall(between(1, 3, _),
	   time(forall(between(1, 10_000_000, _),
		       swap_field_term(foo(a, b, c, d, e), _)))),
    nl,
    nl,
    format("Swap field dict x 10M~n"),
    forall(between(1, 3, _),
	   time(forall(between(1, 10_000_000, _),
		       swap_field_dict(foo{a:a, b:b, c:c, d:d, e:e}, _)))),
    nl,
    nl.

Output

Baseline no op x 10M
% 20,000,000 inferences, 0.648 CPU in 0.648 seconds (100% CPU, 30872506 Lips)
% 19,999,999 inferences, 0.670 CPU in 0.670 seconds (100% CPU, 29870419 Lips)
% 19,999,999 inferences, 0.657 CPU in 0.657 seconds (100% CPU, 30431102 Lips)


Swap field term x 10M
% 19,999,999 inferences, 1.021 CPU in 1.021 seconds (100% CPU, 19579253 Lips)
% 19,999,999 inferences, 1.015 CPU in 1.015 seconds (100% CPU, 19707627 Lips)
% 19,999,999 inferences, 1.018 CPU in 1.018 seconds (100% CPU, 19646922 Lips)


Swap field dict x 10M
% 120,000,002 inferences, 5.323 CPU in 5.323 seconds (100% CPU, 22545290 Lips)
% 119,999,999 inferences, 5.269 CPU in 5.269 seconds (100% CPU, 22774026 Lips)
% 119,999,999 inferences, 5.403 CPU in 5.403 seconds (100% CPU, 22208489 Lips)

The surprising bits to me:

  • Dictionaries is coming at 5x the time
  • Baseline takes more than 100ms

Clojure benchmark

Disclaimer: I don’t know prolog nor clojure well enough to know if this is truly apples to apples. I suspect it’s not, but I couldn’t tell you how.

Code

(defn no-op [] true)

(println "baseline: " (no-op))
(dotimes [_ 3]
  (time
    (dotimes [_ 10000000]
      (no-op))))

(defn swap-dict-test [x]
  (assoc x
     :d (:e x)
     :e (:d x)))

(println "dict: " (swap-dict-test {:a :a :b :b :c :c :d :d :e :e}))
(dotimes [_ 3]
  (time
    (dotimes [_ 10000000]
      (swap-dict-test {:a :a :b :b :c :c :d :d :e :e}))))

(defn swap-array-test [vec]
  (assoc vec
    3 (vec 4)
    4 (vec 3)))

(println "array: " (swap-array-test [:a, :b, :c, :d, :e]))
(dotimes [_ 3]
  (time
    (dotimes [_ 10000000]
      (swap-array-test [:a, :b, :c, :d, :e]))))

Output

baseline:  true
"Elapsed time: 8.7682 msecs"
"Elapsed time: 6.775901 msecs"
"Elapsed time: 5.1539 msecs"
dict:  {:a :a, :b :b, :c :c, :d :e, :e :d}
"Elapsed time: 738.496829 msecs"
"Elapsed time: 684.561426 msecs"
"Elapsed time: 635.70142 msecs"
array:  [:a :b :c :e :d]
"Elapsed time: 488.002514 msecs"
"Elapsed time: 486.411614 msecs"
"Elapsed time: 487.424913 msecs"

Note: Dictionary updates only took ~1.5x the time.

Javascript (no-op only)

I only checked no-op here since there are no immutable data structures, so that would definitely be not-apples-to-apples.

console.time(); x = []; for (i = 0; i < 10000000; i++) { x.push(1); x.pop(); }; console.timeEnd();
VM181:1 default: 19.4169921875 ms

That is not a fair comparison. You’d have to use

swap_field_dict(_{a:A,b:B,c:C,d:D,e:E},
                _{a:A,b:B,c:C,d:E,e:D}).

Writing a compound version that is fair comparison is even harder. In my view, one should compare get_dict/3 witg arg/3. These perform pretty close, get_dict/3 getting significantly slower above 1,000 keys if I recall well.

P.s. You can speed up a little using this and swipl -O, which compiles the forall/2.

:- use_module(library(apply_macros)).
:- use_module(library(statistics)).

But still, it is relatively slow. Most is loop overhead. profile/1 tells me that the no-op call itself is only 20%.

SWI-Prolog doesn’t do very well on such small artificial tests compared to e.g., SICStus or YAP. On real programs it tends to do well :slight_smile:

1 Like

Ah, thanks! I knew I was doing something silly! It does perform much more similar with that. Where the dict version is now about 1.1s on my system (much closer to ~1.5x of the term version, which seems perfectly reasonable).

Also good to know that get_dict performance is expected to perform well. I didn’t do a comparison yet, but I’m finding the speed of get_dict to scale well, even at 1000 keys. Though maybe I’m doing something wrong again?

size_dict(N, Dict) :- size_dict(N, foo{}, Dict).
size_dict(N0, Dict0, DictF) :-
    N0 > 0, !,
    put_dict(N0, Dict0, N0, Dict1),
    N1 is N0 - 1,    
    size_dict(N1, Dict1, DictF).
size_dict(N, Dict, Dict) :- N =< 0.

run_get_test :-
    forall(between(1, 2_000, I),
	   ( N is I * 100,
	     size_dict(N, Dict),
	     N1 is N + 1,
	     call_time(
		 forall(between(1, 100_000, _),
			( random(1, N1, X),
			  get_dict(X, Dict, _)
			)),
		 Timing),
	     format("~w,~w~n", [N, Timing.cpu])
	   )).

Result (forgot to label axis, but y is seconds and a x number of items):
chart

Get speed seems to remain pretty consistent (maybe loop overhead is dwarfing and hiding it?).

Creation speed gets pretty slow though, but this is less surprising:

chart (1)


It’s also good to know that the loop/no-op test’s overhead is negligible in more realistic scenarios. But, I am curious, why is it so heavy? I don’t have a good mental model for understanding why it would be costly, so looking to get a little more insight there.

1 Like

forall(between(1,N,_), Goal) is quite a bit more work than for(int i=0; i<N; i++) { some code }. For one thing, unless library(apply_macros) is in effect, Goal is meta called, which means it has to go through two hash tables to find the implementation. Using this library, forall/2 is implemented as \+ (Gen, \+ Test), which at least compiled the call to Test, so we do not need any hash tables. But then still, it is a call.

Next, the looping in the imperative version is a simple jump. Here we must activate the Prolog backtracking mechanism to undo all work and find the choice point of between/3, call the foreign (C) function implementing that, go through a couple more VM instructions to reach at our Goal, call it, fail again, etc. Surely you can do all this faster than SWI-Prolog, but then there are few practical cases that are likely to profit a lot. One of the things that bothers me is the rather high cost of calling predicates defined in C. That is probably worth some reconsideration …

2 Likes

SWI-Prolog dicts have copy semantics. If there were a nb_setarg/3
for SWI-Prolog dicts, maybe they would be faster, but also less
“logical”. Here is a test case between copy semantics dicts and

shared semantics dicts, its just an increment counter scenario:

/* SWI-Prolog 9.3.8 */
?- Dict = foo{a:0,b:1,c:2,d:3,e:4,f:5,g:6,h:7,j:8,k:9},
|    time((between(1,1000,_), loop_copy(1000, Dict, Dict2), fail; true)).
% 10,002,000 inferences, 0.438 CPU in 0.469 seconds (93% CPU, 22861714 Lips)

/* Dogelog Player 1.2.3, JDK 21 */
?- time((between(1,1000,_), make_shared(Dict), 
   loop_shared(1000, Dict), fail; true)).
% Zeit 341 ms, GC 0 ms, Lips 17645480, Uhr 13.09.2024 10:46
true.

For small dicts like 7 or so elements it doesn’t matter so much.
But if they are larger you will obviously see more and more a
speed advantage for shared dicts. Source code:

dicts.pl.log (1,5 KB)

nb_set_dict/3 is available for this purpose. Also the backtrackable destructive version b_set_dict/3. The latter would even be a sound replacement if you can prove that the original dict is no longer accessed.

For large (> 50,000 entries) symbol tables, using “copy” semantics (that is, adding an entry produced a new table), with more inserts than lookups, I found library(rbtrees) to be the fastest. I used an abstraction layer to test the alternatives. The biggest performance problems I encountered:

  • input/output of symbol tables - library(fastrw) helped, but I/O still dominated
  • merging two symbol tables

For RB-trees, the cost of an update or insert is roughly logarithmic. Possibly hash tables would be faster, although that would require changing my algorithms to use “destructive” update. There are already lock-free hash tables in the foreign interface (but without a Prolog-level API it seems), or C++ hash tables could be used (e.g., SWI-cpp2-atommap.h). But it’s not clear to me how to change my code from copy semantics to update semantics.

You can can call PL_register_atom() to make sure that SWI-Prolog’s atom garbage collection doesn’t remove an atom (SWI-cpp-atommap.h does this using the C++ method register_ref()). And you can store arbitrary terms by using PL_record() and PL_recorded().

And a minor correction: Python dicts used to be unordered (or, more precisely, the ordering wasn’t specified and could change from one run to the next) but since Python 3.7, if you iterate over a dict, you get the items in the order they were inserted into the dict. (That is, the ordering for a dict became the same as OrderedDict’s.) Incidentally, OrderedDict is useful for implementing an LRU cache.

If you want an external, non-backtrackable key-value store you can use the rocksdb API. That might be a little heavy. You can also use a trie. As this creates a hash table for choice notes, mapping atomic → term is effectively using a hashmap. It does however allow mapping arbitrary terms to a value :slight_smile:

In the Prolog world, I’d assume you typically want a backtrackable map, with or without destructive assignment. Something like hashtable.pl -- Hash tables This is a pure Prolog implementation of a destructive (but backtrackable) hashtable. It is not fast, but we could speedup quite a bit by moving some of the implementation to C.

Although I wrote prolog codes on hash table (library(pac/zdd/'zdd-array) for ZDD,
one year ago, I gave up to improve it by rewriting in C or Rust because I was not sure at all
how much faster it coutd get.

However I would change my mind . I have compared library(hashtable) and
library(pac/zdd/zdd-array) on a tiny test query, which showed unexpectedly
the latter is about 2 times faster. As I am far way long from C programming, I am reluctant
to dive in C for now.

A Quick Tiny Comparison
% ?- N is 10^6, ht_new(_X), numlist(1, N, Ns),
%		time(ht_put_lists(Ns, _X)).
%@ % 40,834,905 inferences, 13.258 CPU in 16.113 seconds (82% CPU, 3079922 Lips)
%@ N = 1000000,
%@ Ns = [1, 2, 3, 4, 5, 6, 7, 8, 9|...].
% ?- N is 10^7, ht_new(_X), numlist(1, N, Ns),
%		time(ht_put_lists(Ns, _X)).
%@ % 563,061,499 inferences, 188.620 CPU in 229.540 seconds (82% CPU, 2985169 Lips)
%@ N = 10000000,
%@ Ns = [1, 2, 3, 4, 5, 6, 7, 8, 9|...].



% ?-zdd,  N is 10^6, numlist(1, N, Ns),
%		time(zdd_put_lists(Ns)).
%@ % 33,822,337 inferences, 7.868 CPU in 8.602 seconds (91% CPU, 4298801 Lips)
%@ N = 1000000,
%@ Ns = [1, 2, 3, 4, 5, 6, 7, 8, 9|...].

% ?-zdd,  N is 10^7, numlist(1, N, Ns),
%		time(zdd_put_lists(Ns)).
%@ % 320,819,563 inferences, 82.414 CPU in 89.058 seconds (93% CPU, 3892789 Lips)
%@ N = 10000000,
%@ Ns = [1, 2, 3, 4, 5, 6, 7, 8, 9|...].


zdd_put_lists([]).
zdd_put_lists([X|Xs]):- memo(X-X), zdd_put_lists(Xs).

ht_put_lists([], _).
ht_put_lists([X|Xs], H):- ht_put(H, X, X),
	ht_put_lists(Xs, H).

That is indeed not always easy to assess. There is quite a big of overhead calling out to C, mainly as we need to export enough of the VM state to be able to call GC in case the C code creates a term that would overflow the stack. In this case though, there are already multiple calls to C predicates and we can combine these into one. If we would add the complete hash table operations to the core, we can also use the low-level C interface that is much harder to deal with, but considerably faster.

Since some recent point of time, I found that it is not appropriate
to use name ZDD for what I did for my package library(pac/zdd) but it is in fact a library of binary DAG based on hashing as a kind of assoc-list programming which is strongly structure sharing aware programming.

I am really looking forward to seeing that core of library(hashtable) to be moved to C, though from your comment I can imagine difficulty for implementation. My experience on my ‘ZDD’ library for years, such fast library(hashtable) in the future is one of the most expected basics which existing prologs lack.

?- set_prolog_flag(trace_gc, true).

And yes, GC coalesces the trail entries of multiple destructive assignments on the same location between the same two choicepoints.

As I posted one year ago somewhere in this forum, a zdd library of mine
counts the number of paths from the lower left node to upper right one
of unordered grid graph with 13 x 13 nodes as follows, which I reran recently
taking about two days on iMac intel 128GB macOS Sequoia beta.
I saw the Activity monitor showed about VM 180GB !!

% [2024/09/03] 13 x 13 grid graph passed also by the simple frontier vector.
% ?- time(rect_path_count(rect(12,12), C)).
% 1,273,378,663,129 inferences, 176187.026 CPU in 244597.424 seconds (72% CPU, 7227426 Lips)
% C = 64528039343270018963357185158482118.

As far as I know, Minato’s group went to more than 20 x 20 grid nodes
at 10 years ago (in Python ?) As a prologer I have to suspect something
fatal error in my codes for larger than 14 x 14 nodes. Otherwise to me it seems pity for Prolog.

So far I have no idea to fill such the big gap, which is one of reasons
that I am interested efficient library(hashtable).

BTW, now I got a questions on my codes on closing memo (hash table).

To close them I use destructive nb_linkval/2 or nb_setval/2 expecting garbage prolog structures are reclaimed by garbage_collenct/0, which I hoped without definite basis.

close_state:-
	nb_linkval(zdd_hash, []),
	nb_linkval(zdd_vec, []),
	nb_linkval(zdd_extra, []),
	nb_linkval(zdd_compare, []),
	nb_linkval(zdd_child, []).

close_memo(Memo_name) :- nb_setval(Memo_name, []).

Is there a simple way to check whether garbage collect is really taken
by the garbage_collect/0 ? Thanks in advance.

In my implemetation, what zdd does is to do cofact ing (two way directional) operations in theory, and the ‘cofact’ is almost equal to memo (hashtable) in my implemenation as seen in codes ‘library(pac/zdd/zdd-array.pl)’. As I have no idea to realize cofact operations
in other form than hashtable, I have no idea of your “macro” big steps. Thank you for comment, though it seems for me to take time to digest it.

The set of required paths is a subset of the power set L of given set of links,
the L is too big for exhaust search. If I remember correct, an exhaustive search for rect(3, 3) did not terminate within one hour.

Plotting memory size for rect(n, n) seems not difficult for me
by using codes library(pac/zdd/vecter-frontier.pl). For now,
as for rect(4,4) the below N-S means that when node N is added, the number of entries of “main vector” i s S.

% ?- time(rect_path_count(rect(4,4), C)).
%@ 25-1
%@ 24-2
%@ 23-5
%@ 22-11
%@ 21-23
%@ 20-47
%@ 19-87
%@ 18-175
%@ 17-291
%@ 16-454
%@ 15-608
%@ 14-586
%@ 13-722
%@ 12-851
%@ 11-984
%@ 10-1044
%@ 9-1006
%@ 8-1348
%@ 7-1270
%@ 6-1522
%@ 5-1246
%@ 4-1271
%@ 3-1123
%@ 2-1132
%@ 1-988
%@ done
%@ % 3,773,368 inferences, 0.888 CPU in 0.960 seconds (93% CPU, 4248135 Lips)
%@ C = 8512.

BTW, this try also shows that frequent manual call of garbage_collect in my codes seems working, though not sure on physical memory freeing.