I wanted to test how “large dicts” (50’000 entries) behave, i.e. how fast lookup works on them.
The answer: Pretty fast (I still have to compare with library(assoc))!
Unless…
I hit a bit of a phenomenon.
(First things first, this is a “Intel(R) Xeon(R) CPU W3520 @ 2.67GHz” with 20-GiB of RAM, so there is space in the system)
The code (it’s quite big and I have dumped all the necessary predicates into this single file for easy running):
It is supposed to be run like this:
?- using_dict(50_000,500_000,builtin,dict,verbose).
which means
“build a dict with 50’000 entries (which will be the dict on which lookups will be performed) using the builtin dict_create/3, create a lookup sequence (just a list of valid keys) of length 500’000 using a dict (to look up keys randomly by index), and finally be verbose when doing the test, printing a line every 10’000 lookups”,
Then the following happens:
- on the first run, lookup is quite slow, one reaches about 1100 dict lookup/s. So the test runs a few minutes.
- you then consult the code again
- after that the lookup is quite fast, one reaches about 800’000 lookup/s in the large dict. So the test runs in under a second.
I thought this had something to do with the memory growth (hence the printout of process information), or some code problem however, it apparently turns out that it has to do with the lambda expression in the test!
The above is observed if the code uses this implementation to run the 500’000 lookups on Dict
by visiting each key in the list Sequence
. Counters
is just a list of monotonically increasing integers, used to provide a counting service for the predicate in maplist/4.
perform_lookup_and_store_timed_2(Dict,Quiet,Sequence,SequenceLength,Counters) :-
maplist(
({Dict,Quiet,SequenceLength}/[Key,Value,C]
>>
(get_dict(Key,Dict,Value),
emit(Quiet,SequenceLength,C)))
,Sequence,Result,Counters),
length(Result,ResultLength),
format("ResultLength is ~d~n",[ResultLength]).
If the code uses a replacement implementation that does not rely on a library(yall)
expression one reaches top performance on first load (that code is commented out in the source for now):
perform_lookup_and_store_timed_2(Dict,Quiet,Sequence,SequenceLength,Counters) :-
maplist(
inside_maplist_b(Dict,Quiet,SequenceLength),
Sequence,Result,Counters),
length(Result,ResultLength),
format("ResultLength is ~d~n",[ResultLength]).
inside_maplist_b(Dict,Quiet,SequenceLength,Key,Value,C) :-
get_dict(Key,Dict,Value),
emit(Quiet,SequenceLength,C).
I don’t know why though. It’s a mystery!
Took me some time to find it it too.