The 1brc challenge! You’ve a billion rows of this:
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John’s;15.2
Cracow;12.6
Bridgetown;26.9
Istanbul;6.2
Roseau;34.4
Conakry;31.2
Istanbul;23.0
and you need to calculate min/max/mean for every city. There’s a Python script to generate the row data here.
Here is a straightforward Prolog version, but it’s epically slow : ~92 minutes. AWK takes about 6m30 for the same data, and a competitive (single threaded) solution starts at roughly 1 minute. Am I doing something obviously wrong somewhere?
:- use_module(library(dcg/basics)).
:- initialization(parse, main).
parse :-
ht_new(HashTable),
phrase_from_file(lines(HashTable), "measurements.txt"),
findall(_,( ht_gen(HashTable, Key, (Min,Max,Sum,Count)),
Mean is Sum / Count,
string_codes(Name, Key),
format('~w = ~1f, ~1f, ~1f~n', [Name,Min,Max,Mean])), _).
line(H) --> string(K), ";", number(V), { update_hash(H,K,V) }.
lines(_) --> [].
lines(H) --> line(H), "\n", !, lines(H).
update_hash(H, K, V) :-
( ht_update(H, K, (Min0, Max0, Sum0, Count0), New)
-> ( Min is min(V, Min0), Max is max(V, Max0),
Sum is V + Sum0, Count is Count0 + 1,
New = (Min, Max, Sum, Count))
; ht_put(H, K, (V,V,V,1))).