1brc -- Billion row challenge

The 1brc challenge! You’ve a billion rows of this:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John’s;15.2
Cracow;12.6
Bridgetown;26.9
Istanbul;6.2
Roseau;34.4
Conakry;31.2
Istanbul;23.0

and you need to calculate min/max/mean for every city. There’s a Python script to generate the row data here.

Here is a straightforward Prolog version, but it’s epically slow : ~92 minutes. AWK takes about 6m30 for the same data, and a competitive (single threaded) solution starts at roughly 1 minute. Am I doing something obviously wrong somewhere?

``````:- use_module(library(dcg/basics)).
:- initialization(parse, main).

parse :-
ht_new(HashTable),
phrase_from_file(lines(HashTable), "measurements.txt"),
findall(_,( ht_gen(HashTable, Key, (Min,Max,Sum,Count)),
Mean is Sum / Count,
string_codes(Name, Key),
format('~w = ~1f, ~1f, ~1f~n', [Name,Min,Max,Mean])), _).

line(H)  --> string(K), ";", number(V), { update_hash(H,K,V) }.
lines(_) --> [].
lines(H) --> line(H), "\n", !, lines(H).

update_hash(H, K, V) :-
(   ht_update(H, K, (Min0, Max0, Sum0, Count0), New)
->  ( Min is min(V, Min0), Max is max(V, Max0),
Sum is V + Sum0, Count is Count0 + 1,
New = (Min, Max, Sum, Count))
;   ht_put(H, K, (V,V,V,1))).
``````

Have you tried using the profiler to determine the bottlenecks? (Suggest you start with a smallish dataset.)

I would immediately point a finger at the findall in there.

You can probably use the same looping technique as library(aggregate) but keep a running min, max, and a running mean for each city you encounter. You could for example use Welford’s online algorithm for the mean?

Updating the hash table and parsing the float come in at 40% and 35% respectively. But even if those were reduced to zero, that’d make the solution 75% better: 91 * 0.25 = 22.75 minutes: still > 20x away from a competitive entry.

This is a bit faster, using `string_upto`

``````:- use_module(library(dcg/basics)).
:- initialization(parse, main).

parse :-
ht_new(HashTable),
phrase_from_file(lines(HashTable), "measurements.txt"), !,
findall(_,( ht_gen(HashTable, Key, (Min,Max,Sum,Count)),
Mean is Sum / Count,
format('~w = ~1f, ~1f, ~1f~n', [Key,Min,Max,Mean])), _).

line(H)  -->
string_upto(S, 0';),
string_upto(NC, 0'\n),
{ atom_codes(A, S), number_codes(V, NC), update_hash(H, A, V) }.
lines(_) --> [].
lines(H) --> line(H), lines(H).

string_upto([], C) --> [C], !.
string_upto([H|T], C) --> [H], string_upto(T, C).

update_hash(H, K, V) :-
(   ht_update(H, K, (Min0, Max0, Sum0, Count0), New)
->  ( Min is min(V, Min0), Max is max(V, Max0),
Sum is V + Sum0, Count is Count0 + 1,
New = (Min, Max, Sum, Count))
;   ht_put(H, K, (V,V,V,1))).
``````
1 Like

Digging a little deeper, it’s spending almost half the time in two `dcg_basics` predicates: `string/1` and `number/1`. Looks like a competitive solution will have to start with a competitive parsing implementation, although that’s just a start. `hashtable:ht_put/6` is 26%, i.e., 20 times slower than “competitive” all by itself. Stock library solutions may not scale well if you have huge amounts of data to process.

It might be an interesting exercise to just read the file a line at a time (no line level parsing or processing) and see how close that gets you to your end goal.

2 Likes

I’ve written solutions in AWK, Haskell, Fortran and Prolog. They are all here. The best naive solution by far is in AWK (6m21). Haskell was rather disappointing, and Fortran was really impressive. Fortran is shockingly easy to learn compared to anything I’ve ever tried before. I got about 2m8 out of the gate, and I manage to grind it down to just 5.7s in 203 LoC on a 4 core laptop.

I don’t think Prolog is really meant for this use case: an interpreting VM and implicit backtracking are going to add lots of unnecessary cycles, but I was a little surprised the baseline is as bad as it is.

It’s not obvious to me there’s much backtracking in the code where it matters. The VM is going to add some overhead, but I suspect the libraries you’re using aren’t optimal for this problem. For example, just looking at the `dcg_basics` library, it looks like there are multiple predicate calls (think function calls for imperative languages) per character processed. For example:

``````number(N) -->
{ var(N)
},
!,
int_codes(I),
(   dot,
digit(DF0),
digits(DF)
->  {F = [0'., DF0|DF]}
;   {F = []}
),
(   exp
->  int_codes(DI),
{E=[0'e|DI]}
;   {E = []}
),
{ append([I, F, E], Codes),
number_codes(N, Codes)
}.
number(N) -->
{ type_error(number, N) }.

sign(0'-) --> "-".
sign(0'+) --> "+".

dot --> ".".

exp --> "e".
exp --> "E".
``````

So while there may be some grounds for treating this as a language comparison exercise, I think it’s more of a library comparison. (It doesn’t mean you have to use a hammer, just because it looks like a nail.)

1 Like

This code cuts the time under 50%…
Seems assert/retract are faster than nb_linkval or rbtree (that I used because there is a destructive assignement)

``````:- module(by_capellic_db, []).

:- dynamic rec/5. % Key,Min,Max,Sum,Count

parse :-
parse(_).

parse(NumLinesFile) :-
(   integer(NumLinesFile)
->  format(atom(File), 'data/~I-measurements.txt', [NumLinesFile])
;   File = 'data/measurements.txt'
),
retractall(rec(_,_,_,_,_)),
forall(parse_line(Stream,Name,V), update_rec(Name,V)),
close(Stream),
findall(Out, (
rec(Key,Min,Max,Sum,Count),
Mean is Sum / Count,
format(atom(Out), '~w = ~1f, ~1f, ~1f~n', [Key,Min,Max,Mean])
), Result),
length(Result,NResult),
writeln(NResult).

update_rec(A,V) :-
(   retract(rec(A, Min0, Max0, Sum0, Count0))
->  Min is min(V, Min0), Max is max(V, Max0),
Sum is V + Sum0, Count is Count0 + 1,
assert(rec(A, Min, Max, Sum, Count))
;   assert(rec(A, V, V, V, 1))
).

parse_line(Stream, Name, Value) :-
repeat,
(   Line == end_of_file
->  !, fail
;   true
),
append(NameCodes, [0';|ValueCodes], Line),
atom_codes(Name, NameCodes),
number_codes(Value, ValueCodes).
``````
2 Likes

This is elegant:

It doesn’t seem to work for me on SWI 9.1 for this small sample it just prints `9` at the end but no rows:

Istanbul;6.2
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John’s;15.2
Cracow;12.6
Bridgetown;26.9
Roseau;34.4
Conakry;31.2
Istanbul;23.0

Processing lines like this instead of using dcg/basics cuts run-time by about 20%:

``````lines(_,[],[]).
lines(H,In,Out) :-
append(Ks, [0';|Rest],In),!,
append(Vs, [0'\n|Out0], Rest),!,
number_codes(V,Vs),
update_hash(H,Ks,V),
lines(H, Out0, Out).
``````

Note though that the same input is re-processed multiple times. Once to break it into sections and then again to obtain the number. @CapelliC version has the same issue.

yes, I did some rather arbitrary usage modifications, like to print the number of cities found (Istanbul appears twice) instead of the full unsorted result

This works about as well. Clocks in at about 44m10 vs 91m+ original.

``````:- initialization(parse, main).
:- dynamic rec/5.

parse :-
phrase_from_file(lines, "measurements.txt"),
findall(_,( rec(Key, Min, Max, Sum, Count),
Mean is Sum / Count,
format('~w = ~1f, ~1f, ~1f~n', [Key,Min,Max,Mean])), _).

lines([],[]).
lines(In,Out) :-
append(Ks, [0';|Rest],In),!,
append(Vs, [0'\n|Out0], Rest),!,
number_codes(V,Vs),
atom_codes(A,Ks),
update_rec(A,V),
lines(Out0, Out).

update_rec(A,V) :-
(   retract(rec(A, Min0, Max0, Sum0, Count0))
->  Min is min(V, Min0), Max is max(V, Max0),
Sum is V + Sum0, Count is Count0 + 1,
assert(rec(A, Min, Max, Sum, Count))
;   assert(rec(A, V, V, V, 1))).
``````

read_line_to_codes/2 eats the nl, so the second append is not required. But we need a failure driven loop to apply it easily to every line. BTW I think that such pattern could be the most effective to reduce memory usage, at least with Prologs with less advanced memory GC that SWI-Prolog… or that miss completely any GC

yeah but as a result read_line_to_codes processes the whole line. you then have one append to get the key which double processes most of the line again, and the float parsing which parses the rest of it. I.e. its processed twice. With the double append I think its processed 1.5 times.

The best I could get is the code below, processing 10M lines in 9.6 sec (M1, current GIT version of Prolog). Also tried 100M, which is exactly 10 times slower, so I think we can claim about 1,000 seconds.

It spends 84% in read_string/5. I wouldn’t be surprised if this built-in could be optimized a bit further.

Anyway, if I had this task I would write the code in C. It is so simple that it would barely cost more time to write (well, if you have some code lying around to quicky inplement the hash table) and it would surely run a lot faster.

``````process_file(File) :-
process_file(File, Dict),
forall(get_dict(City, Dict, city(Min,Max,Count,Sum)),
(   Mean is Sum/Count,
format('~w = ~1f, ~1f, ~1f~n', [City,Min,Max,Mean])
)).

process_file(File, Dict) :-
setup_call_cleanup(
process_stream(In, #{}, Dict),
close(In)).

process_stream(In, Data0, Data) :-
(   Sep == -1
->  Data = Data0
;   read_string(In, "\n", "", _, NumS),
atom_string(City, CityS),
number_string(Num, NumS),
update(City, Num, Data0, Data1),
process_stream(In, Data1, Data)
).

update(City, Num, Data, Data) :-
get_dict(City, Data, CData),
!,
CData = city(Min,Max,Count,Sum),
(   Num < Min
->  nb_setarg(1, CData, Num)
;   true
),
(   Num > Max
->  nb_setarg(2, CData, Num)
;   true
),
NewCount is Count+1,
nb_setarg(3, CData, NewCount),
NewSum is Sum+Num,
nb_setarg(4, CData, NewSum).
update(City, Num, Data0, Data) :-
put_dict(City, Data0, city(Num,Num,1,Num), Data).
``````
1 Like

I don’t think that Num can be both `< Min` and `> Max` - so can optimize that logic a bit, to:

``````    (   Num < Min
->  nb_setarg(1, CData, Num)
;   Num > Max
->  nb_setarg(2, CData, Num)
;   true
),
``````
2 Likes

Well spotted. I planned to look more careful at the update and try a few alternatives, but after profile/1 told me the culprit is read_string/5, I stopped bothering. I did not expect that.

The fastest solutions uses parallelism. Most likely some sharding approach.
Like the solution `CalculateAverage_thomaswue.java` on the 1brc GitHub website.
But I didn’t get my head around why parallelism even works. If you think

about it in the terms of min, max, count and sum operations alone, it wouldn’t
make sense. Maybe if you have random access files, which should exist
in an ISO compliant Prolog system, you could determine the file size,

and then seek equal distances, to make big bouquets. If you are at a seeked
file position, you just need to look for the next newline to start your bouquet, and
you would stop when you are past the next seek distance. This way you could

indeed make an embarrassingly parallel and lock-free solution, which also
speculates on parallel I/O read. If each bouquet produces a sorted association
list, you can later combine them towards the end results similare as in merge sort,

because of the aggregation they should be already shorter than each shard.