1brc -- Billion row challenge

emiruz · March 20, 2024, 3:49pm

The 1brc challenge! You’ve a billion rows of this:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John’s;15.2
Cracow;12.6
Bridgetown;26.9
Istanbul;6.2
Roseau;34.4
Conakry;31.2
Istanbul;23.0

and you need to calculate min/max/mean for every city. There’s a Python script to generate the row data here.

Here is a straightforward Prolog version, but it’s epically slow : ~92 minutes. AWK takes about 6m30 for the same data, and a competitive (single threaded) solution starts at roughly 1 minute. Am I doing something obviously wrong somewhere?

:- use_module(library(dcg/basics)).
:- initialization(parse, main).

parse :-
    ht_new(HashTable),
    phrase_from_file(lines(HashTable), "measurements.txt"),
    findall(_,( ht_gen(HashTable, Key, (Min,Max,Sum,Count)),
	        Mean is Sum / Count,
		string_codes(Name, Key),
	        format('~w = ~1f, ~1f, ~1f~n', [Name,Min,Max,Mean])), _).

line(H)  --> string(K), ";", number(V), { update_hash(H,K,V) }.
lines(_) --> [].
lines(H) --> line(H), "\n", !, lines(H).

update_hash(H, K, V) :-
    (   ht_update(H, K, (Min0, Max0, Sum0, Count0), New)
    ->  ( Min is min(V, Min0), Max is max(V, Max0),
	  Sum is V + Sum0, Count is Count0 + 1,
	  New = (Min, Max, Sum, Count))
    ;   ht_put(H, K, (V,V,V,1))).

ridgeworks · March 20, 2024, 4:38pm

Have you tried using the profiler to determine the bottlenecks? (Suggest you start with a smallish dataset.)

Boris · March 20, 2024, 4:56pm

I would immediately point a finger at the findall in there.

You can probably use the same looping technique as library(aggregate) but keep a running min, max, and a running mean for each city you encounter. You could for example use Welford’s online algorithm for the mean?

emiruz · March 20, 2024, 6:08pm

Updating the hash table and parsing the float come in at 40% and 35% respectively. But even if those were reduced to zero, that’d make the solution 75% better: 91 * 0.25 = 22.75 minutes: still > 20x away from a competitive entry.

brebs · March 20, 2024, 6:55pm

This is a bit faster, using string_upto

:- use_module(library(dcg/basics)).
:- initialization(parse, main).

parse :-
    ht_new(HashTable),
    phrase_from_file(lines(HashTable), "measurements.txt"), !,
    findall(_,( ht_gen(HashTable, Key, (Min,Max,Sum,Count)),
            Mean is Sum / Count,
            format('~w = ~1f, ~1f, ~1f~n', [Key,Min,Max,Mean])), _).

line(H)  -->
    string_upto(S, 0';),
    string_upto(NC, 0'\n),
    { atom_codes(A, S), number_codes(V, NC), update_hash(H, A, V) }.
lines(_) --> [].
lines(H) --> line(H), lines(H).

string_upto([], C) --> [C], !.
string_upto([H|T], C) --> [H], string_upto(T, C).

update_hash(H, K, V) :-
    (   ht_update(H, K, (Min0, Max0, Sum0, Count0), New)
    ->  ( Min is min(V, Min0), Max is max(V, Max0),
      Sum is V + Sum0, Count is Count0 + 1,
      New = (Min, Max, Sum, Count))
    ;   ht_put(H, K, (V,V,V,1))).

ridgeworks · March 20, 2024, 7:13pm

Digging a little deeper, it’s spending almost half the time in two dcg_basics predicates: string/1 and number/1. Looks like a competitive solution will have to start with a competitive parsing implementation, although that’s just a start. hashtable:ht_put/6 is 26%, i.e., 20 times slower than “competitive” all by itself. Stock library solutions may not scale well if you have huge amounts of data to process.

It might be an interesting exercise to just read the file a line at a time (no line level parsing or processing) and see how close that gets you to your end goal.

emiruz · March 20, 2024, 7:31pm

I’ve written solutions in AWK, Haskell, Fortran and Prolog. They are all here. The best naive solution by far is in AWK (6m21). Haskell was rather disappointing, and Fortran was really impressive. Fortran is shockingly easy to learn compared to anything I’ve ever tried before. I got about 2m8 out of the gate, and I manage to grind it down to just 5.7s in 203 LoC on a 4 core laptop.

I don’t think Prolog is really meant for this use case: an interpreting VM and implicit backtracking are going to add lots of unnecessary cycles, but I was a little surprised the baseline is as bad as it is.

ridgeworks · March 20, 2024, 7:57pm

It’s not obvious to me there’s much backtracking in the code where it matters. The VM is going to add some overhead, but I suspect the libraries you’re using aren’t optimal for this problem. For example, just looking at the dcg_basics library, it looks like there are multiple predicate calls (think function calls for imperative languages) per character processed. For example:

number(N) -->
    { var(N)
    },
    !,
    int_codes(I),
    (   dot,
        digit(DF0),
        digits(DF)
    ->  {F = [0'., DF0|DF]}
    ;   {F = []}
    ),
    (   exp
    ->  int_codes(DI),
        {E=[0'e|DI]}
    ;   {E = []}
    ),
    { append([I, F, E], Codes),
      number_codes(N, Codes)
    }.
number(N) -->
    { type_error(number, N) }.

sign(0'-) --> "-".
sign(0'+) --> "+".

dot --> ".".

exp --> "e".
exp --> "E".

So while there may be some grounds for treating this as a language comparison exercise, I think it’s more of a library comparison. (It doesn’t mean you have to use a hammer, just because it looks like a nail.)

CapelliC · March 21, 2024, 1:23am

This code cuts the time under 50%…
Seems assert/retract are faster than nb_linkval or rbtree (that I used because there is a destructive assignement)

:- module(by_capellic_db, []).

:- dynamic rec/5. % Key,Min,Max,Sum,Count

parse :-
  parse(_).

parse(NumLinesFile) :-
  (   integer(NumLinesFile)
  ->  format(atom(File), 'data/~I-measurements.txt', [NumLinesFile])
  ;   File = 'data/measurements.txt'
  ),
  retractall(rec(_,_,_,_,_)),
  open(File, read, Stream, [encoding(utf8)]),
  forall(parse_line(Stream,Name,V), update_rec(Name,V)),
  close(Stream),
  findall(Out, (
    rec(Key,Min,Max,Sum,Count),
    Mean is Sum / Count,
    format(atom(Out), '~w = ~1f, ~1f, ~1f~n', [Key,Min,Max,Mean])
  ), Result),
  length(Result,NResult),
  writeln(NResult).

update_rec(A,V) :-
  (   retract(rec(A, Min0, Max0, Sum0, Count0))
  ->  Min is min(V, Min0), Max is max(V, Max0),
      Sum is V + Sum0, Count is Count0 + 1,
      assert(rec(A, Min, Max, Sum, Count))
  ;   assert(rec(A, V, V, V, 1))
).

parse_line(Stream, Name, Value) :-
  repeat,
  read_line_to_codes(Stream, Line),
  (   Line == end_of_file
  ->  !, fail
  ;   true
  ),
  append(NameCodes, [0';|ValueCodes], Line),
  atom_codes(Name, NameCodes),
  number_codes(Value, ValueCodes).

emiruz · March 21, 2024, 7:04am

This is elegant:

emiruz · March 21, 2024, 7:28am

It doesn’t seem to work for me on SWI 9.1 for this small sample it just prints 9 at the end but no rows:

Istanbul;6.2
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John’s;15.2
Cracow;12.6
Bridgetown;26.9
Roseau;34.4
Conakry;31.2
Istanbul;23.0

emiruz · March 21, 2024, 8:48am

Processing lines like this instead of using dcg/basics cuts run-time by about 20%:

lines(_,[],[]).
lines(H,In,Out) :-
    append(Ks, [0';|Rest],In),!,
    append(Vs, [0'\n|Out0], Rest),!,
    number_codes(V,Vs),
    update_hash(H,Ks,V),
    lines(H, Out0, Out).

Note though that the same input is re-processed multiple times. Once to break it into sections and then again to obtain the number. @CapelliC version has the same issue.

CapelliC · March 21, 2024, 8:57am

yes, I did some rather arbitrary usage modifications, like to print the number of cities found (Istanbul appears twice) instead of the full unsorted result

emiruz · March 21, 2024, 8:59am

This works about as well. Clocks in at about 44m10 vs 91m+ original.

:- initialization(parse, main).
:- dynamic rec/5.

parse :-
    phrase_from_file(lines, "measurements.txt"),
    findall(_,( rec(Key, Min, Max, Sum, Count),
	        Mean is Sum / Count,
	        format('~w = ~1f, ~1f, ~1f~n', [Key,Min,Max,Mean])), _).

lines([],[]).
lines(In,Out) :-
    append(Ks, [0';|Rest],In),!,
    append(Vs, [0'\n|Out0], Rest),!,
    number_codes(V,Vs),
    atom_codes(A,Ks),
    update_rec(A,V),
    lines(Out0, Out).

update_rec(A,V) :-
  (   retract(rec(A, Min0, Max0, Sum0, Count0))
  ->  Min is min(V, Min0), Max is max(V, Max0),
      Sum is V + Sum0, Count is Count0 + 1,
      assert(rec(A, Min, Max, Sum, Count))
  ;   assert(rec(A, V, V, V, 1))).

CapelliC · March 21, 2024, 10:45am

read_line_to_codes/2 eats the nl, so the second append is not required. But we need a failure driven loop to apply it easily to every line. BTW I think that such pattern could be the most effective to reduce memory usage, at least with Prologs with less advanced memory GC that SWI-Prolog… or that miss completely any GC

emiruz · March 21, 2024, 11:12am

yeah but as a result read_line_to_codes processes the whole line. you then have one append to get the key which double processes most of the line again, and the float parsing which parses the rest of it. I.e. its processed twice. With the double append I think its processed 1.5 times.

jan · March 21, 2024, 11:44am

The best I could get is the code below, processing 10M lines in 9.6 sec (M1, current GIT version of Prolog). Also tried 100M, which is exactly 10 times slower, so I think we can claim about 1,000 seconds.

It spends 84% in read_string/5. I wouldn’t be surprised if this built-in could be optimized a bit further.

Anyway, if I had this task I would write the code in C. It is so simple that it would barely cost more time to write (well, if you have some code lying around to quicky inplement the hash table) and it would surely run a lot faster.

process_file(File) :-
    process_file(File, Dict),
    forall(get_dict(City, Dict, city(Min,Max,Count,Sum)),
           (   Mean is Sum/Count,
               format('~w = ~1f, ~1f, ~1f~n', [City,Min,Max,Mean])
           )).

process_file(File, Dict) :-
    setup_call_cleanup(
        open(File, read, In),
        process_stream(In, #{}, Dict),
        close(In)).

process_stream(In, Data0, Data) :-
    read_string(In, ";", "", Sep, CityS),
    (   Sep == -1
    ->  Data = Data0
    ;   read_string(In, "\n", "", _, NumS),
        atom_string(City, CityS),
        number_string(Num, NumS),
        update(City, Num, Data0, Data1),
        process_stream(In, Data1, Data)
    ).

update(City, Num, Data, Data) :-
    get_dict(City, Data, CData),
    !,
    CData = city(Min,Max,Count,Sum),
    (   Num < Min
    ->  nb_setarg(1, CData, Num)
    ;   true
    ),
    (   Num > Max
    ->  nb_setarg(2, CData, Num)
    ;   true
    ),
    NewCount is Count+1,
    nb_setarg(3, CData, NewCount),
    NewSum is Sum+Num,
    nb_setarg(4, CData, NewSum).
update(City, Num, Data0, Data) :-
    put_dict(City, Data0, city(Num,Num,1,Num), Data).

brebs · March 21, 2024, 11:56am

I don’t think that Num can be both < Min and > Max - so can optimize that logic a bit, to:

    (   Num < Min
    ->  nb_setarg(1, CData, Num)
    ;   Num > Max
    ->  nb_setarg(2, CData, Num)
    ;   true
    ),

jan · March 21, 2024, 12:00pm

Well spotted. I planned to look more careful at the update and try a few alternatives, but after profile/1 told me the culprit is read_string/5, I stopped bothering. I did not expect that.

j4n_bur53 · March 21, 2024, 12:43pm

The fastest solutions uses parallelism. Most likely some sharding approach.
Like the solution CalculateAverage_thomaswue.java on the 1brc GitHub website.
But I didn’t get my head around why parallelism even works. If you think

about it in the terms of min, max, count and sum operations alone, it wouldn’t
make sense. Maybe if you have random access files, which should exist
in an ISO compliant Prolog system, you could determine the file size,

and then seek equal distances, to make big bouquets. If you are at a seeked
file position, you just need to look for the next newline to start your bouquet, and
you would stop when you are past the next seek distance. This way you could

indeed make an embarrassingly parallel and lock-free solution, which also
speculates on parallel I/O read. If each bouquet produces a sorted association
list, you can later combine them towards the end results similare as in merge sort,

because of the aggregation they should be already shorter than each shard.

Topic		Replies	Views
Database raw and crude micro benchmark Nice to know	2	421	October 2, 2020
Please review my second CHR program Help!	2	319	August 13, 2019
CLP Optimizations Solved! Should I have known? Discussion clpfd	8	380	June 13, 2023
Fast Hash Tables Resources request	10	1335	August 18, 2019
Library(yall) expression killing performance ... until you reload the code a second time Help! discussion	6	616	January 25, 2021

1brc -- Billion row challenge

Related Topics