1brc -- Billion row challenge

ridgeworks · March 21, 2024, 3:14pm

Lots of interesting takeaways here, the main one being there’s a heavy price to be paid for converting text from a file into a list of characters so that you can use a dcg when you can use read_string to parse simple records straight from the text buffer (presumably).

I find this to be a pretty astonishing result. So that’s ~6 ns. per record for the whole job. For comparison using @jan’s numbers (84% of 9.6 sec for 10M records) for read_string (i.e., just the string level parse done in C?) is about 800 ns. per record; that’s over two orders of magnitude difference. I’m not sure even massive parallelism can explain that.

emiruz · March 21, 2024, 3:33pm

Well, not really if you have a look at the Fortran solution and compare it. Its fully array based and features:

Custom float parser.
Custom hash table implementation.
mmap.
OpenMP for parallel runs.

Further, the gfortran optimising compiler is amazing and generates architecture native code which is vectorised where possible. Finally, OpenMP is state of the art for efficient parallel processing, so adding cores reduces the time (on the same machine) almost in direct linear relationship since the problem is “embarrassingly parallel”.

Meanwhile, the best C version is 0.15s relying additionally on packing 256-bit SIMD registers for insanely fast calculation of critical elements.

The guys on the Haskell forum managed to pull together a version which runs at <4s on a 10 core Mac. Mine ran in 4.3s against the same data on the same hardware.

Other things to note: if you check the main 1BRC project page, all the best Java results are <2s.

ridgeworks · March 21, 2024, 4:17pm

I stand by my comment - still pretty astonishing to me. At 15 bytes per record, that’s about 2.5 Gbytes per second. I’m pretty sure I couldn’t read the data from permanent storage that quickly. And a solution that runs in 0.15 sec. well exceeds DD4 memory transfer rates, even if the 15GB of data is pre-loaded, if I understand the published numbers correctly. But, admittedly, there’s a lot I don’t understand.

At the very least, it’s hard (for me) to imagine that normal I/O via OS calls is included in the fastest measured times. What is being measured here?

emiruz · March 21, 2024, 5:40pm

The official competition starts with the data on a RAM disk, the official benchmarks are just for the Java version, the best being 1.53s. People have all sorts of setups for non official benchmarks. A guy on the Haskell 1brc thread clocked the C version referenced above at 0.7s on his hardware.

Even if starting from permanent storage, a basic PCIe SSD reads at 5-6G/s, whilst the fastest (Crucial T705) has a sequential read speed of 14.5G/s.

There is a little write up about the fastest C result here, but I’d encourage others to just give the problem a try. The juice has been worth the squeeze.

Ian · March 21, 2024, 8:31pm

I updated Jan example to do just that. It uses good old concurrent_maplist/4 to parallelise the work. Though merging dicts seemed too painful, so I converted them to compounds at reduce time.

process_file_by_partitions(File, Partitions) :-
    size_file(File,Size),
    Chunk is Size//Partitions,
    P is Partitions-1,
    numlist(0,P,Ps),
    convlist(mult(Chunk),Ps,LBs),
    length(Ds,Partitions),
    LBs = [_|T],
    append(T,[inf],UBs),
    concurrent_maplist(process_file(File),Ds,LBs,UBs),
    convlist(c,Ds,Ss),
    flatten(Ss,Fs),
    msort(Fs,S),
    reduce(S,Os),
    forall(member(city(City,Min,Max,Count,Sum),Os),
           (   Mean is Sum/Count,
               format('~w = ~1f, ~1f, ~1f~n', [City,Min,Max,Mean])
           )).

process_file(File, Dict, Start, Stop):-
    setup_call_cleanup(
        (open(File, read, In),seek(In,Start,bof,_),(Start>0->skip(In,'\n');true)),
        process_stream(In, #{}, Dict, Stop),
        close(In)).

process_stream(In, Data0, Data, Stop) :-
    byte_count(In,Pos),
    (Pos>Stop->Data=Data0
       ;
     read_string(In, ";", "", Sep, CityS),
     (   Sep == -1
     ->  Data = Data0
     ;   read_string(In, "\n", "", _, NumS),
         atom_string(City, CityS),
         number_string(Num, NumS),
         update(City, Num, Data0, Data1),
         process_stream(In, Data1, Data,Stop)
     )
    ).

update(City, Num, Data, Data) :-
    get_dict(City, Data, CData),
    !,
    CData = city(Min,Max,Count,Sum),
    (   Num < Min
    ->  nb_setarg(1, CData, Num)
    ;   true
    ),
    (   Num > Max
    ->  nb_setarg(2, CData, Num)
    ;   true
    ),
    NewCount is Count+1,
    nb_setarg(3, CData, NewCount),
    NewSum is Sum+Num,
    nb_setarg(4, CData, NewSum).
update(City, Num, Data0, Data) :-
    put_dict(City, Data0, city(Num,Num,1,Num), Data).

mult(A,B,C):-C is A*B.
c(I,O):-findall(city(Name,I1,I2,I3,I4),city(I1,I2,I3,I4)=I.Name,O).

reduce([],[]).
reduce([X],[X]):-!.
reduce([city(Name,I1,I2,I3,I4),city(Name,J1,J2,J3,J4)|Cs],Ks):-
   NewMin is min(I1,J1),
   NewMax is max(I2,J2),
   NewCnt is I3+J3,
   NewTot is I4+J4,
   reduce([city(Name,NewMin,NewMax,NewCnt,NewTot)|Cs],Ks),!.
reduce([H|T],S):-reduce(T,T2),S=[H|T2].

partitions is best set to your cpu_count. On my WSL instance 100 million rows completed in 180s with 1 partitions, and 30s with 16 partitions.

emiruz · March 22, 2024, 6:34am

Its a good try but am I correct in saying that at least these things are an issue?

Partitions seem unaligned. I.e. If a partition cuts a row it will not adjust the adjacent partition to make sure that the row is included in just one partition.
The whole file must end up in memory simultaneously, so if you’ve not enough memory it won’t work. Solutions typically use mmap to avoid this.

Ian · March 22, 2024, 9:16am

No, because it will be picked up by the previous partition, since each thread only consume rows whose start position are in the partition range (note each thread still has access to the full file).

Maybe, what is in memory are the aggregates produced after streaming over a partition range, so at worse is Num-of-Partitions*Output Size. (every weather station in the output has an entry in each partition prior to reduction). I think the default buffering for streams is full-buffering, however not sure if streaming over a file loads the full file into memory.

Moreover,I was merely demonstrating a simple map-reduce implementation as outlined in the post above.

brebs · March 22, 2024, 9:27am

It would be interesting to see how fast Mercury is with this

jan · March 22, 2024, 11:56am

That is rather disappointing. As there is no serious interaction between the threads, the speedup should be pretty close to the number of cores as long as you do not need to rely on hyperthreading. With hyperthreading the speedup seems to depend a lot on the workload.
And of course, we may be faced with throttling and loss of the single CPU frequency boost.

Also, WSL file I/O is, AFAIK, a lot slower than Linux native file I/O, so possibly this is simply I/O bound?

emiruz · March 22, 2024, 12:50pm

I mean this situation:

Partition N

…
New York;23.4
Lon

Partition N+1

don;13.4
Paris;-23.2
…

There isn’t enough information in either partition to complete the entry for London. So, you need to cut the partitions ahead of time or have some kind of rules which backtrack beyond the boundaries of the partitions. On the sample given above 1 and 5 partitions give different results (I also note that both outputs are wrong).

> process_file_by_partitions("sample.txt",1).
Bridgetown = 26.9, 26.9, 0.0
Bulawayo = 8.9, 8.9, 0.1
Conakry = 31.2, 31.2, 0.0
Cracow = 12.6, 12.6, 0.1
Hamburg = 12.0, 12.0, 0.1
Istanbul = 6.2, 23.0, 0.1
Palembang = 38.8, 38.8, 0.0
Roseau = 34.4, 34.4, 0.0
St. John's = 15.2, 15.2, 0.1

> process_file_by_partitions("sample.txt",5).
Bridgetown = 26.9, 26.9, 0.0
Bulawayo = 8.9, 8.9, 0.1
Conakry = 31.2, 31.2, 0.0
Cracow = 12.6, 12.6, 0.1
Hamburg = 12.0, 12.0, 0.1
Istanbul = 6.2, 23.0, 0.1
Palembang = 38.8, 38.8, 0.0
Roseau = 34.4, 34.4, 0.0

This seems to be the case. It runs in constant memory.

emiruz · March 22, 2024, 1:04pm

Not disk bound Note, your original single threaded code does 1B rows in 25min on my machine. Here is the multi threaded solution above for 1B rows vs my Fortran code as a high benchmark. Exactly the same file and hardware:

time swipl test.prolog > /dev/null

real    7m14.351s
user    55m35.318s
sys     0m10.728s

Fortran:

time ./1brc > /dev/null

real    0m5.712s
user    0m20.291s
sys     0m0.643s

This single threaded AWK solution runs in about 6m20s, which I think is a good benchmark for single threaded performance. My best single threaded Fortran code runs in about 32s.

Ian · March 22, 2024, 1:37pm

There was an off by 1 error and I had mixed up the Sum and Count variables (I added the output code at the last minute, previously just looking at the data structures directly). Now corrected.

emiruz · March 22, 2024, 2:11pm

I think I see. So, start positions demarcated by either start of file or new line. In the London case above, Partition 1 would extend to the end of the record because it includes the start of the row, whilst Partition 2, would skip to the first new line and begin from there instead: the two processes being complimentary.

peter.ludemann · March 22, 2024, 5:51pm

This kind of problem is well-suited for “map-reduce”. SWI-Prolog can do the “map” part in parallel, but doesn’t (AFAIK) have support for multi-threading the “merge” or the “reduce”/“fold”/“aggregate”.

“Merge” can be partially multi-threaded, by taking the outputs of the “maps” as chunks and partially sorting them, then merging the chunks (also called “shuffling”); and the reduce/fold/aggregate can be multi-threaded by dispatching the same-key chunks to independent threads. Finally, these aggregated outputs are merged (with some additional opportunities for multi-threading).
So, it would be nice to have a multi-core version of msort/2.

The distribution of the data is important – Google’s MapReduce(*) has a lot of tuning parameters and on-the-fly analysis (plus quite a bit of stuff for dealing with shards that fail(**) or stall, plus opportunistic parallel processing using different parameters)

(*) I used Google’s MapReduce a lot about 15 years ago [some MR jobs took over a week to run over petabytes of data, with thousands of machines and 10s of thousands of cores] – it’s been replaced by “Flume”, but AFAIK this replacement is to allow a more flexible workflow than map-shard-shuffle-reduce and also more automation in tuning.
(**) Failure in the sense of a disk error, network failure, machine check, out-of-memory, etc.; not in the sense of logic programming failure.

indieterminacy · March 23, 2024, 4:12pm

I notice you are using Mawk as a flavour of Awk.
Have you considered using alternative flavours?

For your information, here is some information on Gawk’s recent utilization of persistent-memory:

                                  run time      peak memory    intermediate
AWK script                       (sec)      footprint (K)    storage (K)
naive                               O(N)   242.132       2,081,360           n/a
rwarray build O(N)     250.288       2,846,868         156,832
rwarray query O(W)     11.653       2,081,444
freqtbl build O(N)       288.408       2,400,120          69,112
freqtbl query O(W)       11.624       2,336,616
pm-gawk build O(N)   251.946       2,079,520       2,076,608
pm-gawk query O(1)      0.026           3,252

Forgive me, but quickly scanning, is it that you are operating off an already compiled file from the cited dataset?

Is it formed from this?

github.com

gunnarmorling/1brc/blob/3372b6b29072af7359c5137bd1893d98828029a2/test.sh

#!/bin/bash
#
#  Copyright 2023 The original authors
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
#

set -euo pipefail

DEFAULT_INPUT="src/test/resources/samples/*.txt"

This file has been truncated. show original

Im looking forward to benchmarking whether pm-gawk is impactful.

Boris · March 27, 2024, 12:14pm

The answer is probably get one of the “accepted” Java versions to work locally and use that to validate. Okay.

Ignore the rest

I know I am very late for the party, I have a small question. How do we compare the results of this challenge? I was looking at the data generating script in Python and it seems that it randomly picks up to 10K location names, and it randomly generates the temperature data. I didn’t see any seeding in the code. Maybe the Java version produces stable output?

So how do we know if the otherwise fast(er?) implementation is also correct? Do we compare the results between the multiple versions of the solution we have programmed?

Topic		Replies	Views
Can you please provide advice on big data handling strategy? Algorithm	38	3057	October 6, 2020
Please review my second CHR program Help!	2	344	August 13, 2019
Regression with concurrent_maplist/3 in 8.3.27? Predicate	3	603	August 1, 2021
Poor memory reuse (tcmalloc?) Split Topic malloc	16	2770	March 27, 2020
Wow. Tabling turns a non-LL(1) grammar with explosive parse times into something tractable Discussion	0	485	May 12, 2021

1brc -- Billion row challenge

Related topics