Giter Site home page Giter Site logo

gunnarmorling / 1brc Goto Github PK

View Code? Open in Web Editor NEW
6.0K 58.0 1.8K 2.68 MB

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

Home Page: https://www.morling.dev/blog/one-billion-row-challenge/

License: Apache License 2.0

Shell 10.35% Java 88.76% Python 0.22% Dockerfile 0.06% Go 0.60%
1brc challenges

1brc's Introduction

1️⃣🐝🏎️ The One Billion Row Challenge

Status Feb 4: The final leaderboards have been published. Congrats to all the winners, and a big thank you to everyone participating in this challenge as well as to everyone helping to organize it!

Status Feb 3: All entries have been evaluated and I am in the process of finalizing the leaderboards.

Status Feb 1: The challenge has been closed for new submissions. No new pull requests for adding submissions are accepted at this time. Pending PRs will be evaluated over the next few days.

Status Jan 31: The challenge will close today at midnight UTC.

Status Jan 12: As there has been such a large number of entries to this challenge so far (100+), and this is becoming hard to manage, please only create new submissions if you expect them to run in 10 seconds or less on the evaluation machine.

Status Jan 1: This challenge is open for submissions!

Sponsorship

A big thank you to my employer Decodable for funding the evaluation environment and supporting this challenge!

The One Billion Row Challenge (1BRC) is a fun exploration of how far modern Java can be pushed for aggregating one billion rows from a text file. Grab all your (virtual) threads, reach out to SIMD, optimize your GC, or pull any other trick, and create the fastest implementation for solving this task!

1BRC

The text file contains temperature values for a range of weather stations. Each row is one measurement in the format <string: station name>;<double: measurement>, with the measurement value having exactly one fractional digit. The following shows ten rows as an example:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
Bridgetown;26.9
Istanbul;6.2
Roseau;34.4
Conakry;31.2
Istanbul;23.0

The task is to write a Java program which reads the file, calculates the min, mean, and max temperature value per weather station, and emits the results on stdout like this (i.e. sorted alphabetically by station name, and the result values per station in the format <min>/<mean>/<max>, rounded to one fractional digit):

{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, Abéché=-10.0/29.4/69.0, Accra=-10.1/26.4/66.4, Addis Ababa=-23.7/16.0/67.0, Adelaide=-27.8/17.3/58.5, ...}

Submit your implementation by Jan 31 2024 and become part of the leaderboard!

Results

These are the results from running all entries into the challenge on eight cores of a Hetzner AX161 dedicated server (32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM).

# Result (m:s.ms) Implementation JDK Submitter Notes Certificates
1 00:01.535 link 21.0.2-graal Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen GraalVM native binary, uses Unsafe Certificate
2 00:01.587 link 21.0.2-graal Artsiom Korzun GraalVM native binary, uses Unsafe Certificate
3 00:01.608 link 21.0.2-graal Jaromir Hamala GraalVM native binary, uses Unsafe Certificate
00:01.880 link 21.0.1-open Serkan ÖZAL uses Unsafe Certificate
00:01.921 link 21.0.2-graal Van Phu DO GraalVM native binary, uses Unsafe Certificate
00:02.018 link 21.0.2-graal Stephen Von Worley GraalVM native binary, uses Unsafe Certificate
00:02.157 link 21.0.2-graal Roy van Rijn GraalVM native binary, uses Unsafe Certificate
00:02.319 link 21.0.2-graal Yavuz Tas GraalVM native binary, uses Unsafe Certificate
00:02.332 link 21.0.2-graal Marko Topolnik GraalVM native binary, uses Unsafe Certificate
00:02.367 link 21.0.1-open Quan Anh Mai uses Unsafe Certificate
00:02.507 link 21.0.1-open gonix uses Unsafe Certificate
00:02.557 link 21.0.1-open yourwass uses Unsafe Certificate
00:02.820 link 22.ea.32-open Li Lin uses Unsafe Certificate
00:02.995 link 21.0.2-graal tivrfoa GraalVM native binary, uses Unsafe Certificate
00:02.997 link 21.0.1-open gonix Certificate
00:03.095 link 21.0.2-graal Jamal Mulla GraalVM native binary, uses Unsafe Certificate
00:03.210 link 21.0.1-open Quan Anh Mai Certificate
00:03.298 link 21.0.1-graal Subrahmanyam uses Unsafe Certificate
00:03.431 link 21.0.1-graal Roman Musin GraalVM native binary, uses Unsafe Certificate
00:03.469 link 21.0.2-graal Elliot Barlas GraalVM native binary, uses Unsafe Certificate
00:03.698 link 21.0.1-graal Jason Nochlin Certificate
00:03.785 link 21.0.2-graal zerninv GraalVM native binary, uses Unsafe Certificate
00:03.820 link 21.0.2-graal John Ziamos GraalVM native binary, uses Unsafe Certificate
00:03.902 link 21.0.1-open Juan Parera Certificate
00:03.966 link 21.0.1-open Jin Cong Ho uses Unsafe Certificate
00:03.991 link 21.0.1-graal Vaidhy Mayilrangam uses Unsafe Certificate
00:04.066 link 21.0.1-open JesseVanRooy uses Unsafe Certificate
00:04.101 link 21.0.2-graal Jaime Polidura GraalVM native binary, uses Unsafe Certificate
00:04.209 link 21.0.1-open Giovanni Cuccu Certificate
00:04.474 link 21.0.1-open Roman Stoffel Certificate
00:04.676 link 21.0.2-tem Peter Levart Certificate
00:04.684 link 21.0.1-open Florin Blanaru uses Unsafe Certificate
00:04.701 link 21.0.1-open Dr Ian Preston Certificate
00:04.741 link 21.0.1-open Cliff Click uses Unsafe Certificate
00:04.800 link 21.0.1-open Parker Timmins Certificate
00:04.884 link 21.0.1-open Aleksey Shipilëv Certificate
00:04.920 link 21.0.1-graal Subrahmanyam Certificate
00:05.077 link 21.0.2-graal Jonathan Wright GraalVM native binary, uses Unsafe Certificate
00:05.142 link 21.0.1-open Arjen Wisse Certificate
00:05.167 link 21.0.2-open Yevhenii Melnyk Certificate
00:05.235 link 21.0.1-open unbounded Certificate
00:05.336 link java Sumit Chaudhary uses Unsafe Certificate
00:05.354 link 21.0.2-graal Arman Sharif GraalVM native binary, uses Unsafe Certificate
00:05.478 link 21.0.1-open Olivier Bourgain uses Unsafe Certificate
00:05.559 link 21.0.1-graal Panagiotis Drakatos GraalVM native binary Certificate
00:05.887 link 21.0.1-graal Charlie Evans uses Unsafe Certificate
00:05.979 link 21.0.1-graal Sam Pullara Certificate
00:06.166 link 21.0.1-open Jamie Stansfield Certificate
00:06.257 link 21.0.1-graal Stefan Sprenger uses Unsafe Certificate
00:06.392 link 21.0.2-graal Diego Parra Certificate
00:06.576 link 21.0.1-open Andrew Sun uses Unsafe Certificate
00:06.635 link 21.0.1-graal Laake Scates-Gervasi GraalVM native binary, uses Unsafe Certificate
00:06.654 link 21.0.1-graal Jaroslav Bachorik Certificate
00:06.715 link 21.0.1-open Algirdas Raščius Certificate
00:06.884 link 21.0.1-graal rcasteltrione Certificate
00:06.982 link 21.0.1-open Chris Bellew Certificate
00:07.563 link 21.0.1-graal 3j5a Certificate
00:07.680 link 21.0.1-graal Xylitol uses Unsafe Certificate
00:07.712 link 21.0.1-graal Anita SV Certificate
00:07.730 link 21.0.1-open Johannes Schüth Certificate
00:07.894 link 21.0.2-tem Antonio Muñoz Certificate
00:07.925 link 21.0.1-graal Ricardo Pieper Certificate
00:07.948 link java Smoofie uses Unsafe Certificate
00:08.157 link 21.0.1-open JurenIvan Certificate
00:08.167 link 21.0.1-tem Dimitar Dimitrov Certificate
00:08.214 link 21.0.1-open deemkeen Certificate
00:08.255 link 21.0.1-open Mathias Bjerke Certificate
00:08.398 link 21.0.1-open Parth Mudgal uses Unsafe Certificate
00:08.489 link 21.0.1-graal Bang NGUYEN Certificate
00:08.517 link 21.0.1-graal ags uses Unsafe Certificate
00:08.557 link 21.0.1-graal Adrià Cabeza Certificate
00:08.622 link 21.0.1-graal Keshavram Kuduwa uses Unsafe Certificate
00:08.892 link 21.0.1-open Roman Romanchuk Certificate
00:08.896 link 21.0.1-open Andrzej Nestoruk Certificate
00:09.020 link 21.0.1-open yemreinci Certificate
00:09.071 link 21.0.1-open Gabriel Reid Certificate
00:09.352 link 21.0.1-graal Filip Hrisafov Certificate
00:09.725 link 21.0.2-graal Martin GraalVM native binary Certificate
00:09.867 link 21.0.1-graal Ricardo Pieper Certificate
00:09.945 link 21.0.1-open Anthony Goubard Certificate
00:10.092 link 21.0.1-graal Pratham Certificate
00:10.127 link 21.0.1-open Parth Mudgal uses Unsafe Certificate
00:11.577 link 21.0.1-open Eve Certificate
00:10.473 link 21.0.1-open Anton Rybochkin Certificate
00:11.119 link 21.0.1-open lawrey Certificate
00:11.156 link java Yann Moisan Certificate
00:11.167 link 21.0.1-open Nick Palmer Certificate
00:11.352 link 21.0.1-open karthikeyan97 uses Unsafe Certificate
00:11.363 link 21.0.2-tem Guruprasad Sridharan Certificate
00:11.405 link 21.0.1-graal Rafael Merino García Certificate
00:11.406 link 21.0.1-graal gabrielfoo Certificate
00:11.433 link 21.0.1-graal Jatin Gala Certificate
00:11.505 link 21.0.1-open Dmitry Bufistov uses Unsafe Certificate
00:11.744 link 21.0.2-tem Sebastian Lövdahl Certificate
00:11.805 link 21.0.1-graal Cool_Mineman Certificate
00:11.934 link 21.0.1-open arjenvaneerde Certificate
00:12.220 link 21.0.1-open Richard Startin Certificate
00:12.495 link 21.0.1-graal Samuel Yvon GraalVM native binary Certificate
00:12.568 link 21.0.1-graal Vlad Certificate
00:12.800 link java Yonatan Graber Certificate
00:13.013 link 21.0.1-graal Thanh Duong Certificate
00:13.071 link 21.0.1-open Dr Ian Preston Certificate
00:13.729 link java Cedric Boes Certificate
00:13.817 link 21.0.1-open Carlo Certificate
00:14.502 link 21.0.1-graal eriklumme Certificate
00:14.772 link 21.0.1-open Kevin McMurtrie Certificate
00:14.867 link 21.0.1-open Michael Berry Certificate
00:14.900 link java Judekeyser Certificate
00:15.006 link java Paweł Adamski Certificate
00:15.662 link 21.0.1-open Serghei Motpan Certificate
00:16.063 link 21.0.1-open Marek Kohn Certificate
00:16.457 link 21.0.1-open Aleksei Certificate
00:16.953 link 21.0.1-open Gaurav Anantrao Deshmukh Certificate
00:17.046 link 21.0.1-open Dimitris Karampinas Certificate
00:17.086 link java Breejesh Rathod Certificate
00:17.490 link 21.0.1-open Gergely Kiss Certificate
00:17.255 link 21.0.1-open tkosachev Certificate
00:17.520 link 21.0.1-open Farid Certificate
00:17.717 link 21.0.1-open Oleh Marchenko Certificate
00:17.815 link 21.0.1-open Hallvard Trætteberg Certificate
00:17.932 link 21.0.1-open Bartłomiej Pietrzyk Certificate
00:18.251 link 21.0.1-graal Markus Ebner Certificate
00:18.448 link 21.0.1-open Moysés Borges Furtado Certificate
00:18.771 link 21.0.1-graal David Kopec Certificate
00:18.902 link 21.0.1-graal Maxime Certificate
00:19.357 link 21.0.1-graalce Roman Schweitzer Certificate
00:20.691 link 21.0.1-graal Kidlike GraalVM native binary Certificate
00:21.989 link 21.0.1-open couragelee Certificate
00:22.188 link 21.0.1-open Jairo Graterón Certificate
00:22.334 link 21.0.1-open Alberto Venturini Certificate
00:22.457 link 21.0.1-open Ramzi Ben Yahya Certificate
00:22.471 link 21.0.1-open Shivam Agarwal Certificate
00:24.986 link 21.0.1-open kumarsaurav123 Certificate
00:25.064 link 21.0.2-open Sudhir Tumati Certificate
00:26.500 link 21.0.1-open Bruno Félix Certificate
00:28.381 link 21.0.1-open Hampus Certificate
00:29.741 link 21.0.1-open Matteo Vaccari Certificate
00:32.018 link 21.0.1-open Aurelian Tutuianu Certificate
00:34.388 link 21.0.1-tem Tobi Certificate
00:35.875 link 21.0.1-open MahmoudFawzyKhalil Certificate
00:36.180 link 21.0.1-open Horia Chiorean Certificate
00:36.424 link java Manish Garg Certificate
00:38.340 link 21.0.1-open AbstractKamen Certificate
00:41.982 link 21.0.1-open Chris Riccomini Certificate
00:42.893 link 21.0.1-open javamak Certificate
00:46.597 link 21.0.1-open Maeda-san Certificate
00:58.811 link 21.0.1-open Ujjwal Bharti Certificate
01:05.094 link 21.0.1-open Mudit Saxena Certificate
01:05.979 link 21.0.1-graal Hieu Dao Quang Certificate
01:06.790 link 21.0.1-open Karl Heinz Marbaise Certificate
01:06.944 link 21.0.1-open santanu Certificate
01:07.014 link 21.0.1-open pedestrianlove Certificate
01:07.101 link 21.0.1-open Jeevjyot Singh Chhabda Certificate
01:08.811 link 21.0.1-open Aleš Justin Certificate
01:08.908 link 21.0.1-open itaske Certificate
01:09.595 link 21.0.1-tem Antonio Goncalves Certificate
01:09.882 link 21.0.1-open Prabhu R Certificate
01:14.815 link 21.0.1-open twohardthings Certificate
01:25.801 link 21.0.1-open ivanklaric Certificate
01:33.594 link 21.0.1-open Gaurav Mathur Certificate
01:53.208 link java Mahadev K Certificate
01:56.607 link 21.0.1-open Abhilash Certificate
03:43.521 link 21.0.1-open 김예환 Ye-Hwan Kim (Sam) Certificate
03:59.760 link 21.0.1-open Samson Certificate
---
04:49.679 link (Baseline) 21.0.1-open Gunnar Morling

Note that I am not super-scientific in the way I'm running the contenders (see Evaluating Results for the details). This is not a high-fidelity micro-benchmark and there can be variations of up to +-3% between runs. So don't be too hung up on the exact ordering of your entry compared to others in close proximity. The primary purpose of this challenge is to learn something new, have fun along the way, and inspire others to do the same. The leaderboard is only means to an end for achieving this goal. If you observe drastically different results though, please open an issue.

See Entering the Challenge for instructions how to enter the challenge with your own implementation. The Show & Tell features a wide range of 1BRC entries built using other languages, databases, and tools.

Bonus Results

This section lists results from running the fastest N entries with different configurations. As entries have been optimized towards the specific conditions of the original challenge description and set-up (such as size of the key set), challenge entries may perform very differently across different configurations. These bonus results are provided here for informational purposes only. For the 1BRC challenge, only the results in the previous section are of importance.

32 Cores / 64 Threads

For officially evaluating entries into the challenge, each contender is run on eight cores of the evaluation machine (AMD EPYC™ 7502P). Here are the results from running the top 50 entries (as of commit e1fb378a, Feb 2) on all 32 cores / 64 threads (i.e. SMT is enabled) of the machine:

# Result (m:s.ms) Implementation JDK Submitter Notes
1 00:00.323 link 21.0.2-graal Jaromir Hamala GraalVM native binary, uses Unsafe
2 00:00.326 link 21.0.2-graal Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen GraalVM native binary, uses Unsafe
3 00:00.349 link 21.0.2-graal Artsiom Korzun GraalVM native binary, uses Unsafe
00:00.351 link 21.0.2-graal Van Phu DO GraalVM native binary, uses Unsafe
00:00.389 link 21.0.2-graal Stephen Von Worley GraalVM native binary, uses Unsafe
00:00.408 link 21.0.2-graal Yavuz Tas GraalVM native binary, uses Unsafe
00:00.415 link 21.0.2-graal Roy van Rijn GraalVM native binary, uses Unsafe
00:00.499 link 21.0.2-graal Marko Topolnik GraalVM native binary, uses Unsafe
00:00.602 link 21.0.1-graal Roman Musin GraalVM native binary, uses Unsafe
00:00.623 link 21.0.1-open gonix uses Unsafe
00:00.710 link 21.0.2-graal Jamal Mulla GraalVM native binary, uses Unsafe
00:00.727 link 21.0.2-graal tivrfoa GraalVM native binary, uses Unsafe
00:00.774 link 21.0.1-open Serkan ÖZAL uses Unsafe
00:00.788 link 21.0.2-graal Elliot Barlas GraalVM native binary, uses Unsafe
00:00.832 link 21.0.2-graal zerninv GraalVM native binary, uses Unsafe
00:00.840 link 21.0.1-open gonix
00:00.857 link 21.0.2-graal Jaime Polidura GraalVM native binary, uses Unsafe
00:00.880 link 21.0.2-graal John Ziamos GraalVM native binary, uses Unsafe
00:00.939 link 21.0.1-open Aleksey Shipilëv
00:01.026 link 21.0.1-open JesseVanRooy uses Unsafe
00:01.118 link 21.0.2-graal Jonathan Wright GraalVM native binary
00:01.140 link 21.0.2-graal Arman Sharif GraalVM native binary, uses Unsafe
00:01.143 link 21.0.1-open Cliff Click uses Unsafe
00:01.169 link 21.0.2-open Yevhenii Melnyk
00:01.188 link 21.0.1-graal Subrahmanyam uses Unsafe
00:01.193 link 21.0.1-open Florin Blanaru uses Unsafe
00:01.234 link 21.0.1-open Olivier Bourgain uses Unsafe
00:01.242 link 21.0.1-open Quan Anh Mai uses Unsafe
00:01.252 link 21.0.1-open Jin Cong Ho uses Unsafe
00:01.267 link 22.ea.32-open Li Lin uses Unsafe
00:01.363 link 21.0.2-tem Peter Levart
00:01.380 link 21.0.1-graal Jason Nochlin
00:01.391 link 21.0.1-open Quan Anh Mai
00:01.439 link 21.0.1-open Arjen Wisse
00:01.446 link 21.0.1-open Dr Ian Preston
00:01.504 link 21.0.1-open Jamie Stansfield
00:01.514 link 21.0.1-graal Subrahmanyam
00:01.516 link 21.0.1-graal Vaidhy Mayilrangam uses Unsafe
00:01.586 link 21.0.1-open yourwass uses Unsafe
00:01.647 link 21.0.2-graal Diego Parra
00:01.694 link 21.0.1-open Parker Timmins
00:01.694 link 21.0.1-graal Charlie Evans uses Unsafe
00:01.702 link 21.0.1-graal Sam Pullara
00:01.733 link java Sumit Chaudhary uses Unsafe
00:01.742 link 21.0.1-open unbounded
00:02.241 link 21.0.1-graal Stefan Sprenger uses Unsafe
00:02.294 link 21.0.1-open Giovanni Cuccu
00:02.990 link 21.0.1-graal Panagiotis Drakatos GraalVM native binary
00:03.205 link 21.0.1-open Juan Parera
00:10.929 link 21.0.1-open Roman Stoffel

10K Key Set

The 1BRC challenge data set contains 413 distinct weather stations, whereas the rules allow for 10,000 different station names to occur. Here are the results from running the top 40 entries (as of commit e1fb378a, Feb 2) against 1,000,000,000 measurement values across 10K stations (created via ./create_measurements3.sh 1000000000), using eight cores on the evaluation machine:

# Result (m:s.ms) Implementation JDK Submitter Notes
1 00:02.957 link 21.0.2-graal Artsiom Korzun GraalVM native binary, uses Unsafe
2 00:03.058 link 21.0.2-graal Marko Topolnik GraalVM native binary, uses Unsafe
3 00:03.186 link 21.0.2-graal Stephen Von Worley GraalVM native binary, uses Unsafe
00:03.998 link 21.0.2-graal Roy van Rijn GraalVM native binary, uses Unsafe
00:04.042 link 21.0.2-graal Jaromir Hamala GraalVM native binary, uses Unsafe
00:04.289 link 21.0.1-open gonix uses Unsafe
00:04.522 link 21.0.2-graal tivrfoa GraalVM native binary, uses Unsafe
00:04.653 link 21.0.2-graal Jamal Mulla GraalVM native binary, uses Unsafe
00:04.733 link 21.0.1-open gonix
00:04.836 link 21.0.1-graal Subrahmanyam uses Unsafe
00:04.870 link 21.0.2-graal Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen GraalVM native binary, uses Unsafe
00:05.240 link 21.0.2-graal zerninv GraalVM native binary, uses Unsafe
00:05.394 link 21.0.2-graal Yavuz Tas GraalVM native binary, uses Unsafe
00:05.906 link 21.0.2-graal Elliot Barlas GraalVM native binary, uses Unsafe
00:06.086 link 21.0.2-graal Van Phu DO GraalVM native binary, uses Unsafe
00:06.379 link 21.0.2-graal John Ziamos GraalVM native binary, uses Unsafe
00:07.113 link 21.0.2-open Yevhenii Melnyk
00:07.542 link 21.0.2-graal Jonathan Wright GraalVM native binary
00:07.889 link 21.0.1-open Florin Blanaru uses Unsafe
00:07.970 link 21.0.1-open Cliff Click uses Unsafe
00:08.857 link 21.0.1-open Serkan ÖZAL
00:09.333 link 21.0.1-open yourwass uses Unsafe
00:09.722 link 21.0.1-open Aleksey Shipilëv
00:09.777 link 21.0.1-graal Vaidhy Mayilrangam uses Unsafe
00:10.263 link 21.0.1-open Quan Anh Mai uses Unsafe
00:11.154 link 21.0.1-open Parker Timmins
00:13.175 link 21.0.1-open Quan Anh Mai
00:13.245 link 21.0.1-open Dr Ian Preston
00:13.377 link 21.0.1-open Giovanni Cuccu
00:13.761 link 21.0.1-open Juan Parera
00:14.441 link 21.0.2-tem Peter Levart
00:15.548 link 21.0.1-open Jin Cong Ho uses Unsafe
00:17.906 link 21.0.1-graal Jason Nochlin
00:18.770 link 22.ea.32-open Li Lin uses Unsafe
00:19.106 link 21.0.1-open Roman Stoffel
00:20.151 link 21.0.1-graal Roman Musin GraalVM native binary, uses Unsafe; seg-faults occassionally
00:22.953 link 21.0.2-graal Jaime Polidura GraalVM native binary, uses Unsafe
---
DNF link 21.0.1-open JesseVanRooy Incorrect output
DNF link 21.0.1-graal Subrahmanyam Doesn't complete in 60 sec
DNF link 21.0.1-open Arjen Wisse Incorrect output

Prerequisites

Java 21 must be installed on your system.

Running the Challenge

This repository contains two programs:

  • dev.morling.onebrc.CreateMeasurements (invoked via create_measurements.sh): Creates the file measurements.txt in the root directory of this project with a configurable number of random measurement values
  • dev.morling.onebrc.CalculateAverage (invoked via calculate_average_baseline.sh): Calculates the average values for the file measurements.txt

Execute the following steps to run the challenge:

  1. Build the project using Apache Maven:

    ./mvnw clean verify
    
  2. Create the measurements file with 1B rows (just once):

    ./create_measurements.sh 1000000000
    

    This will take a few minutes. Attention: the generated file has a size of approx. 12 GB, so make sure to have enough diskspace.

    If you're running the challenge with a non-Java language, there's a non-authoritative Python script to generate the measurements file at src/main/python/create_measurements.py. The authoritative method for generating the measurements is the Java program dev.morling.onebrc.CreateMeasurements.

  3. Calculate the average measurement values:

    ./calculate_average_baseline.sh
    

    The provided naive example implementation uses the Java streams API for processing the file and completes the task in ~2 min on environment used for result evaluation. It serves as the base line for comparing your own implementation.

  4. Optimize the heck out of it:

    Adjust the CalculateAverage program to speed it up, in any way you see fit (just sticking to a few rules described below). Options include parallelizing the computation, using the (incubating) Vector API, memory-mapping different sections of the file concurrently, using AppCDS, GraalVM, CRaC, etc. for speeding up the application start-up, choosing and tuning the garbage collector, and much more.

Flamegraph/Profiling

A tip is that if you have jbang installed, you can get a flamegraph of your program by running async-profiler via ap-loader:

jbang --javaagent=ap-loader@jvm-profiling-tools/ap-loader=start,event=cpu,file=profile.html -m dev.morling.onebrc.CalculateAverage_yourname target/average-1.0.0-SNAPSHOT.jar

or directly on the .java file:

jbang --javaagent=ap-loader@jvm-profiling-tools/ap-loader=start,event=cpu,file=profile.html src/main/java/dev/morling/onebrc/CalculateAverage_yourname

When you run this, it will generate a flamegraph in profile.html. You can then open this in a browser and see where your program is spending its time.

Rules and limits

  • Any of these Java distributions may be used:
    • Any builds provided by SDKMan
    • Early access builds available on openjdk.net may be used (including EA builds for OpenJDK projects like Valhalla)
    • Builds on builds.shipilev.net If you want to use a build not available via these channels, reach out to discuss whether it can be considered.
  • No external library dependencies may be used
  • Implementations must be provided as a single source file
  • The computation must happen at application runtime, i.e. you cannot process the measurements file at build time (for instance, when using GraalVM) and just bake the result into the binary
  • Input value ranges are as follows:
    • Station name: non null UTF-8 string of min length 1 character and max length 100 bytes, containing neither ; nor \n characters. (i.e. this could be 100 one-byte characters, or 50 two-byte characters, etc.)
    • Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit
  • There is a maximum of 10,000 unique station names
  • Line endings in the file are \n characters on all platforms
  • Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported
  • The rounding of output values must be done using the semantics of IEEE 754 rounding-direction "roundTowardPositive"

Entering the Challenge

To submit your own implementation to 1BRC, follow these steps:

  • Create a fork of the onebrc GitHub repository.
  • Run ./create_fork.sh <your_GH_user> to copy the baseline implementation to your personal files, or do this manually:
    • Create a copy of CalculateAverage_baseline.java, named CalculateAverage_<your_GH_user>.java, e.g. CalculateAverage_doloreswilson.java.
    • Create a copy of calculate_average_baseline.sh, named calculate_average_<your_GH_user>.sh, e.g. calculate_average_doloreswilson.sh.
    • Adjust that script so that it references your implementation class name. If needed, provide any JVM arguments via the JAVA_OPTS variable in that script. Make sure that script does not write anything to standard output other than calculation results.
    • (Optional) OpenJDK 21 is used by default. If a custom JDK build is required, create a copy of prepare_baseline.sh, named prepare_<your_GH_user>.sh, e.g. prepare_doloreswilson.sh. Include the SDKMAN command sdk use java [version] in the your prepare script.
    • (Optional) If you'd like to use native binaries (GraalVM), add all the required build logic to your prepare_<your_GH_user>.sh script.
  • Make that implementation fast. Really fast.
  • Run the test suite by executing /test.sh <your_GH_user>; if any differences are reported, fix them before submitting your implementation.
  • Create a pull request against the upstream repository, clearly stating
    • The name of your implementation class.
    • The execution time of the program on your system and specs of the same (CPU, number of cores, RAM). This is for informative purposes only, the official runtime will be determined as described below.
  • I will run the program and determine its performance as described in the next section, and enter the result to the scoreboard.

Note: I reserve the right to not evaluate specific submissions if I feel doubtful about the implementation (I.e. I won't run your Bitcoin miner ;).

If you'd like to discuss any potential ideas for implementing 1BRC with the community, you can use the GitHub Discussions of this repository. Please keep it friendly and civil.

The challenge runs until Jan 31 2024. Any submissions (i.e. pull requests) created after Jan 31 2024 23:59 UTC will not be considered.

Evaluating Results

Results are determined by running the program on a Hetzner AX161 dedicated server (32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM).

Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant), using 8 cores of the machine. Each contender must pass the 1BRC test suite (/test.sh). The hyperfine program is used for measuring execution times of the launch scripts of all entries, i.e. end-to-end times are measured. Each contender is run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the results table above. The exact same measurements.txt file is used for evaluating all contenders. See the script evaluate.sh for the exact implementation of the evaluation steps.

Prize

If you enter this challenge, you may learn something new, get to inspire others, and take pride in seeing your name listed in the scoreboard above. Rumor has it that the winner may receive a unique 1️⃣🐝🏎️ t-shirt, too!

FAQ

Q: Can I use Kotlin or other JVM languages other than Java?
A: No, this challenge is focussed on Java only. Feel free to inofficially share implementations significantly outperforming any listed results, though.

Q: Can I use non-JVM languages and/or tools?
A: No, this challenge is focussed on Java only. Feel free to inofficially share interesting implementations and results though. For instance it would be interesting to see how DuckDB fares with this task.

Q: I've got an implementation—but it's not in Java. Can I share it somewhere?
A: Whilst non-Java solutions cannot be formally submitted to the challenge, you are welcome to share them over in the Show and tell GitHub discussion area.

Q: Can I use JNI?
A: Submissions must be completely implemented in Java, i.e. you cannot write JNI glue code in C/C++. You could use AOT compilation of Java code via GraalVM though, either by AOT-compiling the entire application, or by creating a native library (see here.

Q: What is the encoding of the measurements.txt file?
A: The file is encoded with UTF-8.

Q: Can I make assumptions on the names of the weather stations showing up in the data set?
A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no ; or \n characters).

Q: Can I copy code from other submissions?
A: Yes, you can. The primary focus of the challenge is about learning something new, rather than "winning". When you do so, please give credit to the relevant source submissions. Please don't re-submit other entries with no or only trivial improvements.

Q: Which operating system is used for evaluation?
A: Fedora 39.

Q: My solution runs in 2 sec on my machine. Am I the fastest 1BRC-er in the world?
A: Probably not :) 1BRC results are reported in wallclock time, thus results of different implementations are only comparable when obtained on the same machine. If for instance an implementation is faster on a 32 core workstation than on the 8 core evaluation instance, this doesn't allow for any conclusions. When sharing 1BRC results, you should also always share the result of running the baseline implementation on the same hardware.

Q: Why 1️⃣🐝🏎️ ?
A: It's the abbreviation of the project name: One Billion Row Challenge.

1BRC on the Web

A list of external resources such as blog posts and videos, discussing 1BRC and specific implementations:

License

This code base is available under the Apache License, version 2.

Code of Conduct

Be excellent to each other! More than winning, the purpose of this challenge is to have fun and learn something new.

1brc's People

Contributors

abeobk avatar alexanderyastrebov avatar armandino avatar artsiomkorzun avatar ddimtirov avatar ebarlas avatar filiphr avatar gamlerhart avatar gnabyl avatar gonix avatar gunnarmorling avatar hundredwatt avatar ianopolous avatar ianopolousfast avatar iziamos avatar jerrinot avatar jgrateron avatar karthikeyan97 avatar kuduwa-keshavram avatar kumarsaurav123 avatar merykitty avatar mtopolnik avatar nipafx avatar richardstartin avatar roman-r-m avatar royvanrijn avatar serkan-ozal avatar thomaswue avatar yavuztas avatar zerninv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

1brc's Issues

Fix tests (kuduwa-keshavram)

Hey @kuduwa-keshavram, since your submission we've added a basic test suite for covering some corner cases (as we discussed already). One of the tests fails for your submission. Could you take a look and fix this so the submission can be re-evaluated? Thx!

./test.sh kuduwa-keshavram
Validating calculate_average_kuduwa-keshavram.sh -- src/test/resources/samples/measurements-1.txt

real	0m0.163s
user	0m0.067s
sys	0m0.037s
Validating calculate_average_kuduwa-keshavram.sh -- src/test/resources/samples/measurements-10.txt

real	0m0.057s
user	0m0.066s
sys	0m0.017s
Validating calculate_average_kuduwa-keshavram.sh -- src/test/resources/samples/measurements-2.txt

real	0m0.054s
user	0m0.060s
sys	0m0.011s
Validating calculate_average_kuduwa-keshavram.sh -- src/test/resources/samples/measurements-20.txt

real	0m0.057s
user	0m0.065s
sys	0m0.015s
Validating calculate_average_kuduwa-keshavram.sh -- src/test/resources/samples/measurements-3.txt

real	0m0.056s
user	0m0.063s
sys	0m0.014s
1c1
< {Bosaso=-15.0/0.0/20.0, Petropavlovsk-Kamchatsky=-9.5/0.0/9.5}
---
> {Bosaso=-15.0/1.3/20.0, Petropavlovsk-Kamchatsky=-9.5/0.0/9.5}

Specify how to set the JDK with SDKman in the launch scripts

Getting e.g. this one: ./calculate_average_ebarlas.sh: line 18: sdk: command not found

I think it's because sdk is a function or something? If anyone has an idea of how to set the SDK reliably in launch scripts, any input would be welcome.

Suggestion: clear disk cache between runs

Hi, In order to have a fair comparison between I/O methods I think that when measuring multiple runs it would be better to clean the disk cache before every run, otherwise only the first time that the file is read the reading will happen from disk, while all the subsequent read operations will be performed directly from the disk cache in RAM, which could lead to wrong overall conclusions.

In order to clean disk cache you can run this command: sync && echo 3 | sudo tee /proc/sys/vm/drop_caches.

Btw very nice challenge!

Question : does the hardware use hyperthreading?

if the hyperthreading is enabled its possible that just splitting the total work to N core count will have quite varying actual cpu time, so one worker may finish much sooner and be idle...

Might yield opportunity to improve the best time even further.

Use Same JVMs for Everyone

Thank you for this fun and interesting competition and maintaining it so well so everyone can enjoy it. I do have one suggestion:

I think the spirit of the competition should be more around the code, not who can figure out the best JVM to use for their code. So I would suggest two options:

  • Use a standard (perhaps best in class for this type of application) JVM for everyone
  • Run all submissions on a pre-defined set of JVMs and show the times next to each other (interesting to see, but expensive in terms of time to run)

Comparing one submission running on one JVM to another submission on another JVM is kind of like comparing apples to oranges if they radically differ in terms of execution model (Graal AOT vs HotSpot JIT for example). You have strictly limited the language here (Java, not even other JVM languages), the dependencies (only standard libraries), why not also strictly limit the JVM so it is a level playing-field?

Always use \n as line endings in measurements.txt for uniformity

This line here:

...is platform dependent. When running on windows it generates \r\n line endings, which threw me off a bit at first as some of the existing submissions weren't working on my machine.

Could we change it to bw.write('\n'); so that the line endings are consistent and the same format file is thus generated on all platforms?

(Happy to submit a PR, but it'd obviously be trivial.)

Add PR template

It should primarly say that one should run (and pass) the test suite and commit any changes from the formatter.

Project structure improvements

  • move the calculate_average_*.sh files to a new scripts folder inside the project.
  • move CreateMeasurements and CreateMeasurements2 to a different package since these classes have a different purpose

Tests fail or don't finish: fatroom

Hey @fatroom, could you please run ./test.sh fatroom and send a PR to make sure they pass and finish in a short time? We have added some tests to cover some corner cases and this implementation doesn't pass all of them right now. Thanks!

Fix tests (ddimtirov)

Hey @ddimtirov, since your submission we've added a basic test suite for covering some corner cases (as we discussed already). One of the tests fails for your submission. Could you take a look and fix this so the submission can be re-evaluated? Thx!

./test.sh ddimtirov
Validating calculate_average_ddimtirov.sh -- src/test/resources/samples/measurements-1.txt
PT0.029701S

real	0m0.337s
user	0m0.093s
sys	0m0.130s
1c1
< {=0.0/19.8/19.8, Kunming=0.0/19.8/19.8, g=0.0/19.8/19.8, ing=0.0/19.8/19.8, ming=0.0/19.8/19.8, ng=0.0/19.8/19.8, nming=0.0/19.8/19.8}
---
> {Kunming=19.8/19.8/19.8}

Sort leaderboard by time

When running ./evaluate2.sh for multiple entries, the emitted leaderboard should be ordered by decreasing time.

Mean or Avg

Great project/challenge Gunnar, thank you.
The value to be calculate is the average and not the mean, right?
Based on the baseline (src/main/java/dev/morling/onebrc/CalculateAverage.java) it is average, but just to double-check.

Use CSV output format

Use proper CSV (or semicolon-separated for that matter) output format

Addis Ababa;33.0;33.0;33.0
Aden;31.1;31.1;31.1,
...

instead of weird not-a-json oneliner output.

This would simplify result comparison (diff) between multiple implementations.
This would also enable parsing of results - think of import into database or e.g. database-based implementation.

Enforce source code formatting

Fix and enforce (e.g. via PR build check) source code formatting. Currently build produces changes to sources which is annoying when working on own implementation:

$ git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
$ ./mvnw clean verify
...
$ git status
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java
        modified:   src/main/java/dev/morling/onebrc/CalculateAverage_richardstartin.java
        modified:   src/main/java/dev/morling/onebrc/CalculateAverage_seijikun.java
        modified:   src/main/java/dev/morling/onebrc/CalculateAverage_truelive.java
        modified:   src/main/java/dev/morling/onebrc/CreateMeasurements2.java

no changes added to commit (use "git add" and/or "git commit -a")

CreateMeasurements4 proposal

Use case: simulate new data appearing (for example if a new weather station is built) => test if program handle new key appearing correctly.

Summary: the first 500M lines only have 2500 keys, then the remaining 500M lines have full 10K keys.

/*
 *  Copyright 2023 The original authors
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */
package dev.morling.onebrc;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StringReader;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.concurrent.ThreadLocalRandom;

public class CreateMeasurements4 {

    public static final int MAX_NAME_LEN = 100;
    public static final int KEYSET_SIZE = 10_000;

    public static void main(String[] args) throws Exception {
        if (args.length != 1) {
            System.out.println("Usage: create_measurements4.sh <number of records to create>");
            System.exit(1);
        }
        int size = 0;
        try {
            size = Integer.parseInt(args[0]);
        }
        catch (NumberFormatException e) {
            System.out.println("Invalid value for <number of records to create>");
            System.out.println("Usage: create_measurements4.sh <number of records to create>");
            System.exit(1);
        }
        final var weatherStations = generateWeatherStations();
        final var start = System.currentTimeMillis();
        final var rnd = ThreadLocalRandom.current();
        try (var out = new BufferedWriter(new FileWriter("measurements.txt"))) {
            int mid = size / 2;
            for (int i = 1; i <= mid; i++) {
                var station = weatherStations.get(rnd.nextInt(KEYSET_SIZE / 4));
                double temp = rnd.nextGaussian(station.avgTemp, 7.0);
                out.write(station.name);
                out.write(';');
                out.write(Double.toString(Math.round(temp * 10.0) / 10.0));
                out.newLine();
                if (i % 50_000_000 == 0) {
                    System.out.printf("Wrote %,d measurements in %,d ms%n", i, System.currentTimeMillis() - start);
                }
            }

            for (int i = mid + 1; i <= size; i++) {
                var station = weatherStations.get(rnd.nextInt(KEYSET_SIZE));
                double temp = rnd.nextGaussian(station.avgTemp, 7.0);
                out.write(station.name);
                out.write(';');
                out.write(Double.toString(Math.round(temp * 10.0) / 10.0));
                out.newLine();
                if (i % 50_000_000 == 0) {
                    System.out.printf("Wrote %,d measurements in %,d ms%n", i, System.currentTimeMillis() - start);
                }
            }
        }
    }

    record WeatherStation(String name, float avgTemp) {
    }

    private static ArrayList<WeatherStation> generateWeatherStations() throws Exception {
        // Use a public list of city names and concatenate them all into a long string,
        // which we'll use as a "source of city name randomness"
        var bigName = new StringBuilder(1 << 20);
        try (var rows = new BufferedReader(new FileReader("data/weather_stations.csv"));) {
            skipComments(rows);
            while (true) {
                var row = rows.readLine();
                if (row == null) {
                    break;
                }
                bigName.append(row, 0, row.indexOf(';'));
            }
        }
        final var weatherStations = new ArrayList<WeatherStation>();
        final var names = new HashSet<String>();
        var minLen = Integer.MAX_VALUE;
        var maxLen = Integer.MIN_VALUE;
        try (var rows = new BufferedReader(new FileReader("data/weather_stations.csv"))) {
            skipComments(rows);
            final var nameSource = new StringReader(bigName.toString());
            final var buf = new char[MAX_NAME_LEN];
            final var rnd = ThreadLocalRandom.current();
            final double yOffset = 4;
            final double factor = 2500;
            final double xOffset = 0.372;
            final double power = 7;
            for (int i = 0; i < KEYSET_SIZE; i++) {
                var row = rows.readLine();
                if (row == null) {
                    break;
                }
                // Use a 7th-order curve to simulate the name length distribution.
                // It gives us mostly short names, but with large outliers.
                var nameLen = (int) (yOffset + factor * Math.pow(rnd.nextDouble() - xOffset, power));
                var count = nameSource.read(buf, 0, nameLen);
                if (count == -1) {
                    throw new Exception("Name source exhausted");
                }
                var nameBuf = new StringBuilder(nameLen);
                nameBuf.append(buf, 0, nameLen);
                if (Character.isWhitespace(nameBuf.charAt(0))) {
                    nameBuf.setCharAt(0, readNonSpace(nameSource));
                }
                if (Character.isWhitespace(nameBuf.charAt(nameBuf.length() - 1))) {
                    nameBuf.setCharAt(nameBuf.length() - 1, readNonSpace(nameSource));
                }
                var name = nameBuf.toString();
                while (names.contains(name)) {
                    nameBuf.setCharAt(rnd.nextInt(nameBuf.length()), readNonSpace(nameSource));
                    name = nameBuf.toString();
                }
                int actualLen;
                while (true) {
                    actualLen = name.getBytes(StandardCharsets.UTF_8).length;
                    if (actualLen <= 100) {
                        break;
                    }
                    nameBuf.deleteCharAt(nameBuf.length() - 1);
                    if (Character.isWhitespace(nameBuf.charAt(nameBuf.length() - 1))) {
                        nameBuf.setCharAt(nameBuf.length() - 1, readNonSpace(nameSource));
                    }
                    name = nameBuf.toString();
                }
                if (name.indexOf(';') != -1) {
                    throw new Exception("Station name contains a semicolon!");
                }
                names.add(name);
                minLen = Integer.min(minLen, actualLen);
                maxLen = Integer.max(maxLen, actualLen);
                var lat = Float.parseFloat(row.substring(row.indexOf(';') + 1));
                // Guesstimate mean temperature using cosine of latitude
                var avgTemp = (float) (30 * Math.cos(Math.toRadians(lat))) - 10;
                weatherStations.add(new WeatherStation(name, avgTemp));
            }
        }
        System.out.format("Generated %,d station names with length from %,d to %,d%n", KEYSET_SIZE, minLen, maxLen);
        return weatherStations;
    }

    private static void skipComments(BufferedReader rows) throws IOException {
        while (rows.readLine().startsWith("#")) {
        }
    }

    private static char readNonSpace(StringReader nameSource) throws IOException {
        while (true) {
            var n = nameSource.read();
            if (n == -1) {
                throw new IOException("Name source exhausted");
            }
            var ch = (char) n;
            if (ch != ' ') {
                return ch;
            }
        }
    }
}

Java errors - what am I doing wrong?

I keep getting the following errors (990483d) when trying to use the code in Visual Studio Code. Compile with maven appears to work but run and debug in VS fails due to this. What am I doing wrong?

image

There are many other warnings too.

Fix tests (palmr)

Hey @Palmr, since your submission we've added a basic test suite for covering some corner cases. One of the tests fails for your submission. Could you take a look and fix this so the submission can be re-evaluated? Thx!

./test.sh palmr
Validating calculate_average_palmr.sh -- src/test/resources/samples/measurements-1.txt

real	0m0.161s
user	0m0.116s
sys	0m0.039s
Validating calculate_average_palmr.sh -- src/test/resources/samples/measurements-10.txt

real	0m0.058s
user	0m0.109s
sys	0m0.018s
Validating calculate_average_palmr.sh -- src/test/resources/samples/measurements-2.txt

real	0m0.068s
user	0m0.104s
sys	0m0.017s
Validating calculate_average_palmr.sh -- src/test/resources/samples/measurements-20.txt

real	0m0.075s
user	0m0.108s
sys	0m0.019s
Validating calculate_average_palmr.sh -- src/test/resources/samples/measurements-3.txt

real	0m0.084s
user	0m0.110s
sys	0m0.022s
1c1
< {Bosaso=-15.0/-4.3/20.0, Petropavlovsk-Kamchatsky=-9.5/-1.9/9.5}
---
> {Bosaso=-15.0/1.3/20.0, Petropavlovsk-Kamchatsky=-9.5/0.0/9.5}

Validate entry names eagerly in evaluate2.sh

It should be detected at the beginning of the script, whether all passed entry names are valid, i.e. whether a calculate_average_<entry>.sh exists. And if not, abort the script before running anything.

UTF-8 Normalization

Hey!

Can we assume the input is UTF-8 normalized? Many SIMD-powered implementation, or implementations that do not rely on building Strings right away will assume the input is UTF-8 normalized. Is this a valid assumption?

Tests fail or don't finish: moysesb

Hey @moysesb, could you please run ./test.sh moysesb and send a PR to make sure they pass and finish in a short time? We have added some tests to cover some corner cases and this implementation doesn't pass all of them right now. Thanks!

Segmenting the Overall Benchmark Problem for Detailed Analysis

The benchmark, though straightforward, encompasses various tasks that can be tackled through distinct approaches. The proposed method involves isolating key tasks for focused examination. This can be achieved by developing a basic skeleton program for the general framework while isolating a specific functionality segment for modifications. The initial step would be to implement an efficient (fast or simple) version, then adapt it to allow easy replacement of the targeted portion. To execute, use the following command:

java --class-path the.jar TaskSpecificSkeletonMain TaskSpecificModule

Here, TaskSpecificSkeletonMain remains constant across different task implementations, while TaskSpecificModule represents the particular implementation being evaluated.

Tests fail or don't finish: yemreinci

Hey @yemreinci, could you please run ./test.sh yemreinci and send a PR to make sure they pass and finish in a short time? We have added some tests to cover some corner cases and this implementation doesn't pass all of them right now. Thanks!

Measure memory consumption?

Hi,
I like this competition and wonder which effect the different approaches have on the amount of used memory (RAM).

I also triggered by the "always upcoming" benefits of native compilation.
It would be great to see the differences in the table.

So my question is: Could you memorythe memory consumption of the benchmarks and add it to the table?

Need acceptance test suite

E.g. current "winner" produces garbage output on trivial input:

$ ./create_measurements.sh 2
Created file with 2 measurements in 18 ms
$ cat measurements.txt 
Calgary;1.6
Tamanrasset;13.7
$ ./calculate_average_ebarlas.sh 
{=2.147483647E8/0.0/-2.147483648E8, manrasset=13.7/13.7/13.7, y=2.147483647E8/0.0/-2.147483648E8}

real    0m0,206s
user    0m0,415s
sys     0m0,048s

Wrong names in Windows?

Disclaimer: I haven't played with Java for ~18 years, so maybe I'm doing something wrong.

PS D:\git\github.com\gunnarmorling\1brc> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.3.10
PSEdition                      Core
GitCommitId                    7.3.10
OS                             Microsoft Windows 10.0.22621
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0
PS D:\git\github.com\gunnarmorling\1brc> scoop install temurin21-jdk maven
...
PS D:\git\github.com\gunnarmorling\1brc> mvn clean verify
...

PS D:\git\github.com\gunnarmorling\1brc> java --version
openjdk 21.0.1 2023-10-17 LTS
OpenJDK Runtime Environment Temurin-21.0.1+12 (build 21.0.1+12-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.1+12 (build 21.0.1+12-LTS, mixed mode, sharing)
PS D:\git\github.com\gunnarmorling\1brc> chcp
Active code page: 437
PS D:\git\github.com\gunnarmorling\1brc> chcp 65001
Active code page: 65001
PS D:\git\github.com\gunnarmorling\1brc> $OutputEncoding

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 65001


PS D:\git\github.com\gunnarmorling\1brc> $Env:JAVA_TOOL_OPTIONS = "-Dfile.encoding=UTF8"
PS D:\git\github.com\gunnarmorling\1brc> java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CreateMeasurements 1000000000
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
Wrote 50,000,000 measurements in 17124 ms
Wrote 100,000,000 measurements in 34366 ms
Wrote 150,000,000 measurements in 51380 ms
Wrote 200,000,000 measurements in 68445 ms
Wrote 250,000,000 measurements in 85397 ms
Wrote 300,000,000 measurements in 102491 ms
Wrote 350,000,000 measurements in 119489 ms
Wrote 400,000,000 measurements in 136484 ms
Wrote 450,000,000 measurements in 153494 ms
Wrote 500,000,000 measurements in 170461 ms
Wrote 550,000,000 measurements in 187471 ms
Wrote 600,000,000 measurements in 205101 ms
Wrote 650,000,000 measurements in 222205 ms
Wrote 700,000,000 measurements in 239340 ms
Wrote 750,000,000 measurements in 256477 ms
Wrote 800,000,000 measurements in 273675 ms
Wrote 850,000,000 measurements in 290896 ms
Wrote 900,000,000 measurements in 307993 ms
Wrote 950,000,000 measurements in 325116 ms
Created file with 1,000,000,000 measurements in 342196 ms
PS D:\git\github.com\gunnarmorling\1brc> java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result.txt
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
PS D:\git\github.com\gunnarmorling\1brc> ug -m1 -o "Ab.ch." src/main/java/dev/morling/onebrc/CreateMeasurements.java measurements.txt result.txt
src/main/java/dev/morling/onebrc/CreateMeasurements.java
    80: Abéché

measurements.txt
   527: Abéché

result.txt
     1: AbΘchΘ

PS D:\git\github.com\gunnarmorling\1brc> ug --hexdump -m1 -o "Ab.ch." src/main/java/dev/morling/onebrc/CreateMeasurements.java measurements.txt result.txt
src/main/java/dev/morling/onebrc/CreateMeasurements.java
    80:
00000cc0  -- -- -- -- -- -- -- --  -- -- -- -- 41 62 c3 a9  |------------Ab..|
00000cd0  63 68 c3 a9 -- -- -- --  -- -- -- -- -- -- -- --  |ch..------------|

measurements.txt
   527:
00001c10  41 62 c3 a9 63 68 c3 a9  -- -- -- -- -- -- -- --  |Ab..ch..--------|

result.txt
     1:
00000030  41 62 ce 98 63 68 ce 98  -- -- -- -- -- -- -- --  |Ab..ch..--------|

I changed code page to 65001 (UTF-8) and set JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF8 hoping it could improve the situation, but it didn't change anything (originally I tested without those steps).

Can someone explain why there is ce 98 for é instead of c3 a9?

Fix tests (richardstartin)

Hey @richardstartin, @AlexanderYastrebov added one more test for 10K distinct keys which fails with your impl. Could you take a look? Thx!

./test.sh richardstartin
Validating calculate_average_richardstartin.sh -- src/test/resources/samples/measurements-1.txt

real	0m0.077s
user	0m0.077s
sys	0m0.019s
Validating calculate_average_richardstartin.sh -- src/test/resources/samples/measurements-10.txt

real	0m0.065s
user	0m0.076s
sys	0m0.014s
Validating calculate_average_richardstartin.sh -- src/test/resources/samples/measurements-10000-unique-keys.txt
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
	at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:542)
	at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567)
	at java.base/java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:653)
	at dev.morling.onebrc.CalculateAverage_richardstartin.main(CalculateAverage_richardstartin.java:391)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 603 out of bounds for length 600
	at dev.morling.onebrc.CalculateAverage_richardstartin$Page.update(CalculateAverage_richardstartin.java:254)
	at dev.morling.onebrc.CalculateAverage_richardstartin$AggregationTask.computeSlice(CalculateAverage_richardstartin.java:307)
	at dev.morling.onebrc.CalculateAverage_richardstartin$AggregationTask.compute(CalculateAverage_richardstartin.java:339)
	at dev.morling.onebrc.CalculateAverage_richardstartin$AggregationTask.compute(CalculateAverage_richardstartin.java:276)
	at java.base/java.util.concurrent.RecursiveTask.exec(RecursiveTask.java:110)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

Integrate Eval machine in Github Action

GitHub support private Github Runner on own machines.

To reduce the manual evaluation efforts, two thinks could be considered:

  • Generated a ranked matrix on a daily based from GitHub action
  • Provide preview of the rank on Pull Requests.

@gunnarmorling I'm far away from Java, but I could assist with Github Action coding here.

Should the 1 billion row file be deterministic?

Currently it seems that the 1 billion rows file is generated randomly. Making the generation pseudorandom would make sharing the 1 billion row file a little easier (since it should always be the same), and would make sure that everyone is running exactly the same test.

Just using a Random with a predefined seed to pick out stations, and seeding a Random with the hash code of the city name to obtain measurements should do the trick.

Tests fail or don't finish: nstng

Hey @nstng, could you please run ./test.sh nstng and send a PR to make sure they pass and finish in a short time? We have added some tests to cover some corner cases and this implementation doesn't pass all of them right now. Thanks!

Seeing others solutions nice but unfair...

First of all huge respect for pretty insane solutions that people already provided.
But since anyone can take previous solution and do a minor improvement on it, it basically shadows the original contenders work no?

If someone really is tryharding to win this challenge they will wait for last day, take the best solution, add their little improvement on top of it and hope that someone else cant improve it further.

Smaller file segments is faster

At this point it seems difficult to come up with a faster solution (and call it your own), but I believe I can make a contribution.

I took Sam Pullara's solution @spullara and changed one code line from

int numberOfSegments = Runtime.getRuntime().availableProcessors();

to

int numberOfSegments = 20 * Runtime.getRuntime().availableProcessors();

On my MacBook Air M2 Sam's solution takes about 5.6s. With that one line changed it's about 1s faster. 20 may not be the optimal number on another machine. The key thing is to make the segments smaller, and the number of segments a multiple of the number of cores.

Tests fail or don't finish: twobiers

Hey @twobiers, could you please run ./test.sh twobiers and send a PR to make sure they pass and finish in a short time? We have added some tests to cover some corner cases and this implementation doesn't pass all of them right now. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.