Giter Site home page Giter Site logo

1brc's Introduction

One Billion Row Challenge

A Rust implementation for the 1 Billion Row Challenge (1BRC) that generates a large dataset of temperature records and calculates statistics for each city.

Setup

Note: The following instructions are for Linux and macOS. If you are using Windows, you can install the Windows Subsystem for Linux (WSL) or use a virtual machine.

  1. Install Rust: https://www.rust-lang.org/tools/install
  2. Clone the repository: git clone [email protected]:ViniciosLugli/1brc.git
  3. Change to the project directory: cd 1brc
  4. You can set DATA_FILE_PATH environment variable to specify the path where the generated dataset will be saved. If not set, the default path is the local folder.
  5. Run the project: cargo run --release

Solution Strategy

I started the project by reading the challenge description and understanding the requirements. I then broke down the problem into smaller tasks and identified the key components of the solution. I decided to implement the solution in Rust due to its performance, safety, and concurrency features.

Data Processing

  • The data processing step involves reading the generated dataset, parsing the temperature data, and calculating statistics for each city.
  • The dataset is read in parallel to utilize multiple CPU cores effectively.
  • City-wise statistics such as minimum, maximum, and average temperatures are calculated and stored.
  • The results are displayed in a tabular format for easy interpretation.

Implementation

  • The solution utilizes asynchronous programming with Tokio and Rayon to achieve concurrency and parallelism.
  • Tokio is used for asynchronous I/O operations, while Rayon is used for parallel data processing.
  • Memory-mapping is employed to efficiently read large files without loading them entirely into memory.
  • Atomic counters are used for tracking progress and coordinating tasks across multiple threads.
  • The solution is designed to scale with the available CPU cores to maximize performance.

Performance

  • The solution is designed to utilize multi-core processors efficiently for both data generation and processing.
  • Memory-mapping is employed to minimize memory usage and optimize I/O performance.
  • Parallelization and concurrency are used to maximize throughput and reduce processing time.

Known Issues

The generated dataset has a rand seed problem setting all the temperatures to the same value, i know that this is a problem, but this dont impact the final result of the project, so I decided to not fix it and focus on the main problem.

Conclusion

The final result in my testing environment is a execution time of 27.51 seconds for the 1 billion rows generated dataset, which is a significant improvement over the initial implementation of project that took ~300 seconds. Far away from the ~6 seconds of the most optimized solution found in the internet, but I'm happy with the result, considering that is my first time with this kind of problem and all the implementation was done in a single day :) generator solution

Helpers / References

1brc's People

Contributors

vinicioslugli avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.