Giter Site home page Giter Site logo

Comments (1)

alexey-milovidov avatar alexey-milovidov commented on May 21, 2024

There is a large catalog of prepared datasets: https://clickhouse.com/docs/en/getting-started/example-datasets

For example, these datasets are over 1 TB uncompressed:

  • Reddit comments;
  • YouTube likes;
  • GitHub events;
  • Wikipedia page views;
  • Environmental Sensors Data;

They can be loaded into ClickHouse in a few hours.
There is also a list of queries https://github.com/ClickHouse/github-explorer/blob/main/queries.sql

But these datasets are not used in ClickBench, because testing all ~30 database management systems will be too slow.

For example, if you try to load Wikipedia page views (typical time-series dataset) into TimescaleDB (typical time-series DBMS) it will take months, making the benchmark impractical. If you try to load it into DuckDB, it will not load because duckdb is not a production-quality database. If you try to use Druid, or Pinot, you will need a long time to recover after PTSD.

Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)

In fact, ClickHouse compresses it to only 9.28 GB. But the benchmark methodology requires one cold run with flushed caches, so it can test the IO subsystem. Also keep in mind, that it requires the usage of gp2 EBS volumes of size 500 GB that has a well-known IO profile (tldr, they are slow).

from clickbench.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.