Comments (1)
There is a large catalog of prepared datasets: https://clickhouse.com/docs/en/getting-started/example-datasets
For example, these datasets are over 1 TB uncompressed:
- Reddit comments;
- YouTube likes;
- GitHub events;
- Wikipedia page views;
- Environmental Sensors Data;
They can be loaded into ClickHouse in a few hours.
There is also a list of queries https://github.com/ClickHouse/github-explorer/blob/main/queries.sql
But these datasets are not used in ClickBench, because testing all ~30 database management systems will be too slow.
For example, if you try to load Wikipedia page views (typical time-series dataset) into TimescaleDB (typical time-series DBMS) it will take months, making the benchmark impractical. If you try to load it into DuckDB, it will not load because duckdb is not a production-quality database. If you try to use Druid, or Pinot, you will need a long time to recover after PTSD.
Testing on a 200GB data set that is easily compressible down to 50GB with modern compression algorithms might exclude disk IO from the equation on systems with large caches (even if those are simple disk caches)
In fact, ClickHouse compresses it to only 9.28 GB. But the benchmark methodology requires one cold run with flushed caches, so it can test the IO subsystem. Also keep in mind, that it requires the usage of gp2 EBS volumes of size 500 GB that has a well-known IO profile (tldr, they are slow).
from clickbench.
Related Issues (20)
- Databend benchmark is not valid the the latest Databend versions HOT 3
- Does the skipping index have more advantages for test dataset HOT 1
- Why is the perf of cold scan much worse than ever in the last commit of m5d.24xlarge. HOT 1
- Add YTsaurus support
- Add BoilingData to ClickBench HOT 14
- Doris vs Clickhouse for TPC-H HOT 1
- Update DataFusion & results
- DuckDB doesn't show up when Type = C++ HOT 2
- Add result for Apache Doris HOT 2
- Q17 doesn't have sorting HOT 2
- Segmentation fault running hardware.sh while running Test 17 HOT 1
- The Pinot benchmark does not have indices. HOT 7
- Elasticsearch benchmarks flush the cache between queries HOT 6
- ClickHouse appears to run into an overflow on Q3 HOT 2
- Add YDB for comparing HOT 1
- Inaccurate table size calculation of Mysql HOT 1
- Add Quickwit support HOT 5
- feat: add Github Actions to auto generate index.html HOT 1
- Syntax error for postgresql CREATE TABLE HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clickbench.