Since Clickhouse is running these benchmarks, it's not surprising that Clickhouse gets

Thanks Alexey for making your intentions clear: <p dir="auto"

Biased reports: Benchmark uses a different input for Clickhouse than for other databases about clickbench HOT 6 CLOSED

clickhouse commented on May 18, 2024 2

Biased reports: Benchmark uses a different input for Clickhouse than for other databases

from clickbench.

Comments (6)

fhoffa commented on May 18, 2024 3

Thanks Alexey for making your intentions clear:

If there are 2 ways of doing something with Snowflake, you will choose the one that makes Snowflake looks worse. Here you have acknowledged that there is a better way, but you refuse to change. Give Snowflake the same files that you provided to everyone else and yourself, and the numbers will change.
Of the 38 systems you tested, only Snowflake got a snarky NOTES.MD. Then you want us to believe that's because your main goal is to make Snowflake better. I doubt that's your main goal.

from clickbench.

alexey-milovidov commented on May 18, 2024 2

You are messing the results for ClickHouse and clickhouse-local:

Loading data into ClickHouse:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/benchmark.sh#L24

It takes 476 seconds to load from TSV file on c6a.4xlarge machine:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/results/c6a.4xlarge.json

Or 417 seconds if you use zstd compression:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/results/c6a.4xlarge.zstd.json

It takes 137 seconds to load from TSV file on c6a.metal machine:
https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/results/c6a.metal.json

In contrast, clickhouse-local is a stateless system (like AWS Athena) and it does not take any time to load the data (it is using the files as is without loading), but the performance on the queries is lower.

ClickHouse and clickhouse-local are present as different entries in the benchmark.
On your screenshot, you are comparing Snowflake with ClickHouse, and ClickHouse indeed does loading faster.

There is no magic and you can reproduce the result by following the script.

The loading is not parallelized and should not be, as per Methodology:
https://github.com/ClickHouse/ClickBench#data-loading

from clickbench.

alexey-milovidov commented on May 18, 2024 1

About the comments in NOTES.md

I've spent multiple hours figuring out how to load the data.
First I tried to load with SnowSQL. But it is using Python code to parse CSV, spent one CPU core, and did not finish in 24 hours.
Happily, I've found another way to load the data.

The usability issue with SnowSQL is real. I tried to specify my account name multiple times before I found out that I also need to specify the region name in the command line. This was unclear from the documentation and represents a usability issue worth fixing. There were two different substrings looking like my account name, it was unclear what substring to copy-paste and none of them work by default.

The syntax @test.public.%hits does look weird.

The pricing is also not quite clear. It shows the price in "credits" but it is difficult to find what credit is worth.
Finally, I found it in some PDF but it was not easy (the search in the documentation does not help and the random internet pages show controversial info). I could not find the billing information in the UI. This is an opportunity for improvement.

The internet is flooded with half-spam pages that "help to figure out the cost of Snowflake".

Finally, I have found the overall experience of the UI one of the best. It works well and looks polished.
The possibility to resize Warehouse in seconds is unique among other services.
Query performance is very consistent - all queries run fine.
While it's slower on average than ClickHouse, you'd better compare it with similar services, like Redshift, and Redshift Serverless.

I've already told my colleagues that Snowflake surprised me in a good way.
(easy scaling + good user experience)

from clickbench.

alexey-milovidov commented on May 18, 2024 1

If there are 2 ways of doing something with Snowflake, you will choose the one that makes Snowflake looks worse. Here you have acknowledged that there is a better way, but you refuse to change. Give Snowflake the same files that you provided to everyone else and yourself, and the numbers will change.

No, I've selected the best way to load the data.

As mentioned in the NOTES.md, I've ended up using

COPY INTO test.public.hits2 FROM 's3://clickhouse-public-datasets/hits_compatible/hits.csv.gz' FILE_FORMAT = (TYPE = CSV, COMPRESSION = GZIP, FIELD_OPTIONALLY_ENCLOSED_BY = '"')

If there is an even better variant of data loading within the rules of this benchmark, let's use it.

from clickbench.

alexey-milovidov commented on May 18, 2024 1

Of the 38 systems you tested, only Snowflake got a snarky NOTES.MD. Then you want us to believe that's because your main goal is to make Snowflake better. I doubt that's your main goal.

You will find similar comments about other systems' usability, for example:
https://github.com/ClickHouse/ClickBench/tree/main/bigquery

from clickbench.

alexey-milovidov commented on May 18, 2024

Please note that poor onboarding, obsolete documentation, and nothing working by default - it's sad, but it's typical among the services, and your service does not perform the worst from this standpoint.

I think that capturing the experience of a "clueless" and "ignorant" user that is using your service for the first time - is the most valuable for improving the product.

from clickbench.

Biased reports: Benchmark uses a different input for Clickhouse than for other databases about clickbench HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent