With default options I made the following measurements using a 10m row table with few

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Quick benchmark about odbc2parquet HOT 18 CLOSED

pacman82 commented on May 29, 2024

Quick benchmark

from odbc2parquet.

Comments (18)

alippai commented on May 29, 2024 1

I played with one of my larger tables (30M rows) a little bit. The query execution seems to be quick, eg if I add limit 1000000 it completes in 5sec fetching included.

So far I collected this:

Prints ODBC env created. It starts with CPU usage 20-25%
After 1 min it prints the column info (11 columns, int, bigint, timestamp, but tried without timestamp as well). Row size 220 bytes, batch size 10M.
Switches to 100% CPU usage and one fetch seems to be 12sec
Still with 100% CPU usage writing columns for 10 sec (so 1 sec / column)
The complete workload is finished in 2minutes (repeating the 2-3. points above 4x), user time 70sec, sys time 7 sec.

This definitely looks to be CPU bound. If I turn off compression it's down to 1m38s (and user time down to 50s), but the CPU is still on 100% (only one used out of many). Setting the batch size to 1M doesn't really help, the column info is printed earlier, but the end-to-end time is the same.

How should I proceed in the investigation? Is this helpful?

from odbc2parquet.

pacman82 commented on May 29, 2024 1

Hello @vikramhn ,

feel free to evaluate and choose any tool you like for your usecase. This odbc2parquet is only concerned with ODBC <-> Parquet. As such I might be interessted in benchmarks regarding odbc2parquet itself, comparing it to previous versions of itself. It would be interessting to know how it competes with other ways of fetching parquet from a database. So far I feel though odbc2parquet is competitive, if not best.

Personally I do not have the bandwith to worry about the performance of other tools. If you want to do so here and share the results of your findings, you are welcome to do so however.

Cheers, Markus

from odbc2parquet.

pacman82 commented on May 29, 2024

how did you generate the text file with MySQL?

from odbc2parquet.

pacman82 commented on May 29, 2024

and was the database used a MySQL, in all three cases?

from odbc2parquet.

alippai commented on May 29, 2024

It was the same (old) MySQL instance. The text is simply mysql -e query output piped to a file.
The unixodbc, mysql client, turbodbc and odbc2parquet were installed using conda-forge (cargo coming from the rust dev channel), so everything up to date.

from odbc2parquet.

alippai commented on May 29, 2024

I can make more scientific benchmarks (or even add to this repo to compare)

from odbc2parquet.

pacman82 commented on May 29, 2024

cool thanks! I wonder how odbcsv would perform against the piped text output? This should be more comparable than with odbc2parquet. I also wonder how odbc2parquet perform against turbodbc -> pandas -> parquet.

I don't know which of these comparisions with other tech stacks makes sense to include in documentation, but I am all for a benchmarks suite (I usually use criterion).

from odbc2parquet.

alippai commented on May 29, 2024

What do you think about https://github.com/sharkdp/hyperfine for e2e benchmarks? Criterion is nice for rust lib benchmarking, but it wouldn't allow us to track the performance compared to a CLI baseline (with odbcsv, turbodbc or simple mysql CLI client)

from odbc2parquet.

pacman82 commented on May 29, 2024

Have never used it. Yet it seems nice enough and the license checks out. I think criterions primary usecase is to catch preformance regressions. Comparing a bunch of different tools is a different usage scenario. Both have their uses. From your suggestion I gather that you are more interested in doing the latter.

How do you want to go about this? Do you want to test a bunch of stuff once and just publish the results? Or do you want to check a reproducible setup into this repository? What do we actually want to compare?

I am curious what we will learn from these benchmarks. So far the earily similar numbers between turbodbc and odbc2parquet indicate that the bottle neck is transmitting the records over the network.

I'd be interessted in learning what could be achieved by playing around with the batch-size parameter. Is there a point of diminishing returns, there the speedup does not varant the additional memory usage? Is there a point after which it affects the total speed negatively?

I can see many usescases for a nice benchmarking setup. I am looking forward to it, should you wish to provide one.

from odbc2parquet.

alippai commented on May 29, 2024

odbcsv is similar in terms of performance, 1m50s with 1m user time. If I understand correctly it's 5000rows in a batch/fetch. Still 100% CPU usage continuously.

from odbc2parquet.

pacman82 commented on May 29, 2024

Hi @alippai ,

thanks for the investigation and the numbers. Increasing the fetch size, has diminshing returns. If you assume a fixed overhead per batch. Increasing the batch size from 1 to 100, cuts the overhead to 1/100. Going from there to a fetch size of 10.000 will decrease it to a 1/100 again, but the overhead you decrease is way smaller already.

As ODBC batch fetch size == Parquet Row Group Size, you should probably choose the value so you get the desired output rowput size. As long as we don't go row by row, the IO overhead per batch should not be to bad.

The above equation is however not by accident. Having the Row Group Size being equal to the fetch size allows for passing the non nullable, binary identical buffers directly to parquet without copying them in memory. This optimization is currently already implemented for inserting, but not fetching data. Doing that will at least save some CPU cycles depending on the use case.

from odbc2parquet.

pacman82 commented on May 29, 2024

I think you even mentioned the optimization already earlier, in another issue. ;-)

Cheers, Markus

from odbc2parquet.

pacman82 commented on May 29, 2024

Just another thought: As you said the performance of odbcsv is similar. The writing data part is different in odbc2parquet and odbcsv. The work done by the ODBC driver is somewhat similar (with the notable difference, that odbcsv requests all columns as string). My guess is that the majority of the work happens within the ODBC driver.

from odbc2parquet.

alippai commented on May 29, 2024

I'm afraid that the ODBC driver is the bottleneck. Is there an easy way to profile the run? Or are there good entry points in odbc-sys/API to collect some basic perf stats? Eg end of the job it could write a debug message: "receiving data 3s, time spent in ODBC driver 1s, odbc2parquet 0.5s, writing parquet 3s"

from odbc2parquet.

pacman82 commented on May 29, 2024

In terms of this tool, we could always add more logging, but I think it won't be very enlightning. If you can build your ODBC driver with profiling information you can get a flamegraph, or something like that. Yet at that point we would have stretched even the most generous definition of "easy way". I am currently unaware of profiling information in the ODBC API.

Sorry, I don't have better news.

from odbc2parquet.

alippai commented on May 29, 2024

No problem, I had to ask as you are super-experienced in this field. I'll let you know if I find anything useful.

from odbc2parquet.

vikramhn commented on May 29, 2024

After seeing @alippai benchmark it performs the same as others? My focus was on performance than the size/format of the output file.
option 1 seems better than this tool. see attached below. @pacman82 Should we evaluate this tool may be for larger tables with strings where it might make a difference? Sorry if I have misunderstood.

from odbc2parquet.

pacman82 commented on May 29, 2024

Closing this issue, I hope this does not discourage anyone from sharing benchmarks, or interssting findings

from odbc2parquet.

Quick benchmark about odbc2parquet HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent