Giter Site home page Giter Site logo

Quick benchmark about odbc2parquet HOT 18 CLOSED

pacman82 avatar pacman82 commented on May 29, 2024
Quick benchmark

from odbc2parquet.

Comments (18)

alippai avatar alippai commented on May 29, 2024 1

I played with one of my larger tables (30M rows) a little bit. The query execution seems to be quick, eg if I add limit 1000000 it completes in 5sec fetching included.

So far I collected this:

  1. Prints ODBC env created. It starts with CPU usage 20-25%
  2. After 1 min it prints the column info (11 columns, int, bigint, timestamp, but tried without timestamp as well). Row size 220 bytes, batch size 10M.
  3. Switches to 100% CPU usage and one fetch seems to be 12sec
  4. Still with 100% CPU usage writing columns for 10 sec (so 1 sec / column)
  5. The complete workload is finished in 2minutes (repeating the 2-3. points above 4x), user time 70sec, sys time 7 sec.

This definitely looks to be CPU bound. If I turn off compression it's down to 1m38s (and user time down to 50s), but the CPU is still on 100% (only one used out of many). Setting the batch size to 1M doesn't really help, the column info is printed earlier, but the end-to-end time is the same.

How should I proceed in the investigation? Is this helpful?

from odbc2parquet.

pacman82 avatar pacman82 commented on May 29, 2024 1

Hello @vikramhn ,

feel free to evaluate and choose any tool you like for your usecase. This odbc2parquet is only concerned with ODBC <-> Parquet. As such I might be interessted in benchmarks regarding odbc2parquet itself, comparing it to previous versions of itself. It would be interessting to know how it competes with other ways of fetching parquet from a database. So far I feel though odbc2parquet is competitive, if not best.

Personally I do not have the bandwith to worry about the performance of other tools. If you want to do so here and share the results of your findings, you are welcome to do so however.

Cheers, Markus

from odbc2parquet.

pacman82 avatar pacman82 commented on May 29, 2024

how did you generate the text file with MySQL?

from odbc2parquet.

pacman82 avatar pacman82 commented on May 29, 2024

and was the database used a MySQL, in all three cases?

from odbc2parquet.

alippai avatar alippai commented on May 29, 2024

It was the same (old) MySQL instance. The text is simply mysql -e query output piped to a file.
The unixodbc, mysql client, turbodbc and odbc2parquet were installed using conda-forge (cargo coming from the rust dev channel), so everything up to date.

from odbc2parquet.

alippai avatar alippai commented on May 29, 2024

I can make more scientific benchmarks (or even add to this repo to compare)

from odbc2parquet.

pacman82 avatar pacman82 commented on May 29, 2024

cool thanks! I wonder how odbcsv would perform against the piped text output? This should be more comparable than with odbc2parquet. I also wonder how odbc2parquet perform against turbodbc -> pandas -> parquet.

I don't know which of these comparisions with other tech stacks makes sense to include in documentation, but I am all for a benchmarks suite (I usually use criterion).

from odbc2parquet.

alippai avatar alippai commented on May 29, 2024

What do you think about https://github.com/sharkdp/hyperfine for e2e benchmarks? Criterion is nice for rust lib benchmarking, but it wouldn't allow us to track the performance compared to a CLI baseline (with odbcsv, turbodbc or simple mysql CLI client)

from odbc2parquet.

pacman82 avatar pacman82 commented on May 29, 2024

Have never used it. Yet it seems nice enough and the license checks out. I think criterions primary usecase is to catch preformance regressions. Comparing a bunch of different tools is a different usage scenario. Both have their uses. From your suggestion I gather that you are more interested in doing the latter.

How do you want to go about this? Do you want to test a bunch of stuff once and just publish the results? Or do you want to check a reproducible setup into this repository? What do we actually want to compare?

I am curious what we will learn from these benchmarks. So far the earily similar numbers between turbodbc and odbc2parquet indicate that the bottle neck is transmitting the records over the network.

I'd be interessted in learning what could be achieved by playing around with the batch-size parameter. Is there a point of diminishing returns, there the speedup does not varant the additional memory usage? Is there a point after which it affects the total speed negatively?

I can see many usescases for a nice benchmarking setup. I am looking forward to it, should you wish to provide one.

from odbc2parquet.

alippai avatar alippai commented on May 29, 2024

odbcsv is similar in terms of performance, 1m50s with 1m user time. If I understand correctly it's 5000rows in a batch/fetch. Still 100% CPU usage continuously.

from odbc2parquet.

pacman82 avatar pacman82 commented on May 29, 2024

Hi @alippai ,

thanks for the investigation and the numbers. Increasing the fetch size, has diminshing returns. If you assume a fixed overhead per batch. Increasing the batch size from 1 to 100, cuts the overhead to 1/100. Going from there to a fetch size of 10.000 will decrease it to a 1/100 again, but the overhead you decrease is way smaller already.

As ODBC batch fetch size == Parquet Row Group Size, you should probably choose the value so you get the desired output rowput size. As long as we don't go row by row, the IO overhead per batch should not be to bad.

The above equation is however not by accident. Having the Row Group Size being equal to the fetch size allows for passing the non nullable, binary identical buffers directly to parquet without copying them in memory. This optimization is currently already implemented for inserting, but not fetching data. Doing that will at least save some CPU cycles depending on the use case.

from odbc2parquet.

pacman82 avatar pacman82 commented on May 29, 2024

I think you even mentioned the optimization already earlier, in another issue. ;-)

Cheers, Markus

from odbc2parquet.

pacman82 avatar pacman82 commented on May 29, 2024

Just another thought: As you said the performance of odbcsv is similar. The writing data part is different in odbc2parquet and odbcsv. The work done by the ODBC driver is somewhat similar (with the notable difference, that odbcsv requests all columns as string). My guess is that the majority of the work happens within the ODBC driver.

from odbc2parquet.

alippai avatar alippai commented on May 29, 2024

I'm afraid that the ODBC driver is the bottleneck. Is there an easy way to profile the run? Or are there good entry points in odbc-sys/API to collect some basic perf stats? Eg end of the job it could write a debug message: "receiving data 3s, time spent in ODBC driver 1s, odbc2parquet 0.5s, writing parquet 3s"

from odbc2parquet.

pacman82 avatar pacman82 commented on May 29, 2024

In terms of this tool, we could always add more logging, but I think it won't be very enlightning. If you can build your ODBC driver with profiling information you can get a flamegraph, or something like that. Yet at that point we would have stretched even the most generous definition of "easy way". I am currently unaware of profiling information in the ODBC API.

Sorry, I don't have better news.

from odbc2parquet.

alippai avatar alippai commented on May 29, 2024

No problem, I had to ask as you are super-experienced in this field. I'll let you know if I find anything useful.

from odbc2parquet.

vikramhn avatar vikramhn commented on May 29, 2024

After seeing @alippai benchmark it performs the same as others? My focus was on performance than the size/format of the output file.
option 1 seems better than this tool. see attached below. @pacman82 Should we evaluate this tool may be for larger tables with strings where it might make a difference? Sorry if I have misunderstood.
Capture

from odbc2parquet.

pacman82 avatar pacman82 commented on May 29, 2024

Closing this issue, I hope this does not discourage anyone from sharing benchmarks, or interssting findings

from odbc2parquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.