Comments (18)
I played with one of my larger tables (30M rows) a little bit. The query execution seems to be quick, eg if I add limit 1000000 it completes in 5sec fetching included.
So far I collected this:
- Prints ODBC env created. It starts with CPU usage 20-25%
- After 1 min it prints the column info (11 columns, int, bigint, timestamp, but tried without timestamp as well). Row size 220 bytes, batch size 10M.
- Switches to 100% CPU usage and one fetch seems to be 12sec
- Still with 100% CPU usage writing columns for 10 sec (so 1 sec / column)
- The complete workload is finished in 2minutes (repeating the 2-3. points above 4x), user time 70sec, sys time 7 sec.
This definitely looks to be CPU bound. If I turn off compression it's down to 1m38s (and user time down to 50s), but the CPU is still on 100% (only one used out of many). Setting the batch size to 1M doesn't really help, the column info is printed earlier, but the end-to-end time is the same.
How should I proceed in the investigation? Is this helpful?
from odbc2parquet.
Hello @vikramhn ,
feel free to evaluate and choose any tool you like for your usecase. This odbc2parquet
is only concerned with ODBC <-> Parquet. As such I might be interessted in benchmarks regarding odbc2parquet
itself, comparing it to previous versions of itself. It would be interessting to know how it competes with other ways of fetching parquet from a database. So far I feel though odbc2parquet
is competitive, if not best.
Personally I do not have the bandwith to worry about the performance of other tools. If you want to do so here and share the results of your findings, you are welcome to do so however.
Cheers, Markus
from odbc2parquet.
how did you generate the text file with MySQL?
from odbc2parquet.
and was the database used a MySQL, in all three cases?
from odbc2parquet.
It was the same (old) MySQL instance. The text is simply mysql -e query
output piped to a file.
The unixodbc, mysql client, turbodbc and odbc2parquet were installed using conda-forge (cargo coming from the rust dev channel), so everything up to date.
from odbc2parquet.
I can make more scientific benchmarks (or even add to this repo to compare)
from odbc2parquet.
cool thanks! I wonder how odbcsv
would perform against the piped text output? This should be more comparable than with odbc2parquet
. I also wonder how odbc2parquet
perform against turbodbc
-> pandas
-> parquet
.
I don't know which of these comparisions with other tech stacks makes sense to include in documentation, but I am all for a benchmarks suite (I usually use criterion
).
from odbc2parquet.
What do you think about https://github.com/sharkdp/hyperfine for e2e benchmarks? Criterion is nice for rust lib benchmarking, but it wouldn't allow us to track the performance compared to a CLI baseline (with odbcsv, turbodbc or simple mysql CLI client)
from odbc2parquet.
Have never used it. Yet it seems nice enough and the license checks out. I think criterions primary usecase is to catch preformance regressions. Comparing a bunch of different tools is a different usage scenario. Both have their uses. From your suggestion I gather that you are more interested in doing the latter.
How do you want to go about this? Do you want to test a bunch of stuff once and just publish the results? Or do you want to check a reproducible setup into this repository? What do we actually want to compare?
I am curious what we will learn from these benchmarks. So far the earily similar numbers between turbodbc and odbc2parquet
indicate that the bottle neck is transmitting the records over the network.
I'd be interessted in learning what could be achieved by playing around with the batch-size
parameter. Is there a point of diminishing returns, there the speedup does not varant the additional memory usage? Is there a point after which it affects the total speed negatively?
I can see many usescases for a nice benchmarking setup. I am looking forward to it, should you wish to provide one.
from odbc2parquet.
odbcsv
is similar in terms of performance, 1m50s with 1m user time. If I understand correctly it's 5000rows in a batch/fetch. Still 100% CPU usage continuously.
from odbc2parquet.
Hi @alippai ,
thanks for the investigation and the numbers. Increasing the fetch size, has diminshing returns. If you assume a fixed overhead per batch. Increasing the batch size from 1 to 100, cuts the overhead to 1/100. Going from there to a fetch size of 10.000 will decrease it to a 1/100 again, but the overhead you decrease is way smaller already.
As ODBC batch fetch size == Parquet Row Group Size, you should probably choose the value so you get the desired output rowput size. As long as we don't go row by row, the IO overhead per batch should not be to bad.
The above equation is however not by accident. Having the Row Group Size being equal to the fetch size allows for passing the non nullable, binary identical buffers directly to parquet without copying them in memory. This optimization is currently already implemented for inserting, but not fetching data. Doing that will at least save some CPU cycles depending on the use case.
from odbc2parquet.
I think you even mentioned the optimization already earlier, in another issue. ;-)
Cheers, Markus
from odbc2parquet.
Just another thought: As you said the performance of odbcsv
is similar. The writing data part is different in odbc2parquet
and odbcsv
. The work done by the ODBC driver is somewhat similar (with the notable difference, that odbcsv
requests all columns as string). My guess is that the majority of the work happens within the ODBC driver.
from odbc2parquet.
I'm afraid that the ODBC driver is the bottleneck. Is there an easy way to profile the run? Or are there good entry points in odbc-sys/API to collect some basic perf stats? Eg end of the job it could write a debug message: "receiving data 3s, time spent in ODBC driver 1s, odbc2parquet 0.5s, writing parquet 3s"
from odbc2parquet.
In terms of this tool, we could always add more logging, but I think it won't be very enlightning. If you can build your ODBC driver with profiling information you can get a flamegraph, or something like that. Yet at that point we would have stretched even the most generous definition of "easy way". I am currently unaware of profiling information in the ODBC API.
Sorry, I don't have better news.
from odbc2parquet.
No problem, I had to ask as you are super-experienced in this field. I'll let you know if I find anything useful.
from odbc2parquet.
After seeing @alippai benchmark it performs the same as others? My focus was on performance than the size/format of the output file.
option 1 seems better than this tool. see attached below. @pacman82 Should we evaluate this tool may be for larger tables with strings where it might make a difference? Sorry if I have misunderstood.
from odbc2parquet.
Closing this issue, I hope this does not discourage anyone from sharing benchmarks, or interssting findings
from odbc2parquet.
Related Issues (20)
- Issue with MySQL JSON columns HOT 8
- Reserved Column Names not Supported HOT 1
- Feature Request - Support column encryption in the generated parquet file HOT 4
- JobName as .sql file in config file HOT 4
- Parquet format version support HOT 9
- Feature suggestion: connect to URL `postgresql://username:pass@host/database` HOT 1
- What permissions are needed? - State: 42501, Native error: 1, Message: ERROR: permission denied HOT 4
- StarRocks parquet file import of parquet file generated by odbc2parquet fails with encoding error HOT 11
- Memory allocation with column-length-limit HOT 11
- Build for alpine HOT 8
- file-size-threshold generates wrong size files HOT 1
- --no-empty-file option doesn't work properly when row-groups-per-file should devide result into few files HOT 6
- MSSQL nvarchar - missing column in output file HOT 2
- Feature request: Progress bar for full table copies HOT 6
- Data source must return valid UTF16 in wide character buffer: Utf16Error HOT 4
- Write statistics HOT 14
- Make zstd the default compression HOT 4
- Build release assets for Ubuntu ARM64 as well HOT 11
- Exporter adding trailing zero's in when exporting from PostgreSQL Numeric dtype HOT 5
- thread 'main' panicked at src/query/date.rs:60:87: called `Option::unwrap()` on a `None` value HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from odbc2parquet.