Comments (5)
Hello @lukeawyatt ,
thanks for your feedback! You are right. When I started writing this tool, I was concerned with the memory usage on the machine executing this code. batch_size_mib
does refer to a buffer size for data in transit. It is not a way to specify the size of the batch within the file. Currently each batch becomes one row group, which is likely to be a lot smaller in terms of memory consumption, due to compression, and not every string or binary value, maxing out the supported length. The factory of how much smaller a batch depends a lot on the shape of the data.
You likely know this already, but I think I write it down here, as your issue made me realize the --help
does a bad job explaining this.
the file output sizes are inconsistent when specifying batch_size_mib and batches_per_file
I would be interessted to learn what kind of inconsistency you are most concerned with. The inconstistency between different datasets resulting from different queries? Or is it the inconsistency of the size of a row group within each dataset?
I'd imagine it'd look something like
file_size_limit_mib
I agree, this feels like a sensible feature. I currently have little idea of how hard or easy it will be to implement. The biggest source of uncertainty is the capabilities of the upstream parquet
library in that regard. If they allow writing to any io::write, or even support a file size limit directly, I would say this is likely to happen.
In both cases I would expect the limit to be fuzzy though. I could also imagine that to exceed the file limit, by maybe a length of a footer or something. Would this be fine for your usecase?
Also, if possible, I would like to learn more about the way you are using odbc2parquet
and why consistent file sizes matter to you.
Thanks for all that you do!
Well thank you. Despite the fact that you are just an anomynous person on the internet to me, this still means a lot to me.
Cheers, Markus
from odbc2parquet.
Hello @lukeawyatt ,
As of now, I feel I could only implement this using hacky workarounds. Let's see if the upstream issue gains some traction.
Cheers, Markus
from odbc2parquet.
Hey @pacman82
Thanks for being so prompt with this. I've subscribed and "thumbed up" the arrow-rs issue to aid traction. Regarding my use case and the inconsistencies I'm referencing, within my datasets, it seems some batches hold significantly more data than others. This is likely due to select varchar columns. I'd like to agnostically pass in a query and have the chunked output be fairly consistent. Having more predictability will help prevent file cap issues when the files are in transit. My end goal can be seen in the scenarios below:
Example 1: Default parameters
odbc2parquet query --connection-string '{CS}' /tmp/out.par " {QUERY}OUTPUT out.par - 5.6 gb
Example 2: Current Usage, Batch configuration to enable partitioning
odbc2parquet query --batches_per_file 4000 --batch_size_mib 100 --connection-string '{CS}' /tmp/out.par " {QUERY}OUTPUT out_1.par - 745 mb out_2.par - 1205 mb out_3.par - 432 mb out_4.par - 1461 mb out_5.par - 894 mb out_6.par - 863 mb
Example 3: Desired Output, Configure only a partitioned file limit
odbc2parquet query --file_size_limit_mib 976 --connection-string '{CS}' /tmp/out.par " {QUERY}OUTPUT out_1.par - 1022 mb out_2.par - 1024 mb out_3.par - 1019 mb out_4.par - 1021 mb out_5.par - 1020 mb out_6.par - 494 mb
Let me know if this helps or if you need further clarification. And thanks again!
~Luke
from odbc2parquet.
Hello Luke,
odbc2parquet 0.8.0
has been released. It features the --file-size-threshold
option for the query
subcommand. Please tell me how it works for you.
Cheers, Markus
from odbc2parquet.
Hi Markus,
This works flawlessly! Thank you for your efforts on this.
~Luke
from odbc2parquet.
Related Issues (20)
- Issue with MySQL JSON columns HOT 8
- Reserved Column Names not Supported HOT 1
- Feature Request - Support column encryption in the generated parquet file HOT 4
- JobName as .sql file in config file HOT 4
- Parquet format version support HOT 9
- Feature suggestion: connect to URL `postgresql://username:pass@host/database` HOT 1
- What permissions are needed? - State: 42501, Native error: 1, Message: ERROR: permission denied HOT 4
- StarRocks parquet file import of parquet file generated by odbc2parquet fails with encoding error HOT 11
- Memory allocation with column-length-limit HOT 11
- Build for alpine HOT 8
- file-size-threshold generates wrong size files HOT 1
- --no-empty-file option doesn't work properly when row-groups-per-file should devide result into few files HOT 6
- MSSQL nvarchar - missing column in output file HOT 2
- Feature request: Progress bar for full table copies HOT 6
- Data source must return valid UTF16 in wide character buffer: Utf16Error HOT 4
- Write statistics HOT 14
- Make zstd the default compression HOT 4
- Build release assets for Ubuntu ARM64 as well HOT 11
- Exporter adding trailing zero's in when exporting from PostgreSQL Numeric dtype HOT 5
- thread 'main' panicked at src/query/date.rs:60:87: called `Option::unwrap()` on a `None` value HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from odbc2parquet.