Comments (7)
Newest version allows for specifying desired memory usage. Defaults to 2 GiB on 64 Bit platforms. There is still more that could be done. Both in terms of either streaming large values or being able to increase the buffer size upon encountering larger values. At least the latter does require driver support, and some research on my side is required wether this would work with MariaDB.
from odbc2parquet.
Hello, I can look at this later, but your report has been very helpful so far. May I draw your attention to the --batch_size
parameter. Its default value is 100000 and it describes the number of rows transmitted between data source and buffer in one go. This tool aims to lower IO overhead by fetching many rows at once. To do so it allocates enough memory to hold batch_size
entries of the largest possible row.
Guess in your case that is too much. You can lower the batch size of course. Likely to work because modern machines have stupid amounts of memory, but it will be slow, because you need a roundtrip for each row of the database. Currently I have no option for you to provide the domain knowledege of the maximum value size via command line parameter (would this help you?). Sometimes casting to a fixed VARCHAR() in the select statement helps (if you try it, please tell me how it goes).
The very least I can do is to provide a better error message. I am totally open to suggestions and ideas from your side.
from odbc2parquet.
Well, do I feel stupid now :D
I did actually start specifying the batch size after I ran into this for the first time but I didn't make the connection at all between the 100000 in that argument and the error, wow. Thank you, that completely makes sense.
I re-tried now with a batch size of 2 and batches per file of 100000 and that worked completely fine, so that already is a great solution for my problem.
The domain knowledge bit could be interesting, but I'm guessing it's a pain in the ass to implement and also given the very uhm let's call it "questionable" design of my source db here, not sure it'll really work out great. This isn't the only field that's completely off in sizing, just the largest one of the ones being off. And as it could technically change what the largest data is in there (given that the data type obviously allows it) that could be tricky for exports that run more frequently. Unless there's some way of determining something like max(length)
of the largest columns and using that instead of the max(data_type_length)
to keep it dynamic it could also incur some additional maintenance needs.
I'll have to look into the cast option just out of curiosity, but the smaller batch size is fine for me now, so this would just be to see if I can make it work with that at all. Currently the cast yells at me because of unicode/ascii issues so not entirely sure yet if that's a feasible way or not.
Yeah not sure what options you have regarding error message. Can you like "catch" the allocation error and print something along the lines of what you mentioned here regarding largest column times batch size? It completely makes sense but just didn't occur to me at all.
from odbc2parquet.
I think you are far from stupid, and I am happy you raised the issue. I do not know how much I can do to make this work out of the box, but at least the tool has a ux problem in case of out of memory errors.
I could calculate the required size beforehand and give a fair warning if it goes over an amount. I also came across this gem in the ODBC4.0 specification:
That would allow me to set an upper bound. And fetch truncated values later. Documentation is sparse though. And this probably needs support from the driver, too (don't know, just a guess). Yet it is worth checking out.
from odbc2parquet.
How would you feel about specifying the --batch-size
in Memory rather than number of rows?
from odbc2parquet.
That sounds like a great idea, that would make it really well plannable.
E.g. in my use case this is running in a kubernetes pod so I could allocate X GB of RAM to the pod and then use that(-buffer) for the batch sizing without running risk of out of memory errors etc. Worst case the operation just takes longer because of more small queries being made on tables that expose pointlessly large column sizes but it would almost guarantee to work :)
from odbc2parquet.
So far my strategy for handling these large columns and memory allocations in general is:
- Calculate and log the amount of memory required per row
- Make batch_size configurable in terms of memory consumption
- Set default to 1GB
- Warn users if more than that is required for a single row
- Try something fancy using ODBC 4.0 and length exception behaviour
from odbc2parquet.
Related Issues (20)
- Support for data type timezone conversion to UTC HOT 11
- Warnings shown when quiet flag HOT 4
- Automatic change of batch size when memory error occurs HOT 7
- Converted type not written to output file for timestamps without timezone HOT 9
- Support MSSQL data type TIME HOT 10
- Compression SNAPPY not possible since version 0.13.2 HOT 5
- export in chunks? HOT 3
- Flag to support legacy converted types HOT 1
- Option to not generate file if row count is 0 HOT 4
- setup types for particular column HOT 2
- Issue with MySQL JSON columns HOT 8
- Reserved Column Names not Supported HOT 1
- Feature Request - Support column encryption in the generated parquet file HOT 4
- JobName as .sql file in config file HOT 4
- Parquet format version support HOT 9
- Feature suggestion: connect to URL `postgresql://username:pass@host/database` HOT 1
- What permissions are needed? - State: 42501, Native error: 1, Message: ERROR: permission denied HOT 4
- StarRocks parquet file import of parquet file generated by odbc2parquet fails with encoding error HOT 11
- Memory allocation with column-length-limit HOT 11
- Build for alpine HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from odbc2parquet.