Hey, not entirely sure this issue will make sense, but I'm an absolute parquet noob an

Memory allocation error due to Longtext column about odbc2parquet HOT 7 OPEN

pacman82 commented on May 30, 2024

Memory allocation error due to Longtext column

from odbc2parquet.

Comments (7)

pacman82 commented on May 30, 2024 2

Newest version allows for specifying desired memory usage. Defaults to 2 GiB on 64 Bit platforms. There is still more that could be done. Both in terms of either streaming large values or being able to increase the buffer size upon encountering larger values. At least the latter does require driver support, and some research on my side is required wether this would work with MariaDB.

from odbc2parquet.

pacman82 commented on May 30, 2024

Hello, I can look at this later, but your report has been very helpful so far. May I draw your attention to the --batch_size parameter. Its default value is 100000 and it describes the number of rows transmitted between data source and buffer in one go. This tool aims to lower IO overhead by fetching many rows at once. To do so it allocates enough memory to hold batch_size entries of the largest possible row.

Guess in your case that is too much. You can lower the batch size of course. Likely to work because modern machines have stupid amounts of memory, but it will be slow, because you need a roundtrip for each row of the database. Currently I have no option for you to provide the domain knowledege of the maximum value size via command line parameter (would this help you?). Sometimes casting to a fixed VARCHAR() in the select statement helps (if you try it, please tell me how it goes).

The very least I can do is to provide a better error message. I am totally open to suggestions and ideas from your side.

from odbc2parquet.

langesven commented on May 30, 2024

Well, do I feel stupid now :D
I did actually start specifying the batch size after I ran into this for the first time but I didn't make the connection at all between the 100000 in that argument and the error, wow. Thank you, that completely makes sense.
I re-tried now with a batch size of 2 and batches per file of 100000 and that worked completely fine, so that already is a great solution for my problem.

The domain knowledge bit could be interesting, but I'm guessing it's a pain in the ass to implement and also given the very uhm let's call it "questionable" design of my source db here, not sure it'll really work out great. This isn't the only field that's completely off in sizing, just the largest one of the ones being off. And as it could technically change what the largest data is in there (given that the data type obviously allows it) that could be tricky for exports that run more frequently. Unless there's some way of determining something like max(length) of the largest columns and using that instead of the max(data_type_length) to keep it dynamic it could also incur some additional maintenance needs.

I'll have to look into the cast option just out of curiosity, but the smaller batch size is fine for me now, so this would just be to see if I can make it work with that at all. Currently the cast yells at me because of unicode/ascii issues so not entirely sure yet if that's a feasible way or not.

Yeah not sure what options you have regarding error message. Can you like "catch" the allocation error and print something along the lines of what you mentioned here regarding largest column times batch size? It completely makes sense but just didn't occur to me at all.

from odbc2parquet.

pacman82 commented on May 30, 2024

I think you are far from stupid, and I am happy you raised the issue. I do not know how much I can do to make this work out of the box, but at least the tool has a ux problem in case of out of memory errors.

I could calculate the required size beforehand and give a fair warning if it goes over an amount. I also came across this gem in the ODBC4.0 specification:

https://github.com/microsoft/ODBC-Specification/blob/b7ef71fba508ed010cd979428efae3091b732d75/ODBC%204.0.md#381-sql_attr_length_exception_behavior

That would allow me to set an upper bound. And fetch truncated values later. Documentation is sparse though. And this probably needs support from the driver, too (don't know, just a guess). Yet it is worth checking out.

from odbc2parquet.

pacman82 commented on May 30, 2024

How would you feel about specifying the --batch-size in Memory rather than number of rows?

from odbc2parquet.

langesven commented on May 30, 2024

That sounds like a great idea, that would make it really well plannable.
E.g. in my use case this is running in a kubernetes pod so I could allocate X GB of RAM to the pod and then use that(-buffer) for the batch sizing without running risk of out of memory errors etc. Worst case the operation just takes longer because of more small queries being made on tables that expose pointlessly large column sizes but it would almost guarantee to work :)

from odbc2parquet.

pacman82 commented on May 30, 2024

So far my strategy for handling these large columns and memory allocations in general is:

Calculate and log the amount of memory required per row
Make batch_size configurable in terms of memory consumption
Set default to 1GB
Warn users if more than that is required for a single row
Try something fancy using ODBC 4.0 and length exception behaviour

from odbc2parquet.

Memory allocation error due to Longtext column about odbc2parquet HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent