Giter Site home page Giter Site logo

Comments (7)

pacman82 avatar pacman82 commented on May 30, 2024 2

Newest version allows for specifying desired memory usage. Defaults to 2 GiB on 64 Bit platforms. There is still more that could be done. Both in terms of either streaming large values or being able to increase the buffer size upon encountering larger values. At least the latter does require driver support, and some research on my side is required wether this would work with MariaDB.

from odbc2parquet.

pacman82 avatar pacman82 commented on May 30, 2024

Hello, I can look at this later, but your report has been very helpful so far. May I draw your attention to the --batch_size parameter. Its default value is 100000 and it describes the number of rows transmitted between data source and buffer in one go. This tool aims to lower IO overhead by fetching many rows at once. To do so it allocates enough memory to hold batch_size entries of the largest possible row.

Guess in your case that is too much. You can lower the batch size of course. Likely to work because modern machines have stupid amounts of memory, but it will be slow, because you need a roundtrip for each row of the database. Currently I have no option for you to provide the domain knowledege of the maximum value size via command line parameter (would this help you?). Sometimes casting to a fixed VARCHAR() in the select statement helps (if you try it, please tell me how it goes).

The very least I can do is to provide a better error message. I am totally open to suggestions and ideas from your side.

from odbc2parquet.

langesven avatar langesven commented on May 30, 2024

Well, do I feel stupid now :D
I did actually start specifying the batch size after I ran into this for the first time but I didn't make the connection at all between the 100000 in that argument and the error, wow. Thank you, that completely makes sense.
I re-tried now with a batch size of 2 and batches per file of 100000 and that worked completely fine, so that already is a great solution for my problem.

The domain knowledge bit could be interesting, but I'm guessing it's a pain in the ass to implement and also given the very uhm let's call it "questionable" design of my source db here, not sure it'll really work out great. This isn't the only field that's completely off in sizing, just the largest one of the ones being off. And as it could technically change what the largest data is in there (given that the data type obviously allows it) that could be tricky for exports that run more frequently. Unless there's some way of determining something like max(length) of the largest columns and using that instead of the max(data_type_length) to keep it dynamic it could also incur some additional maintenance needs.

I'll have to look into the cast option just out of curiosity, but the smaller batch size is fine for me now, so this would just be to see if I can make it work with that at all. Currently the cast yells at me because of unicode/ascii issues so not entirely sure yet if that's a feasible way or not.

Yeah not sure what options you have regarding error message. Can you like "catch" the allocation error and print something along the lines of what you mentioned here regarding largest column times batch size? It completely makes sense but just didn't occur to me at all.

from odbc2parquet.

pacman82 avatar pacman82 commented on May 30, 2024

I think you are far from stupid, and I am happy you raised the issue. I do not know how much I can do to make this work out of the box, but at least the tool has a ux problem in case of out of memory errors.

I could calculate the required size beforehand and give a fair warning if it goes over an amount. I also came across this gem in the ODBC4.0 specification:

https://github.com/microsoft/ODBC-Specification/blob/b7ef71fba508ed010cd979428efae3091b732d75/ODBC%204.0.md#381-sql_attr_length_exception_behavior

That would allow me to set an upper bound. And fetch truncated values later. Documentation is sparse though. And this probably needs support from the driver, too (don't know, just a guess). Yet it is worth checking out.

from odbc2parquet.

pacman82 avatar pacman82 commented on May 30, 2024

How would you feel about specifying the --batch-size in Memory rather than number of rows?

from odbc2parquet.

langesven avatar langesven commented on May 30, 2024

That sounds like a great idea, that would make it really well plannable.
E.g. in my use case this is running in a kubernetes pod so I could allocate X GB of RAM to the pod and then use that(-buffer) for the batch sizing without running risk of out of memory errors etc. Worst case the operation just takes longer because of more small queries being made on tables that expose pointlessly large column sizes but it would almost guarantee to work :)

from odbc2parquet.

pacman82 avatar pacman82 commented on May 30, 2024

So far my strategy for handling these large columns and memory allocations in general is:

  • Calculate and log the amount of memory required per row
  • Make batch_size configurable in terms of memory consumption
  • Set default to 1GB
  • Warn users if more than that is required for a single row
  • Try something fancy using ODBC 4.0 and length exception behaviour

from odbc2parquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.