Comments (8)
Hello @theOvercloud ,
thanks for the excelletly written issue. To my own suprise (unlike odbcsv
) odbc2parquet
does not offer a flag yet to limit the size of an indivdual column. You are unlikely the first person to run into this issue, but you are kind of the first to ask for it.
In terms of workarounds. Could you try to cast the JSON column into a VARCHAR(4096)
column or the like in order to limit the size? You would need to of course apply some kind of domain knowledege how big your JSONS actually get, in order to avoid truncation.
This is more a limitation of ODBC rather than parquet. ODBC requires you for bulk (i.e. fast) fetching of rows to allocate the buffer before hand. Yet at that point in time you do not know the length of the actual values (otherwise they would be already fetched) so you have to assume the largest possible element there is.
Hope this answer is as helpful to you as your issue is to me. Otherwise feel free to ask again.
Best, Markus
from odbc2parquet.
Hallo Markus,
thanks a lot for your fast feedback! Good idea! I will certainly try that out. Hopefully the conversion does not introduces issues when inserting the data later into an other database. I will report back:)
Cheers,
Yanick
from odbc2parquet.
I was able to get it running using your suggestion. Thanks!
However, i noticed that the driver will cache all the batches before passing it to Rust. You can see that when looking at the network traffic and comparing it to the log outputs of the script. So i would still run into memory issues. The fix was to pass the driver argument NO_CACHE=1;
(mysql docu) to the connection string. With that, the batches were directly passed to Rust and it worked as expected. Can you see the same behavior? Would it make sense to add this to the documentation or even add this argument by default? I am using the MySQL ODBC 8.0 Unicode Driver
on Ubuntu.
from odbc2parquet.
Short update on the JSON front: As I said above, i was able to successfully export the table using the cast( <column_name> as char(..)) as <column_name>
. However, when importing i had the following error:
2023-02-08T15:57:24+01:00 - WARN - State: HY000, Native error: 3144, Message: [MySQL][ODBC 8.0(w) Driver][mysqld-8.0.31-0ubuntu0.22.04.1]Cannot create a JSON value from a string with CHARACTER SET 'binary'.
I was able to fix that by setting the charset when connecting to the database via the initstmt
. My full connection string now looks like this for inserts:
connection_string="Driver={MySQL ODBC 8.0 Unicode Driver};initstmt={set charset utf8mb4;};charset=utf8mb4;Server=$db_ip;Port=$db_port;UID=$username;PWD=$password;NO_CACHE=1;FORWARD_CURSOR=1;COMPRESSED_PROTO=1;"
from odbc2parquet.
Hello Yanick,
thanks for your feedback. I am a bit underwater at the moment, otherwise I would have already implemented an option to limit the column size for querying. I'll leave this issue open at least until I can do that.
I am happy it works now for you, and there is some value in persisting that knowledge. However, a lot of your issues are specific to MySQL, I am a bit hesitant to driver specific stuff, as I want for the most part to sit firmly on top of ODBC.
How about an odbc2parquet
wiki here at GitHub with driver specific issues as a first step?
Thanks again for all your work and your feedback, I want to answer in greater detail, but I am pretty spend today.
Cheers, Markus
from odbc2parquet.
FYI: A new flag column-length-limit
has just been added to the new version odbc2parquet 0.15.2
.
from odbc2parquet.
Hi Markus,
Same here;) Always too much on the plate...
FYI: A new flag column-length-limit has just been added to the new version odbc2parquet 0.15.2.
I was not expecting that feature so soon, especially as you are doing this for free! Thanks a lot for the great tool and your work!
How about an odbc2parquet wiki here at GitHub with driver specific issues as a first step?
I agree. Would be great to have a knowledge base on that.
I think you could close the issue if you want.
Cheers,
Yanick
from odbc2parquet.
Hello Yanick,
thanks for the feedback. I have created a minimal homepage for the odbc2parquet wiki. Feel free to create a page to share you knowledge about odbc2parquet and MySQL JSON there. I did not use GitHub wiki before, so tell me if I missed something in case you wanna give it a spin.
Closing this issue.
Cheers, Markus
from odbc2parquet.
Related Issues (20)
- Reserved Column Names not Supported HOT 1
- Feature Request - Support column encryption in the generated parquet file HOT 4
- JobName as .sql file in config file HOT 4
- Parquet format version support HOT 9
- Feature suggestion: connect to URL `postgresql://username:pass@host/database` HOT 1
- What permissions are needed? - State: 42501, Native error: 1, Message: ERROR: permission denied HOT 4
- StarRocks parquet file import of parquet file generated by odbc2parquet fails with encoding error HOT 11
- Memory allocation with column-length-limit HOT 11
- Build for alpine HOT 8
- file-size-threshold generates wrong size files HOT 1
- --no-empty-file option doesn't work properly when row-groups-per-file should devide result into few files HOT 6
- MSSQL nvarchar - missing column in output file HOT 2
- Feature request: Progress bar for full table copies HOT 6
- Data source must return valid UTF16 in wide character buffer: Utf16Error HOT 4
- Write statistics HOT 14
- Make zstd the default compression HOT 4
- Build release assets for Ubuntu ARM64 as well HOT 11
- Exporter adding trailing zero's in when exporting from PostgreSQL Numeric dtype HOT 5
- thread 'main' panicked at src/query/date.rs:60:87: called `Option::unwrap()` on a `None` value HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from odbc2parquet.