Comments (8)
Hi @pacman82
I am much in favor for a support to write to stdout. The default way in linux to tell a programm to write to stdout / read from stdin is to pass -
as parameter value. The calling syntax would look like this:
# to convert the table into parquet and write it to a local file using piping
odbc2parquet query -c "<connection_string>" - "SELECT * FROM my_table" > my_table.parquet
# to convert the table into parquet and write it to a azure storage using azcopy
# see https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
odbc2parquet query -c "<connection_string>" - "SELECT * FROM my_table" \
| azcopy cp https://<account_name>.blob.core.windows.net/<container_name>/<path>/my_table.parquet?<SAS token> --from-to PipeBlob
I would suggest to implement reading the sql query via stdin as well. Example:
# reading the SQL query from stdin, converting the resutl to parquet and write it to a local file using piping
echo "SELECT * FROM my_table" \
| odbc2parquet query -c "<connection_string>" - - > my_table.parquet
This would make substitution with the small linux tool sed
possible, for example:
# gets the SQL query from 'echo', replacing 'TABLE_NAME' with 'my_table',
# using odbc2parquet to read the query and write it to stdout which is passed
# to azcopy to write the content to a Azure blob storage
echo "SELECT * FROM TABLE_NAME" \
| sed "s/TABLE_NAME/my_table/g"
| odbc2parquet query -c "<connection_string>" - - \
| azcopy cp https://<account_name>.blob.core.windows.net/<container_name>/<path>/my_table.parquet?<SAS token> --from-to PipeBlob
from odbc2parquet.
Currently, I am on vacation. Recent parquet Versions now support abitrary io::Write for output, so this should be straight forward to implement, once I've regular internet access again.
If you consider a PR, please also provide a tests
from odbc2parquet.
Would other tools be able to process chunks of the stream even though the parquet metadata is written at the end of the stream?
from odbc2parquet.
Hello @mskyttner ,
I guess formulated the issue a bit close to the implementation, rather than the use case. My motivation for streaming it to stdout has not been to enable processing the data in stream. Then I created the tool originally friends frequently asked me if odbc2parquet
is able to stream its output to cloud storage directly, rather than write it to disk.
I feel adding native support for various cloud providers is a huge increase in scope, but the usecase seems valid to me, so the idea is to stream the parquet to standard out and let the command line tools of the cloud provider of your choice take over from there. The fact that the parquet metadata is written at the end of the stream, is what enables this story at all, as the metadata is not known at the beginning of writing.
Cheers, Markus
from odbc2parquet.
Hey Markus! Thanks for a nice tool, and tools that can fit in a pipe are nice.
I too throught it would be nice to be able to do odbc2parquet -O - ... | mc pipe s3dospaces/my250GBfile.parquet
... (assuming one uses for example https://docs.min.io/docs/minio-client-complete-guide.html)... The minio object storage software (and minio client) works with most providers (at least those I have tried), see https://github.com/minio/minio-rs
Do object storage CLI tools that read from stdin start writing without knowing the total length of what could be a potentially long and often larger-than-RAM or even endless stream? I see this: minio/mc#2271
If not they would need the metadata at the end of the stream to know the length which I guess would mean that most content would need to be buffered before being written and for large parquet files, wouldn't the stream often risk to be too large to fit in memory if there is no chunking?
from odbc2parquet.
Thanks for a nice tool, and tools that can fit in a pipe are nice
Thank you, and yes, I agree pipes are nice.
Do object storage CLI tools that read from stdin start writing without knowing the total length of what could be a potentially long and often larger-than-RAM or even endless stream?
I am no expert on Object Storage Cli tools, but given that stdin does not communicate the length of a file stream, my guess is that these tools work without prior knowledge of the streams length. If the tool communicates to an S3 compatible backend on the other end, there seems to be the possibility splitting the upload into up to 10000 chunks. With a configurable chunk size. See:
I would imagine such tools to take advantage of file metadata of the filesystem if available, but also to be able to work without the it. I would not expect such a tool to care about the fact that it is a parquet file, or any parquet specific metadata.
Anyhow, we do not know how long the parquet file is going to be, before it is completly written.
Cheers, Markus
from odbc2parquet.
I made it work for me via
odbc2parquet query -c "<connection_string>" /dev/stdout "$( \
echo 'SELECT * FROM TABLE_NAME/ \
| sed 's/TABLE_NAME/my_table/g'
)" \
| azcopy cp https://<account_name>.blob.core.windows.net/<container_name>/<path>/my_table.parquet?<SAS token> --from-to PipeBlob
This looks nasty and isn't nice but it works and since I am not a rust developer, I suggest someone else takes the lead here...
from odbc2parquet.
odbc2parquet 0.9.0
has just been released. It supports passing -
for the output parameter of the query
subcommand in order to allow writing to standard out.
from odbc2parquet.
Related Issues (20)
- Issue with MySQL JSON columns HOT 8
- Reserved Column Names not Supported HOT 1
- Feature Request - Support column encryption in the generated parquet file HOT 4
- JobName as .sql file in config file HOT 4
- Parquet format version support HOT 9
- Feature suggestion: connect to URL `postgresql://username:pass@host/database` HOT 1
- What permissions are needed? - State: 42501, Native error: 1, Message: ERROR: permission denied HOT 4
- StarRocks parquet file import of parquet file generated by odbc2parquet fails with encoding error HOT 11
- Memory allocation with column-length-limit HOT 11
- Build for alpine HOT 8
- file-size-threshold generates wrong size files HOT 1
- --no-empty-file option doesn't work properly when row-groups-per-file should devide result into few files HOT 6
- MSSQL nvarchar - missing column in output file HOT 2
- Feature request: Progress bar for full table copies HOT 6
- Data source must return valid UTF16 in wide character buffer: Utf16Error HOT 4
- Write statistics HOT 14
- Make zstd the default compression HOT 4
- Build release assets for Ubuntu ARM64 as well HOT 11
- Exporter adding trailing zero's in when exporting from PostgreSQL Numeric dtype HOT 5
- thread 'main' panicked at src/query/date.rs:60:87: called `Option::unwrap()` on a `None` value HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from odbc2parquet.