This feature is desirable, because not always would users want the output stored in th

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

I made it work for me via <div class="highlight highlight-source-shell notranslate

Make it possible for odbc2parquet to stream its output to standard out. about odbc2parquet HOT 8 CLOSED

pacman82 commented on June 9, 2024

Make it possible for odbc2parquet to stream its output to standard out.

from odbc2parquet.

Comments (8)

leo-schick commented on June 9, 2024 2

Hi @pacman82

I am much in favor for a support to write to stdout. The default way in linux to tell a programm to write to stdout / read from stdin is to pass - as parameter value. The calling syntax would look like this:

# to convert the table into parquet and write it to a local file using piping
odbc2parquet query -c "<connection_string>" - "SELECT * FROM my_table" > my_table.parquet

# to convert the table into parquet and write it to a azure storage using azcopy
# see https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
odbc2parquet query -c "<connection_string>" - "SELECT * FROM my_table" \
  | azcopy cp https://<account_name>.blob.core.windows.net/<container_name>/<path>/my_table.parquet?<SAS token> --from-to PipeBlob

I would suggest to implement reading the sql query via stdin as well. Example:

# reading the SQL query from stdin, converting the resutl to parquet and write it to a local file using piping
echo "SELECT * FROM my_table" \
  | odbc2parquet query -c "<connection_string>" - - > my_table.parquet

This would make substitution with the small linux tool sed possible, for example:

# gets the SQL query from 'echo', replacing 'TABLE_NAME' with 'my_table',
# using odbc2parquet to read the query and write it to stdout which is passed 
# to azcopy to write the content to a Azure blob storage
echo "SELECT * FROM TABLE_NAME" \
  | sed "s/TABLE_NAME/my_table/g"
  | odbc2parquet query -c "<connection_string>" - - \
  | azcopy cp https://<account_name>.blob.core.windows.net/<container_name>/<path>/my_table.parquet?<SAS token> --from-to PipeBlob

from odbc2parquet.

pacman82 commented on June 9, 2024 1

Currently, I am on vacation. Recent parquet Versions now support abitrary io::Write for output, so this should be straight forward to implement, once I've regular internet access again.

If you consider a PR, please also provide a tests

from odbc2parquet.

mskyttner commented on June 9, 2024

Would other tools be able to process chunks of the stream even though the parquet metadata is written at the end of the stream?

from odbc2parquet.

pacman82 commented on June 9, 2024

Hello @mskyttner ,

I guess formulated the issue a bit close to the implementation, rather than the use case. My motivation for streaming it to stdout has not been to enable processing the data in stream. Then I created the tool originally friends frequently asked me if odbc2parquet is able to stream its output to cloud storage directly, rather than write it to disk.

I feel adding native support for various cloud providers is a huge increase in scope, but the usecase seems valid to me, so the idea is to stream the parquet to standard out and let the command line tools of the cloud provider of your choice take over from there. The fact that the parquet metadata is written at the end of the stream, is what enables this story at all, as the metadata is not known at the beginning of writing.

Cheers, Markus

from odbc2parquet.

mskyttner commented on June 9, 2024

Hey Markus! Thanks for a nice tool, and tools that can fit in a pipe are nice.

I too throught it would be nice to be able to do odbc2parquet -O - ... | mc pipe s3dospaces/my250GBfile.parquet ... (assuming one uses for example https://docs.min.io/docs/minio-client-complete-guide.html)... The minio object storage software (and minio client) works with most providers (at least those I have tried), see https://github.com/minio/minio-rs

Do object storage CLI tools that read from stdin start writing without knowing the total length of what could be a potentially long and often larger-than-RAM or even endless stream? I see this: minio/mc#2271

If not they would need the metadata at the end of the stream to know the length which I guess would mean that most content would need to be buffered before being written and for large parquet files, wouldn't the stream often risk to be too large to fit in memory if there is no chunking?

from odbc2parquet.

pacman82 commented on June 9, 2024

Thanks for a nice tool, and tools that can fit in a pipe are nice

Thank you, and yes, I agree pipes are nice.

Do object storage CLI tools that read from stdin start writing without knowing the total length of what could be a potentially long and often larger-than-RAM or even endless stream?

I am no expert on Object Storage Cli tools, but given that stdin does not communicate the length of a file stream, my guess is that these tools work without prior knowledge of the streams length. If the tool communicates to an S3 compatible backend on the other end, there seems to be the possibility splitting the upload into up to 10000 chunks. With a configurable chunk size. See:

https://stackoverflow.com/questions/8653146/can-i-stream-a-file-upload-to-s3-without-a-content-length-header#8881939

I would imagine such tools to take advantage of file metadata of the filesystem if available, but also to be able to work without the it. I would not expect such a tool to care about the fact that it is a parquet file, or any parquet specific metadata.

Anyhow, we do not know how long the parquet file is going to be, before it is completly written.

Cheers, Markus

from odbc2parquet.

leo-schick commented on June 9, 2024

I made it work for me via

odbc2parquet query -c "<connection_string>" /dev/stdout "$( \
  echo 'SELECT * FROM TABLE_NAME/ \
  | sed 's/TABLE_NAME/my_table/g'
)" \
  | azcopy cp https://<account_name>.blob.core.windows.net/<container_name>/<path>/my_table.parquet?<SAS token> --from-to PipeBlob

This looks nasty and isn't nice but it works and since I am not a rust developer, I suggest someone else takes the lead here...

from odbc2parquet.

pacman82 commented on June 9, 2024

odbc2parquet 0.9.0 has just been released. It supports passing - for the output parameter of the query subcommand in order to allow writing to standard out.

from odbc2parquet.

Make it possible for odbc2parquet to stream its output to standard out. about odbc2parquet HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent