Giter Site home page Giter Site logo

Comments (8)

leo-schick avatar leo-schick commented on June 9, 2024 2

Hi @pacman82

I am much in favor for a support to write to stdout. The default way in linux to tell a programm to write to stdout / read from stdin is to pass - as parameter value. The calling syntax would look like this:

# to convert the table into parquet and write it to a local file using piping
odbc2parquet query -c "<connection_string>" - "SELECT * FROM my_table" > my_table.parquet

# to convert the table into parquet and write it to a azure storage using azcopy
# see https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
odbc2parquet query -c "<connection_string>" - "SELECT * FROM my_table" \
  | azcopy cp https://<account_name>.blob.core.windows.net/<container_name>/<path>/my_table.parquet?<SAS token> --from-to PipeBlob

I would suggest to implement reading the sql query via stdin as well. Example:

# reading the SQL query from stdin, converting the resutl to parquet and write it to a local file using piping
echo "SELECT * FROM my_table" \
  | odbc2parquet query -c "<connection_string>" - - > my_table.parquet

This would make substitution with the small linux tool sed possible, for example:

# gets the SQL query from 'echo', replacing 'TABLE_NAME' with 'my_table',
# using odbc2parquet to read the query and write it to stdout which is passed 
# to azcopy to write the content to a Azure blob storage
echo "SELECT * FROM TABLE_NAME" \
  | sed "s/TABLE_NAME/my_table/g"
  | odbc2parquet query -c "<connection_string>" - - \
  | azcopy cp https://<account_name>.blob.core.windows.net/<container_name>/<path>/my_table.parquet?<SAS token> --from-to PipeBlob

from odbc2parquet.

pacman82 avatar pacman82 commented on June 9, 2024 1

Currently, I am on vacation. Recent parquet Versions now support abitrary io::Write for output, so this should be straight forward to implement, once I've regular internet access again.

If you consider a PR, please also provide a tests

from odbc2parquet.

mskyttner avatar mskyttner commented on June 9, 2024

Would other tools be able to process chunks of the stream even though the parquet metadata is written at the end of the stream?

from odbc2parquet.

pacman82 avatar pacman82 commented on June 9, 2024

Hello @mskyttner ,

I guess formulated the issue a bit close to the implementation, rather than the use case. My motivation for streaming it to stdout has not been to enable processing the data in stream. Then I created the tool originally friends frequently asked me if odbc2parquet is able to stream its output to cloud storage directly, rather than write it to disk.

I feel adding native support for various cloud providers is a huge increase in scope, but the usecase seems valid to me, so the idea is to stream the parquet to standard out and let the command line tools of the cloud provider of your choice take over from there. The fact that the parquet metadata is written at the end of the stream, is what enables this story at all, as the metadata is not known at the beginning of writing.

Cheers, Markus

from odbc2parquet.

mskyttner avatar mskyttner commented on June 9, 2024

Hey Markus! Thanks for a nice tool, and tools that can fit in a pipe are nice.

I too throught it would be nice to be able to do odbc2parquet -O - ... | mc pipe s3dospaces/my250GBfile.parquet ... (assuming one uses for example https://docs.min.io/docs/minio-client-complete-guide.html)... The minio object storage software (and minio client) works with most providers (at least those I have tried), see https://github.com/minio/minio-rs

Do object storage CLI tools that read from stdin start writing without knowing the total length of what could be a potentially long and often larger-than-RAM or even endless stream? I see this: minio/mc#2271

If not they would need the metadata at the end of the stream to know the length which I guess would mean that most content would need to be buffered before being written and for large parquet files, wouldn't the stream often risk to be too large to fit in memory if there is no chunking?

from odbc2parquet.

pacman82 avatar pacman82 commented on June 9, 2024

Thanks for a nice tool, and tools that can fit in a pipe are nice

Thank you, and yes, I agree pipes are nice.

Do object storage CLI tools that read from stdin start writing without knowing the total length of what could be a potentially long and often larger-than-RAM or even endless stream?

I am no expert on Object Storage Cli tools, but given that stdin does not communicate the length of a file stream, my guess is that these tools work without prior knowledge of the streams length. If the tool communicates to an S3 compatible backend on the other end, there seems to be the possibility splitting the upload into up to 10000 chunks. With a configurable chunk size. See:

https://stackoverflow.com/questions/8653146/can-i-stream-a-file-upload-to-s3-without-a-content-length-header#8881939

I would imagine such tools to take advantage of file metadata of the filesystem if available, but also to be able to work without the it. I would not expect such a tool to care about the fact that it is a parquet file, or any parquet specific metadata.

Anyhow, we do not know how long the parquet file is going to be, before it is completly written.

Cheers, Markus

from odbc2parquet.

leo-schick avatar leo-schick commented on June 9, 2024

I made it work for me via

odbc2parquet query -c "<connection_string>" /dev/stdout "$( \
  echo 'SELECT * FROM TABLE_NAME/ \
  | sed 's/TABLE_NAME/my_table/g'
)" \
  | azcopy cp https://<account_name>.blob.core.windows.net/<container_name>/<path>/my_table.parquet?<SAS token> --from-to PipeBlob

This looks nasty and isn't nice but it works and since I am not a rust developer, I suggest someone else takes the lead here...

from odbc2parquet.

pacman82 avatar pacman82 commented on June 9, 2024

odbc2parquet 0.9.0 has just been released. It supports passing - for the output parameter of the query subcommand in order to allow writing to standard out.

from odbc2parquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.