It doesn't look like there's a way to get a streaming download from google storage in

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

OK, looking at the <a href="https://github.com/googleapis/google-resumable-media-pytho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

This was requested many times but was at some point turned down (<a class="issue-link

Add an API method to give us a streaming file object about python-storage HOT 21 CLOSED

dmsolow commented on September 22, 2024 40

Add an API method to give us a streaming file object

from python-storage.

Comments (21)

dmsolow commented on September 22, 2024 49

It's a shame that this was turned down. It's a feature that every python dev is going to expect from a library like this, as evidenced by the fact that it keeps coming up.

from python-storage.

dmsolow commented on September 22, 2024 24

I don't think so. The situation is that it's often useful to start processing a file as it downloads instead of waiting until it's finished. For example if there's 1GB CSV file in google storage, it should be possible to parse it line by line as it's downloaded.

It's fairly common for network libraries to offer this kind of functionality. For example in the standard urllib.request HTTP library:

import urllib.request
import csv
from io import TextIOWrapper

with urllib.request.urlopen('http://test.com/big.csv') as f:
    wrapped = TextIOWrapper(f) # decode from bytes to str
    reader = csv.reader(wrapped)
    for row in reader:
       print(row[0])

This parses the CSV as it's downloaded. I'd like to get the same functionality from google storage. If there's already a good way to do this with the current library, please let me know.

from python-storage.

tseaver commented on September 22, 2024 13

@dmsolow Hmm, Blob.download_to_file takes a file object -- does that not suit your usecase?

from python-storage.

ElliotSilver commented on September 22, 2024 13

The lack of a simple streaming interface is a challenge to implementing a cloud function that reads/writes large files. I need the ability to read an object in from cloud storage, manipulate it, and write it out to another object. Since the only filestore available to GCF is /tmp which lives in the function memory space, you are limited to files less than 2 GB.

from python-storage.

dmsolow commented on September 22, 2024 8

@tseaver No. I would like something that is a "file-like object." This means something that supports standard Python io methods like readline, next, read etc. Maybe that object buffers chunks under the hood, but it should essentially be indistinguishable from the file object returned by the builtin open function.

from python-storage.

tseaver commented on September 22, 2024 6

OK, looking at the underlying implementation in google-resumable-media, all that we actually expect of the file object is that it has a write method, which is then passed each chunk as it is downloaded.

You could therefore pass in an instance of your own class which wrapped the underlying stream, e.g.:

from google.cloud.storage import Client

class ChunkParser(object):

    def __init__(self, fileobj):
        self._fileobj = fileobj

    def write(self, chunk):
        self._fileobj.write(chunk)
        self._do_something_with(chunk)

client = Client()
bucket = client.get_bucket('my_bucket_name')
blob = bucket.blob('my_blob.xml')'

with open('my_blob.xml', 'wb') as blob_file:
    parser = ChunkParser(blob_file)
    blob.download_to_file(parser)

from python-storage.

yan-hic commented on September 22, 2024 4

@thnee you should check back, gcsfs has the setxattrs() method to set metadata, including content-type.

from python-storage.

yan-hic commented on September 22, 2024 3

This was requested many times but was at some point turned down (googleapis/google-cloud-python#3903)

As an alternative, one can use the gcsfs library which supports file-obj for read and write.

from python-storage.

IlyaFaer commented on September 22, 2024 3

Well, if this new method is so much wanted, I'd propose solution: class, that inherits FileIO. It inits ChunkedDownload in self property and then on every read() call it consumes next chunk and returns it (some variants provided, as seek() will work in that class, so as flush()). New blob-method will be initializing this object and returning it to user

Looks like it'll work, 'cause (as I know) most file methods works through read(), so overriding it must do the trick. I've already raw-coded this and tried some tests - it worked. And it's compact

from python-storage.

xbrianh commented on September 22, 2024 3

I've implemented gs-chunked-io to satisfy my own needs for GS read/write streams. It's designed to compliment the Google python API.

import gs_chunked_io as gscio
from google.cloud.storage import Client

bucket = Client().bucket("my-bucket")
blob = bucket.get_blob("my-key)

# read
with gscio.Reader(blob) as fh:
    fh.read(size)

# read in background
with gscio.AsyncReader(blob) as fh:
    fh.read(size)

# write
with gscio.Writer("my_new_key", bucket) as fh:
    fh.write(data)

from python-storage.

thnee commented on September 22, 2024 2

I was really surprised to see that not only is this feature not available, but it also has been brought up and closed in the past. It seems like an obvious and important feature to have.

Fortunately, gcsfs works really well as a substitute, but it's a little bit awkward to have to have a second library for such a core functionality.

But gcsfs does not support setting Content-Type, so I end up having to first upload the file using gcsfs, and then call gsutil setmeta via subprocess to set it after the file has been uploaded. This takes extra time and it is brittle, it is more of a workaround than a solution.

from python-storage.

petedannemann commented on September 22, 2024 2

smart_open now has support for streaming files to/from GCS.

from smart_open import open

# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
    print(line)

# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

from python-storage.

petedannemann commented on September 22, 2024 2

@petedannemann great work - any ETA for an official release?

Release 1.10 last night included GCS functionality

from python-storage.

tseaver commented on September 22, 2024 1

Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good. You can make this work using Python's os.pipe: see this gist, which produces the following output:

$ bin/python pipe_test.py 
reader: start
reader: read one chunk
reader: read one chunk
...
reader: read one chunk
reader: read 800000 bytes

from python-storage.

dmsolow commented on September 22, 2024

Using a separate thread kind of feels like a hack to me, but it is surely one way to do it. I think the ability to do this without using extra threads would be widely useful, but idk how hard it would be to implement.

from python-storage.

akuzminsky commented on September 22, 2024

Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good.

Unfortunately this doesn't work with uploading streams.
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/blob.py#L1160 returns size of a pipe equal to zero. As result the pipe never empties and thus a child gets eventually blocked writing to it.

Are there known workarounds?

from python-storage.

tseaver commented on September 22, 2024

@akuzminsky The line you've linked to is in the implementation of Blob.upload_from_filename. This issue is about being able to process downloaded chunks before the download completes.

@dmsolow Does my file-emulating wrapper class solution work for you?

from python-storage.

olejorgenb commented on September 22, 2024

Tensorflow have an implementation that gives a file like object for gc blobs: https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile

Not sure if it actually streams or not though.

from python-storage.

rocketbitz commented on September 22, 2024

@petedannemann great work - any ETA for an official release?

from python-storage.

petedannemann commented on September 22, 2024

@rocketbitz no idea but for now you could install from Github

pip install git+https://github.com/RaRe-Technologies/smart_open

from python-storage.

abhipn commented on September 22, 2024

Any update on this?

from python-storage.

Add an API method to give us a streaming file object about python-storage HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent