Giter Site home page Giter Site logo

Comments (21)

dmsolow avatar dmsolow commented on September 22, 2024 49

It's a shame that this was turned down. It's a feature that every python dev is going to expect from a library like this, as evidenced by the fact that it keeps coming up.

from python-storage.

dmsolow avatar dmsolow commented on September 22, 2024 24

I don't think so. The situation is that it's often useful to start processing a file as it downloads instead of waiting until it's finished. For example if there's 1GB CSV file in google storage, it should be possible to parse it line by line as it's downloaded.

It's fairly common for network libraries to offer this kind of functionality. For example in the standard urllib.request HTTP library:

import urllib.request
import csv
from io import TextIOWrapper

with urllib.request.urlopen('http://test.com/big.csv') as f:
    wrapped = TextIOWrapper(f) # decode from bytes to str
    reader = csv.reader(wrapped)
    for row in reader:
       print(row[0])

This parses the CSV as it's downloaded. I'd like to get the same functionality from google storage. If there's already a good way to do this with the current library, please let me know.

from python-storage.

tseaver avatar tseaver commented on September 22, 2024 13

@dmsolow Hmm, Blob.download_to_file takes a file object -- does that not suit your usecase?

from python-storage.

ElliotSilver avatar ElliotSilver commented on September 22, 2024 13

The lack of a simple streaming interface is a challenge to implementing a cloud function that reads/writes large files. I need the ability to read an object in from cloud storage, manipulate it, and write it out to another object. Since the only filestore available to GCF is /tmp which lives in the function memory space, you are limited to files less than 2 GB.

from python-storage.

dmsolow avatar dmsolow commented on September 22, 2024 8

@tseaver No. I would like something that is a "file-like object." This means something that supports standard Python io methods like readline, next, read etc. Maybe that object buffers chunks under the hood, but it should essentially be indistinguishable from the file object returned by the builtin open function.

from python-storage.

tseaver avatar tseaver commented on September 22, 2024 6

OK, looking at the underlying implementation in google-resumable-media, all that we actually expect of the file object is that it has a write method, which is then passed each chunk as it is downloaded.

You could therefore pass in an instance of your own class which wrapped the underlying stream, e.g.:

from google.cloud.storage import Client

class ChunkParser(object):

    def __init__(self, fileobj):
        self._fileobj = fileobj

    def write(self, chunk):
        self._fileobj.write(chunk)
        self._do_something_with(chunk)

client = Client()
bucket = client.get_bucket('my_bucket_name')
blob = bucket.blob('my_blob.xml')'

with open('my_blob.xml', 'wb') as blob_file:
    parser = ChunkParser(blob_file)
    blob.download_to_file(parser)

from python-storage.

yan-hic avatar yan-hic commented on September 22, 2024 4

@thnee you should check back, gcsfs has the setxattrs() method to set metadata, including content-type.

from python-storage.

yan-hic avatar yan-hic commented on September 22, 2024 3

This was requested many times but was at some point turned down (googleapis/google-cloud-python#3903)

As an alternative, one can use the gcsfs library which supports file-obj for read and write.

from python-storage.

IlyaFaer avatar IlyaFaer commented on September 22, 2024 3

Well, if this new method is so much wanted, I'd propose solution: class, that inherits FileIO. It inits ChunkedDownload in self property and then on every read() call it consumes next chunk and returns it (some variants provided, as seek() will work in that class, so as flush()). New blob-method will be initializing this object and returning it to user

Looks like it'll work, 'cause (as I know) most file methods works through read(), so overriding it must do the trick. I've already raw-coded this and tried some tests - it worked. And it's compact

from python-storage.

xbrianh avatar xbrianh commented on September 22, 2024 3

I've implemented gs-chunked-io to satisfy my own needs for GS read/write streams. It's designed to compliment the Google python API.

import gs_chunked_io as gscio
from google.cloud.storage import Client

bucket = Client().bucket("my-bucket")
blob = bucket.get_blob("my-key)

# read
with gscio.Reader(blob) as fh:
    fh.read(size)

# read in background
with gscio.AsyncReader(blob) as fh:
    fh.read(size)

# write
with gscio.Writer("my_new_key", bucket) as fh:
    fh.write(data)

from python-storage.

thnee avatar thnee commented on September 22, 2024 2

I was really surprised to see that not only is this feature not available, but it also has been brought up and closed in the past. It seems like an obvious and important feature to have.

Fortunately, gcsfs works really well as a substitute, but it's a little bit awkward to have to have a second library for such a core functionality.

But gcsfs does not support setting Content-Type, so I end up having to first upload the file using gcsfs, and then call gsutil setmeta via subprocess to set it after the file has been uploaded. This takes extra time and it is brittle, it is more of a workaround than a solution.

from python-storage.

petedannemann avatar petedannemann commented on September 22, 2024 2

smart_open now has support for streaming files to/from GCS.

from smart_open import open

# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
    print(line)

# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

from python-storage.

petedannemann avatar petedannemann commented on September 22, 2024 2

@petedannemann great work - any ETA for an official release?

Release 1.10 last night included GCS functionality

from python-storage.

tseaver avatar tseaver commented on September 22, 2024 1

Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good. You can make this work using Python's os.pipe: see this gist, which produces the following output:

$ bin/python pipe_test.py 
reader: start
reader: read one chunk
reader: read one chunk
...
reader: read one chunk
reader: read 800000 bytes

from python-storage.

dmsolow avatar dmsolow commented on September 22, 2024

Using a separate thread kind of feels like a hack to me, but it is surely one way to do it. I think the ability to do this without using extra threads would be widely useful, but idk how hard it would be to implement.

from python-storage.

akuzminsky avatar akuzminsky commented on September 22, 2024

Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good.

Unfortunately this doesn't work with uploading streams.
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/blob.py#L1160 returns size of a pipe equal to zero. As result the pipe never empties and thus a child gets eventually blocked writing to it.

Are there known workarounds?

from python-storage.

tseaver avatar tseaver commented on September 22, 2024

@akuzminsky The line you've linked to is in the implementation of Blob.upload_from_filename. This issue is about being able to process downloaded chunks before the download completes.

@dmsolow Does my file-emulating wrapper class solution work for you?

from python-storage.

olejorgenb avatar olejorgenb commented on September 22, 2024

Tensorflow have an implementation that gives a file like object for gc blobs: https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile

Not sure if it actually streams or not though.

from python-storage.

rocketbitz avatar rocketbitz commented on September 22, 2024

@petedannemann great work - any ETA for an official release?

from python-storage.

petedannemann avatar petedannemann commented on September 22, 2024

@rocketbitz no idea but for now you could install from Github

pip install git+https://github.com/RaRe-Technologies/smart_open

from python-storage.

abhipn avatar abhipn commented on September 22, 2024

Any update on this?

from python-storage.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.