Comments (21)
It's a shame that this was turned down. It's a feature that every python dev is going to expect from a library like this, as evidenced by the fact that it keeps coming up.
from python-storage.
I don't think so. The situation is that it's often useful to start processing a file as it downloads instead of waiting until it's finished. For example if there's 1GB CSV file in google storage, it should be possible to parse it line by line as it's downloaded.
It's fairly common for network libraries to offer this kind of functionality. For example in the standard urllib.request
HTTP library:
import urllib.request
import csv
from io import TextIOWrapper
with urllib.request.urlopen('http://test.com/big.csv') as f:
wrapped = TextIOWrapper(f) # decode from bytes to str
reader = csv.reader(wrapped)
for row in reader:
print(row[0])
This parses the CSV as it's downloaded. I'd like to get the same functionality from google storage. If there's already a good way to do this with the current library, please let me know.
from python-storage.
@dmsolow Hmm, Blob.download_to_file
takes a file object -- does that not suit your usecase?
from python-storage.
The lack of a simple streaming interface is a challenge to implementing a cloud function that reads/writes large files. I need the ability to read an object in from cloud storage, manipulate it, and write it out to another object. Since the only filestore available to GCF is /tmp which lives in the function memory space, you are limited to files less than 2 GB.
from python-storage.
@tseaver No. I would like something that is a "file-like object." This means something that supports standard Python io methods like readline
, next
, read
etc. Maybe that object buffers chunks under the hood, but it should essentially be indistinguishable from the file object returned by the builtin open
function.
from python-storage.
OK, looking at the underlying implementation in google-resumable-media
, all that we actually expect of the file
object is that it has a write
method, which is then passed each chunk as it is downloaded.
You could therefore pass in an instance of your own class which wrapped the underlying stream, e.g.:
from google.cloud.storage import Client
class ChunkParser(object):
def __init__(self, fileobj):
self._fileobj = fileobj
def write(self, chunk):
self._fileobj.write(chunk)
self._do_something_with(chunk)
client = Client()
bucket = client.get_bucket('my_bucket_name')
blob = bucket.blob('my_blob.xml')'
with open('my_blob.xml', 'wb') as blob_file:
parser = ChunkParser(blob_file)
blob.download_to_file(parser)
from python-storage.
@thnee you should check back, gcsfs
has the setxattrs()
method to set metadata, including content-type.
from python-storage.
This was requested many times but was at some point turned down (googleapis/google-cloud-python#3903)
As an alternative, one can use the gcsfs
library which supports file-obj for read and write.
from python-storage.
Well, if this new method is so much wanted, I'd propose solution: class, that inherits FileIO
. It inits ChunkedDownload
in self property and then on every read()
call it consumes next chunk and returns it (some variants provided, as seek()
will work in that class, so as flush()
). New blob-method will be initializing this object and returning it to user
Looks like it'll work, 'cause (as I know) most file methods works through read()
, so overriding it must do the trick. I've already raw-coded this and tried some tests - it worked. And it's compact
from python-storage.
I've implemented gs-chunked-io to satisfy my own needs for GS read/write streams. It's designed to compliment the Google python API.
import gs_chunked_io as gscio
from google.cloud.storage import Client
bucket = Client().bucket("my-bucket")
blob = bucket.get_blob("my-key)
# read
with gscio.Reader(blob) as fh:
fh.read(size)
# read in background
with gscio.AsyncReader(blob) as fh:
fh.read(size)
# write
with gscio.Writer("my_new_key", bucket) as fh:
fh.write(data)
from python-storage.
I was really surprised to see that not only is this feature not available, but it also has been brought up and closed in the past. It seems like an obvious and important feature to have.
Fortunately, gcsfs
works really well as a substitute, but it's a little bit awkward to have to have a second library for such a core functionality.
But gcsfs
does not support setting Content-Type
, so I end up having to first upload the file using gcsfs
, and then call gsutil setmeta
via subprocess
to set it after the file has been uploaded. This takes extra time and it is brittle, it is more of a workaround than a solution.
from python-storage.
smart_open now has support for streaming files to/from GCS.
from smart_open import open
# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
print(line)
# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
fout.write(b'hello world')
from python-storage.
@petedannemann great work - any ETA for an official release?
Release 1.10 last night included GCS functionality
from python-storage.
Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good. You can make this work using Python's os.pipe
: see this gist, which produces the following output:
$ bin/python pipe_test.py
reader: start
reader: read one chunk
reader: read one chunk
...
reader: read one chunk
reader: read 800000 bytes
from python-storage.
Using a separate thread kind of feels like a hack to me, but it is surely one way to do it. I think the ability to do this without using extra threads would be widely useful, but idk how hard it would be to implement.
from python-storage.
Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good.
Unfortunately this doesn't work with uploading streams.
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/blob.py#L1160 returns size of a pipe equal to zero. As result the pipe never empties and thus a child gets eventually blocked writing to it.
Are there known workarounds?
from python-storage.
@akuzminsky The line you've linked to is in the implementation of Blob.upload_from_filename
. This issue is about being able to process downloaded chunks before the download completes.
@dmsolow Does my file-emulating wrapper class solution work for you?
from python-storage.
Tensorflow have an implementation that gives a file like object for gc blobs: https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile
Not sure if it actually streams or not though.
from python-storage.
@petedannemann great work - any ETA for an official release?
from python-storage.
@rocketbitz no idea but for now you could install from Github
pip install git+https://github.com/RaRe-Technologies/smart_open
from python-storage.
Any update on this?
from python-storage.
Related Issues (20)
- tests.system.test_bucket: test_bucket_list_blobs_hierarchy_second_level failed
- tests.system.test_bucket: test_bucket_list_blobs_hierarchy_third_level failed
- tests.system.test_bucket: test_bucket_list_blobs_hierarchy_w_include_trailing_delimiter failed
- tests.system.test_bucket: test_bucket_list_blobs failed HOT 1
- tests.system.test_bucket: test_bucket_list_blobs_w_user_project failed
- tests.system.test_bucket: test_bucket_list_blobs_paginated failed HOT 1
- tests.system.test_bucket: test_bucket_list_blobs_paginated_w_offset failed HOT 1
- tests.system.test_transfer_manager: test_upload_many_skip_if_exists failed HOT 2
- tests.system.test_transfer_manager: test_upload_many_from_filenames_with_attributes failed HOT 1
- tests.system.test_transfer_manager: test_download_many failed HOT 1
- tests.system.test_transfer_manager: test_download_many_with_threads_and_file_objs failed HOT 1
- tests.system.test_bucket: test_ubla_set_unset_preserves_acls failed HOT 1
- tests.system.test_kms_integration: test_bucket_w_default_kms_key_name failed HOT 1
- Make it possible to update / add to the user-agent for an existing client object
- Micropi cannot install the package HOT 1
- OSError occurred while downloading files using transfer_manager.download_many_to_path HOT 6
- Warning: a recent release failed
- Support in-memory data for Transfer Manager download_chunks_concurrently and upload_chunks_concurrently HOT 1
- Blob.download_to_filename leaves empty file if not found
- Transfer Manager: Buffer Error in debug mode HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-storage.