googleapis / python-storage Goto Github PK

License: Apache License 2.0

Python 97.93% Shell 1.94% Dockerfile 0.13%

python-storage's Introduction

Python Client for Google Cloud Storage

Google Cloud Storage is a managed service for storing unstructured data. Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.

A comprehensive list of changes in each version may be found in the CHANGELOG.

Certain control plane and long-running operations for Cloud Storage (including Folder and Managed Folder operations) are supported via the Storage Control Client. The Storage Control API creates one space to perform metadata-specific, control plane, and long-running operations apart from the Storage API.

Read more about the client libraries for Cloud APIs, including the older Google APIs Client Libraries, in Client Libraries Explained.

Quick Start

In order to use this library, you first need to go through the following steps. A step-by-step guide may also be found in Get Started with Client Libraries.

Installation

Install this library in a virtual environment using venv. venv is a tool that creates isolated Python environments. These isolated environments can have separate versions of Python packages, which allows you to isolate one project's dependencies from the dependencies of other projects.

With venv, it's possible to install this library without needing system install permissions, and without clashing with the installed system dependencies.

Code samples and snippets

Code samples and snippets live in the samples/ folder.

Supported Python Versions

Our client libraries are compatible with all current active and maintenance versions of Python.

Python >= 3.7

Unsupported Python Versions

Python <= 3.6

If you are using an end-of-life version of Python, we recommend that you update as soon as possible to an actively supported version.

Mac/Linux

python3 -m venv <your-env>
source <your-env>/bin/activate
pip install google-cloud-storage

Windows

py -m venv <your-env>
.\<your-env>\Scripts\activate
pip install google-cloud-storage

Next Steps

Read the Google Cloud Storage Product documentation to learn more about the product and see How-to Guides.
Read the Client Library Documentation for Google Cloud Storage API to see other available methods on the client.
View this README to see the full list of Cloud APIs that we cover.

python-storage's People

Contributors

Stargazers

Watchers

Forkers

renovate-bot hemangchothani plamut maxxlellc skamansam richlysakowski kevinhuang28 securibox mgalindor vam-google wilod98673 manojku4447 mauglibas harshitk26 chie8842 andrewaxue ozanyetkin seung-lab kurhula martinkilonzo jdimatteo global-localhost kc-panw yun-cn isabella232 dabbyndubisi haim0n tamir-jether bradykieffer maxholloway guillaumeblaquiere domzippilli greenmtnboy xelhark tritone xrhd harrydrippin catcassie diegoqueiroz tryweirder spacebaire daniellehanks megabytemb rjw57 ryanyuan ddelgrosso1 eeila cojenco ark-kun prioritasgroup unforced simone-codeluppi bhavyapatel215 skandagn cbonnie micahjsmith classicvalues vampiresati aabbccddeeffgghhii1438 rebecca-pete mlazowik jonyjalfon94 mfarik21 syphar prakashbala muzammil360 jhamot ozgurtekingoz curry94-dev readingd ewealks samkenxstream jack-michaud bigbearframersedge steffnay stack-overflow zabop s2t2 ranjanp75 jakubczaplicki allenxiang hwijune aetotvl6789 aribray davidkorczynski techsolomon askomorokhov rafaelhgbotelho kellanburket praveen-elastic quanta-of-solitude izardy charbelnaba tonyluo7 ravicodelabs neivamcarvalho rsaksham rsaksham-dev elijahahianyo ddelange

python-storage's Issues

Support parallel operations for copying objects

It would be great if there were a way to do the equivalent of the following command from the python API:

gsutil -m cp -r myfolder gs://mybucket/

You can currently upload each file individually, but this is slow for folders with many files.

I've also submitted a bug to gsutil: GoogleCloudPlatform/gsutil#492

[Storage] V4 Signature: Formatting Inconsistencies

Tracking issue for work defined in companion document.

New Client Request - GCS SigV4 Formatting Inconsistencies.

Storage: mtime of downloaded file is incorrect by UTC offset

Google Cloud Storage v1.25.0
Python 3.7.3
OS: OSX & Win7

Issue: If I upload a file to Google Cloud Storage and then immediately download it, the mtime is incorrect - for me, I'm in EST, so I'm 5 hours behind UTC. That's the exact timedelta that occurs between the file's original mtime and the recorded mtime after the file is downloaded.

Here's an example screenshot:

The original file mtime in Google Cloud Storage is 1/23/20 9:04 PM (which is correct from the file I uploaded), but when I download the file, the mtime becomes 1/24/20 2:04 AM, which is 5 hours ahead of what is supposed to be (the UTC offset from my timezone).

The issue is here in blob.download_to_filename:

updated = self.updated
if updated is not None:
            mtime = time.mktime(updated.timetuple())
            os.utime(file_obj.name, (mtime, mtime))

In my example, updated is the timezone-aware datetime corresponding to 2020-01-24 02:04:11.184000+00:00 (it has tzinfo==UTC). The updated.timetuple() is

time.struct_time(tm_year=2020, tm_mon=1, tm_mday=24, tm_hour=2, tm_min=4, tm_sec=9, tm_wday=4, tm_yday=24, tm_isdst=0)

The problem, I believe, is that the timetuple doesn't know this is a UTC date, nor did it convert to my timezone. The docs of mktime note, "Its argument is the struct_time or full 9-tuple (since the dst flag is needed; use -1 as the dst flag if it is unknown) which expresses the time in local time, not UTC." Perhaps, we should do this instead:

if updated is not None:
   mtime = updated.timestamp() # For Python3, not sure of the Python2 version
   os.utime(file_obj.name, (mtime, mtime))

The timestamp() function accounts for the timezone information in the datetime object.
I've just been doing this manually in my code after downloading a file because my application is sensitive to mtimes, and it seems to fix the issue.

Storage: No timeouts cause indefinite hanging

Library: Google Cloud Storage
Environment: Win7 and OSX
Python Version: 3.7.3
Google Cloud Storage Version: 1.25.0

I don't believe all methods of the storage client are using timeouts. I've come across several situations where an upload or download has completely hung because of this. Unfortunately, there's no stack trace because the thread is just hanging waiting for a response. Just from a brief code inspection, I can identify an example area where a timeout is not being honored:

Bucket.get_blob calls blob.reload(), which then calls the following without specifying a timeout:

api_response = client._connection.api_request(
            method="GET",
            path=self.path,
            query_params=query_params,
            headers=self._encryption_headers(),
            _target_object=self,
        )

This then calls JSONConnection.api_request (defaults timeout to None) -> JSONConnection._make_request (defaults timeout to None) -> JSONConnection._do_request (defaults timeout to None) -> AuthorizedSession.request (defaults timeout to None), which makes the ultimate call to the requests.Session object with a None timeout. The end result is that a request is made without a timeout, which can very easily cause a thread to hang indefinitely waiting for a response.

I realize that it would be a huge pain to try and find all possible None timeout paths and patch them, but I at least wanted to bring this to attention. I'm currently wrapping every call to the Google Cloud Python library with a custom timeout function that forcefully stops execution after my own specified timeout, since I have no way to pass that in to the library. A fix that either allows developers to pass in their custom timeout, either to each function called (e.g. get_blob(...)) or to the client object so that it's passed with every single request in the underlying http instance, would be amazing. (In this sense, I suppose this issue is a mix of a bug and a feature request, so my apologies if I chose the incorrect category.)

Storage code for resumable uploads that makes the call to resumable_media/requests/_helpers.py, more specifically the http_request function, seems to do much better since that function sets a default timeout of (60, 61) as opposed to None.

[Storage] V4 Signature POST Policies

Tracking issue for work defined in companion document.

GCS - POST with SigV4

Upload blob from HTTPResponse

I'm trying to use Blob.upload_from_file to upload an http.client.HTTPResponse object without saving it to disk first. It seems like this, or a version of this that wraps the HTTPResponse in an io object, should be possible.

However, because the response may be larger than _MAX_MULTIPART_SIZE, Blob.upload_from_file creates a resumable upload, which depends on tell to make sure the stream is at the beginning. Here is the code that reproduces this issue:

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('my-file.csv', chunk_size=1 << 20)

import urllib.request
a_few_megs_of_data = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&batter_stands=&game_date_gt=2018-09-06&game_date_lt=2018-09-09&group_by=name&hfAB=&hfBBL=&hfBBT=&hfC=&hfFlag=&hfGT=R%7CPO%7CS%7C&hfInn=&hfNewZones=&hfOuts=&hfPR=&hfPT=&hfRO=&hfSA=&hfSea=2018%7C&hfSit=&hfZ=&home_road=&metric_1=&min_abs=0&min_pitches=0&min_results=0&opponent=&pitcher_throws=&player_event_sort=h_launch_speed&player_type=batter&position=&sort_col=pitches&sort_order=desc&stadium=&team=&type=details'
response = urllib.request.urlopen(a_few_megs_of_data)

blob.upload_from_file(response)

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1081, in upload_from_file
    client, file_obj, content_type, size, num_retries, predefined_acl
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 991, in _do_upload
    client, stream, content_type, size, num_retries, predefined_acl
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 934, in _do_resumable_upload
    predefined_acl=predefined_acl,
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 883, in _initiate_resumable_upload
    stream_final=False,
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/resumable_media/requests/upload.py", line 323, in initiate
    total_bytes=total_bytes, stream_final=stream_final)
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/resumable_media/_upload.py", line 409, in _prepare_initiate_request
    if stream.tell() != 0:
io.UnsupportedOperation: seek

Is it possible to read an HTTP response in chunks and write it to the blob without using the filesystem as an intermediary, or is this bad practice? If it is possible and not discouraged, what is the recommended way to do this?

Storage: Mislead BucketNotification.reload method's return type in doc-string

BucketNotification's reload method document specify return type = bool

python-storage/google/cloud/storage/notification.py

Lines 329 to 331 in 98684b8

    
                   :rtype: bool 
        
                   :returns: True, if the notification exists, else False. 
        
                   :raises ValueError: if the notification has no ID.

But, it returns nothing.

python-storage/google/cloud/storage/notification.py

Lines 342 to 345 in 98684b8

    
           response = client._connection.api_request( 
        
               method="GET", path=self.path, query_params=query_params, timeout=timeout 
        
           ) 
        
           self._set_properties(response)

Python GCS Client library low performance on multi-thread

Experiencing slow performance on a multi-thread script in a GCE VM, the Bucket and the VM are in the same zone (us-east1). After upgrading the library to the latest (1.25), the performance increased, but they found a bottle neck when starting to use 10 threads and over.
Threads timeGCP timeAWS
5 48.4 118.0
10 25.1 58.6
15 22.5 41.3
20 24.1 30.9
25 24.5 25.3

The test data set consists of 114,750 files of ~25GB in size.

Comparing the results with the same app hosted in a VM in AWS. I Want to decrease the time as the threads are increased.

Is the Library going through internet instead of having the communication inside the GCP network?
Are there some limitation that can be solved by some kind of congratulation on the Library?
How to improve the performance avoiding the bottle neck?

Checked the performance of the Bucket with cp and perf-diag directly on the VM in GCE and the results were just fine . This delimit the issue to the library directly.

Just as a reference, this are the values of CP from VM in GCE and AWS with SDK 1.20
Source Multi-thread App gsutil -m cp
GCE VM 30+min 8.5 min
AWS EC2 VM 25 min 26 min

Storage: Bucket.list_blobs(max_results=n) does not behave as documented

The max_results parameter of list_blobs() is documented as controlling the maximum number of blobs returned in each page of results, but actually limits the total number of results as the name implies.

Compare the Bucket.list_blobs() documentation:
https://googleapis.dev/python/storage/latest/buckets.html#google.cloud.storage.bucket.Bucket.list_blobs

max_results (int) – The maximum number of blobs in each page of results from this request. Non-positive values are ignored. Defaults to a sensible value set by the API.

With the Iterator documentation:
https://googleapis.dev/python/google-api-core/latest/page_iterator.html#google.api_core.page_iterator.Iterator

max_results (int) – The maximum number of results to fetch.

Also the implementation of HTTPIterator which is used by list_blobs() internally does treat max_results as a hard limit for total num_results:
https://github.com/googleapis/google-cloud-python/blob/master/api_core/google/api_core/page_iterator.py#L378

Code example

iterator = some_big_bucket.list_blobs(max_results=100)
assert len(list(iterator)) > 100  # throws
assert sum(len(list(page)) for page in iterator.pages) > 100  # throws

Suggested resolution

Change the documentation to match what the parameter actually does. If supplying a paging size is required, a new argument to HTTPIterator could be added and exposed up through the list_blobs() interface.

Storage: access token field is unused in system test 'test_create_signed_read_url_*_w_access_token'

Currently in the system test blob.generate_signed_url method passing access_token=None, Due to that test_create_signed_read_url_*_w_access_token test not executed properly,so needs to pass value of access_token. also need to change the scope which passing to generate the access token as the current one is not working.

Storage: TimeoutGuard raises TimeoutException even though upload successful

Environment Details:

Using google-cloud-storage==1.23.0 and 1.24.1
Using Mac OSX 10.14 and Windows 7 64 bit
Using Python 3.7.3

Issue: A file can upload completely to Google Cloud Storage, yet still raise a TimeoutException if the upload process took longer than ~60 seconds (not 100% sure on the timedelta, but I'm guessing that it's 60 seconds from a brief analysis of the code).

Details The use of AuthorizedSession.request for blob uploads in the Google Cloud Storage Python library causes an unwarranted TimeoutException. The TimeoutGuard class raises an unnecessary TimeoutException on file uploads to Cloud Storage even when the Cloud Storage server is responding in a timely manner to file uploads. In fact, a file can completely upload and the TimeoutGuard will still raise a TimeoutException even though a true request timeout never occurred. The reason why is explained below.

Steps to reproduce:
I first encountered this when uploading a large file (1 GB) on a medium upload connection (10 Mbps upload). Although the upload was technically successful, I was still receiving a TimeoutException at the end of the upload from a call to blob.upload_from_filename(filepath) (a resumable upload, not multipart upload).
The stacktrace is below:

  File "site-packages\google\cloud\storage\blob.py", line 1320, in upload_from_filename
  File "site-packages\google\cloud\storage\blob.py", line 1265, in upload_from_file
  File "site-packages\google\cloud\storage\blob.py", line 1175, in _do_upload
  File "site-packages\google\cloud\storage\blob.py", line 1122, in _do_resumable_upload
  File "site-packages\google\resumable_media\requests\upload.py", line 425, in transmit_next_chunk
  File "site-packages\google\resumable_media\requests\_helpers.py", line 136, in http_request
  File "site-packages\google\resumable_media\_helpers.py", line 150, in wait_and_retry
  File "site-packages\google\auth\transport\requests.py", line 287, in request
  File "site-packages\google\auth\transport\requests.py", line 110, in __exit__
requests.exceptions.Timeout"

The core of the issue is the TimeoutGuard class when used in a context like AuthorizedSession.request. Specifically, look at the following code in the aforementioned method:

with TimeoutGuard(timeout) as guard:
            response = super(AuthorizedSession, self).request(
                method,
                url,
                data=data,
                headers=request_headers,
                timeout=timeout,
                **kwargs
            )
        timeout = guard.remaining_timeout

There are two timeouts going on. One of them is a true request timeout used by the requests library (note AuthorizedSession is a subclass of requests.Session), and this is functioning correctly. The other timeout is a naive timeout set by TimeoutGuard that is causing problems. Essentially, it starts a clock that will raise a TimeoutException if a certain amount of time passes, even if the Google Cloud Storage servers are responding in a timely manner. In this case, the requests library will not raise a TimeoutException (because a true network timeout never occurred), but the TimeoutGuard will.

This causes issues with large files uploads or slow internet connections. If a user tries to upload a file that takes a long time to upload, then, even if the file upload is successful and didn't raise a TimeoutException from the requests library (i.e. server was responding in a timely fashion the entire upload), during TimeoutGuard.__exit__, the TimeoutGuard will raise an unsolicited TimeoutException.

Here's a walkthrough of the error:
(1) File upload initiated
(2) File uploads for a couple minutes, exceeding the default timeout of 60/61 seconds (resumable_media/requests/_helper.py _DEFAULT_CONNECT_TIMEOUT and _DEFAULT_READ_TIMEOUT, although it looks like the TimeGuard will take the minimum of the two) that the TimeGuard uses. The server is responding normally to all chunk uploads. TimeoutException is never thrown from the Python requests library because the server is consistently responding.
(3) File finishes upload, TimeoutGuard raises TimeoutException even though file upload was successful.

I've been able to work around this problem by monkeypatching the TimeoutGuard code, but I believe a proper fix is needed in the codebase. I would be happy to contribute or open a pull request if a maintainer can elaborate on the need for the TimeoutGuard TimeoutException when there is already a TimeoutException being used by the requests.Session class.

'Bucket.list_blobs' surface issues

While investigating googleapis/google-cloud-python#4154, I noticed the following problems with Bucket.list_blobs:

It exposes paging semantics, rather than a "normal" iterator.
~~It exposes fields, which is probably not optimal for a method which is supposed to return populated Blob instances.~~
The semantics of the versions flag are questionable, given googleapis/google-cloud-python#2463.

Given that we are in GA, my inclination would be to add another method which addresses these issues, and docs-deprecate the existing one. @lukesneeringer how would you like to proceed?

Storage: Capture relevant headers to blob properties during download

Residual from googleapis/google-cloud-python#9003.

@william-silversmith notes that even with raw_download enabled, he is unable to detect the content_type of a downloaded blob without performing an additional reload request, which is prohibitive for his usecase at scale. E.g.:

blob = bucket.blob( key )
binary = blob.download_as_string(raw_download=True)
if blob.content_encoding == 'gzip':
    return gunzip(binary)
elif blob.content_encoding == 'br':
    return brotli.decompress(binary)
else:
    return binary

Potentially even...

if blob.content_type == 'application/json':
    return json.loads(binary.decode('utf8'))

'Blob.exists()' does not work within batch context

The Blob.exists() method does not work when run within a Batch context. The normal behavior of exists() is to return True unless a NotFound exception occurs. Within the Batch context the exception seems to be suppressed and the function returns True. After leaving the Batch context, an Exception is then thrown.

This is how I expected to be able to use the exists() function:

blobs = [storage.blob.Blob(path, bucket) for path in paths]
with client.batch():
  bools = [blob.exists() for blob in blobs]

Without the Batch contextmanager this code works, if inefficiently. With the Batch contextmanager the code returns all Trues and throws an exception when leaving the context.

This behavior seems unintuitive to me. Please let me know if the API is meant to be used differently. If it is meant to be used as in the provided code sample, I'd be happy to attempt a fix if one of the maintainers could point me in the right direction.

Environment configuration just in case:

macOS 10.13.3
Python 3.6.5
google-cloud-storage==1.8.0

Add support for JSON API headers and query string parameters

Description

The XML-Api and the Json-API are supporting a large set of parameters as described at https://cloud.google.com/storage/docs/xml-api/reference-headers.

Into those parameters you can find very useful features like x-goog-metageneration which let you control how to deal with versions and also blocking rewriting of blobs in a bucket.

These API used to be available in the AppEngine library for google cloud storage as you can see at https://cloud.google.com/appengine/docs/standard/python/googlecloudstorageclient/functions#open

Proposition

Adds support for extended options on different operations like upload_from_string and delete, something like:

def upload_from_string(self, data, content_type='text/plain', client=None,
                           predefined_acl=None, options=None):

Which would like you pick and sets of options from https://cloud.google.com/storage/docs/xml-api/reference-headers

Add support for the Storage Transfer service

Request to support the Storage Transfer service, mainly 2 components: transferJobs and transferOperations (instantiation of jobs)

Currently only available through api:

If in the works, rough ETA would be helpful as we plan on writing corresponding Airflow operators and need to decide based on what libraries.

`Blob.rewrite()` does not work with batches.

google-cloud-storage v1.8.0

Not sure whether rewrites are supposed to work when batched, but it would be nice and useful if they did, otherwise there's no efficient way to copy lots of blobs across buckets in different locations or with different encryption keys.

Example:

with gcs_client.batch():
    dest_blob.rewrite(src_blob)

Traceback:

Traceback (most recent call last):
  File "batch_test.py", line 10, in <module>
    dest_blob.rewrite(src_blob)
  File ".../pyvirtenv/python-common/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 1359, in rewrite
    rewritten = int(api_response['totalBytesRewritten'])
  File ".../pyvirtenv/python-common/lib/python2.7/site-packages/google/cloud/storage/batch.py", line 105, in __getitem__
    raise KeyError('Cannot get item %r from a future' % (key,))
KeyError: "Cannot get item 'totalBytesRewritten' from a future"

[Storage] V4 Signature Signed URLs With Bucket-Bound Hostnames

Tracking issue for work defined in companion document.

New Client Request - Support for Bucket-Bound Hostnames With GCS Signed URLs

Create documentation for parallel uploads, suggest multiprocessing.

While working through #69, we found that threading is pretty slow. There is likely work the client can do to become less blocking, but suggesting multiprocessing should help users to stay on the happy path. This is what gsutil -m is doing and works very well for multiple uploads.

Update shared conformance tests

~~We need to add a test suite which exercises the cross-language tests and fix anything where we don't pass.~~

We need to update the shared conformance tests to match the current spec:

/cc @frankyn

Storage: Timeout when upload file using google.cloud.storage.Blob.upload_from_filename()

Environment details

OS: MacOS 10.15.1
Python: Python 3.7.4
Google-cloud version:

google-api-core==1.16.0
google-api-python-client==1.7.11
google-auth==1.11.2
google-auth-httplib2==0.0.3
google-auth-oauthlib==0.4.0
google-cloud-core==1.3.0
google-cloud-error-reporting==0.32.1
google-cloud-firestore==1.5.0
google-cloud-kms==1.0.0
google-cloud-logging==1.14.0
google-cloud-storage==1.26.0
google-cloud-translate==1.7.0
google-resumable-media==0.5.0
google-translate==0.1
googleapis-common-protos==1.6.0

Steps to reproduce

Prepare a file with size >300MB
Run blob.upload_from_filename("path/on/storage", "path/of/big/file/on/local")

Stack trace

Traceback (most recent call last):
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1065, in _send_output
    self.send(chunk)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 987, in send
    self.sock.sendall(data)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/ssl.py", line 1034, in sendall
    v = self.send(byte_view[count:])
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/ssl.py", line 1003, in send
    return self._sslobj.write(data)
socket.timeout: The write operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/util/retry.py", line 400, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1065, in _send_output
    self.send(chunk)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 987, in send
    self.sock.sendall(data)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/ssl.py", line 1034, in sendall
    v = self.send(byte_view[count:])
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/ssl.py", line 1003, in send
    return self._sslobj.write(data)
urllib3.exceptions.ProtocolError: ('Connection aborted.', timeout('The write operation timed out'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-cdc889c11775>", line 4, in <module>
    "data/tmp/averaging.joblib", None)
  File "/Users/dualeoo/PycharmProjects/mlweb-ml/mlweb_ml/firestore/google_storage.py", line 30, in upload
    blob.upload_from_filename(file_path_on_local, content_type, predefined_acl=predefined_acl)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1342, in upload_from_filename
    predefined_acl=predefined_acl,
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1287, in upload_from_file
    client, file_obj, content_type, size, num_retries, predefined_acl
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1197, in _do_upload
    client, stream, content_type, size, num_retries, predefined_acl
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1144, in _do_resumable_upload
    response = upload.transmit_next_chunk(transport)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/resumable_media/requests/upload.py", line 425, in transmit_next_chunk
    retry_strategy=self._retry_strategy,
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/resumable_media/requests/_helpers.py", line 136, in http_request
    return _helpers.wait_and_retry(func, RequestsMixin._get_status_code, retry_strategy)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/resumable_media/_helpers.py", line 150, in wait_and_retry
    response = func()
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/auth/transport/requests.py", line 317, in request
    **kwargs
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', timeout('The write operation timed out'))

Expected result

No timeout error

Actual result

The upload timeout after 1 minute

Storage: upload_from_string() with ifGenerationMatch=0

The GCS HTTP protocol -- but not the Python API -- has the ability to set ifGenerationMatch when creating a storage object:

Makes the operation conditional on whether the object's current generation matches the given value. Setting to 0 makes the operation succeed only if there is a live version of the object.

Why it's useful: With this feature, the client could create a directory placeholder entry (a 0-byte object with a name ending in '/') very efficiently like this:

blob = bucket.blob('path/to/my/subdirectory/')
blob.upload_from_string(b'', if_generation_match=0)

That one round trip creates the directory placeholder entry if it doesn't already exist. The alternatives are to first make a round trip to check if the entry exists or else to let the bucket accumulate identical placeholder entries (esp. for top level directories) by blindly creating them. [~~Or does GCS check if an uploaded object matches the current generation and optimize that case?~~ -- Nope.]

Why that matters: Directory placeholders speed up gcsfuse by an order of magnitude. Without the placeholders, you have to use gcsfuse in --implicit-dirs mode, and such a mount is frustratingly slow for interactive work. E.g. it takes several seconds just to list a tiny directory containing 2 files. With the placeholders, you can run gcsfuse without --implicit-dirs, and that mount lists directories in a tenth of a second or two.

Proposal: I could create a Pull Request adding this feature if you like, with either the specific if_generation_match query parameter or a way to pass in additional query parameters.

Another alternative is recommend that callers do something like subclass Blob to override _add_query_parameters() to add the if_generation_match=0 name-value pair. That's ugly and fragile.

Is there a way to do this that I'm missing? Are there better alternatives?

Client fails for reauth, while gsutil works fine

Environment details

Specify the API at the beginning of the title (for example, "BigQuery: ...")
General, Core, and Other are also allowed as types
OS X highsierra
Python version and virtual environment information: Python 3.6.9
google-cloud- version: google-cloud-storage = 1.24.1

Steps to reproduce

Trying to use could-storage python package with user-credentials but it fails asking "'invalid_grant: reauth related error (rapt_required)'". Reauth does not help. However if I try it with gsutil in the same shell everything works.

Our company policy changed couple months ago so gcloud asks lot more often reauthentication.

Code example

client = storage.Client(project="prod-xxx")
bucket = client.get_bucket("model-data-prod-xxx")

Stack trace

File "/Users/xxx/Library/Caches/pypoetry/virtualenvs/model-pitkaveto-abtfw7oX-py3.6/lib/python3.6/site-packages/google/oauth2/_client.py", line 60, in _handle_error_response
    raise exceptions.RefreshError(error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_grant: reauth related error (rapt_required)', '{\n  "error": "invalid_grant",\n  "error_description": "reauth related error (rapt_required)",\n  "error_subtype": "rapt_required"\n}')
(model-xxx-abtfw7oX-py3.6) [xxx@mbp ~/projects/model_xxx/model_xxx (trainer *)]

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

Fix test for IAM get/set.

__________________ TestStorageBuckets.test_get_set_iam_policy __________________

self = <tests.system.TestStorageBuckets testMethod=test_get_set_iam_policy>

    def test_get_set_iam_policy(self):
        import pytest
        from google.cloud.storage.iam import STORAGE_OBJECT_VIEWER_ROLE
        from google.api_core.exceptions import BadRequest, PreconditionFailed

        bucket_name = "iam-policy" + unique_resource_id("-")
        bucket = retry_429_503(Config.CLIENT.create_bucket)(bucket_name)
        self.case_buckets_to_delete.append(bucket_name)
        self.assertTrue(bucket.exists())

        policy_no_version = bucket.get_iam_policy()
        self.assertEqual(policy_no_version.version, 1)

        policy = bucket.get_iam_policy(requested_policy_version=3)
        self.assertEqual(policy, policy_no_version)

        member = "serviceAccount:{}".format(Config.CLIENT.get_service_account_email())

        BINDING_W_CONDITION = {
            "role": STORAGE_OBJECT_VIEWER_ROLE,
            "members": {member},
            "condition": {
                "title": "always-true",
                "description": "test condition always-true",
                "expression": "true",
            },
        }
        policy.bindings.append(BINDING_W_CONDITION)

        with pytest.raises(
            PreconditionFailed, match="enable uniform bucket-level access"
        ):
            bucket.set_iam_policy(policy)

        bucket.iam_configuration.uniform_bucket_level_access_enabled = True
        bucket.patch()

        policy = bucket.get_iam_policy(requested_policy_version=3)
        policy.bindings.append(BINDING_W_CONDITION)

        with pytest.raises(BadRequest, match="at least 3"):
            bucket.set_iam_policy(policy)

        policy.version = 3
        returned_policy = bucket.set_iam_policy(policy)
        self.assertEqual(returned_policy.version, 3)
        self.assertEqual(returned_policy.bindings, policy.bindings)

        with pytest.raises(
            BadRequest, match="cannot be less than the existing policy version"
        ):
>           bucket.get_iam_policy()
E           Failed: DID NOT RAISE <class 'google.api_core.exceptions.BadRequest'>

tests/system.py:315: Failed

Allow tracking upload progress.

This is related to googleapis/google-cloud-python#1830 reopening here as this seems to have been closed many years ago.

We are please looking for this feature as we need to monitor large files being uploaded to Google Storage buckets. I am surprised not many people are after this essential feature, which makes me feel we haven't done our research properly or that the solution is very obvious or trivial.

Can someone please share an example of how we could track progress during upload?

Update: Should we be looking at google-resumable-media? will try that out and report back.

Thanks

Proposal: move API-methods to client

The Cloud Storage Python client was one of the first google-cloud-python clients, intended to be more reliable and more Pythonic than the google-api-client that it replaced. It is also a 100% hand-written client. Because it predates the auto-generated google-cloud-python clients, the Cloud Storage Python client has inconsistencies with other google-cloud-python clients.

This design proposal is to bring the Cloud Storage client further into alignment with other google-cloud-python clients and make the eventual inclusion of an auto-generated gRPC transport layer less disruptive to users of the Cloud Storage client.

Design document:

https://docs.google.com/document/d/1A4FIxThZK_enK7OV9ChY54IPrlpbdHwuhvVEu6iTHxM/edit?usp=sharing

Please share your feedback on the specifics of this as comments in the design document.

CC @crwilcox @frankyn @lbristol88

Update: according to comment, we need to update samples for moved methods before deprecating the old ones.

Handle 410 errors on resumable-media operations

A previous issue googleapis/google-cloud-python#7530 was closed in favor of a targeted issue for a feature request.

Resumable media operations can fail in such a way that they cannot be retried, at least on the chunk level. A 410 error indicates that the only choice is to restart the operation altogether

Pseudo code:

try:
   resumable_operation_upload(some_file)
except 410_error:
    # Retry the operation, from the very beginning of the file.

We should find the instances of resumable uploads and protect them from the higher level failure. It is also possible this could be pushed down into resumable media, but these higher level failures are a different category than existing retry-able errors.

Related: https://issuetracker.google.com/137168102 and internal bug 115694647.

'test_access_to_public_bucket' flakes with 503

From this test run:

_______________ TestAnonymousClient.test_access_to_public_bucket _______________

self = <test_system.TestAnonymousClient testMethod=test_access_to_public_bucket>

    @vpcsc_config.skip_if_inside_vpcsc
    def test_access_to_public_bucket(self):
        anonymous = storage.Client.create_anonymous_client()
        bucket = anonymous.bucket(self.PUBLIC_BUCKET)
>       blob, = retry_429_503(bucket.list_blobs)(max_results=1)

tests/system/test_system.py:1498:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.nox/system-2-7/lib/python2.7/site-packages/google/api_core/page_iterator.py:212: in _items_iter
    for page in self._page_iter(increment=False):
.nox/system-2-7/lib/python2.7/site-packages/google/api_core/page_iterator.py:243: in _page_iter
    page = self._next_page()
.nox/system-2-7/lib/python2.7/site-packages/google/api_core/page_iterator.py:369: in _next_page
    response = self._get_next_page_response()
.nox/system-2-7/lib/python2.7/site-packages/google/api_core/page_iterator.py:419: in _get_next_page_response
    method=self._HTTP_METHOD, path=self.path, query_params=params
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <google.cloud.storage._http.Connection object at 0x7f5575846150>
method = 'GET', path = '/b/gcp-public-data-landsat/o'
query_params = {'maxResults': 1, 'projection': 'noAcl'}, data = None
content_type = None, headers = None, api_base_url = None, api_version = None
expect_json = True, _target_object = None, timeout = 60

    def api_request(
        self,
        method,
        path,
        query_params=None,
        data=None,
        content_type=None,
        headers=None,
        api_base_url=None,
        api_version=None,
        expect_json=True,
        _target_object=None,
        timeout=_DEFAULT_TIMEOUT,
    ):
    ... # docstring elided    
    url = self.build_api_url(
            path=path,
            query_params=query_params,
            api_base_url=api_base_url,
            api_version=api_version,
        )

        # Making the executive decision that any dictionary
        # data will be sent properly as JSON.
        if data and isinstance(data, dict):
            data = json.dumps(data)
            content_type = "application/json"

        response = self._make_request(
            method=method,
            url=url,
            data=data,
            content_type=content_type,
            headers=headers,
            target_object=_target_object,
            timeout=timeout,
        )

        if not 200 <= response.status_code < 300:
>           raise exceptions.from_http_response(response)
E           ServiceUnavailable: 503 GET https://storage.googleapis.com/storage/v1/b/gcp-public-data-landsat/o?projection=noAcl&maxResults=1: <html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>

.nox/system-2-7/lib/python2.7/site-packages/google/cloud/_http.py:423: ServiceUnavailable

[Storage] V4 Signature Supplying Additional Query Parameters With GCS Signed URLs

Tracking issue for work defined in companion document.

New Client Request - Support For Query Params With GCS Signed URLs

[Storage] V4 Signature Support for virtual hosted-style signed

Tracking issue for work defined in companion document.

New Client Request - Support for virtual hosted-style signed URLs

Storage: Consider externalising http configuration

https://cloud.google.com/storage/quotas

self.storage_client: Client = storage.Client.from_service_account_json(config_json_path)
adapter: HTTPAdapter = self.storage_client._connection.http.adapters["https://"]
self.storage_client._connection.http.adapters["https://"] = HTTPAdapter(pool_connections=1000, pool_maxsize=1000, pool_block=adapter._pool_block)

Add an API method to give us a streaming file object

It doesn't look like there's a way to get a streaming download from google storage in the Python API. We have download_to_file , download_to_string, and download_to_filename, but I don't see anything that returns a file-like object that can be streamed. This is a disadvantage for many file types which can usefully be processed as they download.

Can a method like this be added?

Storage: Possible metadata regression on blobs in 1.24.0

Environment details

Debian 10
google-cloud-storage version: 1.24.1

Steps to reproduce

It's documented that metadata that isn't set will return a NoneType. Naturally, when wanting to unset any metadata, you expect to be able to pass in None. This worked up until 1.24.0 which broke this with googleapis/google-cloud-python#9796.

I think you can still technically set the metadata to an empty dictionary and get the same functionality, but it's a bit counter intuitive, when no metadata is described as being a NoneType.

Code example

storage_client = storage.Client(...)
blob = storage_client.get_bucket('abc').get_blob('abc')
blob.metadata = None
blob.patch()

Stack trace

AttributeError: 'NoneType' object has no attribute 'items'
  File "django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "django/views/generic/base.py", line 71, in view
    return self.dispatch(request, *args, **kwargs)
  File "rest_framework/views.py", line 505, in dispatch
    response = self.handle_exception(exc)
  File "rest_framework/views.py", line 465, in handle_exception
    self.raise_uncaught_exception(exc)
  File "rest_framework/views.py", line 476, in raise_uncaught_exception
    raise exc
  File "rest_framework/views.py", line 502, in dispatch
    response = handler(request, *args, **kwargs)
  File "api/media/views.py", line 96, in put
    product.media.set_primary(name)
  File "media/api.py", line 134, in set_primary
    blob.metadata = None
  File "/usr/local/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1917, in metadata
    value = {k: str(v) for k, v in value.items()}

Storage: Expose the get method of Notifications

-> Expose the get method for BucketNotification class

Json API documentation: https://cloud.google.com/storage/docs/json_api/v1/notifications/get

Related Feature request in java: googleapis/java-storage#138

/cc: @frankyn @crwilcox

Memory leak from client objects

I have a leak that seems to be related to client construction/destruction. Given the following code

from google.cloud import storage

def upload_content(client, content):
    bucket = client.bucket('bucket-name')
    blob = bucket.blob('test-hello')
    blob.upload_from_string(content)

if __name__ == '__main__':                  
    content = b"" 
    for i in range(100):
        client = storage.Client()
        upload_content(client, content)

Here is a graph of memory usage over the 100 client creations, small uploads.

If instead a single client object is reused, you will see no growth

I don't believe this is storage specific. It seems something about the client object isn't being cleaned up.

cc: b/147997894

Clarify documentation on retrieving md5 and/or crc32c hash for blobs without downloading

Update: Just briefly after posting the issue, I found out how to do it. I'm thus recommending to update the documentation to clarify how to do this.

If I understand it correctly, retrieving a blob's md5 or crc32c hash without downloading it requires calling reload():

>>> blob = bucket.blob("gs://some/url")
>>> blob.crc32c
None
>>> blob.reload()
>>> blob.crc32c
'quMJjg=='

It took me almost an hour to find that out, including browsing the documentation, browsing SO, cloning the source and having a look around there. Eventually I found this SO post which implied that blob.crc32c actually works, and then using tab-completion trial-and-error in ipython I found the reload() method.

I think it would be great if the documentation clarified this :-).

Use Case

Checking whether a remote file needs to be downloaded when a local file of the same filename already exists.

SSL certificate error when using gsutil on machine in a private subnet and wildcard CNAME

OS: Amazon Linux 2018.03
Python: 2.7.16
gsutil: 4.46

I'm receiving an ssl.CertificateError any time I try to run the gsutil command on our database server, which resides inside of a private subnet. This error only cropped up after I created a wildcard CNAME pointing everything to web.mydomain.com in Cloudflare and enabled Cloudflare's wildcard SSL. The hostname of the machine receiving the error is db.mydomain.com. When I use gsutil from our web.mydomain.com, which is in a public subnet, everything works as expected. Here's the error I'm receiving from gsutil:

Traceback (most recent call last):
  File "/opt/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 123, in <module>
    exceptions.HandleError(e, 'gsutil')
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/calliope/exceptions.py", line 527, in HandleError
    core_exceptions.reraise(exc)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/exceptions.py", line 146, in reraise
    six.reraise(type(exc_value), exc_value, tb)
  File "/opt/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 121, in <module>
    main()
  File "/opt/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 35, in main
    project, account = bootstrapping.GetActiveProjectAndAccount()
  File "/opt/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 292, in GetActiveProjectAndAccount
    project_name = properties.VALUES.core.project.Get(validate=False)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 2039, in Get
    required)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 2338, in _GetProperty
    value = _GetPropertyWithoutDefault(prop, properties_file)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 2376, in _GetPropertyWithoutDefault
    value = callback()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 260, in GetProject
    return c_gce.Metadata().Project()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 281, in Metadata
    _metadata = _GCEMetadata()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 94, in __init__
    self.connected = gce_cache.GetOnGCE()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 155, in GetOnGCE
    return _SINGLETON_ON_GCE_CACHE.GetOnGCE(check_age)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 88, in GetOnGCE
    return self.CheckServerRefreshAllCaches()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 91, in CheckServerRefreshAllCaches
    on_gce = self._CheckServer()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 140, in _CheckServer
    gce_read.GOOGLE_GCE_METADATA_NUMERIC_PROJECT_URI)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_read.py", line 66, in ReadNoProxy
    request, timeout=timeout_property).read()
  File "/usr/lib64/python2.7/urllib2.py", line 435, in open
    response = meth(req, response)
  File "/usr/lib64/python2.7/urllib2.py", line 548, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.7/urllib2.py", line 467, in error
    result = self._call_chain(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 654, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1243, in https_open
    context=self._context)
  File "/usr/lib64/python2.7/urllib2.py", line 1197, in do_open
    h.request(req.get_method(), req.get_selector(), req.data, headers)
  File "/usr/lib64/python2.7/httplib.py", line 1058, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib64/python2.7/httplib.py", line 1098, in _send_request
    self.endheaders(body)
  File "/usr/lib64/python2.7/httplib.py", line 1054, in endheaders
    self._send_output(message_body)
  File "/usr/lib64/python2.7/httplib.py", line 892, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.7/httplib.py", line 854, in send
    self.connect()
  File "/usr/lib64/python2.7/httplib.py", line 1279, in connect
    server_hostname=server_hostname)
  File "/usr/lib64/python2.7/ssl.py", line 369, in wrap_socket
    _context=self)
  File "/usr/lib64/python2.7/ssl.py", line 599, in __init__
    self.do_handshake()
  File "/usr/lib64/python2.7/ssl.py", line 836, in do_handshake
    match_hostname(self.getpeercert(), self.server_hostname)
  File "/usr/lib64/python2.7/ssl.py", line 292, in match_hostname
    % (hostname, dnsnames[0]))
ssl.CertificateError: hostname 'metadata.google.internal' doesn't match '*.mydomain.com'

I'm assuming this isn't a bug, but instead a misconfiguration or DNS/SSL error on my end caused by the wildcard CNAME and the fact that db.mydomain.com doesn't actually point to this machine (which isn't accessible from the internet)? Any help would be appreciated.

Storage: copy_blob doesn't respect preserve_acl

Thanks for stopping by to let us know something could be better!

Environment details

macOS 10.15
Python 3.6.5
google-cloud-storage==1.20.0

Steps to reproduce

Upload a blob with a predefined ACL of publicRead
Copy this blob with preserve_acl set to true
Expected result is that the new blob is set to publicRead, but this ACL isn't actually preserved
Calling make_public() on the new blob correct sets it to publicRead

Code example

file_obj = BytesIO(...)
blob.upload_from_file(file_obj, content_type='...', predefined_acl='publicRead')

bucket = storage_client.get_bucket(...)
bucket.copy_blob(blob, destination_bucket=bucket, new_name='new-name.jpg', preserve_acl=True)
# At this point, the blob will not be set to public

Storage: invalid_grant: Invalid JWT Signature

I am facing same issue in python3.

from dags.util.GCloudStorage import GCloudStorage

client = GCloudStorage(
"/home/gaurav/airflow/dags/script/gcp_credentials/my_gcs_credentials.json", ""project_name)

client.create_bucket("test_bucket")

===============================================================
Here's the GCloudStorage.py looks like:
from google.cloud import storage
from google.oauth2 import service_account
from google.api_core import exceptions

class GCloudStorage:

def __init__(self, credential_file_path, project_id):
    """
    :param credential_file_path: credential File for Authentication.
    :param project_id: project ID
    """
    self.CREDENTIAL_FILE_PATH = credential_file_path
    self.PROJECT_ID = project_id
    self.DATASET_ID = None

def create_connection(self):
    """
    Creates Connection with Google Cloud Storage and returns client.
    :return client: Google Cloud Storage Client
    """
    google_credentials = service_account.Credentials.from_service_account_file(self.CREDENTIAL_FILE_PATH)
    # Construct a BigQuery client object.
    client = storage.Client(project=self.PROJECT_ID, credentials=google_credentials)
    return client

def create_bucket(self, bucket_name):
    """
    Creates a new empty bucket.
    :param bucket_name: name of the bucket.
    :return responseMsg: success or other message.
    """
    # Instantiates a client
    storage_client = self.create_connection()
    response = None
    try:
        bucket = storage_client.create_bucket(bucket_name)
        print(bucket)
        response = "Bucket {} created".format(bucket.name)
    except exceptions.Conflict as error:
        response = "Bucket Already Exist: {}".format(error.code)
    return {"response": response}

def upload_blob(self, bucket_name, source_file_name, destination_blob_name):
    """
    Uploads a file to the bucket.
    :param bucket_name: bucketName
    :param source_file_name: Source File Name
    :param destination_blob_name: Destination Blob Name.
    :return responseMsg: Success or failed response message.
    """
    storage_client = self.create_connection()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)
    print("BlobName ---> ", blob)
    blob.upload_from_filename(source_file_name)
    response = "File {} uploaded to {}".format(source_file_name, destination_blob_name)
    print(response)
    return {"response": response}

def delete_bucket(self, bucket_name):
    """
    Deletes a bucket. The bucket must be empty.
    :param bucket_name: bucket Name.
    :return responseMsg: success or failed response message.
    """
    # Instantiates a client
    storage_client = self.create_connection()
    bucket = storage_client.get_bucket(bucket_name)
    bucket.delete()
    response = "Bucket {} deleted".format(bucket.name)
    return {"response": response}

=============================================================
when i run the above create bucket code it gives me below error:

Traceback (most recent call last):
File "/home/gaurav/airflow/dags/util/test cloud.py", line 7, in
client.create_bucket("test_bucket")
File "/home/gaurav/airflow/dags/util/GCloudStorage.py", line 37, in create_bucket
bucket = storage_client.create_bucket(bucket_name)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/cloud/storage/client.py", line 436, in create_bucket
_target_object=bucket,
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/cloud/_http.py", line 417, in api_request
timeout=timeout,
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/cloud/_http.py", line 275, in _make_request
method, url, headers, data, target_object, timeout=timeout
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/cloud/_http.py", line 313, in _do_request
url=url, method=method, headers=headers, data=data, timeout=timeout
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/auth/transport/requests.py", line 277, in request
self.credentials.before_request(auth_request, method, url, request_headers)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/auth/credentials.py", line 124, in before_request
self.refresh(request)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/oauth2/service_account.py", line 334, in refresh
access_token, expiry, _ = _client.jwt_grant(request, self._token_uri, assertion)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/oauth2/_client.py", line 153, in jwt_grant
response_data = _token_endpoint_request(request, token_uri, body)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/oauth2/_client.py", line 124, in _token_endpoint_request
_handle_error_response(response_body)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/oauth2/_client.py", line 60, in _handle_error_response
raise exceptions.RefreshError(error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_grant: Invalid JWT Signature.', '{\n "error": "invalid_grant",\n "error_description": "Invalid JWT Signature."\n}')

Process finished with exit code 1

I am newbie to GCP, so any silly mistakes will be helpful to me.
Thank you in advance.

Add Retry Strategy to manual libraries.

Feature request to track work done to fix retry strategy issues in Python libraries.

Issues that have a workaround will be deduped against this bug as @crwilcox, @tritone and I organize the effort.

Storage: Default timeout for requests breaks chunked downloads

The default timeout introduced in googleapis/google-resumable-media-python#88 is causing crashes in our application. We are using chunked downloads by setting chunk_size on the blob, and then calling download_to_file. Our application is multi-threaded, and we are actually downloading files into a custom stream that is backed by a ring-buffer (so writes may block until space is available again). In some cases, and I haven't figured out the pattern yet, our application hits the default timeout when fetching a new chunk of data inside AuthorizedSession.request. Currently, google.cloud.storage offers no way to use a custom transport (as suggested here), so this workaround is not applicable, and there is no way to override the timeout.

I can't provide a simple reproduction here yet, I'm still investigating, but since it's an issues that only occurs sporadically, I'm not even sure that's possible. I'm wondering if maybe in some scenarios Python's multi-threading just hits an unlucky timing, and the thread running the request doesn't get scheduled for a longer time than usual, causing the observed timeout to be much higher. Not sure how to test this though.

I'm posting here, because the upstream change was made to fix googleapis/google-cloud-python#5909, I'm not sure what the proper fix would be.

Storage: wait option for delete_blobs method

Environment details

latest

Steps to reproduce

blobs = [1000000 blobs]
slow_create_blobs(blobs)
delete_blobs(blobs)
slow_create_blobs(blobs)
assert count_blobs(blobs) = big_number

An unpredictable result depends on delete_blobs how fast it is versus the slow_create_blobs

Storage: Bucket not including Access-Control-Allow-Origin header in preflight OPTIONS response

Fedora 31, Google Chrome 79.

On python 3.7 flask server:

from google.cloud import storage
store = storage.Client.from_service_account_json('service_account.json')
bucket = store.create_bucket('test')
cors = bucket.cors
cors.append({'origin': ['*']})
bucket.cors = cors
bucket.update()

Command line cors check:

gsutil cors get gs://test  # [{"origin": ["*"]}]

On client in JS:

# uploadUri is a signed uri from the 'test' bucket for uploading (PUT requests, v4)
# file is a local filesystem file
fetch(uploadUri, {
  method: 'PUT',
  mode: 'cors',
  cache: 'no-cache',
  headers: {
    'Content-Type': 'application/octet-stream',  # same error with file.type
  },
  body: file,
}).then(() => console.log('success'));

When this is sent, it runs a preflight OPTIONS request, which does not return the Access-Control-Allow-Origin header in the response, so the PUT fails.

Response headers include: alt-svc, cache-control, content-length, content-type, date, expires, server, status, vary, x-guploader-uploadid.

It looks like the signed URL uses the XML API by default, since the url is https://storage.googleapis.com/[BUCKET-NAME]/[PATH-NAME]?<signed_url_params> (https://cloud.google.com/storage/docs/request-endpoints), which is why I set the CORS above according to the documentation.

This happens locally, and while hosted on app engine. It also happens with both the fetch API and Axios npm package.

I've also tried adding maxAgeSeconds = 3600, method = ['*'], and 'Access-Control-Allow-Origin' to the 'responseHeader' array. Problem persists on retry, even several hours later.

Upload from command line using curl works: curl -v -I -X PUT -T file.csv -H 'Content-Type: application/octet-stream' <signed_url>, so this appears to be a browser/cors/headers issue.

I believe I have gone through and checked everything here: https://cloud.google.com/storage/docs/configuring-cors.

Support '<' comparision for some blob objects

It would be great to take a list of blob objects, from the same bucket, and sort them lexicographically by name. While it appears that list_blobs always returns blobs sorted by name, the API endpoint documentation does not provide any such guarantee. Note, however, that there is such a mention here: https://cloud.google.com/storage/docs/listing-objects.

Unfortunately, without said guarantee on sort order, we provide an explicit sort sometimes when iterating through blobs. Said code ends up looking like this:

for blob in sorted(bucket.list_blobs(), key=lambda blob: blob.name):
    ...

While that's not really terrible, it'd be great to more concisely write:

for blob in sorted(bucket.list_blobs()):
    ...

Even if the documentation were to guarantee an order for list_blobs, it may still be valuable to be able to sort collections of blobs.

Providing such support requires only defining __lt__ on the bucket object. In fact I've said changes and will link such pull request momentarily.

Storage: bump google-auth dependency to 1.11.0+

The google-auth release version 1.11.0 fixes the issue with too eager Timeout errors in cases when the underlying request takes a lot of time, but still succeeds and does not timeout itself.

In order to benefit from it, the version pin needs to be updated.

Storage: add timeout parameter to all public methods

As a library user, I would like to have a way to specify a (transport) timeout when calling methods that make HTTP requests under the hood. The timeout should have a reasonable default to prevent requests for hanging indefinitely in case I forget to pass in a timeout argument myself.

Motivation: User reports of requests hanging indefinitely, e.g. googleapis/google-cloud-python#10182.

Once x-goog-api-client headers, gccl, is used for storage, remove gcloud-python

internal bug: b/143493862

The Python Client has a patch googleapis/google-cloud-python#9548 to temporarily collect metrics until we address a pipeline issue. This is a temporary fix and should be removed once that is done.

Upload of large files times out.

OS 10.14.4
Python 3.8.0
Google_cloud_storage: 1.20.0

I am getting a connection error due to the following code:

blob = bucket.blob(file)
blob.upload_from_filename(file)

This is certainly a fault of google.client.storage for the following reasons:

I can upload a 60 meg file to google drive with no problem. Basically takes about 6 minutes.
When I try to upload a 20 meg file I get the aforementioned error.
I've uploaded about 70,000 files so far with this code, most between 10 and 60 megs with no problem.

I think I've had this problem before and it happened when my upload speed was between 1 and 2 mbps which is what my upload speed is now. When my upload speed is above 10 mbps I do not have this problem.

Still, I should be able to upload an 18 meg file in less than 18 seconds, so I don't see why I'm getting a connection error.

Here is the full traceback:

    blob.upload_from_filename(file)
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 1246, in upload_from_filename
    self.upload_from_file(
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 1195, in upload_from_file
    created_json = self._do_upload(
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 1105, in _do_upload
    response = self._do_resumable_upload(
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 1053, in _do_resumable_upload
    response = upload.transmit_next_chunk(transport)
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/resumable_media/requests/upload.py", line 419, in transmit_next_chunk
    response = _helpers.http_request(
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/resumable_media/requests/_helpers.py", line 116, in http_request
    return _helpers.wait_and_retry(func, RequestsMixin._get_status_code, retry_strategy)
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/resumable_media/_helpers.py", line 150, in wait_and_retry
    response = func()
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/auth/transport/requests.py", line 207, in request
    response = super(AuthorizedSession, self).request(
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/requests/sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/requests/sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/requests/adapters.py", line 495, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', timeout('The write operation timed out'))

Synthesis failed for python-storage

Hello! Autosynth couldn't regenerate python-storage. 💔

Here's the output from running synth.py:

Cloning into 'working_repo'...
Switched to branch 'autosynth'
Running synthtool
['/tmpfs/src/git/autosynth/env/bin/python3', '-m', 'synthtool', 'synth.py', '--']
synthtool > Executing /tmpfs/src/git/autosynth/working_repo/synth.py.
.coveragerc
.flake8
.github/CONTRIBUTING.md
.github/ISSUE_TEMPLATE/bug_report.md
.github/ISSUE_TEMPLATE/feature_request.md
.github/ISSUE_TEMPLATE/support_request.md
.github/PULL_REQUEST_TEMPLATE.md
.github/release-please.yml
.gitignore
.kokoro/build.sh
.kokoro/continuous/common.cfg
.kokoro/continuous/continuous.cfg
.kokoro/docs/common.cfg
.kokoro/docs/docs.cfg
.kokoro/presubmit/common.cfg
.kokoro/presubmit/presubmit.cfg
.kokoro/publish-docs.sh
.kokoro/release.sh
.kokoro/release/common.cfg
.kokoro/release/release.cfg
.kokoro/trampoline.sh
CODE_OF_CONDUCT.md
CONTRIBUTING.rst
LICENSE
MANIFEST.in
docs/_static/custom.css
docs/_templates/layout.html
docs/conf.py.j2
noxfile.py.j2
renovate.json
setup.cfg
Running session blacken
Creating virtual environment (virtualenv) using python3.6 in .nox/blacken
pip install black==19.3b0
Error: pip is not installed into the virtualenv, it is located at /tmpfs/src/git/autosynth/env/bin/pip. Pass external=True into run() to explicitly allow this.
Session blacken failed.
synthtool > Failed executing nox -s blacken:

None
synthtool > Wrote metadata to synth.metadata.
Traceback (most recent call last):
  File "/home/kbuilder/.pyenv/versions/3.6.1/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/kbuilder/.pyenv/versions/3.6.1/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/synthtool/__main__.py", line 99, in <module>
    main()
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/synthtool/__main__.py", line 91, in main
    spec.loader.exec_module(synth_module)  # type: ignore
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
  File "/tmpfs/src/git/autosynth/working_repo/synth.py", line 30, in <module>
    s.shell.run(["nox", "-s", "blacken"], hide_output=False)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/synthtool/shell.py", line 39, in run
    raise exc
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/synthtool/shell.py", line 33, in run
    encoding="utf-8",
  File "/home/kbuilder/.pyenv/versions/3.6.1/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['nox', '-s', 'blacken']' returned non-zero exit status 1.

Synthesis failed

Google internal developers can see the full log here.

Changes made in jupyter notebook ipynb file are not uploaded to the bucket

We upload jupyter notebook .ipynb file in google bucket. Whenever we make changes and upload the same file the changes are not being uploaded in the bucket. Looks like there is cache issue while uploading/patching the changed file. How to fix the issue??
We use the following code:

    def upload_blob(self, file_to_upload):
        """ Uploads file to the bucket"""
        blob = self.bucket.blob(file_to_upload)
        blob.upload_from_file(file_to_upload)
        print('File {} uploaded to {}'.format(file_to_upload, file_to_upload))

	:rtype: bool
	:returns: True, if the notification exists, else False.
	:raises ValueError: if the notification has no ID.

	response = client._connection.api_request(
	method="GET", path=self.path, query_params=query_params, timeout=timeout
	)
	self._set_properties(response)

googleapis / python-storage Goto Github PK

python-storage's Introduction

Python Client for Google Cloud Storage

Quick Start

Installation

Code samples and snippets

Supported Python Versions

Unsupported Python Versions

Mac/Linux

Windows

Next Steps

python-storage's People

Contributors

Stargazers

Watchers

Forkers

python-storage's Issues

Code example

Suggested resolution

Environment details

Steps to reproduce

Stack trace

Expected result

Actual result

Environment details

Steps to reproduce

Code example

Stack trace

Environment details

Steps to reproduce

Code example

Stack trace

Use Case

Environment details

Steps to reproduce

Code example

I am facing same issue in python3.

============================================================= when i run the above create bucket code it gives me below error:

Environment details

Steps to reproduce

Recommend Projects

Recommend Topics

Recommend Org

=============================================================
when i run the above create bucket code it gives me below error: