Giter Site home page Giter Site logo

python-storage's Introduction

Python Client for Google Cloud Storage

stable pypi versions

Google Cloud Storage is a managed service for storing unstructured data. Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.

A comprehensive list of changes in each version may be found in the CHANGELOG.

Certain control plane and long-running operations for Cloud Storage (including Folder and Managed Folder operations) are supported via the Storage Control Client. The Storage Control API creates one space to perform metadata-specific, control plane, and long-running operations apart from the Storage API.

Read more about the client libraries for Cloud APIs, including the older Google APIs Client Libraries, in Client Libraries Explained.

Quick Start

In order to use this library, you first need to go through the following steps. A step-by-step guide may also be found in Get Started with Client Libraries.

  1. Select or create a Cloud Platform project.
  2. Enable billing for your project.
  3. Enable the Google Cloud Storage API.
  4. Setup Authentication.

Installation

Install this library in a virtual environment using venv. venv is a tool that creates isolated Python environments. These isolated environments can have separate versions of Python packages, which allows you to isolate one project's dependencies from the dependencies of other projects.

With venv, it's possible to install this library without needing system install permissions, and without clashing with the installed system dependencies.

Code samples and snippets

Code samples and snippets live in the samples/ folder.

Supported Python Versions

Our client libraries are compatible with all current active and maintenance versions of Python.

Python >= 3.7

Unsupported Python Versions

Python <= 3.6

If you are using an end-of-life version of Python, we recommend that you update as soon as possible to an actively supported version.

Mac/Linux

python3 -m venv <your-env>
source <your-env>/bin/activate
pip install google-cloud-storage

Windows

py -m venv <your-env>
.\<your-env>\Scripts\activate
pip install google-cloud-storage

Next Steps

python-storage's People

Contributors

andrewsg avatar arithmetic1728 avatar busunkim96 avatar chemelnucfin avatar cojenco avatar crwilcox avatar dandhlee avatar daniellehanks avatar daspecster avatar dhermes avatar emar-kar avatar frankyn avatar gcf-owl-bot[bot] avatar hemangchothani avatar jkwlui avatar lbristol88 avatar lukesneeringer avatar miacy avatar parthea avatar plamut avatar release-please[bot] avatar renovate-bot avatar shaffeeullah avatar surferjeffatgoogle avatar tritone avatar tseaver avatar tswast avatar william-silversmith avatar wyardley avatar yoshi-automation avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-storage's Issues

Storage: mtime of downloaded file is incorrect by UTC offset

Google Cloud Storage v1.25.0
Python 3.7.3
OS: OSX & Win7

Issue: If I upload a file to Google Cloud Storage and then immediately download it, the mtime is incorrect - for me, I'm in EST, so I'm 5 hours behind UTC. That's the exact timedelta that occurs between the file's original mtime and the recorded mtime after the file is downloaded.

Here's an example screenshot:
Screen Shot 2020-01-27 at 9 35 52 PM
The original file mtime in Google Cloud Storage is 1/23/20 9:04 PM (which is correct from the file I uploaded), but when I download the file, the mtime becomes 1/24/20 2:04 AM, which is 5 hours ahead of what is supposed to be (the UTC offset from my timezone).

The issue is here in blob.download_to_filename:

updated = self.updated
if updated is not None:
            mtime = time.mktime(updated.timetuple())
            os.utime(file_obj.name, (mtime, mtime))

In my example, updated is the timezone-aware datetime corresponding to 2020-01-24 02:04:11.184000+00:00 (it has tzinfo==UTC). The updated.timetuple() is

time.struct_time(tm_year=2020, tm_mon=1, tm_mday=24, tm_hour=2, tm_min=4, tm_sec=9, tm_wday=4, tm_yday=24, tm_isdst=0)

The problem, I believe, is that the timetuple doesn't know this is a UTC date, nor did it convert to my timezone. The docs of mktime note, "Its argument is the struct_time or full 9-tuple (since the dst flag is needed; use -1 as the dst flag if it is unknown) which expresses the time in local time, not UTC." Perhaps, we should do this instead:

if updated is not None:
   mtime = updated.timestamp() # For Python3, not sure of the Python2 version
   os.utime(file_obj.name, (mtime, mtime))

The timestamp() function accounts for the timezone information in the datetime object.
I've just been doing this manually in my code after downloading a file because my application is sensitive to mtimes, and it seems to fix the issue.

Storage: No timeouts cause indefinite hanging

Library: Google Cloud Storage
Environment: Win7 and OSX
Python Version: 3.7.3
Google Cloud Storage Version: 1.25.0

I don't believe all methods of the storage client are using timeouts. I've come across several situations where an upload or download has completely hung because of this. Unfortunately, there's no stack trace because the thread is just hanging waiting for a response. Just from a brief code inspection, I can identify an example area where a timeout is not being honored:

Bucket.get_blob calls blob.reload(), which then calls the following without specifying a timeout:

api_response = client._connection.api_request(
            method="GET",
            path=self.path,
            query_params=query_params,
            headers=self._encryption_headers(),
            _target_object=self,
        )

This then calls JSONConnection.api_request (defaults timeout to None) -> JSONConnection._make_request (defaults timeout to None) -> JSONConnection._do_request (defaults timeout to None) -> AuthorizedSession.request (defaults timeout to None), which makes the ultimate call to the requests.Session object with a None timeout. The end result is that a request is made without a timeout, which can very easily cause a thread to hang indefinitely waiting for a response.

I realize that it would be a huge pain to try and find all possible None timeout paths and patch them, but I at least wanted to bring this to attention. I'm currently wrapping every call to the Google Cloud Python library with a custom timeout function that forcefully stops execution after my own specified timeout, since I have no way to pass that in to the library. A fix that either allows developers to pass in their custom timeout, either to each function called (e.g. get_blob(...)) or to the client object so that it's passed with every single request in the underlying http instance, would be amazing. (In this sense, I suppose this issue is a mix of a bug and a feature request, so my apologies if I chose the incorrect category.)

Storage code for resumable uploads that makes the call to resumable_media/requests/_helpers.py, more specifically the http_request function, seems to do much better since that function sets a default timeout of (60, 61) as opposed to None.

Upload blob from HTTPResponse

I'm trying to use Blob.upload_from_file to upload an http.client.HTTPResponse object without saving it to disk first. It seems like this, or a version of this that wraps the HTTPResponse in an io object, should be possible.

However, because the response may be larger than _MAX_MULTIPART_SIZE, Blob.upload_from_file creates a resumable upload, which depends on tell to make sure the stream is at the beginning. Here is the code that reproduces this issue:

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('my-file.csv', chunk_size=1 << 20)

import urllib.request
a_few_megs_of_data = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&batter_stands=&game_date_gt=2018-09-06&game_date_lt=2018-09-09&group_by=name&hfAB=&hfBBL=&hfBBT=&hfC=&hfFlag=&hfGT=R%7CPO%7CS%7C&hfInn=&hfNewZones=&hfOuts=&hfPR=&hfPT=&hfRO=&hfSA=&hfSea=2018%7C&hfSit=&hfZ=&home_road=&metric_1=&min_abs=0&min_pitches=0&min_results=0&opponent=&pitcher_throws=&player_event_sort=h_launch_speed&player_type=batter&position=&sort_col=pitches&sort_order=desc&stadium=&team=&type=details'
response = urllib.request.urlopen(a_few_megs_of_data)

blob.upload_from_file(response)

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1081, in upload_from_file
    client, file_obj, content_type, size, num_retries, predefined_acl
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 991, in _do_upload
    client, stream, content_type, size, num_retries, predefined_acl
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 934, in _do_resumable_upload
    predefined_acl=predefined_acl,
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 883, in _initiate_resumable_upload
    stream_final=False,
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/resumable_media/requests/upload.py", line 323, in initiate
    total_bytes=total_bytes, stream_final=stream_final)
  File "/Users/jared/my-env/lib/python3.7/site-packages/google/resumable_media/_upload.py", line 409, in _prepare_initiate_request
    if stream.tell() != 0:
io.UnsupportedOperation: seek

Is it possible to read an HTTP response in chunks and write it to the blob without using the filesystem as an intermediary, or is this bad practice? If it is possible and not discouraged, what is the recommended way to do this?

Storage: Mislead BucketNotification.reload method's return type in doc-string

BucketNotification's reload method document specify return type = bool

:rtype: bool
:returns: True, if the notification exists, else False.
:raises ValueError: if the notification has no ID.

But, it returns nothing.

response = client._connection.api_request(
method="GET", path=self.path, query_params=query_params, timeout=timeout
)
self._set_properties(response)

Python GCS Client library low performance on multi-thread

Experiencing slow performance on a multi-thread script in a GCE VM, the Bucket and the VM are in the same zone (us-east1). After upgrading the library to the latest (1.25), the performance increased, but they found a bottle neck when starting to use 10 threads and over.
Threads timeGCP timeAWS
5 48.4 118.0
10 25.1 58.6
15 22.5 41.3
20 24.1 30.9
25 24.5 25.3

The test data set consists of 114,750 files of ~25GB in size.

Comparing the results with the same app hosted in a VM in AWS. I Want to decrease the time as the threads are increased.

Is the Library going through internet instead of having the communication inside the GCP network?
Are there some limitation that can be solved by some kind of congratulation on the Library?
How to improve the performance avoiding the bottle neck?

Checked the performance of the Bucket with cp and perf-diag directly on the VM in GCE and the results were just fine . This delimit the issue to the library directly.

Just as a reference, this are the values of CP from VM in GCE and AWS with SDK 1.20
Source Multi-thread App gsutil -m cp
GCE VM 30+min 8.5 min
AWS EC2 VM 25 min 26 min

Storage: Bucket.list_blobs(max_results=n) does not behave as documented

The max_results parameter of list_blobs() is documented as controlling the maximum number of blobs returned in each page of results, but actually limits the total number of results as the name implies.

Compare the Bucket.list_blobs() documentation:
https://googleapis.dev/python/storage/latest/buckets.html#google.cloud.storage.bucket.Bucket.list_blobs

max_results (int) โ€“ The maximum number of blobs in each page of results from this request. Non-positive values are ignored. Defaults to a sensible value set by the API.

With the Iterator documentation:
https://googleapis.dev/python/google-api-core/latest/page_iterator.html#google.api_core.page_iterator.Iterator

max_results (int) โ€“ The maximum number of results to fetch.

Also the implementation of HTTPIterator which is used by list_blobs() internally does treat max_results as a hard limit for total num_results:
https://github.com/googleapis/google-cloud-python/blob/master/api_core/google/api_core/page_iterator.py#L378

Code example

iterator = some_big_bucket.list_blobs(max_results=100)
assert len(list(iterator)) > 100  # throws
assert sum(len(list(page)) for page in iterator.pages) > 100  # throws

Suggested resolution

Change the documentation to match what the parameter actually does. If supplying a paging size is required, a new argument to HTTPIterator could be added and exposed up through the list_blobs() interface.

Storage: TimeoutGuard raises TimeoutException even though upload successful

Environment Details:

  • Using google-cloud-storage==1.23.0 and 1.24.1
  • Using Mac OSX 10.14 and Windows 7 64 bit
  • Using Python 3.7.3

Issue: A file can upload completely to Google Cloud Storage, yet still raise a TimeoutException if the upload process took longer than ~60 seconds (not 100% sure on the timedelta, but I'm guessing that it's 60 seconds from a brief analysis of the code).

Details The use of AuthorizedSession.request for blob uploads in the Google Cloud Storage Python library causes an unwarranted TimeoutException. The TimeoutGuard class raises an unnecessary TimeoutException on file uploads to Cloud Storage even when the Cloud Storage server is responding in a timely manner to file uploads. In fact, a file can completely upload and the TimeoutGuard will still raise a TimeoutException even though a true request timeout never occurred. The reason why is explained below.

Steps to reproduce:
I first encountered this when uploading a large file (1 GB) on a medium upload connection (10 Mbps upload). Although the upload was technically successful, I was still receiving a TimeoutException at the end of the upload from a call to blob.upload_from_filename(filepath) (a resumable upload, not multipart upload).
The stacktrace is below:

  File "site-packages\google\cloud\storage\blob.py", line 1320, in upload_from_filename
  File "site-packages\google\cloud\storage\blob.py", line 1265, in upload_from_file
  File "site-packages\google\cloud\storage\blob.py", line 1175, in _do_upload
  File "site-packages\google\cloud\storage\blob.py", line 1122, in _do_resumable_upload
  File "site-packages\google\resumable_media\requests\upload.py", line 425, in transmit_next_chunk
  File "site-packages\google\resumable_media\requests\_helpers.py", line 136, in http_request
  File "site-packages\google\resumable_media\_helpers.py", line 150, in wait_and_retry
  File "site-packages\google\auth\transport\requests.py", line 287, in request
  File "site-packages\google\auth\transport\requests.py", line 110, in __exit__
requests.exceptions.Timeout" 

The core of the issue is the TimeoutGuard class when used in a context like AuthorizedSession.request. Specifically, look at the following code in the aforementioned method:

with TimeoutGuard(timeout) as guard:
            response = super(AuthorizedSession, self).request(
                method,
                url,
                data=data,
                headers=request_headers,
                timeout=timeout,
                **kwargs
            )
        timeout = guard.remaining_timeout

There are two timeouts going on. One of them is a true request timeout used by the requests library (note AuthorizedSession is a subclass of requests.Session), and this is functioning correctly. The other timeout is a naive timeout set by TimeoutGuard that is causing problems. Essentially, it starts a clock that will raise a TimeoutException if a certain amount of time passes, even if the Google Cloud Storage servers are responding in a timely manner. In this case, the requests library will not raise a TimeoutException (because a true network timeout never occurred), but the TimeoutGuard will.

This causes issues with large files uploads or slow internet connections. If a user tries to upload a file that takes a long time to upload, then, even if the file upload is successful and didn't raise a TimeoutException from the requests library (i.e. server was responding in a timely fashion the entire upload), during TimeoutGuard.__exit__, the TimeoutGuard will raise an unsolicited TimeoutException.

Here's a walkthrough of the error:
(1) File upload initiated
(2) File uploads for a couple minutes, exceeding the default timeout of 60/61 seconds (resumable_media/requests/_helper.py _DEFAULT_CONNECT_TIMEOUT and _DEFAULT_READ_TIMEOUT, although it looks like the TimeGuard will take the minimum of the two) that the TimeGuard uses. The server is responding normally to all chunk uploads. TimeoutException is never thrown from the Python requests library because the server is consistently responding.
(3) File finishes upload, TimeoutGuard raises TimeoutException even though file upload was successful.

I've been able to work around this problem by monkeypatching the TimeoutGuard code, but I believe a proper fix is needed in the codebase. I would be happy to contribute or open a pull request if a maintainer can elaborate on the need for the TimeoutGuard TimeoutException when there is already a TimeoutException being used by the requests.Session class.

'Bucket.list_blobs' surface issues

While investigating googleapis/google-cloud-python#4154, I noticed the following problems with Bucket.list_blobs:

  • It exposes paging semantics, rather than a "normal" iterator.
  • It exposes fields, which is probably not optimal for a method which is supposed to return populated Blob instances.
  • The semantics of the versions flag are questionable, given googleapis/google-cloud-python#2463.

Given that we are in GA, my inclination would be to add another method which addresses these issues, and docs-deprecate the existing one. @lukesneeringer how would you like to proceed?

Storage: Capture relevant headers to blob properties during download

Residual from googleapis/google-cloud-python#9003.

@william-silversmith notes that even with raw_download enabled, he is unable to detect the content_type of a downloaded blob without performing an additional reload request, which is prohibitive for his usecase at scale. E.g.:

blob = bucket.blob( key )
binary = blob.download_as_string(raw_download=True)
if blob.content_encoding == 'gzip':
    return gunzip(binary)
elif blob.content_encoding == 'br':
    return brotli.decompress(binary)
else:
    return binary

Potentially even...

if blob.content_type == 'application/json':
    return json.loads(binary.decode('utf8'))

'Blob.exists()' does not work within batch context

The Blob.exists() method does not work when run within a Batch context. The normal behavior of exists() is to return True unless a NotFound exception occurs. Within the Batch context the exception seems to be suppressed and the function returns True. After leaving the Batch context, an Exception is then thrown.

This is how I expected to be able to use the exists() function:

blobs = [storage.blob.Blob(path, bucket) for path in paths]
with client.batch():
  bools = [blob.exists() for blob in blobs]

Without the Batch contextmanager this code works, if inefficiently. With the Batch contextmanager the code returns all Trues and throws an exception when leaving the context.

This behavior seems unintuitive to me. Please let me know if the API is meant to be used differently. If it is meant to be used as in the provided code sample, I'd be happy to attempt a fix if one of the maintainers could point me in the right direction.

Environment configuration just in case:

  • macOS 10.13.3
  • Python 3.6.5
  • google-cloud-storage==1.8.0

Add support for JSON API headers and query string parameters

Description

The XML-Api and the Json-API are supporting a large set of parameters as described at https://cloud.google.com/storage/docs/xml-api/reference-headers.

Into those parameters you can find very useful features like x-goog-metageneration which let you control how to deal with versions and also blocking rewriting of blobs in a bucket.

These API used to be available in the AppEngine library for google cloud storage as you can see at https://cloud.google.com/appengine/docs/standard/python/googlecloudstorageclient/functions#open

Proposition

Adds support for extended options on different operations like upload_from_string and delete, something like:

def upload_from_string(self, data, content_type='text/plain', client=None,
                           predefined_acl=None, options=None):

Which would like you pick and sets of options from https://cloud.google.com/storage/docs/xml-api/reference-headers

Add support for the Storage Transfer service

Request to support the Storage Transfer service, mainly 2 components: transferJobs and transferOperations (instantiation of jobs)

Currently only available through api:

If in the works, rough ETA would be helpful as we plan on writing corresponding Airflow operators and need to decide based on what libraries.

`Blob.rewrite()` does not work with batches.

google-cloud-storage v1.8.0

Not sure whether rewrites are supposed to work when batched, but it would be nice and useful if they did, otherwise there's no efficient way to copy lots of blobs across buckets in different locations or with different encryption keys.

Example:

with gcs_client.batch():
    dest_blob.rewrite(src_blob)

Traceback:

Traceback (most recent call last):
  File "batch_test.py", line 10, in <module>
    dest_blob.rewrite(src_blob)
  File ".../pyvirtenv/python-common/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 1359, in rewrite
    rewritten = int(api_response['totalBytesRewritten'])
  File ".../pyvirtenv/python-common/lib/python2.7/site-packages/google/cloud/storage/batch.py", line 105, in __getitem__
    raise KeyError('Cannot get item %r from a future' % (key,))
KeyError: "Cannot get item 'totalBytesRewritten' from a future"

Create documentation for parallel uploads, suggest multiprocessing.

While working through #69, we found that threading is pretty slow. There is likely work the client can do to become less blocking, but suggesting multiprocessing should help users to stay on the happy path. This is what gsutil -m is doing and works very well for multiple uploads.

Storage: Timeout when upload file using google.cloud.storage.Blob.upload_from_filename()

Environment details

OS: MacOS 10.15.1
Python: Python 3.7.4
Google-cloud version:

google-api-core==1.16.0
google-api-python-client==1.7.11
google-auth==1.11.2
google-auth-httplib2==0.0.3
google-auth-oauthlib==0.4.0
google-cloud-core==1.3.0
google-cloud-error-reporting==0.32.1
google-cloud-firestore==1.5.0
google-cloud-kms==1.0.0
google-cloud-logging==1.14.0
google-cloud-storage==1.26.0
google-cloud-translate==1.7.0
google-resumable-media==0.5.0
google-translate==0.1
googleapis-common-protos==1.6.0

Steps to reproduce

  1. Prepare a file with size >300MB
  2. Run blob.upload_from_filename("path/on/storage", "path/of/big/file/on/local")

Stack trace

Traceback (most recent call last):
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1065, in _send_output
    self.send(chunk)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 987, in send
    self.sock.sendall(data)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/ssl.py", line 1034, in sendall
    v = self.send(byte_view[count:])
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/ssl.py", line 1003, in send
    return self._sslobj.write(data)
socket.timeout: The write operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/util/retry.py", line 400, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 1065, in _send_output
    self.send(chunk)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/http/client.py", line 987, in send
    self.sock.sendall(data)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/ssl.py", line 1034, in sendall
    v = self.send(byte_view[count:])
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/ssl.py", line 1003, in send
    return self._sslobj.write(data)
urllib3.exceptions.ProtocolError: ('Connection aborted.', timeout('The write operation timed out'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-cdc889c11775>", line 4, in <module>
    "data/tmp/averaging.joblib", None)
  File "/Users/dualeoo/PycharmProjects/mlweb-ml/mlweb_ml/firestore/google_storage.py", line 30, in upload
    blob.upload_from_filename(file_path_on_local, content_type, predefined_acl=predefined_acl)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1342, in upload_from_filename
    predefined_acl=predefined_acl,
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1287, in upload_from_file
    client, file_obj, content_type, size, num_retries, predefined_acl
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1197, in _do_upload
    client, stream, content_type, size, num_retries, predefined_acl
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1144, in _do_resumable_upload
    response = upload.transmit_next_chunk(transport)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/resumable_media/requests/upload.py", line 425, in transmit_next_chunk
    retry_strategy=self._retry_strategy,
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/resumable_media/requests/_helpers.py", line 136, in http_request
    return _helpers.wait_and_retry(func, RequestsMixin._get_status_code, retry_strategy)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/resumable_media/_helpers.py", line 150, in wait_and_retry
    response = func()
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/google/auth/transport/requests.py", line 317, in request
    **kwargs
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/Users/dualeoo/miniconda3/envs/mlweb-ml/lib/python3.7/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', timeout('The write operation timed out'))

Expected result

No timeout error

Actual result

The upload timeout after 1 minute

Storage: upload_from_string() with ifGenerationMatch=0

The GCS HTTP protocol -- but not the Python API -- has the ability to set ifGenerationMatch when creating a storage object:

Makes the operation conditional on whether the object's current generation matches the given value. Setting to 0 makes the operation succeed only if there is a live version of the object.

Why it's useful: With this feature, the client could create a directory placeholder entry (a 0-byte object with a name ending in '/') very efficiently like this:

blob = bucket.blob('path/to/my/subdirectory/')
blob.upload_from_string(b'', if_generation_match=0)

That one round trip creates the directory placeholder entry if it doesn't already exist. The alternatives are to first make a round trip to check if the entry exists or else to let the bucket accumulate identical placeholder entries (esp. for top level directories) by blindly creating them. [Or does GCS check if an uploaded object matches the current generation and optimize that case? -- Nope.]

Why that matters: Directory placeholders speed up gcsfuse by an order of magnitude. Without the placeholders, you have to use gcsfuse in --implicit-dirs mode, and such a mount is frustratingly slow for interactive work. E.g. it takes several seconds just to list a tiny directory containing 2 files. With the placeholders, you can run gcsfuse without --implicit-dirs, and that mount lists directories in a tenth of a second or two.

Proposal: I could create a Pull Request adding this feature if you like, with either the specific if_generation_match query parameter or a way to pass in additional query parameters.

Another alternative is recommend that callers do something like subclass Blob to override _add_query_parameters() to add the if_generation_match=0 name-value pair. That's ugly and fragile.

Is there a way to do this that I'm missing? Are there better alternatives?

Client fails for reauth, while gsutil works fine

Environment details

  1. Specify the API at the beginning of the title (for example, "BigQuery: ...")
    General, Core, and Other are also allowed as types
  2. OS X highsierra
  3. Python version and virtual environment information: Python 3.6.9
  4. google-cloud- version: google-cloud-storage = 1.24.1

Steps to reproduce

Trying to use could-storage python package with user-credentials but it fails asking "'invalid_grant: reauth related error (rapt_required)'". Reauth does not help. However if I try it with gsutil in the same shell everything works.

Our company policy changed couple months ago so gcloud asks lot more often reauthentication.

Code example

client = storage.Client(project="prod-xxx")
bucket = client.get_bucket("model-data-prod-xxx")

Stack trace

File "/Users/xxx/Library/Caches/pypoetry/virtualenvs/model-pitkaveto-abtfw7oX-py3.6/lib/python3.6/site-packages/google/oauth2/_client.py", line 60, in _handle_error_response
    raise exceptions.RefreshError(error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_grant: reauth related error (rapt_required)', '{\n  "error": "invalid_grant",\n  "error_description": "reauth related error (rapt_required)",\n  "error_subtype": "rapt_required"\n}')
(model-xxx-abtfw7oX-py3.6) [xxx@mbp ~/projects/model_xxx/model_xxx (trainer *)]

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

Fix test for IAM get/set.

__________________ TestStorageBuckets.test_get_set_iam_policy __________________

self = <tests.system.TestStorageBuckets testMethod=test_get_set_iam_policy>

    def test_get_set_iam_policy(self):
        import pytest
        from google.cloud.storage.iam import STORAGE_OBJECT_VIEWER_ROLE
        from google.api_core.exceptions import BadRequest, PreconditionFailed

        bucket_name = "iam-policy" + unique_resource_id("-")
        bucket = retry_429_503(Config.CLIENT.create_bucket)(bucket_name)
        self.case_buckets_to_delete.append(bucket_name)
        self.assertTrue(bucket.exists())

        policy_no_version = bucket.get_iam_policy()
        self.assertEqual(policy_no_version.version, 1)

        policy = bucket.get_iam_policy(requested_policy_version=3)
        self.assertEqual(policy, policy_no_version)

        member = "serviceAccount:{}".format(Config.CLIENT.get_service_account_email())

        BINDING_W_CONDITION = {
            "role": STORAGE_OBJECT_VIEWER_ROLE,
            "members": {member},
            "condition": {
                "title": "always-true",
                "description": "test condition always-true",
                "expression": "true",
            },
        }
        policy.bindings.append(BINDING_W_CONDITION)

        with pytest.raises(
            PreconditionFailed, match="enable uniform bucket-level access"
        ):
            bucket.set_iam_policy(policy)

        bucket.iam_configuration.uniform_bucket_level_access_enabled = True
        bucket.patch()

        policy = bucket.get_iam_policy(requested_policy_version=3)
        policy.bindings.append(BINDING_W_CONDITION)

        with pytest.raises(BadRequest, match="at least 3"):
            bucket.set_iam_policy(policy)

        policy.version = 3
        returned_policy = bucket.set_iam_policy(policy)
        self.assertEqual(returned_policy.version, 3)
        self.assertEqual(returned_policy.bindings, policy.bindings)

        with pytest.raises(
            BadRequest, match="cannot be less than the existing policy version"
        ):
>           bucket.get_iam_policy()
E           Failed: DID NOT RAISE <class 'google.api_core.exceptions.BadRequest'>

tests/system.py:315: Failed

Allow tracking upload progress.

This is related to googleapis/google-cloud-python#1830 reopening here as this seems to have been closed many years ago.

We are please looking for this feature as we need to monitor large files being uploaded to Google Storage buckets. I am surprised not many people are after this essential feature, which makes me feel we haven't done our research properly or that the solution is very obvious or trivial.

Can someone please share an example of how we could track progress during upload?

Update: Should we be looking at google-resumable-media? will try that out and report back.

Thanks

Proposal: move API-methods to client

The Cloud Storage Python client was one of the first google-cloud-python clients, intended to be more reliable and more Pythonic than the google-api-client that it replaced. It is also a 100% hand-written client. Because it predates the auto-generated google-cloud-python clients, the Cloud Storage Python client has inconsistencies with other google-cloud-python clients.

This design proposal is to bring the Cloud Storage client further into alignment with other google-cloud-python clients and make the eventual inclusion of an auto-generated gRPC transport layer less disruptive to users of the Cloud Storage client.

Design document:

https://docs.google.com/document/d/1A4FIxThZK_enK7OV9ChY54IPrlpbdHwuhvVEu6iTHxM/edit?usp=sharing

Please share your feedback on the specifics of this as comments in the design document.

CC @crwilcox @frankyn @lbristol88

Update: according to comment, we need to update samples for moved methods before deprecating the old ones.

Handle 410 errors on resumable-media operations

A previous issue googleapis/google-cloud-python#7530 was closed in favor of a targeted issue for a feature request.

Resumable media operations can fail in such a way that they cannot be retried, at least on the chunk level. A 410 error indicates that the only choice is to restart the operation altogether

Pseudo code:

try:
   resumable_operation_upload(some_file)
except 410_error:
    # Retry the operation, from the very beginning of the file.

We should find the instances of resumable uploads and protect them from the higher level failure. It is also possible this could be pushed down into resumable media, but these higher level failures are a different category than existing retry-able errors.

Related: https://issuetracker.google.com/137168102 and internal bug 115694647.

'test_access_to_public_bucket' flakes with 503

From this test run:

_______________ TestAnonymousClient.test_access_to_public_bucket _______________

self = <test_system.TestAnonymousClient testMethod=test_access_to_public_bucket>

    @vpcsc_config.skip_if_inside_vpcsc
    def test_access_to_public_bucket(self):
        anonymous = storage.Client.create_anonymous_client()
        bucket = anonymous.bucket(self.PUBLIC_BUCKET)
>       blob, = retry_429_503(bucket.list_blobs)(max_results=1)

tests/system/test_system.py:1498:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.nox/system-2-7/lib/python2.7/site-packages/google/api_core/page_iterator.py:212: in _items_iter
    for page in self._page_iter(increment=False):
.nox/system-2-7/lib/python2.7/site-packages/google/api_core/page_iterator.py:243: in _page_iter
    page = self._next_page()
.nox/system-2-7/lib/python2.7/site-packages/google/api_core/page_iterator.py:369: in _next_page
    response = self._get_next_page_response()
.nox/system-2-7/lib/python2.7/site-packages/google/api_core/page_iterator.py:419: in _get_next_page_response
    method=self._HTTP_METHOD, path=self.path, query_params=params
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <google.cloud.storage._http.Connection object at 0x7f5575846150>
method = 'GET', path = '/b/gcp-public-data-landsat/o'
query_params = {'maxResults': 1, 'projection': 'noAcl'}, data = None
content_type = None, headers = None, api_base_url = None, api_version = None
expect_json = True, _target_object = None, timeout = 60

    def api_request(
        self,
        method,
        path,
        query_params=None,
        data=None,
        content_type=None,
        headers=None,
        api_base_url=None,
        api_version=None,
        expect_json=True,
        _target_object=None,
        timeout=_DEFAULT_TIMEOUT,
    ):
    ... # docstring elided    
    url = self.build_api_url(
            path=path,
            query_params=query_params,
            api_base_url=api_base_url,
            api_version=api_version,
        )

        # Making the executive decision that any dictionary
        # data will be sent properly as JSON.
        if data and isinstance(data, dict):
            data = json.dumps(data)
            content_type = "application/json"

        response = self._make_request(
            method=method,
            url=url,
            data=data,
            content_type=content_type,
            headers=headers,
            target_object=_target_object,
            timeout=timeout,
        )

        if not 200 <= response.status_code < 300:
>           raise exceptions.from_http_response(response)
E           ServiceUnavailable: 503 GET https://storage.googleapis.com/storage/v1/b/gcp-public-data-landsat/o?projection=noAcl&maxResults=1: <html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><title>Sorry...</title><style> body { font-family: verdana, arial, sans-serif; background-color: #fff; color: #000; }</style></head><body><div><table><tr><td><b><font face=sans-serif size=10><font color=#4285f4>G</font><font color=#ea4335>o</font><font color=#fbbc05>o</font><font color=#4285f4>g</font><font color=#34a853>l</font><font color=#ea4335>e</font></font></b></td><td style="text-align: left; vertical-align: bottom; padding-bottom: 15px; width: 50%"><div style="border-bottom: 1px solid #dfdfdf;">Sorry...</div></td></tr></table></div><div style="margin-left: 4em;"><h1>We're sorry...</h1><p>... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.</p></div><div style="margin-left: 4em;">See <a href="https://support.google.com/websearch/answer/86640">Google Help</a> for more information.<br/><br/></div><div style="text-align: center; border-top: 1px solid #dfdfdf;"><a href="https://www.google.com">Google Home</a></div></body></html>

.nox/system-2-7/lib/python2.7/site-packages/google/cloud/_http.py:423: ServiceUnavailable

Add an API method to give us a streaming file object

It doesn't look like there's a way to get a streaming download from google storage in the Python API. We have download_to_file , download_to_string, and download_to_filename, but I don't see anything that returns a file-like object that can be streamed. This is a disadvantage for many file types which can usefully be processed as they download.

Can a method like this be added?

Storage: Possible metadata regression on blobs in 1.24.0

Environment details

  • Debian 10
  • google-cloud-storage version: 1.24.1

Steps to reproduce

It's documented that metadata that isn't set will return a NoneType. Naturally, when wanting to unset any metadata, you expect to be able to pass in None. This worked up until 1.24.0 which broke this with googleapis/google-cloud-python#9796.

I think you can still technically set the metadata to an empty dictionary and get the same functionality, but it's a bit counter intuitive, when no metadata is described as being a NoneType.

Code example

storage_client = storage.Client(...)
blob = storage_client.get_bucket('abc').get_blob('abc')
blob.metadata = None
blob.patch()

Stack trace

AttributeError: 'NoneType' object has no attribute 'items'
  File "django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "django/views/generic/base.py", line 71, in view
    return self.dispatch(request, *args, **kwargs)
  File "rest_framework/views.py", line 505, in dispatch
    response = self.handle_exception(exc)
  File "rest_framework/views.py", line 465, in handle_exception
    self.raise_uncaught_exception(exc)
  File "rest_framework/views.py", line 476, in raise_uncaught_exception
    raise exc
  File "rest_framework/views.py", line 502, in dispatch
    response = handler(request, *args, **kwargs)
  File "api/media/views.py", line 96, in put
    product.media.set_primary(name)
  File "media/api.py", line 134, in set_primary
    blob.metadata = None
  File "/usr/local/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1917, in metadata
    value = {k: str(v) for k, v in value.items()}

Memory leak from client objects

I have a leak that seems to be related to client construction/destruction. Given the following code

from google.cloud import storage

def upload_content(client, content):
    bucket = client.bucket('bucket-name')
    blob = bucket.blob('test-hello')
    blob.upload_from_string(content)

if __name__ == '__main__':                  
    content = b"" 
    for i in range(100):
        client = storage.Client()
        upload_content(client, content)
    
    

Here is a graph of memory usage over the 100 client creations, small uploads.
client-created-each-loop-100-iters

If instead a single client object is reused, you will see no growth
client-created-once

I don't believe this is storage specific. It seems something about the client object isn't being cleaned up.

cc: b/147997894

Clarify documentation on retrieving md5 and/or crc32c hash for blobs without downloading

Update: Just briefly after posting the issue, I found out how to do it. I'm thus recommending to update the documentation to clarify how to do this.

If I understand it correctly, retrieving a blob's md5 or crc32c hash without downloading it requires calling reload():

>>> blob = bucket.blob("gs://some/url")
>>> blob.crc32c
None
>>> blob.reload()
>>> blob.crc32c
'quMJjg=='

It took me almost an hour to find that out, including browsing the documentation, browsing SO, cloning the source and having a look around there. Eventually I found this SO post which implied that blob.crc32c actually works, and then using tab-completion trial-and-error in ipython I found the reload() method.

I think it would be great if the documentation clarified this :-).

Use Case

Checking whether a remote file needs to be downloaded when a local file of the same filename already exists.

SSL certificate error when using gsutil on machine in a private subnet and wildcard CNAME

OS: Amazon Linux 2018.03
Python: 2.7.16
gsutil: 4.46

I'm receiving an ssl.CertificateError any time I try to run the gsutil command on our database server, which resides inside of a private subnet. This error only cropped up after I created a wildcard CNAME pointing everything to web.mydomain.com in Cloudflare and enabled Cloudflare's wildcard SSL. The hostname of the machine receiving the error is db.mydomain.com. When I use gsutil from our web.mydomain.com, which is in a public subnet, everything works as expected. Here's the error I'm receiving from gsutil:

Traceback (most recent call last):
  File "/opt/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 123, in <module>
    exceptions.HandleError(e, 'gsutil')
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/calliope/exceptions.py", line 527, in HandleError
    core_exceptions.reraise(exc)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/exceptions.py", line 146, in reraise
    six.reraise(type(exc_value), exc_value, tb)
  File "/opt/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 121, in <module>
    main()
  File "/opt/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 35, in main
    project, account = bootstrapping.GetActiveProjectAndAccount()
  File "/opt/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 292, in GetActiveProjectAndAccount
    project_name = properties.VALUES.core.project.Get(validate=False)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 2039, in Get
    required)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 2338, in _GetProperty
    value = _GetPropertyWithoutDefault(prop, properties_file)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 2376, in _GetPropertyWithoutDefault
    value = callback()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 260, in GetProject
    return c_gce.Metadata().Project()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 281, in Metadata
    _metadata = _GCEMetadata()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 94, in __init__
    self.connected = gce_cache.GetOnGCE()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 155, in GetOnGCE
    return _SINGLETON_ON_GCE_CACHE.GetOnGCE(check_age)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 88, in GetOnGCE
    return self.CheckServerRefreshAllCaches()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 91, in CheckServerRefreshAllCaches
    on_gce = self._CheckServer()
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 140, in _CheckServer
    gce_read.GOOGLE_GCE_METADATA_NUMERIC_PROJECT_URI)
  File "/opt/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_read.py", line 66, in ReadNoProxy
    request, timeout=timeout_property).read()
  File "/usr/lib64/python2.7/urllib2.py", line 435, in open
    response = meth(req, response)
  File "/usr/lib64/python2.7/urllib2.py", line 548, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.7/urllib2.py", line 467, in error
    result = self._call_chain(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 654, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib64/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1243, in https_open
    context=self._context)
  File "/usr/lib64/python2.7/urllib2.py", line 1197, in do_open
    h.request(req.get_method(), req.get_selector(), req.data, headers)
  File "/usr/lib64/python2.7/httplib.py", line 1058, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib64/python2.7/httplib.py", line 1098, in _send_request
    self.endheaders(body)
  File "/usr/lib64/python2.7/httplib.py", line 1054, in endheaders
    self._send_output(message_body)
  File "/usr/lib64/python2.7/httplib.py", line 892, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.7/httplib.py", line 854, in send
    self.connect()
  File "/usr/lib64/python2.7/httplib.py", line 1279, in connect
    server_hostname=server_hostname)
  File "/usr/lib64/python2.7/ssl.py", line 369, in wrap_socket
    _context=self)
  File "/usr/lib64/python2.7/ssl.py", line 599, in __init__
    self.do_handshake()
  File "/usr/lib64/python2.7/ssl.py", line 836, in do_handshake
    match_hostname(self.getpeercert(), self.server_hostname)
  File "/usr/lib64/python2.7/ssl.py", line 292, in match_hostname
    % (hostname, dnsnames[0]))
ssl.CertificateError: hostname 'metadata.google.internal' doesn't match '*.mydomain.com'

I'm assuming this isn't a bug, but instead a misconfiguration or DNS/SSL error on my end caused by the wildcard CNAME and the fact that db.mydomain.com doesn't actually point to this machine (which isn't accessible from the internet)? Any help would be appreciated.

Storage: copy_blob doesn't respect preserve_acl

Thanks for stopping by to let us know something could be better!

Environment details

  • macOS 10.15
  • Python 3.6.5
  • google-cloud-storage==1.20.0

Steps to reproduce

  1. Upload a blob with a predefined ACL of publicRead
  2. Copy this blob with preserve_acl set to true
  3. Expected result is that the new blob is set to publicRead, but this ACL isn't actually preserved
  4. Calling make_public() on the new blob correct sets it to publicRead

Code example

file_obj = BytesIO(...)
blob.upload_from_file(file_obj, content_type='...', predefined_acl='publicRead')

bucket = storage_client.get_bucket(...)
bucket.copy_blob(blob, destination_bucket=bucket, new_name='new-name.jpg', preserve_acl=True)
# At this point, the blob will not be set to public

Storage: invalid_grant: Invalid JWT Signature

I am facing same issue in python3.

from dags.util.GCloudStorage import GCloudStorage

client = GCloudStorage(
"/home/gaurav/airflow/dags/script/gcp_credentials/my_gcs_credentials.json", ""project_name)

client.create_bucket("test_bucket")

===============================================================
Here's the GCloudStorage.py looks like:
from google.cloud import storage
from google.oauth2 import service_account
from google.api_core import exceptions

class GCloudStorage:

def __init__(self, credential_file_path, project_id):
    """
    :param credential_file_path: credential File for Authentication.
    :param project_id: project ID
    """
    self.CREDENTIAL_FILE_PATH = credential_file_path
    self.PROJECT_ID = project_id
    self.DATASET_ID = None

def create_connection(self):
    """
    Creates Connection with Google Cloud Storage and returns client.
    :return client: Google Cloud Storage Client
    """
    google_credentials = service_account.Credentials.from_service_account_file(self.CREDENTIAL_FILE_PATH)
    # Construct a BigQuery client object.
    client = storage.Client(project=self.PROJECT_ID, credentials=google_credentials)
    return client

def create_bucket(self, bucket_name):
    """
    Creates a new empty bucket.
    :param bucket_name: name of the bucket.
    :return responseMsg: success or other message.
    """
    # Instantiates a client
    storage_client = self.create_connection()
    response = None
    try:
        bucket = storage_client.create_bucket(bucket_name)
        print(bucket)
        response = "Bucket {} created".format(bucket.name)
    except exceptions.Conflict as error:
        response = "Bucket Already Exist: {}".format(error.code)
    return {"response": response}

def upload_blob(self, bucket_name, source_file_name, destination_blob_name):
    """
    Uploads a file to the bucket.
    :param bucket_name: bucketName
    :param source_file_name: Source File Name
    :param destination_blob_name: Destination Blob Name.
    :return responseMsg: Success or failed response message.
    """
    storage_client = self.create_connection()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)
    print("BlobName ---> ", blob)
    blob.upload_from_filename(source_file_name)
    response = "File {} uploaded to {}".format(source_file_name, destination_blob_name)
    print(response)
    return {"response": response}

def delete_bucket(self, bucket_name):
    """
    Deletes a bucket. The bucket must be empty.
    :param bucket_name: bucket Name.
    :return responseMsg: success or failed response message.
    """
    # Instantiates a client
    storage_client = self.create_connection()
    bucket = storage_client.get_bucket(bucket_name)
    bucket.delete()
    response = "Bucket {} deleted".format(bucket.name)
    return {"response": response}

=============================================================
when i run the above create bucket code it gives me below error:

Traceback (most recent call last):
File "/home/gaurav/airflow/dags/util/test cloud.py", line 7, in
client.create_bucket("test_bucket")
File "/home/gaurav/airflow/dags/util/GCloudStorage.py", line 37, in create_bucket
bucket = storage_client.create_bucket(bucket_name)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/cloud/storage/client.py", line 436, in create_bucket
_target_object=bucket,
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/cloud/_http.py", line 417, in api_request
timeout=timeout,
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/cloud/_http.py", line 275, in _make_request
method, url, headers, data, target_object, timeout=timeout
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/cloud/_http.py", line 313, in _do_request
url=url, method=method, headers=headers, data=data, timeout=timeout
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/auth/transport/requests.py", line 277, in request
self.credentials.before_request(auth_request, method, url, request_headers)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/auth/credentials.py", line 124, in before_request
self.refresh(request)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/oauth2/service_account.py", line 334, in refresh
access_token, expiry, _ = _client.jwt_grant(request, self._token_uri, assertion)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/oauth2/_client.py", line 153, in jwt_grant
response_data = _token_endpoint_request(request, token_uri, body)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/oauth2/_client.py", line 124, in _token_endpoint_request
_handle_error_response(response_body)
File "/home/gaurav/airflow/venv/lib/python3.6/site-packages/google/oauth2/_client.py", line 60, in _handle_error_response
raise exceptions.RefreshError(error_details, response_body)
google.auth.exceptions.RefreshError: ('invalid_grant: Invalid JWT Signature.', '{\n "error": "invalid_grant",\n "error_description": "Invalid JWT Signature."\n}')

Process finished with exit code 1

I am newbie to GCP, so any silly mistakes will be helpful to me.
Thank you in advance.

Storage: Default timeout for requests breaks chunked downloads

The default timeout introduced in googleapis/google-resumable-media-python#88 is causing crashes in our application. We are using chunked downloads by setting chunk_size on the blob, and then calling download_to_file. Our application is multi-threaded, and we are actually downloading files into a custom stream that is backed by a ring-buffer (so writes may block until space is available again). In some cases, and I haven't figured out the pattern yet, our application hits the default timeout when fetching a new chunk of data inside AuthorizedSession.request. Currently, google.cloud.storage offers no way to use a custom transport (as suggested here), so this workaround is not applicable, and there is no way to override the timeout.

I can't provide a simple reproduction here yet, I'm still investigating, but since it's an issues that only occurs sporadically, I'm not even sure that's possible. I'm wondering if maybe in some scenarios Python's multi-threading just hits an unlucky timing, and the thread running the request doesn't get scheduled for a longer time than usual, causing the observed timeout to be much higher. Not sure how to test this though.

I'm posting here, because the upstream change was made to fix googleapis/google-cloud-python#5909, I'm not sure what the proper fix would be.

Storage: wait option for delete_blobs method

Environment details

latest

Steps to reproduce

blobs = [1000000 blobs]
slow_create_blobs(blobs)
delete_blobs(blobs)
slow_create_blobs(blobs)
assert count_blobs(blobs) = big_number

An unpredictable result depends on delete_blobs how fast it is versus the slow_create_blobs

Storage: Bucket not including Access-Control-Allow-Origin header in preflight OPTIONS response

Fedora 31, Google Chrome 79.

On python 3.7 flask server:

from google.cloud import storage
store = storage.Client.from_service_account_json('service_account.json')
bucket = store.create_bucket('test')
cors = bucket.cors
cors.append({'origin': ['*']})
bucket.cors = cors
bucket.update()

Command line cors check:

gsutil cors get gs://test  # [{"origin": ["*"]}]

On client in JS:

# uploadUri is a signed uri from the 'test' bucket for uploading (PUT requests, v4)
# file is a local filesystem file
fetch(uploadUri, {
  method: 'PUT',
  mode: 'cors',
  cache: 'no-cache',
  headers: {
    'Content-Type': 'application/octet-stream',  # same error with file.type
  },
  body: file,
}).then(() => console.log('success'));

When this is sent, it runs a preflight OPTIONS request, which does not return the Access-Control-Allow-Origin header in the response, so the PUT fails.

Response headers include: alt-svc, cache-control, content-length, content-type, date, expires, server, status, vary, x-guploader-uploadid.

It looks like the signed URL uses the XML API by default, since the url is https://storage.googleapis.com/[BUCKET-NAME]/[PATH-NAME]?<signed_url_params> (https://cloud.google.com/storage/docs/request-endpoints), which is why I set the CORS above according to the documentation.

This happens locally, and while hosted on app engine. It also happens with both the fetch API and Axios npm package.

I've also tried adding maxAgeSeconds = 3600, method = ['*'], and 'Access-Control-Allow-Origin' to the 'responseHeader' array. Problem persists on retry, even several hours later.

Upload from command line using curl works: curl -v -I -X PUT -T file.csv -H 'Content-Type: application/octet-stream' <signed_url>, so this appears to be a browser/cors/headers issue.

I believe I have gone through and checked everything here: https://cloud.google.com/storage/docs/configuring-cors.

Support '<' comparision for some blob objects

It would be great to take a list of blob objects, from the same bucket, and sort them lexicographically by name. While it appears that list_blobs always returns blobs sorted by name, the API endpoint documentation does not provide any such guarantee. Note, however, that there is such a mention here: https://cloud.google.com/storage/docs/listing-objects.

Unfortunately, without said guarantee on sort order, we provide an explicit sort sometimes when iterating through blobs. Said code ends up looking like this:

for blob in sorted(bucket.list_blobs(), key=lambda blob: blob.name):
    ...

While that's not really terrible, it'd be great to more concisely write:

for blob in sorted(bucket.list_blobs()):
    ...

Even if the documentation were to guarantee an order for list_blobs, it may still be valuable to be able to sort collections of blobs.

Providing such support requires only defining __lt__ on the bucket object. In fact I've said changes and will link such pull request momentarily.

Storage: bump google-auth dependency to 1.11.0+

The google-auth release version 1.11.0 fixes the issue with too eager Timeout errors in cases when the underlying request takes a lot of time, but still succeeds and does not timeout itself.

In order to benefit from it, the version pin needs to be updated.

Storage: add timeout parameter to all public methods

As a library user, I would like to have a way to specify a (transport) timeout when calling methods that make HTTP requests under the hood. The timeout should have a reasonable default to prevent requests for hanging indefinitely in case I forget to pass in a timeout argument myself.

Motivation: User reports of requests hanging indefinitely, e.g. googleapis/google-cloud-python#10182.

Upload of large files times out.

OS 10.14.4
Python 3.8.0
Google_cloud_storage: 1.20.0

I am getting a connection error due to the following code:

blob = bucket.blob(file)
blob.upload_from_filename(file)

This is certainly a fault of google.client.storage for the following reasons:

  1. I can upload a 60 meg file to google drive with no problem. Basically takes about 6 minutes.
  2. When I try to upload a 20 meg file I get the aforementioned error.
  3. I've uploaded about 70,000 files so far with this code, most between 10 and 60 megs with no problem.

I think I've had this problem before and it happened when my upload speed was between 1 and 2 mbps which is what my upload speed is now. When my upload speed is above 10 mbps I do not have this problem.

Still, I should be able to upload an 18 meg file in less than 18 seconds, so I don't see why I'm getting a connection error.

Here is the full traceback:

    blob.upload_from_filename(file)
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 1246, in upload_from_filename
    self.upload_from_file(
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 1195, in upload_from_file
    created_json = self._do_upload(
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 1105, in _do_upload
    response = self._do_resumable_upload(
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 1053, in _do_resumable_upload
    response = upload.transmit_next_chunk(transport)
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/resumable_media/requests/upload.py", line 419, in transmit_next_chunk
    response = _helpers.http_request(
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/resumable_media/requests/_helpers.py", line 116, in http_request
    return _helpers.wait_and_retry(func, RequestsMixin._get_status_code, retry_strategy)
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/resumable_media/_helpers.py", line 150, in wait_and_retry
    response = func()
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/google/auth/transport/requests.py", line 207, in request
    response = super(AuthorizedSession, self).request(
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/requests/sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/requests/sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "/Users/kylefoley/codes/venv/lib/python3.8/site-packages/requests/adapters.py", line 495, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', timeout('The write operation timed out'))

Synthesis failed for python-storage

Hello! Autosynth couldn't regenerate python-storage. ๐Ÿ’”

Here's the output from running synth.py:

Cloning into 'working_repo'...
Switched to branch 'autosynth'
Running synthtool
['/tmpfs/src/git/autosynth/env/bin/python3', '-m', 'synthtool', 'synth.py', '--']
synthtool > Executing /tmpfs/src/git/autosynth/working_repo/synth.py.
.coveragerc
.flake8
.github/CONTRIBUTING.md
.github/ISSUE_TEMPLATE/bug_report.md
.github/ISSUE_TEMPLATE/feature_request.md
.github/ISSUE_TEMPLATE/support_request.md
.github/PULL_REQUEST_TEMPLATE.md
.github/release-please.yml
.gitignore
.kokoro/build.sh
.kokoro/continuous/common.cfg
.kokoro/continuous/continuous.cfg
.kokoro/docs/common.cfg
.kokoro/docs/docs.cfg
.kokoro/presubmit/common.cfg
.kokoro/presubmit/presubmit.cfg
.kokoro/publish-docs.sh
.kokoro/release.sh
.kokoro/release/common.cfg
.kokoro/release/release.cfg
.kokoro/trampoline.sh
CODE_OF_CONDUCT.md
CONTRIBUTING.rst
LICENSE
MANIFEST.in
docs/_static/custom.css
docs/_templates/layout.html
docs/conf.py.j2
noxfile.py.j2
renovate.json
setup.cfg
Running session blacken
Creating virtual environment (virtualenv) using python3.6 in .nox/blacken
pip install black==19.3b0
Error: pip is not installed into the virtualenv, it is located at /tmpfs/src/git/autosynth/env/bin/pip. Pass external=True into run() to explicitly allow this.
Session blacken failed.
synthtool > Failed executing nox -s blacken:

None
synthtool > Wrote metadata to synth.metadata.
Traceback (most recent call last):
  File "/home/kbuilder/.pyenv/versions/3.6.1/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/kbuilder/.pyenv/versions/3.6.1/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/synthtool/__main__.py", line 99, in <module>
    main()
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/synthtool/__main__.py", line 91, in main
    spec.loader.exec_module(synth_module)  # type: ignore
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
  File "/tmpfs/src/git/autosynth/working_repo/synth.py", line 30, in <module>
    s.shell.run(["nox", "-s", "blacken"], hide_output=False)
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/synthtool/shell.py", line 39, in run
    raise exc
  File "/tmpfs/src/git/autosynth/env/lib/python3.6/site-packages/synthtool/shell.py", line 33, in run
    encoding="utf-8",
  File "/home/kbuilder/.pyenv/versions/3.6.1/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['nox', '-s', 'blacken']' returned non-zero exit status 1.

Synthesis failed

Google internal developers can see the full log here.

Changes made in jupyter notebook ipynb file are not uploaded to the bucket

We upload jupyter notebook .ipynb file in google bucket. Whenever we make changes and upload the same file the changes are not being uploaded in the bucket. Looks like there is cache issue while uploading/patching the changed file. How to fix the issue??
We use the following code:

    def upload_blob(self, file_to_upload):
        """ Uploads file to the bucket"""
        blob = self.bucket.blob(file_to_upload)
        blob.upload_from_file(file_to_upload)
        print('File {} uploaded to {}'.format(file_to_upload, file_to_upload))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.