Comments (17)
@jgeewax Who does this need to be assigned to?
from google-cloud-ruby.
Assigning to me to find the right person. Should have someone soon.
from google-cloud-ruby.
/cc @thobrla
from google-cloud-ruby.
/cc @Capstan
from google-cloud-ruby.
Setting some expectations:
- I'm a lead in GCS, not a Ruby maven. Bear with my n00bitidity.
- I am likely to have time for this first on Monday, 5/11.
from google-cloud-ruby.
Great! Please let me know if you have any questions.
from google-cloud-ruby.
From reading the Gcloud::Storage docs:
- Why name the class
File
for a GCS object? Is that not confusing vis-a-vis the CoreFile
class? Java usesStorageObject
to differentiate betweenObject
, and that seems useful here too to differentiate between Ruby'sObject
- file.delete() permanently deletes the file only if versioning is not on. Otherwise, it'll create an archive version, accessible only by generation.
- file.download() – nice use of verification!
- Is there an
IO::generic_readable
orIO::generic_writable
accessor planned? - file.copy() – why not just have it take another Storage.File object?
- file.signed_url() – what does this do?
- bucket.create_file()
- s/265/256/
- I think we might want to default to a larger chunk size for performance, maybe 2MB. Will clients be able to handle that? We advise that you keep the chunk size as large as possible.
- What are the options? Can you override what the file system guesses the
Content-Type
is?
- bucket.default_acl() – prefer
default_object_acl
. Or is the idea that this is the acl for "contained" things? - bucket.files()
- does this do pagination for you under the covers?
- what are the criteria? Does the consumer know/care that you are doing client-side filtering?
- bucket.find_file() – can this find non-existent objects? as in one you're about to create by using a
IO::generic_writable
? Or does it only refer to extant objects? - Buckets are missing some misc. config options, like setting lifecycle configuration, website configuration, versioning.
from google-cloud-ruby.
What is the way a gcloud-ruby consumer adds their own application (or tool) name/version to the UserAgent
header? Presumably it should look something like MyWebsite/1.0 gcloud-ruby/0.1.0 google-api-ruby-client/0.8.6
. Or perhaps the last is subsumed by the second-to-last, if you tie releases to specific underlying clients.
from google-cloud-ruby.
- bucket.create_file()
- I think we might want to default to a larger chunk size for
performance, maybe 2MB. Will clients be able to handle that? We advise
that you keep the chunk size as large as possible
https://cloud.google.com/storage/docs/json_api/v1/how-tos/upload#chunking
Is there a guideline/recommendation from the gcs team about when it is
preferred to use non-resumable writes and resumable is preferred?
I found one example -
https://cloud.google.com/storage/docs/json_api/v1/objects/insert where it
is suggested for small files (and example is using ~2MB) to use
non-resumable and resumable otherwise.
from google-cloud-ruby.
Also, reading the chunking reference it looks like its being discouraged ("This is not the preferred approach since there are performance costs associated with the additional requests, and it is generally not needed."). I am surprised about that as that was not my impression/experience when working on the appengine_gcs_client (even when taking the AE 10MB up, 32MB down limits).
I guess the only alternative for chunking to make resumable writes meaningful would be to query upon failure and continue the writes from that point. If so, is it guarantied that every writes sent to the service before the failed write are going to be available? Otherwise I am not sure how much written data the client would need to keep in order recover from a failure (and retry) transparently.
from google-cloud-ruby.
Querying upon failure and continuing is desirable with seekable data; for non-seekable data, chunking+buffering is necessary.
from google-cloud-ruby.
Family emergency. I'll try to respond later tonight or over the weekend.
from google-cloud-ruby.
Chunking is not discouraged in that it solves two problems:
- GAE has per-HTTP request size limits
- For being able to retire client-side write buffer, esp. when the client itself is being streamed data, knowing the committed point is useful. If you get a
308 Resume Incomplete
response, you will get the # of bytes stored so far server side and can retire the buffer.*
Aside from those two things, it is inefficient in that it requires re-erecting an HTTP session for every chunk, and so reducing that overhead by increasing chunk size is preferable (single-chunk obviously best). You might be able to, in parallel, perform a request of upload status to see how much the server has committed, and retire the buffer that way, but that will be a conservative number and not necessarily strongly consistent with an ongoing upload session.
As to aozarov@'s question, the canonical definition of what has been committed is what is returned from a chunk write in the Range
header, so yes, every previous 308 Resume Incomplete
write is committed. The client should get a 400 level error if they try to commit a partial chunk not at the end of the file.
*There is the possibility that you will end up with some later issue that could cause the upload session to be unresumable, e.g., MD5 mismatch, that would then still abort the whole upload. For true safety, a client would have to buffer the entire amount until the final 201 Created
from google-cloud-ruby.
The break-even point for resumable vs. non-resumable is how much latency the creation of the resumable session incurs vs. the throughput and quality of your network connection to Google. The fatter and consistent the pipe, the bigger the object would need to be to make it worth performing a second round trip, since retransmitting that data is likely to be as fast if not faster in the event of an error. Certainly, in aggregate, due to low error rates, uploading multiple files will be faster to just retry from scratch for small enough objects.
from google-cloud-ruby.
file.download()
needs help. Downloading a very large object directly will cause you to OOM, if not risk having your connection be broken. Retrying by redownloading the whole object again is not ideal. Does Ruby allow io-based HTTP responses, or must it fit everything in memory? In the latter case, you'll want to choose an appropriate chunk size, and download it using Content-Range: bytes 0-chunksize/length
, repeatedly appending to your file. You will want to pin the download to a specific object generation or risk getting mixed data. If there is an error, you know what offset you had last, and can continue from there.
Advanced forms could parallelize downloads, but I don't know how relevant that is.
from google-cloud-ruby.
It seems like google-api-ruby-client's media.rb is missing download support, so perhaps we should come up with what is appropriate for that level of abstraction and reuse it here.
from google-cloud-ruby.
@Capstan Apologies for the delay. As you said, we are at the mercy of google-api-ruby-client here. Google API Client is built on a library named Faraday, and Faraday does not support streaming downloads. Ruby's stdlib HTTP library does support streaming downloads, but users may configure any number of alternate providers to use, for any number of justifiable reasons.
I'll follow up with @remi and see if there is anything we can do to avoid OOM situations when downloading very large files.
from google-cloud-ruby.
Related Issues (20)
- [Nightly CI Failures] Failures detected for google-cloud-build HOT 3
- [Nightly CI Failures] Failures detected for google-cloud-build-v2 HOT 5
- [Nightly CI Failures] Failures detected for google-cloud-build-v1 HOT 5
- [Nightly CI Failures] Failures detected for google-cloud-build HOT 5
- Causing many GRPC requests per calling discovery engine search api once HOT 1
- An error raises when I call search request API of discovery engine with `ContentSearchSpec` HOT 3
- Missing Ruby Clients for Shopping Merchant APIs HOT 4
- storage: Samples for managed folders HOT 3
- Warning: a recent release failed
- Add a method to big_query load_job to set the "column name character map" option
- [Nightly CI Failures] Failures detected for google-cloud-contact_center_insights-v1 HOT 1
- [Nightly CI Failures] Failures detected for google-cloud-datastore-admin-v1 HOT 1
- [Nightly CI Failures] Failures detected for google-cloud-datastore-admin HOT 1
- [Nightly CI Failures] Failures detected for google-cloud-datastore-v1 HOT 1
- Warning: a recent release failed
- "Google::Cloud::Vision.image_annotator" crashes Puma app server in Rails controller, but same code runs fine as a standalone ruby script HOT 2
- (bigquery) add configuration "column_name_character_map" to dataset load_job HOT 3
- [Nightly CI Failures] Failures detected for google-cloud-recaptcha_enterprise HOT 1
- [Nightly CI Failures] Failures detected for google-cloud-recaptcha_enterprise-v1 HOT 1
- [Nightly CI Failures] Failures detected for google-cloud-recaptcha_enterprise-v1beta1 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from google-cloud-ruby.