Giter Site home page Giter Site logo

Comments (17)

blowmage avatar blowmage commented on July 26, 2024

@jgeewax Who does this need to be assigned to?

from google-cloud-ruby.

jgeewax avatar jgeewax commented on July 26, 2024

Assigning to me to find the right person. Should have someone soon.

from google-cloud-ruby.

jgeewax avatar jgeewax commented on July 26, 2024

/cc @thobrla

from google-cloud-ruby.

jgeewax avatar jgeewax commented on July 26, 2024

/cc @Capstan

from google-cloud-ruby.

Capstan avatar Capstan commented on July 26, 2024

Setting some expectations:

  • I'm a lead in GCS, not a Ruby maven. Bear with my n00bitidity.
  • I am likely to have time for this first on Monday, 5/11.

from google-cloud-ruby.

blowmage avatar blowmage commented on July 26, 2024

Great! Please let me know if you have any questions.

from google-cloud-ruby.

Capstan avatar Capstan commented on July 26, 2024

From reading the Gcloud::Storage docs:

  • Why name the class File for a GCS object? Is that not confusing vis-a-vis the Core File class? Java uses StorageObject to differentiate between Object, and that seems useful here too to differentiate between Ruby's Object
  • file.delete() permanently deletes the file only if versioning is not on. Otherwise, it'll create an archive version, accessible only by generation.
  • file.download() – nice use of verification!
  • Is there an IO::generic_readable or IO::generic_writable accessor planned?
  • file.copy() – why not just have it take another Storage.File object?
  • file.signed_url() – what does this do?
  • bucket.create_file()
    • s/265/256/
    • I think we might want to default to a larger chunk size for performance, maybe 2MB. Will clients be able to handle that? We advise that you keep the chunk size as large as possible.
    • What are the options? Can you override what the file system guesses the Content-Type is?
  • bucket.default_acl() – prefer default_object_acl. Or is the idea that this is the acl for "contained" things?
  • bucket.files()
    • does this do pagination for you under the covers?
    • what are the criteria? Does the consumer know/care that you are doing client-side filtering?
  • bucket.find_file() – can this find non-existent objects? as in one you're about to create by using a IO::generic_writable? Or does it only refer to extant objects?
  • Buckets are missing some misc. config options, like setting lifecycle configuration, website configuration, versioning.

from google-cloud-ruby.

Capstan avatar Capstan commented on July 26, 2024

What is the way a gcloud-ruby consumer adds their own application (or tool) name/version to the UserAgent header? Presumably it should look something like MyWebsite/1.0 gcloud-ruby/0.1.0 google-api-ruby-client/0.8.6. Or perhaps the last is subsumed by the second-to-last, if you tie releases to specific underlying clients.

from google-cloud-ruby.

aozarov avatar aozarov commented on July 26, 2024

Is there a guideline/recommendation from the gcs team about when it is
preferred to use non-resumable writes and resumable is preferred?
I found one example -
https://cloud.google.com/storage/docs/json_api/v1/objects/insert where it
is suggested for small files (and example is using ~2MB) to use
non-resumable and resumable otherwise.

from google-cloud-ruby.

aozarov avatar aozarov commented on July 26, 2024

Also, reading the chunking reference it looks like its being discouraged ("This is not the preferred approach since there are performance costs associated with the additional requests, and it is generally not needed."). I am surprised about that as that was not my impression/experience when working on the appengine_gcs_client (even when taking the AE 10MB up, 32MB down limits).
I guess the only alternative for chunking to make resumable writes meaningful would be to query upon failure and continue the writes from that point. If so, is it guarantied that every writes sent to the service before the failed write are going to be available? Otherwise I am not sure how much written data the client would need to keep in order recover from a failure (and retry) transparently.

from google-cloud-ruby.

thobrla avatar thobrla commented on July 26, 2024

Querying upon failure and continuing is desirable with seekable data; for non-seekable data, chunking+buffering is necessary.

from google-cloud-ruby.

blowmage avatar blowmage commented on July 26, 2024

Family emergency. I'll try to respond later tonight or over the weekend.

from google-cloud-ruby.

Capstan avatar Capstan commented on July 26, 2024

Chunking is not discouraged in that it solves two problems:

  1. GAE has per-HTTP request size limits
  2. For being able to retire client-side write buffer, esp. when the client itself is being streamed data, knowing the committed point is useful. If you get a 308 Resume Incomplete response, you will get the # of bytes stored so far server side and can retire the buffer.*

Aside from those two things, it is inefficient in that it requires re-erecting an HTTP session for every chunk, and so reducing that overhead by increasing chunk size is preferable (single-chunk obviously best). You might be able to, in parallel, perform a request of upload status to see how much the server has committed, and retire the buffer that way, but that will be a conservative number and not necessarily strongly consistent with an ongoing upload session.

As to aozarov@'s question, the canonical definition of what has been committed is what is returned from a chunk write in the Range header, so yes, every previous 308 Resume Incomplete write is committed. The client should get a 400 level error if they try to commit a partial chunk not at the end of the file.

*There is the possibility that you will end up with some later issue that could cause the upload session to be unresumable, e.g., MD5 mismatch, that would then still abort the whole upload. For true safety, a client would have to buffer the entire amount until the final 201 Created

from google-cloud-ruby.

Capstan avatar Capstan commented on July 26, 2024

The break-even point for resumable vs. non-resumable is how much latency the creation of the resumable session incurs vs. the throughput and quality of your network connection to Google. The fatter and consistent the pipe, the bigger the object would need to be to make it worth performing a second round trip, since retransmitting that data is likely to be as fast if not faster in the event of an error. Certainly, in aggregate, due to low error rates, uploading multiple files will be faster to just retry from scratch for small enough objects.

from google-cloud-ruby.

Capstan avatar Capstan commented on July 26, 2024

file.download() needs help. Downloading a very large object directly will cause you to OOM, if not risk having your connection be broken. Retrying by redownloading the whole object again is not ideal. Does Ruby allow io-based HTTP responses, or must it fit everything in memory? In the latter case, you'll want to choose an appropriate chunk size, and download it using Content-Range: bytes 0-chunksize/length, repeatedly appending to your file. You will want to pin the download to a specific object generation or risk getting mixed data. If there is an error, you know what offset you had last, and can continue from there.

Advanced forms could parallelize downloads, but I don't know how relevant that is.

from google-cloud-ruby.

Capstan avatar Capstan commented on July 26, 2024

It seems like google-api-ruby-client's media.rb is missing download support, so perhaps we should come up with what is appropriate for that level of abstraction and reuse it here.

from google-cloud-ruby.

blowmage avatar blowmage commented on July 26, 2024

@Capstan Apologies for the delay. As you said, we are at the mercy of google-api-ruby-client here. Google API Client is built on a library named Faraday, and Faraday does not support streaming downloads. Ruby's stdlib HTTP library does support streaming downloads, but users may configure any number of alternate providers to use, for any number of justifiable reasons.

I'll follow up with @remi and see if there is anything we can do to avoid OOM situations when downloading very large files.

from google-cloud-ruby.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.