Giter Site home page Giter Site logo

Comments (11)

wlandau avatar wlandau commented on June 3, 2024

For this to work, I think I will need to switch to using ETags as hashes instead of the targets custom hash in the metadata. I think the reason I didn't do this initially was because I didn't know that S3 was strongly read-after-write consistent.

from targets.

wlandau avatar wlandau commented on June 3, 2024

Roadmap for AWS:

  • Implement and test aws_s3_list() in the utils. Remember pagination.
  • Switch to ETags.
  • Modify store_aws_hash() to use a cache. This function should only be called locally in the central controlling R session. I could put guardrails to make sure that stays the case.

Unfortunately list_objects_v2() does not return version Ids, and list_object_Versions() returns too much information (never just the most current objects). So it looks like this caching will not be version-aware and will have to fall back on HEAD requests if you git reset your way back to historical metadata.

from targets.

wlandau avatar wlandau commented on June 3, 2024

For GCS, it might be good to just switch to ETags for the next release, then wait for cloudyr/googleCloudStorageR#179.

from targets.

wlandau avatar wlandau commented on June 3, 2024

Hmm.... I don't think we need to switch to ETags for hashes. We could just store the ETag as part of the metadata and use ETags instead of versions to corroborate objects.

from targets.

wlandau avatar wlandau commented on June 3, 2024

I thought this through a bit more, and unfortunately this batched caching feature no longer seems feasible.

As I said before, list_object_versions() is not feasible because it lists all the versions of all the objects, without any kind of guardrail to list e.g. only the most recent versions. Any given object could have thousands of versions, and so listing all the versions of all the objects is way too much.

On the other hand, neither list_objects() nor list_objects_v2() lists version IDs at all, so it is impossible to confirm that the version listed in the metadata actually exists or is current. For example, suppose you revert to a historical copy of the metadata, and you see version ABC and ETag XYZ for target x. The bucket's current version could have ETag XYZ, but version ABC may no longer exist. (For example, it might have been automatically deleted by the object retention policy).

These and similar problems are impossible to reconcile unless:

  1. targets sends a HEAD request for each individual object, as it currently does, or
  2. sends a batched API request with a list of key-version pairs and to learn the existence of each one.

(2) seems impossible, so I think we have to stick with (1).

from targets.

wlandau avatar wlandau commented on June 3, 2024

I just posted https://stackoverflow.com/questions/77454033/is-there-a-way-to-batch-check-the-existence-of-specific-object-versions-in-aws-s

from targets.

wlandau avatar wlandau commented on June 3, 2024

Also posted https://repost.aws/questions/QUq7vI636vTKy0-e48-3Cf1Q/is-there-a-way-to-batch-check-the-existence-of-specific-object-versions-in-aws-s3

from targets.

wlandau avatar wlandau commented on June 3, 2024

Tried to send a feature request on their feedback form, but it's glitchy today:

I am writing an R package which needs to check the existence of a specific version of each AWS S3 object in its data store. The version of a given object is the version ID recorded in the local metadata, and the recorded version may or may not be the most current version in the bucket. Currently, the package accomplishes this by sending a HEAD request for each relevant object-version pair.

I would like a more efficient/batched way to do this for each version/object pair. list_object_versions() returns every version of every object of interest, which is way too many versions to download efficiently, and neither list_objects() nor list_objects_v2() return any version IDs at all. It would be great to have something like delete_objects(), but instead of deleting the objects, accept the supplied key-version pairs and return the ETag and custom metadata of each one that exists.

c.f. https://repost.aws/questions/QUe-yNsIr0Td2aq2oA1RAQdQ/hudi-and-s3-object-versions

from targets.

wlandau avatar wlandau commented on June 3, 2024

Note to self: if it ever becomes possible to revisit this issue, I will probably need to switch targets to use AWS/GCS ETags when available instead of custom local file hashes. The switch is as simple as this:

  1. In store_upload_object_aws(), remove the targets-hash custom metadata:

metadata = list("targets-hash" = store$file$hash),

  1. In store_upload_object_aws(), write store$file$hash <- digest_chr64(head$ETag) just above the following line:

store$file$path <- c(path, paste0("version=", head$VersionId))

  1. At the end of store_aws_hash(), return digest_chr64(head$ETag) instead of head$Metadata[["targets-hash"]].
  2. Test that the correct ETags get to the metadata and the correct ETags are being retrieved by store_aws_hash() to assert that up-to-date targets are indeed up to date.
  3. Repeat all the above for GCS.

from targets.

wlandau avatar wlandau commented on June 3, 2024

Taking a step back: this is actually feasible if targets can ignore version IDs. There could be a tar_option_set()-level option to either check or ignore version IDs. Things to consider:

  • Should the option be at the level of tar_option_set() and not tar_target()? At first glance, I thinks so because caching happens in bulk. Maybe the level of tar_resources_aws() could technically work, but those options are all implicitly target-level, which would be counterintuitive even with good documentation.
  • Should the version check still be enabled by default? I think so, for compatibility. But it will be slow.

from targets.

wlandau avatar wlandau commented on June 3, 2024

Taking another step back: targets should:

  1. Always use the version ID when downloading data, and
  2. Always ignore the version ID when checking the hash.

(1) ensures behavior is clear, consistent, compatible, and version-aware. (2) ensures a target reruns if it is not the current object in the bucket. (2) also makes this issue so much easier to implement. And it lets us avoid adding a new version argument of tar_resources_aws(). The outcomes will be:

  1. Pipelines with cloud targets will run dramatically faster.
  2. The rules for checking/rerunning outdated targets will take into account which objects are the latest versions in the bucket. This makes more conceptual sense.
  3. Users won't need to do anything extra.

from targets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.