Giter Site home page Giter Site logo

Comments (4)

mservidio avatar mservidio commented on July 17, 2024

I prefer the approach of taking an MD5 hash on the file content. However both have their merits. We could allow for both and make it configurable also. Default on file hash, but allow a user to configure it within the individual file configs. If we take a MD5 hash we'll also need a place to persist historical hashes. We could keep at a row level which would be redundant or maybe even store a dummy file named with the hash in a separate path in the bucket upon successfully processing a file. This way we could just check for the existence of a hash/file name within one storage call.

from datashare-toolkit.

salsferrazza avatar salsferrazza commented on July 17, 2024

The hash is available as file metadata for each GCS object - perhaps it makes sense to add this to the datashare_batch_id.

    Creation time:          Thu, 19 Apr 2018 18:50:40 GMT
    Update time:            Thu, 19 Apr 2018 18:50:40 GMT
    Storage class:          STANDARD
    Content-Length:         33685826
    Content-Type:           application/zip
    Hash (crc32c):          25iVgg==
    Hash (md5):             sEjyHtS53TO/t+wK9PFtog==
    ETag:                   CPzOweKAx9oCEAE=
    Generation:             1524163840862076
    Metageneration:         1
    ACL:                    [

from datashare-toolkit.

salsferrazza avatar salsferrazza commented on July 17, 2024

So hash could be stored in column separate from datashare_batch_id, then ingestion function can halt if a COUNT(*) WHERE datashare_hash = '<hash_of_new_file>' returns anything but 0.

from datashare-toolkit.

mservidio avatar mservidio commented on July 17, 2024

Problem is that using a single hash would require a join to all of the data. To minimize the joined set we could combine with a statement date or something depending on the type of data. In this case too, it may make sense to cluster on the date used.

from datashare-toolkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.