Comments (4)
I prefer the approach of taking an MD5 hash on the file content. However both have their merits. We could allow for both and make it configurable also. Default on file hash, but allow a user to configure it within the individual file configs. If we take a MD5 hash we'll also need a place to persist historical hashes. We could keep at a row level which would be redundant or maybe even store a dummy file named with the hash in a separate path in the bucket upon successfully processing a file. This way we could just check for the existence of a hash/file name within one storage call.
from datashare-toolkit.
The hash is available as file metadata for each GCS object - perhaps it makes sense to add this to the datashare_batch_id
.
Creation time: Thu, 19 Apr 2018 18:50:40 GMT
Update time: Thu, 19 Apr 2018 18:50:40 GMT
Storage class: STANDARD
Content-Length: 33685826
Content-Type: application/zip
Hash (crc32c): 25iVgg==
Hash (md5): sEjyHtS53TO/t+wK9PFtog==
ETag: CPzOweKAx9oCEAE=
Generation: 1524163840862076
Metageneration: 1
ACL: [
from datashare-toolkit.
So hash could be stored in column separate from datashare_batch_id
, then ingestion function can halt if a COUNT(*) WHERE datashare_hash = '<hash_of_new_file>'
returns anything but 0.
from datashare-toolkit.
Problem is that using a single hash would require a join to all of the data. To minimize the joined set we could combine with a statement date or something depending on the type of data. In this case too, it may make sense to cluster on the date used.
from datashare-toolkit.
Related Issues (20)
- Show IDP authenticated users on the admin view
- Implement role-based claims for IDP integration
- Remove x-gcp-account from request headers (pull from validated token result)
- Update 0.7.3 release
- Initial login for an admin role account should handle seamlessly
- Datashare 2.0 documentation
- Implement wildcard support for AuthZ
- Implement support for open api spec calls
- Upgrade to Node v16
- API gateway integration fixes
- Marketplace 'Manage on Provider' redirects to restricted page for non-logged in user
- Remove gke related files and docs
- Remove ISTIO configs and references
- Update sample domain in docs HOT 1
- Deployment: Cloud Run for Anthos is no longer available as a GKE add-on HOT 1
- Cloud functions creation failed due to npm ci command HOT 10
- Permissions view isn't authorized to configured dataset
- Sync packaged solution name when performing admin syncs
- Row level access column list isn't populated when editing existing views
- 404 Error on Sign Up and Login URLs when using the Producer Portal HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datashare-toolkit.