In order to design a simple, practical, first-pass metadata format for tracking these datasets.
Information we will want the registries to keep track of:
Which datasets have been downloaded
Where they were downloaded from
Who downloaded them
Where they are available -- http links, ipfs links, dat links, etc
If the download was vetted for authenticity. If so, then how it was vetted.
for example, thus far EDGI has only posted a fraction of the downloaded datasets on datarefuge.org because that CKAN instance only contains stuff that has been vetted using a documented process. We want the registries to show everything, including the stuff that was not vetted, but we want to be able to distinguish between vetted stuff and un-vetted stuff.
Note: a lot of datasets have been downloaded multiple times by different people. We need to represent "these are both versions of the same dataset" without losing info about where they are and who downloaded them.