Giter Site home page Giter Site logo

Comments (10)

flyingzumwalt avatar flyingzumwalt commented on May 28, 2024 2

@b5 we've got two of them! the challenge here is to build metadata registries about what's there, regardless of which system is used to store them. Those metadata registries will certainly contain ways to find the datasets over p2p networks so we can also do things like coordinate clusters of nodes that want to replicate a given dataset, but we need the registry in order to support basic activities like keeping an inventory of what datasets we've rescued, their provenance, what's in them, and who's holding them.

from dataset_registries.

b5 avatar b5 commented on May 28, 2024 1

So, we have a bit of a problem, and are going to need to rethink our practices if we will be able to properly coordinate with other archiving efforts. I have a solution in mind, but it has sweeping implications for our practices to date.

Here's what our current base-schema for metadata collection looks like:

{
 "Individual source or seed URL": "http://www.eia.gov/renewable/data.cfm",
"UUID" : "E30FA3CA-C5CB-41D5-8608-0650D1B6F105",
"id_agency" : 2,
"id_subagency": ,
"id_org":,
"id_suborg":,
"Institution facilitating the data capture creation and packaging": "Penn Data Refuge",
"Date of capture": "2017-01-17",
"Federal agency data acquired from": "Department of Energy/U.S. Energy Information Administration",
"Name of resource": "Renewable and Alternative Fuels",
"File formats contained in package": ".pdf, .zip",
"Type(s) of content in package": "datasets, codebooks",
"Free text description of capture process": "Metadata was generated by viewing page and using spreadsheet descriptions where necessary, data was bulk downloaded from the page using wget -r on the seed URL and then bagged.",
"Name of package creator": "Mallick Hossain and Ben Goldman"
}

There are additional fields added by certain automated tools. All of this information is stored & coordinated along UUIDS, it's a bit spread out, but will be trivial to assemble thanks to good old uuid's.

There's nothing wrong with the metadata we're gathering, the problem lies in the metadata we aren't gathering. Both of the examples from other organizations list a one-to-one relationship between a url and it's content, our approach lacks this mapping, and will prevent us from effectively coordinating.

In our current process the "urls" brought into our pipeline actually represent a one-to-many relationship of sub-urls that the given page links to. When a volunteer logs into the app, they are given a url as a starting point, and then download all of the content that the page links to. So in our current setup 1 "url" will result in many static files. Because we don't dictate strict methods to volunteers on how to archive data, we have no dependable method for associating data within an uploaded zip archive and it's url of origin.

This has some dramatic implications for our archiving process, namely that if we are to coordinate efforts, we will need to programmatically archive data, instead of through volunteer-driven efforts.

I want to say that while this may seem like a bad thing, I think it is in fact a very good thing. I think this is just the nudge we need to move away from having volunteers download data (a task that is quite frankly better performed by a computer) to having volunteers at our events contextualize data (an inherently human task). Instead of asking volunteers to engage in downloading, we would hand them already-archived data, and ask them to enrich the metadata & context that is lost in the archiving process. This absolves us of many chain-of-custody issues for the data itself, gives us higher-integrity data, and allows us to engage with the broader archiving community. It would give me great joy to ask a volunteer to learn about & document an already-archived dataset instead of spending hours troubleshooting s3 credentials.

With that, I'm heading to Boston today to think this over with others and begin conceptualizing changes to our approach to match the efforts of our peer organizations. Growth can sometimes be painful, but I for one am extremely excited at the prospect of growing our process to have more hands make for lighter lifting.

from dataset_registries.

flyingzumwalt avatar flyingzumwalt commented on May 28, 2024

Here is a sample of the download stats that @maxogden collects for his downloads https://www.irccloud.com/pastebin/RgbAui2I/

{
  "url": "http://www.dot.gov/regulations/significant-rulemaking-report-archive",
  "date": "2017-01-29T03:34:05.844Z",
  "headersTook": 4232,
  "package_id": "27949aef-ad78-4a56-8d95-eb2f3943d3bf",
  "id": "366cf35f-24f0-4b14-ba5e-78fbccf8ab6c",
  "status": 200,
  "rawHeaders": [
    "Content-Language",
    "en",
    "Content-Type",
    "text/html; charset=utf-8",
    "ETag",
    "\"1485633771-1\"",
    "Last-Modified",
    "Sat, 28 Jan 2017 20:02:51 GMT",
    "Link",
    "<https://www.transportation.gov/regulations/significant-rulemaking-report-archive>; rel=\"canonical\",<https://www.transportation.gov/node/1485>; rel=\"shortlink\"",
    "Server",
    "nginx",
    "X-Age",
    "0",
    "X-AH-Environment",
    "prod",
    "X-Drupal-Cache",
    "HIT",
    "X-Frame-Options",
    "SAMEORIGIN",
    "X-Generator",
    "Drupal 7 (http://drupal.org)",
    "X-Request-ID",
    "v-c9206ff0-e5d3-11e6-983e-22000b4183e0",
    "X-UA-Compatible",
    "IE=edge,chrome=1",
    "X-Varnish",
    "505586209",
    "Cache-Control",
    "public, max-age=3503",
    "Expires",
    "Sun, 29 Jan 2017 04:32:28 GMT",
    "Date",
    "Sun, 29 Jan 2017 03:34:05 GMT",
    "Transfer-Encoding",
    "chunked",
    "Connection",
    "keep-alive",
    "Connection",
    "Transfer-Encoding",
    "Strict-Transport-Security",
    "max-age=31622400"
  ],
  "headers": {
    "content-language": "en",
    "content-type": "text/html; charset=utf-8",
    "etag": "\"1485633771-1\"",
    "last-modified": "Sat, 28 Jan 2017 20:02:51 GMT",
    "link": "<https://www.transportation.gov/regulations/significant-rulemaking-report-archive>; rel=\"canonical\",<https://www.transportation.gov/node/1485>; rel=\"shortlink\"",
    "server": "nginx",
    "x-age": "0",
    "x-ah-environment": "prod",
    "x-drupal-cache": "HIT",
    "x-frame-options": "SAMEORIGIN",
    "x-generator": "Drupal 7 (http://drupal.org)",
    "x-request-id": "v-c9206ff0-e5d3-11e6-983e-22000b4183e0",
    "x-ua-compatible": "IE=edge,chrome=1",
    "x-varnish": "505586209",
    "cache-control": "public, max-age=3503",
    "expires": "Sun, 29 Jan 2017 04:32:28 GMT",
    "date": "Sun, 29 Jan 2017 03:34:05 GMT",
    "transfer-encoding": "chunked",
    "connection": "keep-alive, Transfer-Encoding",
    "strict-transport-security": "max-age=31622400"
  },
  "downloadTook": 4579,
  "file": "5838d1071ae7c3fee63d2c425d89d799ff4e9bee6ec3f99643952a3a9267febe"
}

from dataset_registries.

mejackreed avatar mejackreed commented on May 28, 2024

A piece of metadata I have: https://gist.github.com/mejackreed/cee25feea0c0b1d9602e38bc9479a61d

Files downloaded from resources are also accompanied by headers from the download.

from dataset_registries.

dcwalk avatar dcwalk commented on May 28, 2024

pinging @b5 and @danielballan RE: DataRescue metadata

from dataset_registries.

dcwalk avatar dcwalk commented on May 28, 2024

Also! Just to note-- the vetting process and posting for the DataRefuge CKAN is handled by DataRefuge, their input could speak to those areas of metadata that are generated through vetting workflow (cc @rlappel and @jschell42 are you the right people to ping on this?)

from dataset_registries.

flyingzumwalt avatar flyingzumwalt commented on May 28, 2024

@b5 could you post an example of the metadata you capture for datasets downloaded at a #datarescue hackathon?

from dataset_registries.

titaniumbones avatar titaniumbones commented on May 28, 2024

@mejackreed is that metadata gist idiosyncratic to you, or is it produced in accordance with the standards of a wider community (climate mirror, azimuth, etc)?

from dataset_registries.

ambergman avatar ambergman commented on May 28, 2024

@flyingzumwalt - With @b5, @trinberg and others, there have been having some early conversations - and I'm confident I'm not the one who should be having or reporting on them - about how to coordinate nodes in the short term, so that replicated datasets can have additional metadata added locally, but still all be able to reference one another and access additions. You comment about being agnostic to the storage system would definitely be a part of that. The long term should, of course, look different (and I won't even try to pretend I really fully understand IPFS here :) ), but it would be great to discuss the short term at the event today

from dataset_registries.

mhucka avatar mhucka commented on May 28, 2024

I think this is just the nudge we need to move away from having volunteers download data (a task that is quite frankly better performed by a computer) to having volunteers at our events contextualize data (an inherently human task). Instead of asking volunteers to engage in downloading, we would hand them already-archived data, and ask them to enrich the metadata & context that is lost in the archiving process.

That's a great goal. I know when I was leading people in seeding URLs during the UCLA event in January, I felt a bit like asking them to do really menial stuff that a computer should be doing.

from dataset_registries.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.