Giter Site home page Giter Site logo

23koivisto / workflow Goto Github PK

View Code? Open in Web Editor NEW

This project forked from datarefuge/workflow

0.0 3.0 0.0 3.87 MB

This document describes the workflow we use for Data Rescue activites as developed by the DataRefuge project and EDGI

License: Creative Commons Attribution Share Alike 4.0 International

workflow's Introduction

DataRescue Workflow -- Overview

This document describes the workflow we use for Data Rescue activites as developed by the DataRefuge project and EDGI, both at in-person events and when people work remotely. It explains the process that a url/dataset goes through from the time it has been identified, either by a seeder & sorter as "uncrawlable," or by other means, until it is made available as a record in the datarefuge.org ckan data catalog. The process involves several stages, and is designed to maximize smooth hand-offs so that each phase is handled by someone with distinct expertise in the area they're tackling, while the data is always being tracked for security.

Before you begin

We are so glad that you are participating in this project!

If you are an Event Organizer:

  • Learn about what you need to do to prepare the event here.

If you are a regular participant:

  • Get a role assignment (e.g., Seeder, or Harvester), get account credentials needed for your role, and make sure you have access to the key documents and app needed to do the work. The Event/Remote organizers will tell you how proceed to do all this.
  • Go over the workflow documentation below, in particular the pages corresponding to your role.

Plan Overview

Seeders and Sorters canvass the resources of a given government agency, identifying important URLs. They identify whether those URLs can be crawled by the Internet Archive's webcrawler. If the URLs are crawlable, the Seeders/Sorters nominate them to the End-of-Term (EOT) project, otherwise they add them to the Uncrawlable spreadsheet using the project's Chrome Extension.

Researchers inspect the "uncrawlable" list to confirm that seeders' assessments were correct (that is, that the URL/dataset is indeed uncrawlable), and investigate how the dataset could be best harvested. Research.md describes this process in more detail.

We recommend that a Researchers and Harvesters (see below) work together in pairs, as much communication is needed between the two roles. In some case, one same person will fulfill both roles.

Harvesters take the "uncrawlable" data and try to figure out how to actully capture it based on the recommendations of the Researchers. This is a complex task which can require substantial technical expertise, and which requires different techniques for different tasks. Harvesters should see the included Harvesting Toolkit for more details and tools.

4. Checkers

Note: This role is currently performed by the Baggers, and does not exist separately.

Checkers inspect a harvested dataset and make sure that it is complete. The main question the checkers need to answer is "will the bag make sense to a scientist"? Checkers need to have an in-depth understanding of harvesting goals and potential content variations for datasets.

Baggers perform some quality assurance on the dataset to make sure the content is correct and corresponds to the original URL. Then they package the data into a bagit file (or "bag"), which includes basic technical metadata and upload it to final DataRefuge destination.

Note: This role is still being fine-tuned.

Describers creates a descriptive record in the DataRefuge CKAN repository for each bag. Then they links the record to the bag, and make the record public.


Partners

Data Rescue is a broad, grassroots effort with support from numerous local and nationwide networks. DateRefuge and EDGI partner with local organizers in supporting these events. See more of our institutional partners on the Data Refuge home page.

workflow's People

Contributors

khdelphine avatar librlaurie avatar dcwalk avatar mhucka avatar titaniumbones avatar marjanzer avatar 23koivisto avatar jschell42 avatar danielballan avatar grosscol avatar vielmetti avatar b5 avatar ottumm avatar arctansusan avatar murphyofglad avatar

Watchers

James Cloos avatar Jorge A. Mendoza-Torres avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.