Data-Rescue-PDX

Volunteers! Welcome and thank you for spending your time with us.

Tools reguired: A laptop, curiosity, and a GitHub account. If you have never used GitHub - welcome! We will talk beginners through setting up GitHub accounts and the use of the platform. See our Contributing Guide for more information.

This event is a volunteer effort to contribute to data back up efforts. By attending you agree to our Code of Conduct which we will enforce.

To chat after this event, join the Science Hack PDX Slack: https://sciencepdxslackin.herokuapp.com/

Citizen Science Metadata Curation Workshop!

Background

There is curently no centralized storage and access system for scientific research datasets. Scientific data is scattered across the internet, where it may be deposited on hard to find servers without meaningful documentation to explain what the data is, who created it, the context in which it was created, and how it should be discovered and used by other researchers.

Throughout the DataRefuge community, many parallel efforts are working to ensure that important datasets are discovered, nominated for archiving, and safely backed up. However, many of the datasets in the archiving queue do not have machine the readable, standard metadata critical to the archiving process.

That's where you come in! We need your help to adopt one of these datasets, hopefully before they disappear from the web. Funding for science is uncertain, and datasets and the websites that accompany them could disappear at any time.

Your mission, if you choose to accept it (which you did implicitly by showing up and eating the snacks), is to create metadata and documentation for the scientific datasets that are being backed up by other DataRefuge efforts so that their existence and provenance is recorded into the online public record with standard, open metadata.

Your efforts will prevent these datasets from drifting into obscurity. A dataset without metadata can not be found, cited, or trusted by the scientific community.

Warning: You will be very confused for the first hour or so, because this is confusing. The data is all over the place, there are millions of files with no rhyme or reason. But fear not, everyone goes through this. We are all in this together! Before we continue, turn to your left (or right) and introduce yourself to the recruit Data Detective sitting next to you. When you get stuck, or have a question, they are your first point of contact (and you are theirs).

What is metadata?

Descriptive and/or machine readable information about a data set.
Makes the dataset useful, reusable, discoverable.
Good metadata is discoverable by search engines and uses open standards (today we focus on the JSON format, which is used by Data.gov).

For example, a metadata file should address these questions:

Who created the data set?
When was it created?
Is it being maintained?
From where was it downloaded?
What types of data/How much data is in this dataset?
What is the origin of these data (owner, agency, instruments, creator, maintainer, contributors)?
Is there a clearly documented chains of custody?
Can you prove that your copy is the same data as the original?

Metadata is super important -- it essentially makes data discovery and resuse possible -- and if all you do today is create one metadata file for one dataset, or find metadata online and match it to a dataset that has been downloaded, or find a few datasets that need metadata, that is AWESOME and necessary! No contribution is too small!

Data Detectives: Discovering data and creating JSON metadata files

We have a job for you!

Here is the workflow you will be following:

Adopt a dataset from the "Done" bin at Archivers.Space.
Claim your issue/dataset in this spreadsheet.
Research your dataset to discover or create standard JSON formatted metadata using a local text file.
Document your dataset by creating an issue in this repository and using our issue template to add the JSON metadata you found or created.
Verify metadata from two other contributors.
Repeat!

Example

Here is a sample exercise to take you through the workflow

Adopt a dataset from the "Done" bin at Archivers.Space. If all datasets in "Done" are claimed, move on to "Bag".search the google spreadsheet for your UUID and URL before claiming your dataset Visit the issues list and find one that nobody else has adopted yet.
Example: We're going to adopt this dataset: https://www.archivers.space/urls/D82A9773-81AF-4AF2-BF01-52CA2CF3BA22. We've updated the Google Spreadsheet and are ready to dive in.
You might find after a few minutes that your dataset is incredibly confusing and hard to understand. This is normal. Here are some questions to ask during your research phase:
Is the URL to a server with a bunch of datasets or one specific dataset?
Example: Our example issue linked to this site: http://www.nrel.gov/gis/data_solar.html. There seem to be dozens of datasets on many topics. The folders seem to be about different scientific research topics. If you find a server hosting different scientific datasets in different folders or different links, that's fine. A high level JSON metadata file describing the page is useful, you don't need to list 3000 resources. Do what you can. Currently this page contains.
Is it clear what scientific purpose this dataset serves?
Example: From Archivers.Space - "These data provides monthly average and annual average daily total solar resource averaged over surface cells of 0.1 degrees in both latitude and longitude, or about 10 km in size. This data was developed using the State University of New York/Albany satellite radiation model. This model was developed by Dr. Richard Perez and collaborators at the National Renewable Energy Laboratory and other universities for the U.S. Department of Energy." Probably worth saving!
If you can't find out the purpose by clicking on the data, you should search the web for links to these files on Google to see how other people have used this data. See the Google-Fu section below.
What organization funded it? Federally funded research?
Sometimes this is in the URL of the server, or you might find it through googling different acronyms.
Example: Our dataset is clearly from National Renewable Energy Laboratory (NREL), which we learned in the URL.
Is raw data at the URL or is it a landing page where you need to click through?
Raw data is things like .CSV, .ZIP, .PDF or weird esoteric scientific data forms, usually displayed in a folder structure, and prompts you to Save As a download on your computer when you click on it. Raw data is sometimes hard to find.
A landing page is a HTML website that usually describes the research project and sometimes links to the raw data. If you find a landing page, try to find out where all the links to the raw data on the landing page(s) are.
Example: This is a landing page that links out to multiple datasets.
Since we're focusing on creating dataset metadata, we want to primarily find and describe the downloadable raw data, not only the project website or landing page (though the websites are useful for learning some about the dataset such as which agency created it etc.).
Check whether your dataset exists in other places - search for the URL.
Google-Fu (see below) comes in handy here! Someone might have already created metadata for this dataset. If the exact dataset you're working with is listed on Data.gov for example there will be a metadata JSON file already.
Example: By searching Google for "http://www.nrel.gov/gis/data/GIS_Data_Technology_Specific/United_States/Solar" we found out that this URL is really the only place on the internet that links to these datasets. We can check the URLs of the raw data files that link out from this site, and we did not find our dataset in any data repositories (with/without metadata). This means that if this webpage disappears, these datasets will be super hard to find Though, this dataset has been crawled by the Internet Archive, so it's going to be OK!
What's the status of the metadata?
If you find metadata, you should inspect it! If it's JSON you can copy and paste in to JSONLint. This will format the file so you can read it. Then you can copy it into a local text editor to work with it some more. If you found JSON metadata, you might still be able to improve it.
I'm going to try to make a bot that checks your GitHub Issues tomorrow, so watch out for the bot! If you are reasonably sure that the metadata you've found is describing the dataset you've adopted -- and nobody has linked to this metadata yet -- leave a comment with a link to the metadata so others can benefit from your detective work! Example: Our example is a landing page with multiple links to download data. Files have XML metadata, which is good but this dataset needs a JSON file that describes its ownership, contents, and other key details. So let's make one!
Editing or creating JSON metadata
If you found metadata in another form, for example XML or even a README with a block of text, you can convert it to the JSON format using a converter or creating a file by hand in a text editor. Take a look at this JSON template.

JSON organizes information about the data, the organization, and resources contained in the dataset.
The example contains one a dataset with one resource, but multiple resources can be added when appropriate, see more examples in Max Ogden's 100 JSON files from Data.gov
More on metadata file in JSON formats, more on metadata and Schema
If you didn't find any metadata, you are going to create metadata in a text editor for your dataset using the JSON template as a guide.
Notes on JSON:
"Pretty print" it if you found it formatted as a block or long line
It's fussy!
Mind placement of all the . } and ]
Check your file in JSONLint to verify that you have no syntax errors when you're done! Here is the JSON metadata for this file: TODO

Document your dataset by creating an issue in this repository and using our issue template to add the JSON metadata you found or created. Post your metadata to the issue tracker
Verify metadata from two other contributors. Open an issue, check the metadata by asking the following questions: -Is the dataset URL correct? -Is the dataset UUID correct? -Are the ownership details of the datset correct? -Is the JSON file formatted corrently? Check it in JSONLint, mind those extra spaces! Any other key questions to ask?

Let's all try it! Beginners and those who need to set up GitHub accounts organize, and everyone else dives in.

Google-Fu

Search for a URL on only data.gov using Google: site:data.gov "www1.ncdc.noaa.gov/pub/data/annualreports/"

Google things! make sure it's in double "" so it's an exact match

"www1.ncdc.noaa.gov" annual reports (e.g. "server" + keywords) - we found other annual reports! But not ours
"www1.ncdc.noaa.gov/pub/data/annualreports/" - we found only a UK Met office PDF citing our PDFs!

Extra Credit

More on metadata file in JSON formats, more on metadata and Schema

More on Data.gov Metadata, including long lists of possible Dataset Field titles here

Google also recently did a great post about using the Data.gov metadata format for discovery of datasets: https://research.googleblog.com/2017/01/facilitating-discovery-of-public.html

Is your data set missing form Data.gov? Max has been working with them to produce a full copy of their metadata archive and a process for reporting missing data

DataRescue Harvester DataRescue Seeder DataRescue Checker, Bagger, Describer

Other places to look for your dataset

rchampieux / data-rescue-pdx Goto Github PK

data-rescue-pdx's Introduction

Data-Rescue-PDX

Citizen Science Metadata Curation Workshop!

Background

What is metadata?

Data Detectives: Discovering data and creating JSON metadata files

Example

Let's all try it! Beginners and those who need to set up GitHub accounts organize, and everyone else dives in.

Google-Fu

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent