wireservice / lookup Goto Github PK

A repository of journalist's lookup tables.

HTML 100.00%

lookup csv wireservice journalism tables agate python r

lookup's Introduction

lookup

A repository of lookup tables for journalists. Designed for programmatic access using tools such as agate-lookup (Python) and lookupr (R).

Anyone may contribute a lookup table by sending a pull request to this repository.

Structure of files

Each folder is a key that can be used for a lookup. Within that folder are CSV files. The name of the CSV file is the name of the value that it maps to. The CSV itself will contain two columns, one with the key and another with the value. For example, usps/state.csv contains a CSV file that looks like this:

usps,state
AL,Alabama
AK,Alaska
AZ,Arizona
...

Sometimes the mapping from a key to value varies over time. For example, NAICS codes change every five years. In this case, a version specifier may be included in the filename. For example, naics/description.2007.csv is the 2007 version of the code mapping and naics/description.2012.csv is the 2012 version.

It may also be useful to be able to map two keys to a single value. For example, you might want to look up population by state and year. In those cases key folders can be nested and the CSV can contain more than one key column. For example, usps/year/population.csv contains a CSV that looks like this:

usps,year,state
AL,2015,4858979
AL,2014,4846411
AL,2013,4830533
...

Metadata format

Each CSV table must be accompanied by a YAML file. That file must have an identical filename, plus the .yml extension. For example, the table fips/state.csv must be accompanied by fips/state.csv.yml. This file should contain the following metadata:

data: A description of the data, including any notes necessary to use it correctly.
version: A description of the specific version of the data.
sources:
  - A list of sources for the data, such as "United States Census Bureau", including URLs whenever possible
contributors:
  - The name <and email of anyone who has contributed to this table>
columns:
  key_column_name:
    name: Human readable name for this column
    type: Agate column type, such as "Text" or "Number"
  value_column_name:
    name: Human readable name for this column
    type: Agate column type, such as "Text" or "Number"

See naics/description.2007.csv.yaml for an example of a complete metadata file.

Rules for including data

Anyone may submit a pull request to add a table to this repository, however, the following rules will guide inclusion of any data:

The data must have journalistic value.
The data must be from an authoritative source.
The CSV must be in "standardized" CSV format. (Run through in2csv.)
All keys must be unique. (No split/combine crosswalks.)
All keys must be durable identifiers, not names.
All filenames and keys must use snake_case.
Periods must not be used in filenames or keys except as defined above.
Four digit years must be used everywhere.
Each CSV must be 250KB or less.

I found an error!

If you find an error in any data, please send a pull request that corrects the mistake and adds a record of the correction to ERRORS.md. Try to describe the nature of the error as precisely as possible.

lookup's People

Contributors

Stargazers

Watchers

Forkers

newsroomdev datadesk tonypapousek rotsee aaronwe dannguyen bluengreen mrsweaters radovankavicky gapdata abeusher ws-pittman amccartney isabella232 jeffreyguntzel

lookup's Issues

state/ap.csv

Should FIPS code columns be a Number data type?

I say this because I often seen FIPS codes provided with leading zeros. Forcing everything to integers might be a workaround on that problem.

Dates?

I was just grabbing this month's Canadian house price index data, and of course they decided to encode their dates like this:

Date	Index
Jan-2015	167.110
Feb-2015	167.320
Mar-2015	167.830
Apr-2015	168.090
May-2015	169.750
Jun-2015	172.220
Jul-2015	174.530
Aug-2015	176.590
Sep-2015	177.760
Oct-2015	177.960
Nov-2015	178.350
Dec-2015	178.260
Jan-2016	178.010
Feb-2016	179.200

It'd be nice to be able to automatically re-encode these to ISO 8601. Would this be a good application of lookup? There's bound to be some variation in how the months are abbreviated, so I'm not entirely sure. Also, days of the month might not always be in the dataset…

Should `columns` allow for more columns, and/or more metadata?

The documentation makes it seem as if columns will only allow for a key and value pair. But what if there's a 3-way lookup, e.g. "New York", "NY", "N.Y.", etc...I'm guessing that's alluded to here:

but is key/colname: datatype enough? Or rather, is the succinctness worth the limitation in expanding the format?

I'm thinking of Census decade-to-decade lookup tables, in which sometimes later tracts incorporate a combination of past tracts, and this complexity would seemingly be needed to state at the columns level of metadata.

Also, having a "human readable full name" attribute for each column would be nice.

Anyway, I know these aren't easy questions with non-tradeoffs...but thanks for taking charge on this!

fips/city.csv

How would you feel about more state metadata, like Associated Press abbreviations?

We have a bunch of stuff like that over in latimes-statestyle that might fit here.

Scope of this repository

I find the stated description of this repo "A repository of journalist's lookup tables." quite ambiguous.

What types of open data are the maintainers willing to accept?

Should we have an "overflow" repository for other open data which is beyond the scope of this repository, with a more permissive merging strategy?

iso2/country and iso3/country are not proper, unicode names

Sao Tome, for instance.

naics/description.2002.csv

Correlates of War country codes

http://www.correlatesofwar.org/data-sets/cow-country-codes

Use an existing CSV schema format?

This is pretty awesome, and what I'm suggesting is possibly overkill, but I was wondering if you had considered using one of the CSV schema formats for specifying the fields in the CSV. These seem to be the two biggest ones out there:

I will admit this seems like a bit of overkill for a CSV of states, but it might be useful if you wanted to automatically validate future changes or additions with an automated test and then you get CI for your CSV. For instance, Goodtables is a validator that uses the JSON schema format (although it needs some work). CSVLint is another new entrant I haven't evaluated it but it also uses the JSON schema format (which seems like the one to consider now).

fips/county.csv

This one might be too big...