Giter Site home page Giter Site logo

lookup's Introduction

lookup

A repository of lookup tables for journalists. Designed for programmatic access using tools such as agate-lookup (Python) and lookupr (R).

Anyone may contribute a lookup table by sending a pull request to this repository.

Structure of files

Each folder is a key that can be used for a lookup. Within that folder are CSV files. The name of the CSV file is the name of the value that it maps to. The CSV itself will contain two columns, one with the key and another with the value. For example, usps/state.csv contains a CSV file that looks like this:

usps,state
AL,Alabama
AK,Alaska
AZ,Arizona
...

Sometimes the mapping from a key to value varies over time. For example, NAICS codes change every five years. In this case, a version specifier may be included in the filename. For example, naics/description.2007.csv is the 2007 version of the code mapping and naics/description.2012.csv is the 2012 version.

It may also be useful to be able to map two keys to a single value. For example, you might want to look up population by state and year. In those cases key folders can be nested and the CSV can contain more than one key column. For example, usps/year/population.csv contains a CSV that looks like this:

usps,year,state
AL,2015,4858979
AL,2014,4846411
AL,2013,4830533
...

Metadata format

Each CSV table must be accompanied by a YAML file. That file must have an identical filename, plus the .yml extension. For example, the table fips/state.csv must be accompanied by fips/state.csv.yml. This file should contain the following metadata:

data: A description of the data, including any notes necessary to use it correctly.
version: A description of the specific version of the data.
sources:
  - A list of sources for the data, such as "United States Census Bureau", including URLs whenever possible
contributors:
  - The name <and email of anyone who has contributed to this table>
columns:
  key_column_name:
    name: Human readable name for this column
    type: Agate column type, such as "Text" or "Number"
  value_column_name:
    name: Human readable name for this column
    type: Agate column type, such as "Text" or "Number"

See naics/description.2007.csv.yaml for an example of a complete metadata file.

Rules for including data

Anyone may submit a pull request to add a table to this repository, however, the following rules will guide inclusion of any data:

  • The data must have journalistic value.
  • The data must be from an authoritative source.
  • The CSV must be in "standardized" CSV format. (Run through in2csv.)
  • All keys must be unique. (No split/combine crosswalks.)
  • All keys must be durable identifiers, not names.
  • All filenames and keys must use snake_case.
  • Periods must not be used in filenames or keys except as defined above.
  • Four digit years must be used everywhere.
  • Each CSV must be 250KB or less.

I found an error!

If you find an error in any data, please send a pull request that corrects the mistake and adds a record of the correction to ERRORS.md. Try to describe the nature of the error as precisely as possible.

lookup's People

Contributors

dannguyen avatar newsroomdev avatar onyxfish avatar palewire avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

lookup's Issues

Dates?

I was just grabbing this month's Canadian house price index data, and of course they decided to encode their dates like this:

Date Index
Jan-2015 167.110
Feb-2015 167.320
Mar-2015 167.830
Apr-2015 168.090
May-2015 169.750
Jun-2015 172.220
Jul-2015 174.530
Aug-2015 176.590
Sep-2015 177.760
Oct-2015 177.960
Nov-2015 178.350
Dec-2015 178.260
Jan-2016 178.010
Feb-2016 179.200

It'd be nice to be able to automatically re-encode these to ISO 8601. Would this be a good application of lookup? There's bound to be some variation in how the months are abbreviated, so I'm not entirely sure. Also, days of the month might not always be in the dataset…

Should `columns` allow for more columns, and/or more metadata?

The documentation makes it seem as if columns will only allow for a key and value pair. But what if there's a 3-way lookup, e.g. "New York", "NY", "N.Y.", etc...I'm guessing that's alluded to here:

#3

but is key/colname: datatype enough? Or rather, is the succinctness worth the limitation in expanding the format?

I'm thinking of Census decade-to-decade lookup tables, in which sometimes later tracts incorporate a combination of past tracts, and this complexity would seemingly be needed to state at the columns level of metadata.

Also, having a "human readable full name" attribute for each column would be nice.

Anyway, I know these aren't easy questions with non-tradeoffs...but thanks for taking charge on this!

Scope of this repository

I find the stated description of this repo "A repository of journalist's lookup tables." quite ambiguous.

What types of open data are the maintainers willing to accept?

Should we have an "overflow" repository for other open data which is beyond the scope of this repository, with a more permissive merging strategy?

Use an existing CSV schema format?

This is pretty awesome, and what I'm suggesting is possibly overkill, but I was wondering if you had considered using one of the CSV schema formats for specifying the fields in the CSV. These seem to be the two biggest ones out there:

I will admit this seems like a bit of overkill for a CSV of states, but it might be useful if you wanted to automatically validate future changes or additions with an automated test and then you get CI for your CSV. For instance, Goodtables is a validator that uses the JSON schema format (although it needs some work). CSVLint is another new entrant I haven't evaluated it but it also uses the JSON schema format (which seems like the one to consider now).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.