reference-data-repository's Introduction

Reference Data Repository

About

The Reference Data Repository provides access to reference data sets (e.g., controlled vocabularies, gazetteers, etc.) that are accessible on the Web and that are useful for data cleaning and data profiling tools like openclean and Auctus.

Data Hosting

Individual datasets are hosted by data maintainers on different platforms. The only requirement is that the datasets (or individual dataset versions) are accessible via HTTP GET requests. Information about dataset is maintained in a central index (as a Json file) that is hosted on the Web (see for example the openclean reference data collection).

Datasets and Data Formats

Each dataset has a unique identifier. Different file formats are supported for the datasets, e.g., csv files, Json, SQLIte database files, etc.. Format information for each dataset is stored as part of its entry in the global index.

Datasets are considered tabular (or sets of columns). Users may access only a single column from a dataset (e.g., country_name), multiple columns (e.b., country_name, captial_city) or the full dataset.

Below is an example dataset descriptor.

{
    "id": "encyclopaedia_britannica:us_cities",
    "name": "Cities in the U.S.",
    "description": "Names of cities in the U.S. from the Encyclopaedia Britannica.",
    "url": "https://raw.githubusercontent.com/VIDA-NYU/openclean-reference-data/master/data/us_cities.tsv",
    "checksum": "d361873f13b867805628d7db63987392835114f13da9ead0e11ccff2946631d2",
    "webpage": "https://www.britannica.com/topic/list-of-cities-and-towns-in-the-United-States-2023068",
    "schema": [
        {"id": "city", "name": "City", "description": "City Name", "dtype": "text"},
        {"id": "state", "name": "State", "description": "U.S. State Name", "dtype": "text"}
    ],
    "format": {
        "type": "csv",
        "parameters": {
            "delim": "\t"
        }
    }
}

The full schema for the data repository index content is defined in schema.yaml.

Local Data Repository

Users maintain copies of the datasets for local access. By default, datasets are stored in a subfolder in the user's cache directory.

Getting Started

Install the package using pip:

pip install refdata

This repository contains an example notebook that demonstrates the basic features of the package.

The package also includes a simple command line interface refdata that can be used to list contents of the repository index and to interact with the local data store.

Usage: refdata [OPTIONS] COMMAND [ARGS]...

  Command line interface for the Reference Data Repository.

Options:
  --help  Show this message and exit.

Commands:
  checksum  Print file checksum.
  index     Data Repository Index.
      list      List repository index content.
      show      Show dataset descriptor from repository index.
      validate  Validate repository index file.
  store     Local Data Store.
      download  List local store content.
      list      List local store content.
      remove    Remove dataset from local store.
      show      Show descriptor for downloaded dataset.

reference-data-repository's People

Contributors

Watchers

reference-data-repository's Issues

Update installation instructions

The README installation instructions still contain the GitHub repo URL. This should be changed to the PyPI package name refdata.

Improve data location

Describe the bug
refdata always puts data in ~/.refdata, however this is not a good location on many OS.

To Reproduce
Steps to reproduce the behavior:

Use refdata on Linux: data goes in /home/remram/.refdata, but the FreeDesktop specification says it should be /home/remram/.cache/refdata (or somewhere else if $XDG_CACHE_HOME is set)
Use refdata on Windows: data goes in C:\Users\remram\.refdata, but it should be C:\Documents and Settings\<User>\Application Data\Local Settings\refdata
Use refdata on MacOS: data goes in /Users/remram/.refdata, but it should be /Users/remram/Library/Application Support/refdata

Expected behavior
refdata should put data in the usual location on each operating system, for example using the appdirs library

Additional context
Similar to fatiando/pooch#26

Add code examples to the documentation

There should be some examples for using the package within Python scripts. This could either be part of the README or in a separate examples.rst file.

Support preprocessing for values in distinct sets and mappings

Is your feature request related to a problem? Please describe.
Lookup tables that can be generated from downloaded datasets are often used in a case-insensitive way (e.g., by converting all strings to lower case). Currently, this would have to be done by the user after they receive the set/dict of values.

Describe the solution you'd like
Allow the user to provide a callable for the distinct() and mapping() methods of the DatasetHandle. This callable is evaluated on the dataset values before they are added to the created set/dict. For mappings there should also be an option to provide a tuple of callables, one for the lhs columns and one for the rhs columns.

Recommend Projects

vida-nyu / reference-data-repository Goto Github PK

reference-data-repository's Introduction

Reference Data Repository

About

Data Hosting

Datasets and Data Formats

Local Data Repository

Getting Started

reference-data-repository's People

Contributors

Watchers

Forkers

reference-data-repository's Issues

Update installation instructions

Improve data location

Add code examples to the documentation

Support preprocessing for values in distinct sets and mappings

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent