Giter Site home page Giter Site logo

reference-data-repository's Introduction

Reference Data Repository

About

The Reference Data Repository provides access to reference data sets (e.g., controlled vocabularies, gazetteers, etc.) that are accessible on the Web and that are useful for data cleaning and data profiling tools like openclean and Auctus.

Data Hosting

Individual datasets are hosted by data maintainers on different platforms. The only requirement is that the datasets (or individual dataset versions) are accessible via HTTP GET requests. Information about dataset is maintained in a central index (as a Json file) that is hosted on the Web (see for example the openclean reference data collection).

Datasets and Data Formats

Each dataset has a unique identifier. Different file formats are supported for the datasets, e.g., csv files, Json, SQLIte database files, etc.. Format information for each dataset is stored as part of its entry in the global index.

Datasets are considered tabular (or sets of columns). Users may access only a single column from a dataset (e.g., country_name), multiple columns (e.b., country_name, captial_city) or the full dataset.

Below is an example dataset descriptor.

{
    "id": "encyclopaedia_britannica:us_cities",
    "name": "Cities in the U.S.",
    "description": "Names of cities in the U.S. from the Encyclopaedia Britannica.",
    "url": "https://raw.githubusercontent.com/VIDA-NYU/openclean-reference-data/master/data/us_cities.tsv",
    "checksum": "d361873f13b867805628d7db63987392835114f13da9ead0e11ccff2946631d2",
    "webpage": "https://www.britannica.com/topic/list-of-cities-and-towns-in-the-United-States-2023068",
    "schema": [
        {"id": "city", "name": "City", "description": "City Name", "dtype": "text"},
        {"id": "state", "name": "State", "description": "U.S. State Name", "dtype": "text"}
    ],
    "format": {
        "type": "csv",
        "parameters": {
            "delim": "\t"
        }
    }
}

The full schema for the data repository index content is defined in schema.yaml.

Local Data Repository

Users maintain copies of the datasets for local access. By default, datasets are stored in a subfolder in the user's cache directory.

Getting Started

Install the package using pip:

pip install refdata

This repository contains an example notebook that demonstrates the basic features of the package.

The package also includes a simple command line interface refdata that can be used to list contents of the repository index and to interact with the local data store.

Usage: refdata [OPTIONS] COMMAND [ARGS]...

  Command line interface for the Reference Data Repository.

Options:
  --help  Show this message and exit.

Commands:
  checksum  Print file checksum.
  index     Data Repository Index.
      list      List repository index content.
      show      Show dataset descriptor from repository index.
      validate  Validate repository index file.
  store     Local Data Store.
      download  List local store content.
      list      List local store content.
      remove    Remove dataset from local store.
      show      Show descriptor for downloaded dataset.

reference-data-repository's People

Contributors

heikomuller avatar maqzi avatar remram44 avatar

Watchers

 avatar  avatar  avatar  avatar

reference-data-repository's Issues

Update installation instructions

The README installation instructions still contain the GitHub repo URL. This should be changed to the PyPI package name refdata.

Improve data location

Describe the bug
refdata always puts data in ~/.refdata, however this is not a good location on many OS.

To Reproduce
Steps to reproduce the behavior:

  1. Use refdata on Linux: data goes in /home/remram/.refdata, but the FreeDesktop specification says it should be /home/remram/.cache/refdata (or somewhere else if $XDG_CACHE_HOME is set)
  2. Use refdata on Windows: data goes in C:\Users\remram\.refdata, but it should be C:\Documents and Settings\<User>\Application Data\Local Settings\refdata
  3. Use refdata on MacOS: data goes in /Users/remram/.refdata, but it should be /Users/remram/Library/Application Support/refdata

Expected behavior
refdata should put data in the usual location on each operating system, for example using the appdirs library

Additional context
Similar to fatiando/pooch#26

Add code examples to the documentation

There should be some examples for using the package within Python scripts. This could either be part of the README or in a separate examples.rst file.

Support preprocessing for values in distinct sets and mappings

Is your feature request related to a problem? Please describe.
Lookup tables that can be generated from downloaded datasets are often used in a case-insensitive way (e.g., by converting all strings to lower case). Currently, this would have to be done by the user after they receive the set/dict of values.

Describe the solution you'd like
Allow the user to provide a callable for the distinct() and mapping() methods of the DatasetHandle. This callable is evaluated on the dataset values before they are added to the created set/dict. For mappings there should also be an option to provide a tuple of callables, one for the lhs columns and one for the rhs columns.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.