curran / data Goto Github PK

A collection of public data sets

License: MIT License

JavaScript 2.12% Shell 0.09% HTML 96.49% Makefile 0.03% R 0.13% Python 0.05% Rich Text Format 1.08%

data's Introduction

A collection of public data sets for testing out visualization methods. These data sets are at various stages of preparation, some are just raw data, some are CSV files, and some are exposed as AMD modules. This collection is messy, but with some digging you may find hidden gems.

Interesting Datasets

Most recently added on the top.

Leads

These are "leads" to find interesting datasets. They have teasers of cool data, but it will take some work to find the data behind them.

Social Explorer

data's People

Contributors

Stargazers

Watchers

Forkers

codeboyyong karthik ramnathv trinker marcodvisser micahstubbs cgruppo sdownin yorktronic heavysixer wavelets florianscharf ryadzenine rakave lukurbanski semihaozgul madhusudana30 gkipkorir grammerlknoedel leandroloi swlee-gameon sjmjulia qinlab haseebriaz pruthvishetty dhivyakrishnan257 farukc prajwal051290 cinchapi saniyamazman 7marjorie fgalshammari lololga hydrosquall git0810497 ishanamin rlugojr robterrin lukemun jadeyusun mojaray2k jr-gray rossmac485 xtutran adityaranjansingh rtg8055 vicraquel miteshnath pradeep-paranjothi jonycassio texyh smorcuend lk4ml nitinware katherinexrocha solertis mlshastra bpd1069 markahunt fsutomo h-arshit sruthi-d ryananyangu hailanwhu paolominguzzi rdbcasillas kurraraghu zunym sushmits alexmunz sysdatas mdickinson2018 rahuls98 walbuc snorrehu martibs ergo-dubito bnopacheco jaindheerain darcy1026 cugzhoumin yermorinsma vijayakumar75 yepuri-r alexpollack krishnakoti elissairhamy af94080 keseaman nishanmudalige tankwin08 aqua-regia soopcandan aishwarya8615 sethips scovak chongho packerfanfrank okenfor esotericmat

data's Issues

Establish Order

Currently this collection of data sets is very chaotic. However, there has reached a critical mass of individual examples such that it might now be possible to distill the commonalities between them and form a coherent general structure out of them.

There are two main things to deal with: directory structure and file structure.

Directory Structure

Ideally, the directory structure should be able to be traversed by some program in order to build an index of all the data sets. In order to make this straightforward, one solution might be to introduce a fixed hierarchical structure like this:

{{dataSource}}/{{dataCollection}}/{{dataSet}}

where

dataSource represents the organization that originally published the data.
dataCollection represents a category of data sets published
dataSet represents an individual data table

File Structure

The dsv-dataset project provides a metadata specification for annotating data sets with column types so they can be automatically parsed. The file structure should leverage dsv-dataset.

Perhaps each data set could have two files, one with the CSV data, and one with the metadata, like this:

{{dataSource}}/{{dataCollection}}/{{dataSet}}/data.csv
{{dataSource}}/{{dataCollection}}/{{dataSet}}/metadata.json

The disadvantage of this is that folks who want to move the .csv file into a different context will need to spend time thinking about what to name it, or just leave it as data.csv, which is rather generic.

Alternatively, the files could take on the name of the data set, like this

{{dataSource}}/{{dataCollection}}/{{dataSet}}/{{dataSet}}.csv
{{dataSource}}/{{dataCollection}}/{{dataSet}}/{{dataSet}}.json

As yet another alternative, data sets and their metadata could be combined into a single JSON file, whose contents might look something like this:

{
  "dsvString":"sepal_length,sepal_width,petal_length,petal_width,class\n5.1,3.5,1.4,0.2,setosa\n4.9,3.0,1.4,0.2,setosa\n4.7,3.2,1.3,0.2,setosa\n5.7,2.8,4.1,1.3,versicolor\n6.3,3.3,6.0,2.5,virginica\n5.8,2.7,5.1,1.9,virginica\n7.1,3.0,5.9,2.1,virginica",
  "metadata":{
    "delimiter": ",",
    "columns": [
      { "name": "sepal_length", "type": "number" },
      { "name": "sepal_width",  "type": "number" },
      { "name": "petal_length", "type": "number" },
      { "name": "petal_width",  "type": "number" },
      { "name": "class",        "type": "string" }
    ]
  }
}

This could live in a single file:

{{dataSource}}/{{dataCollection}}/{{dataSet}}/{{dataSet}}.json

The disadvantage of this approach is that anyone who wants to just get the CSV data out will need to write some code, rather than just copy an existing CSV file.

Alternatively, the data sets could live in the data collection directory, like this:

{{dataSource}}/{{dataCollection}}/{{dataSet}}.csv
{{dataSource}}/{{dataCollection}}/{{dataSet}}.json

This would make the data collection directories kind of messy. Also, having a README.md for each level might be a good thing too, which would make if favorable to have each data set reside in its own directory, like this:

{{dataSource}}/README.md
{{dataSource}}/{{dataCollection}}/README.md
{{dataSource}}/{{dataCollection}}/{{dataSet}}/README.md

Also, it might be nice to have an index at each level, so programs can query for what is there. These files could simply contain arrays of strings. This would make the full file layout look something like this:

{{dataSource}}/README.md
{{dataSource}}/dataCollections.json
{{dataSource}}/{{dataCollection}}/README.md
{{dataSource}}/{{dataCollection}}/dataSets.json
{{dataSource}}/{{dataCollection}}/{{dataSet}}/README.md
{{dataSource}}/{{dataCollection}}/{{dataSet}}/data.csv
{{dataSource}}/{{dataCollection}}/{{dataSet}}/metadata.json

Trees

What about cases where a data cube is partitioned across files, where each file contains a portion of the fact table where a certain dimension equals a certain value? For example, a data set may be partitioned across many files, one file per year. Or the partitioning could use one file per geographic region. Perhaps the directory tree can be built in such a way that it is possible to have a directory full of CSV files within a given data set, and all of them can share the same metadata file.

Letter Frequency

Is the letter frequency data set in here somewhere? http://bl.ocks.org/mbostock/3885304

Really cool! What is the LICENSE?

This is a really cool collection @curran. What is the LICENSE? I understand that the datasets themselves might be licensed differently, but what is this repository licensed under?

Climate Data

Leads from ChatGPT:

Sure! Here are direct links to pages where you can download CSV or Excel files related to opinions on climate change and its causes from the listed sources:

1. Pew Research Center

Data Access Page: Pew Research Center Data & Resources
Example Dataset: Global Concern About Climate Change, Broad Support for Limiting Emissions

2. Yale Program on Climate Change Communication

Data Access Page: Yale Climate Opinion Maps
Example Dataset: Download Data

3. World Values Survey

Data Access Page: World Values Survey Data
Example Dataset: WVS Wave 7 (2017-2020)

4. European Social Survey

Data Access Page: ESS Data Download
Example Dataset: ESS Round 8 (2016/2017)

5. United Nations Development Programme (UNDP)

Data Access Page: Peoples' Climate Vote
Example Dataset: Download the Dataset

These links should provide you with direct access to downloadable CSV or Excel files containing data on public opinions about climate change.

Scrape Asylum applications submitted in Europe 2010-2014 data

http://www.telegraph.co.uk/news/worldnews/europe/11822343/Mapped-Where-do-migrants-apply-for-asylum-in-Europe.html

vegaExamples flights

The flight JSON files in the vegaExamples directory are, I believe, subsets of the corssFilter data here: https://github.com/square/crossfilter/tree/gh-pages/ which is 230k rows. This itself a subset of the ASA Data Expo dataset.

The ASA dataset is very big, but you might consider adding your own large subset from the original source.

As for the vegaExamples JSON files they are formatted poorly. Would you consider a PR these PrettyPrints the JSON files?