altair-viz / vega_datasets Goto Github PK

View Code? Open in Web Editor NEW

163.0 163.0 57.0 325 KB

A Python package for online & offline access to vega datasets

License: MIT License

Python 97.99% Makefile 2.01%

vega_datasets's People

Contributors

Stargazers

Watchers

Forkers

awalin eitanlees ram-n datadesk yuvallanger yy gridl varunjha089 kirosg feiiw sandy4321 iliatimofeev annacalla14 jakevdp zaknbur themidwestcanapps fagan2888 nlenssen2013 karina-rodriguez sevehub baldwint surya-kant elaujoa restartus julienklaus ivirshup caseyhyoon tjtuttle harbarsem jfabriciocp stardust1900 hercules261188 95abhijeet blackopsent amandaha8 marcocaggioni axe52 ccwu0918 jo0702 walquiria16 sriram-karthikeyanr suttikarnchaichana ibdiaye konanast gniminsco haadini dcentbrown1988 jessmatth metoundi sukhsingh93 yuandmi-1 mujeeb-1998 yamid07

vega_datasets's Issues

HTTP Error 404: Not Found

Hello,
vega_datasets is my go to source for quick data to try things out. A couple datasets I use often including but not limited to: birdstrikes, climate are returning HTTP Error 404: Not Found. Any suggestions?

World airports dataset

Nice work on vega_datasets and altair! 😃

It would be great for the entire world airports dataset to be included in vega_datasets, not just a subset for those in the USA. It would make for many more interesting visualization possibilities, starting with this:

alt.Chart(world_airports[:5000]).mark_bar().encode(
    x='count()',
    y='Country:N'
)

and then filtering by country, timezone etc.

This dataset is around 8000 rows, which would also serve the useful purpose of demonstrating how to handle datasets longer than Altair's default limit of 5000 rows. This limit is likely the first hurdle most people using Altair for real datasets will have to surmount. (I'd also gladly volunteer to help to make handling of large datasets more seamless in Altair...)

Example URLs:

Add more local datasets

We can add local datasets if

the dataset license is compatible with the package MIT license (this is often tough to figure out, because the provenance of many available datastes is unclear)
the dataset is small enough that it won't significantly affect the package size

Adding a dataset to the package is easy:

add the name to the list at https://github.com/jakevdp/vega_datasets/blob/master/tools/download_datasets.py#L17
add dataset description & references (including license if available) to https://github.com/jakevdp/vega_datasets/blob/master/vega_datasets/dataset_info.json
run python tools/download_datasets.py
commit the downloaded datasets & modified descriptions & open a Pull Request

Fix CI

Things are broken after moving the repo to altair-viz.

`zipcodes()` returns a dataframe with incorrect dtype.

from vega_datasets import data
zipcodes = data.zipcodes()
print(zipcodes.zip_code.dtype)

Expected: dtype('O') or rather CategoricalDtype(categories=['00501', '00544', ....

Actual: dtype('int64')

Some ZIP codes starts with "0" and zipcodes = data.zipcodes() removes all preceding zeros. The following works, but I think it's better to return with the correct dtypes by default.

zipcodes = data.zipcodes(dtype={'zip_code': 'category'})

Also found that data.unemployment() cannot correctly parse the data. One should specify the separator data.unemployment(sep='\t').

Transition to v2

The vega/vega-datasets repository recently released a major update. A few changes will need to be made to catch up. I just wanted to file this issue to get the ball rolling.

I can work on it and submit a PR when I get a bit of free time :)

Consider using CDN for the base url

In vega/vega-datasets they recommend using a CDN with a fixed version to access the URLs for a dataset such as

https://cdn.jsdelivr.net/npm/[email protected]/data/cars.json

instead of

https://vega.github.io/vega-datasets/data/cars.json

I was wondering if we should make this change as well?

I think modifications would take place here:

vega_datasets/vega_datasets/core.py

Line 94 in 70d6829

base_url = "https://vega.github.io/vega-datasets/data/"

but it would be cool if it could grab the correct version number from

vega_datasets/vega_datasets/__init__.py

Line 9 in 70d6829

SOURCE_TAG = "v1.29.0"

Anyways, just an idea moving forward to try to make things more stable

Tests failing with pandas 0.25.0

I am getting the following test failures with pandas 0.25.0 that didn't occur with earlier versions of pandas:

=================================== FAILURES ===================================
____________________________ test_iris_column_names ____________________________

    def test_iris_column_names():
        iris = data.iris()
        assert type(iris) is pd.DataFrame
>       assert tuple(iris.columns) == ('petalLength', 'petalWidth', 'sepalLength',
                                       'sepalWidth', 'species')
E       AssertionError: assert ('sepalLength...h', 'species') == ('petalLength'...h', 'species')
E         At index 0 diff: 'sepalLength' != 'petalLength'
E         Use -v to get the full diff

vega_datasets/tests/test_local_datasets.py:32: AssertionError
____________________________ test_cars_column_names ____________________________

    def test_cars_column_names():
        cars = data.cars()
        assert type(cars) is pd.DataFrame
>       assert tuple(cars.columns) == ('Acceleration', 'Cylinders', 'Displacement',
                                       'Horsepower', 'Miles_per_Gallon', 'Name',
                                       'Origin', 'Weight_in_lbs', 'Year')
E       AssertionError: assert ('Name', 'Mil..._in_lbs', ...) == ('Acceleration..., 'Name', ...)
E         At index 0 diff: 'Name' != 'Acceleration'
E         Use -v to get the full diff

vega_datasets/tests/test_local_datasets.py:51: AssertionError

ZIP Codes should be treated as strings

When the zipcodes data is loaded, it returns a dataframe with integer values for the zip_code column.

These values should be five-character strings. Is there a reasonable way to make this package control for that?

in the meantime, this is a simple workaround:
df['zip_code'] = df['zip_code'].apply(lambda x: str(x).zfill(5))

Add sp500-2000.csv

Recently there was a new dataset added to vega/vega-datasets

sp500-2000.csv - S&P 500 index values from 2000 to 2020, retrieved from Yahoo Finance.

Making a note here to add this dataset once #39 is done.

Some datasets are images and throw a ValueError

Some datasets like ffox are images (.png) and thus throw a ValueError

To reproduce in vega-datasets-0.9.0 (current version on pip):

from vega_datasets import data
data.ffox()

Raises:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
c:\Users\redacted\Repositories\altair-demos\basic_chart.py in <module>
----> 1 data.ffox()

~\AppData\Local\Programs\Python\Python39\lib\site-packages\vega_datasets\core.py in __call__(self, use_local, **kwargs)
    244             return pd.read_csv(datasource, **kwds)
    245         else:
--> 246             raise ValueError(
    247                 "Unrecognized file format: {0}. "
    248                 "Valid options are ['json', 'csv', 'tsv']."

ValueError: Unrecognized file format: png. Valid options are ['json', 'csv', 'tsv'].

Add separate JSON file with dataset license/source information

also automatically parse this and inject the info into docstrings.

Setting indices makes data not work with vega/vega-lite

Seattle Weather Data is inconsistent with Vega data

The data in the altair repository is inconsistent with the vega data:

https://github.com/altair-viz/vega_datasets/blob/master/vega_datasets/_data/seattle-weather.csv

https://github.com/vega/vega/blob/main/docs/data/seattle-weather.csv

The vega data makes more sense, since it seems that the altair version of the dataset barely has any rain for 2015. Check 2015-01-02 for an example of an inconsistent field, but there are MANY.

Trouble reading a few datasets

I got some errors when trying to read the 'miserables', 'us-10m', and 'world-110m' datasets.

For 'miserables' it read:
ValueError: arrays must all be same length

and for 'us-10m' and 'world-110m' it read:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

problem: import vega_datasets.data as ...

Please forgive me as I am a Python "newbie" and may be asking an ignorant question.

I would like to be able to use the form:

import vega_datasets.data as data

instead of

from vega_datasets import data

My motivation is that I can use something analogous to the first form when using the import() function in the R reticulate package.

If I try this (in Python):

from vega_datasets import data
dir(data)

I get (as I expect):

['7zip',
 'airports',
 'anscombe',
 'barley',
 ...
 'zipcodes']

However, if I try this:

import vega_datasets.data as data2
dir(data2)

I get:

['__doc__', '__loader__', '__name__', '__package__', '__path__', '__spec__']

whereas I am hoping to replicate the first behavior.

By contrast, this works as I expect:

import scipy.stats as stats
dir(stats)

Question: could it be possible for import vega_datasets.data as data to work like from vega_datasets import data?

Thanks!

Two Weather Datasets

It appears that are two datasets called weather in the vega/vega-datasets repo:

Currently the altair-viz/vega_datasets only includes weather.json.

To add weather.csv do I just add an entry to vega_datasets/datasets.json?

Also any thoughts on what to name the two?