Giter Site home page Giter Site logo

epistasislab / pmlb Goto Github PK

View Code? Open in Web Editor NEW
791.0 31.0 129.0 240.58 MB

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.

Home Page: https://epistasislab.github.io/pmlb/

License: MIT License

Python 91.41% HTML 0.60% CSS 0.44% TeX 7.55%

pmlb's Introduction

Penn Machine Learning Benchmarks

This repository contains the code and data for a large, curated set of benchmark datasets for evaluating and comparing supervised machine learning algorithms. These data sets cover a broad range of applications, and include binary/multi-class classification problems and regression problems, as well as combinations of categorical, ordinal, and continuous features.

Please go to our home page to interactively browse the datasets, vignette, and contribution guide!

Breaking changes in PMLB 1.0

This repository has been restructured, and several dataset names have been changed!

If you have an older version of PMLB, we highly recommend you upgrade it to v1.0 for updated URLs and names of datasets:

pip install pmlb --upgrade

Datasets

Datasets are tracked with Git Large File Storage (LFS). If you would like to clone the entire repository, please install and set up Git LFS for your user account. Alternatively, you can download the .zip file from GitHub.

All data sets are stored in a common format:

  • First row is the column names
  • Each following row corresponds to one row of the data
  • The target column is named target
  • All columns are tab (\t) separated
  • All files are compressed with gzip to conserve space

Dataset_Sizes

The complete table of dataset characteristics is also available for download. Please note, in our documentation, a feature is considered:

  • "binary" if it is of type integer and has 2 unique values (equivalent to pandas profiling's "boolean")
  • "categorical" if it is of type integer and has more than 2 unique values (equivalent to pandas profiling's "categorical")
  • "continuous" if it is of type float (equivalent to pandas profiling's "numeric").

Python wrapper

For easy access to the benchmark data sets, we have provided a Python wrapper named pmlb. The wrapper can be installed on Python via pip:

pip install pmlb

and used in Python scripts as follows:

from pmlb import fetch_data

# Returns a pandas DataFrame
adult_data = fetch_data('adult')
print(adult_data.describe())

The fetch_data function has two additional parameters:

  • return_X_y (True/False): Whether to return the data in scikit-learn format, with the features and labels stored in separate NumPy arrays.
  • local_cache_dir (string): The directory on your local machine to store the data files so you don't have to fetch them over the web again. By default, the wrapper does not use a local cache directory.

For example:

from pmlb import fetch_data

# Returns NumPy arrays
adult_X, adult_y = fetch_data('adult', return_X_y=True, local_cache_dir='./')
print(adult_X)
print(adult_y)

You can also list all of the available data sets as follows:

from pmlb import dataset_names

print(dataset_names)

Or if you only want a list of available classification or regression datasets:

from pmlb import classification_dataset_names, regression_dataset_names

print(classification_dataset_names)
print('')
print(regression_dataset_names)

Example usage: Compare two classification algorithms with PMLB

PMLB is designed to make it easy to benchmark machine learning algorithms against each other. Below is a Python code snippet showing the most basic way to use PMLB to compare two algorithms.

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sb

from pmlb import fetch_data, classification_dataset_names

logit_test_scores = []
gnb_test_scores = []

for classification_dataset in classification_dataset_names:
    X, y = fetch_data(classification_dataset, return_X_y=True)
    train_X, test_X, train_y, test_y = train_test_split(X, y)

    logit = LogisticRegression()
    gnb = GaussianNB()

    logit.fit(train_X, train_y)
    gnb.fit(train_X, train_y)

    logit_test_scores.append(logit.score(test_X, test_y))
    gnb_test_scores.append(gnb.score(test_X, test_y))

sb.boxplot(data=[logit_test_scores, gnb_test_scores], notch=True)
plt.xticks([0, 1], ['LogisticRegression', 'GaussianNB'])
plt.ylabel('Test Accuracy')

Contributing

See our Contributing Guide. We're looking for help with documentation, and also appreciate new dataset and functionality contributions.

Citing PMLB

If you use PMLB in a scientific publication, please consider citing one of the following papers:

Joseph D. Romano, Le, Trang T., William La Cava, John T. Gregg, Daniel J. Goldberg, Praneel Chakraborty, Natasha L. Ray, Daniel Himmelstein, Weixuan Fu, and Jason H. Moore. PMLB v1.0: an open source dataset collection for benchmarking machine learning methods. arXiv preprint arXiv:2012.00058 (2020).

@article{romano2021pmlb,
  title={PMLB v1.0: an open source dataset collection for benchmarking machine learning methods},
  author={Romano, Joseph D and Le, Trang T and La Cava, William and Gregg, John T and Goldberg, Daniel J and Chakraborty, Praneel and Ray, Natasha L and Himmelstein, Daniel and Fu, Weixuan and Moore, Jason H},
  journal={arXiv preprint arXiv:2012.00058v2},
  year={2021}
}

Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore (2017). PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10, page 36.

BibTeX entry:

@article{Olson2017PMLB,
    author="Olson, Randal S. and La Cava, William and Orzechowski, Patryk and Urbanowicz, Ryan J. and Moore, Jason H.",
    title="PMLB: a large benchmark suite for machine learning evaluation and comparison",
    journal="BioData Mining",
    year="2017",
    month="Dec",
    day="11",
    volume="10",
    number="1",
    pages="36",
    issn="1756-0381",
    doi="10.1186/s13040-017-0154-4",
    url="https://doi.org/10.1186/s13040-017-0154-4"
}

Support for PMLB

PMLB was developed in the Computational Genetics Lab at the University of Pennsylvania with funding from the NIH under grant AI117694, LM010098 and LM012601. We are incredibly grateful for the support of the NIH and the University of Pennsylvania during the development of this project.

pmlb's People

Contributors

alexzwanenburg avatar daniel0710goldberg avatar github-actions[bot] avatar gkronber avatar greggj2016 avatar janopig avatar jdromano2 avatar lacava avatar natray21 avatar praneelc avatar ramhiser avatar rhiever avatar trang1618 avatar trangdata avatar weixuanfu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pmlb's Issues

Add regression benchmark

@weixuanfu2016 is working on adding regression datasets to PMLB. Please report on your progress here, @weixuanfu2016.

car and car_evaluation seem to be identical

car and car_evaluation datasets appear to be the same.

  • summary statistics in pd_profiling are same, excepting # features
  • I've gone through the feature list, and the n possible values for each feature in car have just been transformed into n binary features in car_evaluation, each indicating whether or not that value is true for each datapoint in car, so the features are effectively the same, too, just binarized from car to car_evaluation

Define metadata.yaml schema

Here's what I'm thinking for the metadata.yaml schema. We can set up CI to validate this schema (potentially with jsonschema?)

Then a README of summary statistics/csv files can be automatically generated (which will allow for easy querying such as this).

hashid: # required, hash id of the dataset
dataset: # required, dataset name
description: # required, dataset description
source: # required, link to the source from where dataset was retrieved
publication: # optional, study that generated the dataset
task: # required, classification or regression
columns: 
    [column_name]: # can be 'target'
        type:  # required, either continuous, nominal or ordinal
        description: # required, what the column measures/indicates, unit
        code: # optional, coding information, e.g., 'Control' = 0, 'Case' = 1
        transform: # optional, any transformation performed on the column, e.g., log scaled

fetch nearest datasets

add the ability for a user to fetch the nearest dataset names, where the neighborhood is defined in summary stats space

def fetch_nearest_dataset_names(X,y, n=1):
    return list of closest datasets based on summary stats

Question: Balanced Accuracy & Scaling

Hello,

I currently try to reproduce the benchmark results in (https://arxiv.org/abs/1703.00512), but I encountered some problems:

In the paper, you refer to using "balanced accuracy". What implementation did you use? (or was it an own implementation?) I couldn't find a proper "balanced accuracy" in sklearn.

And for the scaling: When I scale the features with sklearn.preprocessing.scale, I will get an error as soon as I get to the MultinomialNB classifier (ValueError: Input X must be non-negative). So I guess, you apply another scaling for that one, which is not documented in the paper?

update README for PMLB2

For PMLB2.0, we are capturing metadata in metadata.yaml and summary stats in a csv file. So, the readme is now free to present other info. Brainstorming what would be useful in there:

  • links to the metadata and summary data
  • plot of distribution of target
  • some summary info, like rows, cols, nclasses
  • for appropriately sized datasets, pairwise plots of the columns would be nice, maybe for the top 10 features

Just some ideas

move datasets to one folder?

commit 469bfb2 has the wine quality dataset in the datasets/ folder. as discussed here because we have metadata defining tasks now, we could move all datasets to one folder instead of separate regression and classification folders.

Wrong target!

The current target is a train/test selector, not a dependent variable.
According to openML, most analysis was properly done on this dataset uses the 6th feature, "Drinks", as the target.

The PMLB dataset should be changed to reflect this: treat 6th feature, currently "Drinks", as the new target, and remove 7th feature.

P.S. The metadata.yaml file is currently reflecting this change, minus a TODO that needs to be removed.

add keywords to metadata

It would be useful to tag metadata with keywords like "bioinformatics", "economics", "image", etc. I'm not sure of a principled way to do this, but I imagine we might want to use a reference ontology (like ACM CSS?). Discussion welcome.

Add JSON Schema file to validate `metadata.yaml`

All datasets in PMLB are accompanied by a metadata.yaml file that describes various characteristics of the dataset. We'd like to add a schema specification that can be used to validate the metadata files for each dataset.

The current plan is to create the schema as a JSON Schema, which is (almost) fully compatible with YAML documents as described in https://json-schema-everywhere.github.io/yaml and https://stackoverflow.com/a/44837391/1730417.

The (current) template used to create metadata.yaml is as follows:

# Created by [your name and/or contact info]
dataset: # required, dataset name
description: # required, dataset description
source: # required, link to the source from where dataset was retrieved
publication: # optional, study that generated the dataset (doi, pmid, pmcid, or url)
task: # required, classification or regression
keywords: # descriptive terms for the dataset, e.g., bioinformatics, images, economics, etc.
  - keyword1 # replace this
  - keyword2 # replace this as well
target:
  type:
  description: # required, describe the endpoint/outcome (and unit if exists)
  code: # optional but recommended, coding information, e.g., 'Control' = 0, 'Case' = 1
features: # list of features in the dataset
  - name: # required, name of feature
    type: # required, either continuous, nominal or ordinal
    description: # optional but recommended, what the feature measures/indicates, unit
    code: # optional, coding information, e.g., 'Control = 0', 'Case' = 1
    transform: # optional, any transformation performed on the feature, e.g., log scaled
  - name:
    type:
    description:
    code:
    transform:

Site build fail

Even though everything was deployed correctly, the site is not being built.
site-build-error

One possible reason is that we have reached the repo size limits. I'm squashing all the commits on the gh-pages branch, and will potentially need to add force_orphan: true in the CI deployment step.

Relates #91.

Extend Python wrapper

We should extend the Python wrapper to allow the user to query the repository for data sets that meet a certain criteria.

For example, the user could query for "all data sets that have >=1000 records" or "all data sets that are binary classification problems."

document data types

add documentation for how specifically each data type is assigned, and what "type schema" we are using. It may also be good to mention how it differs from pandas profiling types.

Use Git LFS

It’s not too urgent but I think it might be good to utilize Git LFS and allow for larger datasets to be added. More here.

Users who want to download the whole resource can still do that with git clone, but they'll need git LFS installed and set up for their user account, or they can download the zip file from GitHub. We can add this information to the README.

Later, we can save all the changes in one branch and, in an active branch (master branch when all this gets merged), we can squash all the old commits to reduce the repo space.

Errors with pmlb.fetch_data()

hello,

Just recently I encountered the following error when using pmlb.fetch_data() in python with a jupyter notebook. The python version is 3.7.4, and the pmlb version is 1.0.2a0 or 1.0.1.post3. Could you let us know what might be the problem? Thanks!

from pmlb import fetch_data

Returns a pandas DataFrame

mushroom = fetch_data('mushroom')
mushroom.describe().transpose()


SSLCertVerificationError Traceback (most recent call last)
~\AppData\Roaming\Python\Python37\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
676 headers=headers,
--> 677 chunked=chunked,
678 )

~\AppData\Roaming\Python\Python37\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
380 try:
--> 381 self._validate_conn(conn)
382 except (SocketTimeout, BaseSSLError) as e:

~\AppData\Roaming\Python\Python37\site-packages\urllib3\connectionpool.py in _validate_conn(self, conn)
977 if not getattr(conn, "sock", None): # AppEngine might not have .sock
--> 978 conn.connect()
979

~\AppData\Roaming\Python\Python37\site-packages\urllib3\connection.py in connect(self)
370 server_hostname=server_hostname,
--> 371 ssl_context=context,
372 )

~\AppData\Roaming\Python\Python37\site-packages\urllib3\util\ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data)
383 if HAS_SNI and server_hostname is not None:
--> 384 return context.wrap_socket(sock, server_hostname=server_hostname)
385

~\AppData\Local\Continuum\anaconda3\lib\ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
422 context=self,
--> 423 session=session
424 )

~\AppData\Local\Continuum\anaconda3\lib\ssl.py in _create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)
869 raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 870 self.do_handshake()
871 except (OSError, ValueError):

~\AppData\Local\Continuum\anaconda3\lib\ssl.py in do_handshake(self, block)
1138 self.settimeout(None)
-> 1139 self._sslobj.do_handshake()
1140 finally:

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)

During handling of the above exception, another exception occurred:

MaxRetryError Traceback (most recent call last)
~\AppData\Roaming\Python\Python37\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
448 retries=self.max_retries,
--> 449 timeout=timeout
450 )

~\AppData\Roaming\Python\Python37\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
726 retries = retries.increment(
--> 727 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
728 )

~\AppData\Roaming\Python\Python37\site-packages\urllib3\util\retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
438 if new_retry.is_exhausted():
--> 439 raise MaxRetryError(_pool, url, error or ResponseError(cause))
440

MaxRetryError: HTTPSConnectionPool(host='media.githubusercontent.com', port=443): Max retries exceeded with url: /media/EpistasisLab/pmlb/master/datasets/mushroom/mushroom.tsv.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)')))

During handling of the above exception, another exception occurred:

SSLError Traceback (most recent call last)
in
2
3 # Returns a pandas DataFrame
----> 4 mushroom = fetch_data('mushroom')
5 mushroom.describe().transpose()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pmlb\pmlb.py in fetch_data(dataset_name, return_X_y, local_cache_dir, dropna)
77 raise ValueError('Dataset not found in PMLB.')
78 dataset_url = get_dataset_url(GITHUB_URL,
---> 79 dataset_name, suffix)
80 dataset = pd.read_csv(dataset_url, sep='\t', compression='gzip')
81 else:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pmlb\pmlb.py in get_dataset_url(GITHUB_URL, dataset_name, suffix)
116 )
117
--> 118 re = requests.get(dataset_url)
119 if re.status_code != 200:
120 raise ValueError('Dataset not found in PMLB.')

~\AppData\Roaming\Python\Python37\site-packages\requests\api.py in get(url, params, **kwargs)
74
75 kwargs.setdefault('allow_redirects', True)
---> 76 return request('get', url, params=params, **kwargs)
77
78

~\AppData\Roaming\Python\Python37\site-packages\requests\api.py in request(method, url, **kwargs)
59 # cases, and look like a memory leak in others.
60 with sessions.Session() as session:
---> 61 return session.request(method=method, url=url, **kwargs)
62
63

~\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
528 }
529 send_kwargs.update(settings)
--> 530 resp = self.send(prep, **send_kwargs)
531
532 return resp

~\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py in send(self, request, **kwargs)
663 # Redirect resolving generator.
664 gen = self.resolve_redirects(r, request, **kwargs)
--> 665 history = [resp for resp in gen]
666 else:
667 history = []

~\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py in (.0)
663 # Redirect resolving generator.
664 gen = self.resolve_redirects(r, request, **kwargs)
--> 665 history = [resp for resp in gen]
666 else:
667 history = []

~\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py in resolve_redirects(self, resp, req, stream, timeout, verify, cert, proxies, yield_requests, **adapter_kwargs)
243 proxies=proxies,
244 allow_redirects=False,
--> 245 **adapter_kwargs
246 )
247

~\AppData\Roaming\Python\Python37\site-packages\requests\sessions.py in send(self, request, **kwargs)
641
642 # Send the request
--> 643 r = adapter.send(request, **kwargs)
644
645 # Total elapsed time of the request (approximately)

~\AppData\Roaming\Python\Python37\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
512 if isinstance(e.reason, _SSLError):
513 # This branch is for urllib3 v1.22 and later.
--> 514 raise SSLError(e, request=request)
515
516 raise ConnectionError(e, request=request)

SSLError: HTTPSConnectionPool(host='media.githubusercontent.com', port=443): Max retries exceeded with url: /media/EpistasisLab/pmlb/master/datasets/mushroom/mushroom.tsv.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)')))

contributing guide

Guidelines for contributing.md

Guidelines for contributing datasets

  • We use 'features' to indicate variables/predictors and 'target' to indicate outcome/endpoint/label/class.

  • (WGL) Standardizing the target is necessary, but naming all the features 'features_x' would be unnecessary IMO

  • For feature names and dataset names, we use _ and all lower case? (this will transfer to R package as well).

  • WGL: sounds good for dataset names

  • Class coding: 0, 1, 2, ...

  • WGL: class labels must be contiguous

Guidelines for sourcing / identifying current datasets:

  • provide guidelines for finding the source of the dataset and linking to it
  • provide a checksum guideline, or field for 'how verified'
  • provide template for source information (yaml)

regarding the 6 mfeat datasets

The mfeat_pixel dataset has target values ranging from [0.9], while the other mfeat datasets have target values ranging from [1,10]
These datasets are all part of the same collection (https://archive.ics.uci.edu/ml/datasets/Multiple+Features), so they need to have the same feature values because people might compare feature values between the six datasets. I suggest changing the other 5 sets' values from [1,10] to [0,9] because that is consistent with the primary source's site, as well as your current target annotation in other datasets that I've looked at.

continuous integration for summary statistics and readme data

At the moment the summary stats and readme files need to be generated manually.

  • Based on our discussion, let's move this to a CI step so that users can more easily contribute datasets/dataset metadata without having to rerun those scripts.
  • Use the CI to issue some basic checks on their metadata.yaml file to make sure it's the proper format.

v0.3.1 minor release

  • warning for updating pmlb after August (?) and pmlb <=0.3 won't be support any more after the date
  • add new links for datasets in future version of pmlb.

Question regarding feature metadata

First of all, thank your for putting such an effort to share your great work. Really appreciate that!

I am considering using this benchmark to evaluate some neural network approaches, though I found that there is no distinction between integer features that were originally categorical information (education, relationship) and natural integers (rating, score, age, etc). The latter present sequential information, while categorical features often don't. In neural nets this particular distinction is rather important. My question is: Is it possible to retrieve the original feature dtypes? In other words, is it possible to distinguish between categorical and quantitative integers?

For instance, in the Irish dataset we have "Prestige_score:discrete" and "Type_school:discrete". Both are integers, though "Type_school" is categorical while "Prestige_score" is quantitative.

I could make use of the original datasets as well if you have them.

Standardize names

As @lacava suggested in #22, couple TODO items:

  • rename the datasets and feature names so that special characters, when possible, are replaced with _. Replace uppercase letters with lowercase letters.

  • In addition, I think we should move all datasets one level up and remove the subdirectory classification and regression because this information is already in metadata to do before generating metadata.yaml files, unless this breaks too many things.

fetch_data('iris') returns an error

  • pmlb version 1.0.2a0 and 1.0.1.post3 (fails on both)
  • Issue exists on both macos big sur and WSL ubuntu, both from within a miniconda environment
  • python version 3.9.2
  • Code to recreate:
 >>> from pmlb import fetch_data
>>> z = fetch_data('iris')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yashs97/miniconda3/envs/EGADS/lib/python3.9/site-packages/pmlb/pmlb.py", line 80, in fetch_data
    dataset = pd.read_csv(dataset_url, sep='\t', compression='gzip')
  File "/home/yashs97/miniconda3/envs/EGADS/lib/python3.9/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/yashs97/miniconda3/envs/EGADS/lib/python3.9/site-packages/pandas/io/parsers.py", line 462, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/yashs97/miniconda3/envs/EGADS/lib/python3.9/site-packages/pandas/io/parsers.py", line 819, in __init__
    self._engine = self._make_engine(self.engine)
  File "/home/yashs97/miniconda3/envs/EGADS/lib/python3.9/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/home/yashs97/miniconda3/envs/EGADS/lib/python3.9/site-packages/pandas/io/parsers.py", line 1898, in __init__
    self._reader = parsers.TextReader(self.handles.handle, **kwds)
  File "pandas/_libs/parsers.pyx", line 518, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 620, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1943, in pandas._libs.parsers.raise_parser_error
  File "/home/yashs97/miniconda3/envs/EGADS/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/yashs97/miniconda3/envs/EGADS/lib/python3.9/gzip.py", line 487, in read
    if not self._read_gzip_header():
  File "/home/yashs97/miniconda3/envs/EGADS/lib/python3.9/gzip.py", line 435, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b've')

Backwards compatibility

Thank you all for the great package! It is really lovely to be able to include pmlb in a project and only fetch datasets as need (while avoiding having to personally host and provide access to others directly). Recently I ran into the situation that older versions of pmlb now return 404 errors, and when upgrading pmlb some datasets no longer exist or have been renamed. In general, not a huge deal for ad-hoc projects, but pmlb provides this great functionality for some longer term artifacts.

With that in mind, is there any interest in receiving a patch for (some amount) of backwards compatibility? It would be great if datasets that were removed because they are duplicates re-directed to their canonical names, and similarly that failed requests made a best-efforts attempt to provide an alternative request that succeeds.

I'd be happy to take a look at this, but wanted to gauge interest first (in the meantime I've found myself having to include old datasets directly in repos).

[Errno 2] No such file or directory

I'm trying to use the fetch_data function by passing the local_cache_dir parameter. Despite this, every time I call the function with a path to save the database and not have to download it again, the following error appears: [Errno 2] No such file or directory.

In line 86 of pmlb.py, if the patch does not exist, it cannot create the file in Windows.

My solution:
Path(os.path.dirname(dataset_path)).mkdir(parents=True, exist_ok=True)

Besides that, some base names listed in classification_dataset_names are not really among the bases. For example: "cars1"

repeat dataset: promoters

pretty sure the promoters dataset and molecular-biology-promoters datasets are clones.

In [1]: import pandas as pd

In [2]: from pmlb import fetch_data

In [3]: df1 = fetch_data('promoters')

In [4]: df2 = fetch_data('molecular-biology_promoters')

In [6]: from pandas.util import hash_pandas_object

In [7]: import hashlib

In [8]: rowHashes1 = hash_pandas_object(df1).values

In [9]: hash1 = hashlib.sha256(rowHashes1).hexdigest()

In [10]: rowHashes2 = hash_pandas_object(df2).values

In [11]: hash2 = hashlib.sha256(rowHashes2).hexdigest()

In [12]: hash1
Out[12]: '37c2d79bd3ecaff76ab53f3f20742245e56a3ccb1354b5c45ea4a4429afff261'

In [13]: hash2
Out[13]: '37c2d79bd3ecaff76ab53f3f20742245e56a3ccb1354b5c45ea4a4429afff261'

In [14]: hash1==hash2
Out[14]: True

Issues about deployment step

After some tests in Github, our current deployment step works only on pull requests from a branch in this repo but does not work with a branch in a forked repo. Also somehow the deployment step failed in git push.

Additionally, I think the right behavior should be that deployment step only run if a pull request or commit is merged into PMLB2.0 or master branch.

I will check on it next week.

Include metafeatures & evaluation results

Not sure if this already included, but would it be possible to include the metafeatures and performance results described in the paper? (would be really handy for algorithm selection research).

Clean up commit history

#30 addressed lack of Git LFS for the large dataset files. It makes sense to remove these from the commit history, as well. The main affect is reducing the size of the repository when cloned, but it also has other beneficial side effects such as making the commit history easier to browse and navigate.

Aside from removing large dataset files from the history, is there anything else we can/should clean up?

Is the reference correct?

Hi,

I was about to cite PMLB and found that the reference is inconsistent with the website of BioData Mining. In particular, I think the bibtex should be changed from:

@article{Olson2017PMLB,
    author="Olson, Randal S. and La Cava, William and Orzechowski, Patryk and Urbanowicz, Ryan J. and Moore, Jason H.",
    title="PMLB: a large benchmark suite for machine learning evaluation and comparison",
    journal="BioData Mining",
    year="2017",
    month="Dec",
    day="11",
    volume="10",
    number="1",
    pages="36",
    issn="1756-0381",
    doi="10.1186/s13040-017-0154-4",
    url="https://doi.org/10.1186/s13040-017-0154-4"
}

to

@article{Olson2017PMLB,
    author="Olson, Randal S. and La Cava, William and Orzechowski, Patryk and Urbanowicz, Ryan J. and Moore, Jason H.",
    title="PMLB: a large benchmark suite for machine learning evaluation and comparison",
    journal="BioData Mining",
    year="2017",
    month="Dec",
    day="11",
    volume="10",
    number="36",
    pages="1--13",
    issn="1756-0381",
    doi="10.1186/s13040-017-0154-4",
    url="https://doi.org/10.1186/s13040-017-0154-4"
}

to be consistent with the BioData Mining website and the PDF.

Disclaimer: I'm aware that the information given in the bib info is identical to the info given in the RIS file from BioData Mining.

Credit/Origin?

  1. Nice resource! I may add some to it in future (although the ones I use for benchmarking are considerably "rarer" than the ones here - time-series + raw text + locations, entities, etc') .
  2. The varied datasets dont seem to have credit as to their origin. (e.g. "winered" - I assume is the wine datasets from UCI, but there's nothing about that in the data folder or the csv.gz file).
    Adding the origin (even at the "site" level, e.g. "UCI", "open-ML", "kaggle datasets", "KDD") would make it much easier to analyze the original datasets, context ,domain and interpretation (e.g. "Looking for datasets on time-series + predictive maintenance").

cars and cars1 are very similar

car and cars 1 are very similar; all datapoint appear to be the same, excepting that cars has 1 additional feature, brand. Since the datasets are the same cars1 just has one less column, it might make sense to get rid of cars1 altogether and just keep cars

Get categorical features

Hi! I am trying to use penn-ml-benchmarks to test encodings of categorical features.
Right now I am using "get_types ()" but it returns that all are categorical

for regression_dataset in regression_dataset_names:

    X, y = fetch_data(regression_dataset, return_X_y=True,local_cache_dir='./results_regression/datasets/')
    
    X = pd.DataFrame(X)
    y = pd.DataFrame(y)
    print(set(get_types(X)))

Is there a way to get the categorical features of the datasets?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.