openml / openml-python Goto Github PK

View Code? Open in Web Editor NEW

275.0 23.0 143.0 198.27 MB

Python module to interface with OpenML

Home Page: https://openml.github.io/openml-python/main/

License: Other

Python 99.57% Makefile 0.05% Dockerfile 0.07% Shell 0.30%

openml machine-learning python meta-learning hacktoberfest

openml-python's Introduction

OpenML-Python

A python interface for OpenML, an online platform for open science collaboration in machine learning. It can be used to download or upload OpenML data such as datasets and machine learning experiment results.

General

Citing OpenML-Python

If you use OpenML-Python in a scientific publication, we would appreciate a reference to the following paper:

Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Müller, Joaquin Vanschoren, Frank Hutter
OpenML-Python: an extensible Python API for OpenML
Journal of Machine Learning Research, 22(100):1−5, 2021

Bibtex entry:

@article{JMLR:v22:19-920,
  author  = {Matthias Feurer and Jan N. van Rijn and Arlind Kadra and Pieter Gijsbers and Neeratyoy Mallik and Sahithya Ravi and Andreas Müller and Joaquin Vanschoren and Frank Hutter},
  title   = {OpenML-Python: an extensible Python API for OpenML},
  journal = {Journal of Machine Learning Research},
  year    = {2021},
  volume  = {22},
  number  = {100},
  pages   = {1--5},
  url     = {http://jmlr.org/papers/v22/19-920.html}
}

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_a-moadel
📖 💡

_{Neeratyoy Mallik}
💻 📖 💡

This project follows the all-contributors specification. Contributions of any kind welcome!

openml-python's People

Contributors

Stargazers

Watchers

Forkers

amueller raghavrv gitter-badger kazeevn brylie maien85 rhiever anatolfernandez elibol digideskio hackathorn phpmind strategist922 kristinmcleod benjamesbabala awesome-ml awesome-python zmpyzmpy omgteam dihia libardo1 arlindkadra rkadlec heoa bradyneal jason790 lawrennd minoriinoue ducnguyen77 engelen williamraynaut h2oai carsondahlberg cosmologist10 timcordova femoiseev mchristos roopeshranjan irfannurafif afcarl eliask lazycrazyowl valrcs vigsterkr rth glemaitre ledell dmartinez05 feitianyiren dagger-nn vishalbelsare gf712 samhas2015 bayanibra ji-zhang parnurzeal neeratyoy timandrews335 michaelmmeskhi shzh-airob eggachecat stjordanis hp2500 bin2000 thomascherickal r3sult twsthomas liuweiping2020 konrad prabhant nunofernandes-plight m7142yosuke benman1 tqcai nicolashug rong-inspur ennosigaeon skubatur bilgecelik abhih miguelalexbt marcoslbueno 22quinn queencai eddiebergman omlondhe tarunagarwal99 scratchmex fsabr chouhanaryan a-moadel mkim2001 davidsirui mirkazemi actuarial-tools chin-i andrewtanqb lkampoli hercules261188 pgijsbers

openml-python's Issues

move documentation to a doc folder

including makefile. Having a sphinx makefile on the toplevel is very confusing.

Can't upload run

Get code 450, using Python 3.5.1 on Anaconda.

covtype.zip

Add a run object

Add an object to represent a run:

http://openml.org/api

and the methods to download them.

fix link to docs on readme

Should it go to master? What is the develop branch?

reactivate old tests

there is some tests in the entities subfolder of tests that are not run (missing __init__.py?). We should try running them again and fixing them.

error "The task has no class labels. This method currently only works for tasks with class labels."

Add nice way of printing datasets, tasks etc

make repo pip-installable.

Currently I can't install this repo with pip because of the dependency liac-arff>=2.1.1dev.
I think that can be fixed by pointing to the repo.

Add doctests to make sure that examples are correct.

Do not crash if the authentication is lost

Re-authenticate in that case. To do this, implement authenticate.check, and re-authenticate if necessary.

Better error message if login failed

If you use a faulty api key, the package stays silent, until you try to download something, at which point it
returns a HTTP 500 Error. I believe this happened because my api key included a newline at the end.

Can we either check the login after creating the API connector, or return a proper error message?

/usr/local/lib/python2.7/site-packages/openml-0.0.1.dev0-py2.7.egg/openml/apiconnector.pyc in download_task(self, task_id)
    717             except (URLError, UnicodeEncodeError) as e:
    718                 print(e)
--> 719                 raise e
    720 
    721             # Cache the xml task file

HTTPError: HTTP Error 500: Internal Server Error

have download_dataset accept strings?

Remembering integers seems hard.

Correctly handle sparse data

ATM liac-arff returns a dense np.ndarray if the user does not specify that a dataset is sparse.

openml.datasets.list_datasets() fails with KeyError: 'oml:data'

Hi!

I've tried the today's version 20f0292 on Python 3 and have
`KeyError Traceback (most recent call last)
in ()
----> 1 datasets = openml.datasets.list_datasets()
2
3 data = pd.DataFrame(datasets)
4 print("First 10 of %s datasets..." % len(datasets))
5 print(data[:10][['did','name','NumberOfInstances','NumberOfFeatures']])

/opt/conda/envs/open-ml/lib/python3.5/site-packages/openml/datasets/functions.py in list_datasets()
117 these are also returned.
118 """
--> 119 return _list_datasets("data/list")
120
121

/opt/conda/envs/open-ml/lib/python3.5/site-packages/openml/datasets/functions.py in _list_datasets(api_call)
141
142 # Minimalistic check if the XML is useful
--> 143 assert type(datasets_dict['oml:data']['oml:dataset']) == list,
144 type(datasets_dict['oml:data'])
145 assert datasets_dict['oml:data']['@xmlns:oml'] == \

KeyError: 'oml:data'
`

readthedocs link

The readthedocs link in the readme.md file takes you to the openml project metadata page on rtd (https://readthedocs.org/projects/openml/), the documentation itself is at: http://openml.readthedocs.org/en/latest/

Decoding Issue

I run into a decoding issue when trying to retrieve information about the dataset, when running the code

from openml.apiconnector import APIConnector
connector = APIConnector(apikey = 'key')
task = connector.download_task(59)
dataset = task.get_dataset()
X, y, categorical = dataset.get_dataset(target=dataset.default_target_attribute)

I get

Traceback (most recent call last):
  File "...\OpenML\TaskScript.py", line 15, in <module>
    X, y, categorical = dataset.get_dataset(target=dataset.default_target_attribute)
  File "...\OpenML\openml\entities\dataset.py", line 138, in get_dataset
    data, categorical, attribute_names = pickle.load(fh)
  File "...\python-3.4.3.amd64\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 977: character maps to <undefined>

If it helps, I'm using Python3.4.3 and my default encoding is utf-8 (sys.getdefaultencoding()), though the traceback suggests cp1252 was used.

Retrieving class labels of a dataset

I'm working on a function which lets users more easily run machine learning experiments through the OpenML API. For every run, the arff-file which contains the result should contain the names of the class labels. Currently there is no way to directly retrieve class labels from an OpenML dataset object.

In the code below, let's assume that 'dataset' is an OpenML Dataset entity of the iris dataset, so it should have the class labels {Iris-setosa,Iris-versicolor,Iris-virginica}.
What I would like is (something like) this:

dataset.class_labels
>>> ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]

Instead, I currently do this (which works):

arffFileName = dataset.data_file;
with open(arffFileName) as fh:
    arffData = arff.ArffDecoder().decode(fh);
    dataAttributes = dict(arffData['attributes']);
dataAttributes['class'];

That is, I open the associated ARFF file, and decode it to find the class label information. After executing above code, dataAttributes['class'] is exactly ["Iris-setosa", "Iris-versicolor", "Iris-virginica"].

At first I thought I would simply change the initialization of the dataset, and add a class_label attribute, but looking into it raised the following questions:

In apiconnector.py line 456, which currently initializes the dataset objects, the dataset description is used to initialize the dataset object. Class labels are not information contained in the dataset description XML file, but instead in the data ARFF file. This means that in order to retrieve the class labels, the ARFF file has to be opened. First, this seems to clash with the design so far, in that the dataset can be constructed from the description alone (and if the dataset ARFF is already cached, it then does not need to be opened). Second, from the comment on apiconnector.py line 536 I gather that ARFF files might be sufficiently large not to want to load them into memory if not necessary. Opening a potentially colossal file just to search for the class labels attribute might be too much overhead, though since the attributes are listed before the data, I think a lazy read might negate much of the overhead. So should the class labels be stored in the description, or should the ARFF file be read upon construction of the dataset?
Not all datasets have class labels (for example unsupervised learning problems), so in that sense it is not a generic dataset property. For those datasets, I'm in favor of just setting class_labels to None, but perhaps there are reasons to instead return an empty list. Alternatively, this could be an argument against having a class_label property.

If we can reach a conclusion, then I will try to implement it on my feature branch (feature/script (which I admit is poorly named and will rename to feature/run or feature/autorun)). Looking forward to hearing your thoughts.

Kind regards,
Pieter Gijsbers

Provide an example of ~/.openml/config

At least want to know how to put the API key inside.

Add API method documentation

Similar to the scikit-learn API documentation

Best way to load arff data?

I downloaded a dataset from OpenML and wanted to load it in pandas. What is the best way to do this?
If I use the get_arff() function, for some reason the class values are not included, even though they are in the arff file stored in disk. Maybe I have the wrong arff reader installed?

dataset = openml.download_dataset(61)
iris = dataset.get_arff()
iris = pd.DataFrame(iris['data'], columns=[attribute[0] for attribute in iris['attributes']])
print(iris[:10])
   sepallength  sepalwidth  petallength  petalwidth  class
0          5.1         3.5          1.4         0.2      0
1          4.9         3.0          1.4         0.2      0
2          4.7         3.2          1.3         0.2      0
3          4.6         3.1          1.5         0.2      0
4          5.0         3.6          1.4         0.2      0

should all classes have an instance of the api connector?

currently tasks have, but runs and datasets don't.

openml_run broken?

This does not work anymore. It requires a get_train_and_test_set() in Task but it does not exist.
It was working a few days ago?

AttributeError Traceback (most recent call last)
in ()
2
3 clf = ensemble.RandomForestClassifier()
----> 4 prediction_path, description_path = openml_run(task, clf)
5 print("RandomForest has run on the task.")
6 print("Predictions stored: %s" % prediction_path)

/Users/joa/anaconda/lib/python3.5/site-packages/openml-0.2.1-py3.5.egg/openml/autorun.py in openml_run(task, classifier)
197 for f in range(0, nr_folds):
198 start_time = time.time()
--> 199 TrainX, TrainY, TestX, TestY = task.get_train_and_test_set(f, r)
200 _,test_idx = task.get_train_test_split_indices(f)
201

AttributeError: 'OpenMLTask' object has no attribute 'get_train_and_test_set'

rewrite notebook to work with new API

the example notebook (and user guide) need to be rewritten to match the new API

use different task type for unittests

because classification is the largest and causes the largest download

Remove schemas from package

@joaquinvanschoren I don't think we need them any more. Do you see any reasons to keep them?

Add simple instruction / snipet to openml python interface website

I think the python api here:
http://www.openml.org/guide/#!python

should have the pip command to install the package form github, and like 5 lines to train a scikit-learn classifier on a specific data set.
I guess we should do that after the API is solid, but I think it's very important for adoption.

Can't get dataset 61 (iris)

On version 20f0292
`

dataset = openml.datasets.get_dataset(61)

KeyError Traceback (most recent call last)
in ()
----> 1 dataset = openml.datasets.get_dataset(61)
2
3 print("This is dataset '%s', the target feature is called '%s'" % (dataset.name, dataset.default_target_attribute))
4 print("URL: %s" % dataset.url)
5 print(dataset.description[:500])

/opt/conda/envs/open-ml/lib/python3.5/site-packages/openml/datasets/functions.py in get_dataset(did)
237 "cast to an Integer.")
238
--> 239 description = get_dataset_description(did)
240 arff_file = _get_dataset_arff(did, description=description)
241

/opt/conda/envs/open-ml/lib/python3.5/site-packages/openml/datasets/functions.py in get_dataset_description(did)
253
254 try:
--> 255 return _get_cached_dataset_description(did)
256 except (OpenMLCacheException):
257 try:

/opt/conda/envs/open-ml/lib/python3.5/site-packages/openml/datasets/functions.py in _get_cached_dataset_description(did)
82 continue
83
---> 84 return xmltodict.parse(dataset_xml)["oml:data_set_description"]
85
86 raise OpenMLCacheException("Dataset description for did %d not "

KeyError: 'oml:data_set_description'
`
Iris v. 3 (http://www.openml.org/d/969) works fine,

Incorrect internal calls?

The Task entity calls the get_pandas function of the Dataset entity (in function get_X_and_Y), which does not exist (probably should be get_dataset).
Specifically, calling code underneath got me the error

    from openml.apiconnector import APIConnector
    connector = APIConnector(apikey = 'key')
    task = connector.download_task(59)
    x1,y1,x2,y2 = task.get_train_and_test_set()

Problem with APIConnector.upload_flow (?)

I want to upload a flow description programmatically. I have a flow description which I want to upload using the APIConnector.upload_flow method.

I currently call the method with a description 'test description', and the absolute path to the file. The return value I get is

connector.upload_flow('test description', abs_path)
>>>(200, <Response [200]>)

with the repsonse.content containing

<oml:error xmlns:oml="http://openml.org/openml">
\t<oml:code>163</oml:code>
\t<oml:message>Problem validating uploaded description file</oml:message>
\t\t<oml:additional_information>Start tag expected, \'&amp;lt;\' not found. 
XML does not correspond to XSD schema. </oml:additional_information>
\t</oml:error>

It seems like everything works fine, but my file is not correct. But I don't see what is wrong with the file. Oddly enough, I can upload it through the API docs. I will actually see it show up under my flows on the openml website, and as a response I get the new flow id.

Anyone got ideas?

Absolute Path required when uploading run

When uploading a run (and probably other openml objects as well), you need to provide files, such as a run description, or a predictions file. Upload_run currently requires an absolute path to the files (this is also true in the feature/Upload_Fixes branch). This is caused by the following line (from here)

if os.path.isabs(path) and os.path.exists(path):
    do-file-things

to me, it would make sense to change that to:

if not os.path.isabs(path):
    path = os.path.abspath(path)
if (os.path.exists(path)):
    do-file-things

So it would accept both absolute and relative paths. Is there a particular reason this wasn't done?

need to separate offline and online tests

currently many of the tests only test datasets that are already downloaded.
We should make sure that for the online tests, the cache is erased after each run.

add read-only api key for testing

unexpected result of get_task_list

It gets all of the classification tasks. If we want to rate limit, maybe getting all of one type is not good. If we don't care about server load, I think we should get all tasks.

submodules or not?

I'm not sure if we want to keep the API split into submodules or not.
currently you can do the following:

import openml
connector = openml.APIConnector()
dataset = openml.datasets.download_dataset(connector, id=0)

from openml import APIConnector
from openml.datasets import download_dataset
connector = APIConnector()
dataset = download_dataset(connector, id=0)

One alternative would be to flatten out the datasets, runs etc submodules:

import openml
connector = openml.APIConnector()
dataset = openml..download_dataset(connector, id=0)

and

from openml import APIConnector, download_datasets
connector = APIConnector()
dataset = download_dataset(connector, id=0)

This is only an API decision and doesn't change the layout of the files on disk.

We could also shorten by dropping the the dataset part in the function names:

import openml
connector = openml.APIConnector()
dataset = openml.datasets.download(connector, id=0)

That doesn't work that well with the from imports, though.
lastly, we could make them static methods on the classes:

from openml import APIConnector, OpenMLDataset
connector = APIConnector()
dataset = OpenMLDataset.download(connector, id=0)

The benefit of not flattening out the submodules is that auto-completion (which I use A LOT in the notebook) is much more explorable. The downside is more to type.
The class methods are sort of an intermediate maybe?

note that I could also

from openml import datasets
datasets.download(...)

I'm not saying we need to change something, I'm just saying that if we want to change something, now would be a good time.

make dataset.get_dataset use default target

and return y=None if there is none.
also maybe return a pandas dataframe.

Use coveralls.io

Use new travis-ci API

Currently, the builds fail because the tests are run on something called legacy infrastructure.

Return the train-test splits from a task

Tasks currently have a get_train_and_test_set method to return the train and test set for a specific CV repeat/fold. However, I see no way to get the number of repeats and folds? Can we please either:

add functions to return the CV specs (i.e. the number of folds and repeats)
return the whole set of train-test split as a numpy array (tensor of repeat x fold x splits)?

make travis ci use agg (?) backend?

If we want to keep the notebooks, we probably want it to plot using a backend that doesn't require X. Currently we can't plot at all because the default backend is not supported.
Not sure if all this is worth the hassle.

DOCS: Add configuration file explanation / link to test docs

There is an instruction to testing here:
http://openml.readthedocs.org/en/master/#testing
just after installation.
However, without config, the tests won't run.
Not sure if that can just be fixed adding a read-only api key #47

rename repo

This repo has a very bad name. I realize the other repos have similarly bad names, but I think this is a really bad idea. It is just named python. So if I fork this repo, I'll have a repo called python. And if I check it out, it will be checked out in a folder called`python``. That's really really weird.