Giter Site home page Giter Site logo

openml / openml-python Goto Github PK

View Code? Open in Web Editor NEW
275.0 23.0 143.0 198.27 MB

Python module to interface with OpenML

Home Page: https://openml.github.io/openml-python/main/

License: Other

Python 99.57% Makefile 0.05% Dockerfile 0.07% Shell 0.30%
openml machine-learning python meta-learning hacktoberfest

openml-python's Introduction

OpenML-Python

All Contributors

A python interface for OpenML, an online platform for open science collaboration in machine learning. It can be used to download or upload OpenML data such as datasets and machine learning experiment results.

General

License

Citing OpenML-Python

If you use OpenML-Python in a scientific publication, we would appreciate a reference to the following paper:

Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Müller, Joaquin Vanschoren, Frank Hutter
OpenML-Python: an extensible Python API for OpenML
Journal of Machine Learning Research, 22(100):1−5, 2021

Bibtex entry:

@article{JMLR:v22:19-920,
  author  = {Matthias Feurer and Jan N. van Rijn and Arlind Kadra and Pieter Gijsbers and Neeratyoy Mallik and Sahithya Ravi and Andreas Müller and Joaquin Vanschoren and Frank Hutter},
  title   = {OpenML-Python: an extensible Python API for OpenML},
  journal = {Journal of Machine Learning Research},
  year    = {2021},
  volume  = {22},
  number  = {100},
  pages   = {1--5},
  url     = {http://jmlr.org/papers/v22/19-920.html}
}

Contributors ✨

Thanks goes to these wonderful people (emoji key):


a-moadel

📖 💡

Neeratyoy Mallik

💻 📖 💡

This project follows the all-contributors specification. Contributions of any kind welcome!

openml-python's People

Contributors

a-moadel avatar allcontributors[bot] avatar amueller avatar arlindkadra avatar dependabot[bot] avatar eddiebergman avatar engelen avatar glemaitre avatar janvanrijn avatar joaquinvanschoren avatar lennartpurucker avatar mfeurer avatar minoriinoue avatar mirkazemi avatar neeratyoy avatar pgijsbers avatar prabhant avatar pre-commit-ci[bot] avatar rhiever avatar rong-inspur avatar rth avatar sahithyaravi avatar tashay avatar timandrews335 avatar toon-vc avatar twsthomas avatar v-parmar avatar willcmartin avatar williamraynaut avatar zardaloop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openml-python's Issues

reactivate old tests

there is some tests in the entities subfolder of tests that are not run (missing __init__.py?). We should try running them again and fixing them.

make repo pip-installable.

Currently I can't install this repo with pip because of the dependency liac-arff>=2.1.1dev.
I think that can be fixed by pointing to the repo.

Better error message if login failed

If you use a faulty api key, the package stays silent, until you try to download something, at which point it
returns a HTTP 500 Error. I believe this happened because my api key included a newline at the end.

Can we either check the login after creating the API connector, or return a proper error message?

/usr/local/lib/python2.7/site-packages/openml-0.0.1.dev0-py2.7.egg/openml/apiconnector.pyc in download_task(self, task_id)
    717             except (URLError, UnicodeEncodeError) as e:
    718                 print(e)
--> 719                 raise e
    720 
    721             # Cache the xml task file

HTTPError: HTTP Error 500: Internal Server Error

openml.datasets.list_datasets() fails with KeyError: 'oml:data'

Hi!

I've tried the today's version 20f0292 on Python 3 and have
`KeyError Traceback (most recent call last)
in ()
----> 1 datasets = openml.datasets.list_datasets()
2
3 data = pd.DataFrame(datasets)
4 print("First 10 of %s datasets..." % len(datasets))
5 print(data[:10][['did','name','NumberOfInstances','NumberOfFeatures']])

/opt/conda/envs/open-ml/lib/python3.5/site-packages/openml/datasets/functions.py in list_datasets()
117 these are also returned.
118 """
--> 119 return _list_datasets("data/list")
120
121

/opt/conda/envs/open-ml/lib/python3.5/site-packages/openml/datasets/functions.py in _list_datasets(api_call)
141
142 # Minimalistic check if the XML is useful
--> 143 assert type(datasets_dict['oml:data']['oml:dataset']) == list,
144 type(datasets_dict['oml:data'])
145 assert datasets_dict['oml:data']['@xmlns:oml'] == \

KeyError: 'oml:data'
`

Decoding Issue

I run into a decoding issue when trying to retrieve information about the dataset, when running the code

from openml.apiconnector import APIConnector
connector = APIConnector(apikey = 'key')
task = connector.download_task(59)
dataset = task.get_dataset()
X, y, categorical = dataset.get_dataset(target=dataset.default_target_attribute)

I get

Traceback (most recent call last):
  File "...\OpenML\TaskScript.py", line 15, in <module>
    X, y, categorical = dataset.get_dataset(target=dataset.default_target_attribute)
  File "...\OpenML\openml\entities\dataset.py", line 138, in get_dataset
    data, categorical, attribute_names = pickle.load(fh)
  File "...\python-3.4.3.amd64\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 977: character maps to <undefined>

If it helps, I'm using Python3.4.3 and my default encoding is utf-8 (sys.getdefaultencoding()), though the traceback suggests cp1252 was used.

Retrieving class labels of a dataset

I'm working on a function which lets users more easily run machine learning experiments through the OpenML API. For every run, the arff-file which contains the result should contain the names of the class labels. Currently there is no way to directly retrieve class labels from an OpenML dataset object.

In the code below, let's assume that 'dataset' is an OpenML Dataset entity of the iris dataset, so it should have the class labels {Iris-setosa,Iris-versicolor,Iris-virginica}.
What I would like is (something like) this:

dataset.class_labels
>>> ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]

Instead, I currently do this (which works):

arffFileName = dataset.data_file;
with open(arffFileName) as fh:
    arffData = arff.ArffDecoder().decode(fh);
    dataAttributes = dict(arffData['attributes']);
dataAttributes['class'];

That is, I open the associated ARFF file, and decode it to find the class label information. After executing above code, dataAttributes['class'] is exactly ["Iris-setosa", "Iris-versicolor", "Iris-virginica"].

At first I thought I would simply change the initialization of the dataset, and add a class_label attribute, but looking into it raised the following questions:

  • In apiconnector.py line 456, which currently initializes the dataset objects, the dataset description is used to initialize the dataset object. Class labels are not information contained in the dataset description XML file, but instead in the data ARFF file. This means that in order to retrieve the class labels, the ARFF file has to be opened. First, this seems to clash with the design so far, in that the dataset can be constructed from the description alone (and if the dataset ARFF is already cached, it then does not need to be opened). Second, from the comment on apiconnector.py line 536 I gather that ARFF files might be sufficiently large not to want to load them into memory if not necessary. Opening a potentially colossal file just to search for the class labels attribute might be too much overhead, though since the attributes are listed before the data, I think a lazy read might negate much of the overhead. So should the class labels be stored in the description, or should the ARFF file be read upon construction of the dataset?
  • Not all datasets have class labels (for example unsupervised learning problems), so in that sense it is not a generic dataset property. For those datasets, I'm in favor of just setting class_labels to None, but perhaps there are reasons to instead return an empty list. Alternatively, this could be an argument against having a class_label property.

If we can reach a conclusion, then I will try to implement it on my feature branch (feature/script (which I admit is poorly named and will rename to feature/run or feature/autorun)). Looking forward to hearing your thoughts.

Kind regards,
Pieter Gijsbers

Best way to load arff data?

I downloaded a dataset from OpenML and wanted to load it in pandas. What is the best way to do this?
If I use the get_arff() function, for some reason the class values are not included, even though they are in the arff file stored in disk. Maybe I have the wrong arff reader installed?

dataset = openml.download_dataset(61)
iris = dataset.get_arff()
iris = pd.DataFrame(iris['data'], columns=[attribute[0] for attribute in iris['attributes']])
print(iris[:10])
   sepallength  sepalwidth  petallength  petalwidth  class
0          5.1         3.5          1.4         0.2      0
1          4.9         3.0          1.4         0.2      0
2          4.7         3.2          1.3         0.2      0
3          4.6         3.1          1.5         0.2      0
4          5.0         3.6          1.4         0.2      0

openml_run broken?

This does not work anymore. It requires a get_train_and_test_set() in Task but it does not exist.
It was working a few days ago?


AttributeError Traceback (most recent call last)
in ()
2
3 clf = ensemble.RandomForestClassifier()
----> 4 prediction_path, description_path = openml_run(task, clf)
5 print("RandomForest has run on the task.")
6 print("Predictions stored: %s" % prediction_path)

/Users/joa/anaconda/lib/python3.5/site-packages/openml-0.2.1-py3.5.egg/openml/autorun.py in openml_run(task, classifier)
197 for f in range(0, nr_folds):
198 start_time = time.time()
--> 199 TrainX, TrainY, TestX, TestY = task.get_train_and_test_set(f, r)
200 _,test_idx = task.get_train_test_split_indices(f)
201

AttributeError: 'OpenMLTask' object has no attribute 'get_train_and_test_set'

Can't get dataset 61 (iris)

On version 20f0292
`

dataset = openml.datasets.get_dataset(61)

KeyError Traceback (most recent call last)
in ()
----> 1 dataset = openml.datasets.get_dataset(61)
2
3 print("This is dataset '%s', the target feature is called '%s'" % (dataset.name, dataset.default_target_attribute))
4 print("URL: %s" % dataset.url)
5 print(dataset.description[:500])

/opt/conda/envs/open-ml/lib/python3.5/site-packages/openml/datasets/functions.py in get_dataset(did)
237 "cast to an Integer.")
238
--> 239 description = get_dataset_description(did)
240 arff_file = _get_dataset_arff(did, description=description)
241

/opt/conda/envs/open-ml/lib/python3.5/site-packages/openml/datasets/functions.py in get_dataset_description(did)
253
254 try:
--> 255 return _get_cached_dataset_description(did)
256 except (OpenMLCacheException):
257 try:

/opt/conda/envs/open-ml/lib/python3.5/site-packages/openml/datasets/functions.py in _get_cached_dataset_description(did)
82 continue
83
---> 84 return xmltodict.parse(dataset_xml)["oml:data_set_description"]
85
86 raise OpenMLCacheException("Dataset description for did %d not "

KeyError: 'oml:data_set_description'
`
Iris v. 3 (http://www.openml.org/d/969) works fine,

Incorrect internal calls?

The Task entity calls the get_pandas function of the Dataset entity (in function get_X_and_Y), which does not exist (probably should be get_dataset).
Specifically, calling code underneath got me the error

    from openml.apiconnector import APIConnector
    connector = APIConnector(apikey = 'key')
    task = connector.download_task(59)
    x1,y1,x2,y2 = task.get_train_and_test_set()

Problem with APIConnector.upload_flow (?)

I want to upload a flow description programmatically. I have a flow description which I want to upload using the APIConnector.upload_flow method.

I currently call the method with a description 'test description', and the absolute path to the file. The return value I get is

connector.upload_flow('test description', abs_path)
>>>(200, <Response [200]>)

with the repsonse.content containing

<oml:error xmlns:oml="http://openml.org/openml">
\t<oml:code>163</oml:code>
\t<oml:message>Problem validating uploaded description file</oml:message>
\t\t<oml:additional_information>Start tag expected, \'&amp;lt;\' not found. 
XML does not correspond to XSD schema. </oml:additional_information>
\t</oml:error>

It seems like everything works fine, but my file is not correct. But I don't see what is wrong with the file. Oddly enough, I can upload it through the API docs. I will actually see it show up under my flows on the openml website, and as a response I get the new flow id.

Anyone got ideas?

Absolute Path required when uploading run

When uploading a run (and probably other openml objects as well), you need to provide files, such as a run description, or a predictions file. Upload_run currently requires an absolute path to the files (this is also true in the feature/Upload_Fixes branch). This is caused by the following line (from here)

if os.path.isabs(path) and os.path.exists(path):
    do-file-things

to me, it would make sense to change that to:

if not os.path.isabs(path):
    path = os.path.abspath(path)
if (os.path.exists(path)):
    do-file-things

So it would accept both absolute and relative paths. Is there a particular reason this wasn't done?

unexpected result of get_task_list

It gets all of the classification tasks. If we want to rate limit, maybe getting all of one type is not good. If we don't care about server load, I think we should get all tasks.

submodules or not?

I'm not sure if we want to keep the API split into submodules or not.
currently you can do the following:

import openml
connector = openml.APIConnector()
dataset = openml.datasets.download_dataset(connector, id=0)

or

from openml import APIConnector
from openml.datasets import download_dataset
connector = APIConnector()
dataset = download_dataset(connector, id=0)

One alternative would be to flatten out the datasets, runs etc submodules:

import openml
connector = openml.APIConnector()
dataset = openml..download_dataset(connector, id=0)

and

from openml import APIConnector, download_datasets
connector = APIConnector()
dataset = download_dataset(connector, id=0)

This is only an API decision and doesn't change the layout of the files on disk.

We could also shorten by dropping the the dataset part in the function names:

import openml
connector = openml.APIConnector()
dataset = openml.datasets.download(connector, id=0)

That doesn't work that well with the from imports, though.
lastly, we could make them static methods on the classes:

from openml import APIConnector, OpenMLDataset
connector = APIConnector()
dataset = OpenMLDataset.download(connector, id=0)

The benefit of not flattening out the submodules is that auto-completion (which I use A LOT in the notebook) is much more explorable. The downside is more to type.
The class methods are sort of an intermediate maybe?

  • note that I could also
from openml import datasets
datasets.download(...)

I'm not saying we need to change something, I'm just saying that if we want to change something, now would be a good time.

Use new travis-ci API

Currently, the builds fail because the tests are run on something called legacy infrastructure.

Return the train-test splits from a task

Tasks currently have a get_train_and_test_set method to return the train and test set for a specific CV repeat/fold. However, I see no way to get the number of repeats and folds? Can we please either:

  • add functions to return the CV specs (i.e. the number of folds and repeats)
  • return the whole set of train-test split as a numpy array (tensor of repeat x fold x splits)?

make travis ci use agg (?) backend?

If we want to keep the notebooks, we probably want it to plot using a backend that doesn't require X. Currently we can't plot at all because the default backend is not supported.
Not sure if all this is worth the hassle.

rename repo

This repo has a very bad name. I realize the other repos have similarly bad names, but I think this is a really bad idea. It is just named python. So if I fork this repo, I'll have a repo called python. And if I check it out, it will be checked out in a folder called`python``. That's really really weird.

fix flow specification!

It should be possible to reload a flow on a different system and rerun the model.
brownie points if it works with pipelines.
This should check that the versions of the important libs are the same and warn otherwise.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.