openml / openml Goto Github PK

Open Machine Learning

License: BSD 3-Clause "New" or "Revised" License

CSS 33.25% JavaScript 29.39% Processing 1.55% PHP 35.34% HTML 0.06% Shell 0.05% Hack 0.33% Batchfile 0.01% Python 0.02% Dockerfile 0.01%

machine-learning open-science science citizen-scientists collaboration opendata datasets hacktoberfest

openml's Introduction

OpenML: Open Machine Learning

Welcome to the OpenML GitHub page! 🎉

Contents:

Who are we?
What is OpenML?
Get involved

Who are we?

We are a group of people who are excited about open science, open data and machine learning. We want to make machine learning and data analysis simple, accessible, collaborative and open with an optimal division of labour between computers and humans.

What is OpenML?

Want to learn about OpenML or get involved? Please do and get in touch in case of questions or comments! 📨

Getting started:
- Check out the OpenML Website to get a first impression of what OpenML is
- The OpenML Documentation page gives an introduction in details and features, as well as
- OpenML's different APIs and integrations so that everyone can work with their favorite tool.
How to contribute: https://github.com/openml/OpenML/blob/master/CONTRIBUTING.md
Citation and Honor Code: https://www.openml.org/terms
Communication / Contact: https://github.com/openml/OpenML/wiki/Communication-Channels

OpenML is an online machine learning platform for sharing and organizing data, machine learning algorithms and experiments. It is designed to create a frictionless, networked ecosystem, that you can readily integrate into your existing processes/code/environments, allowing people all over the world to collaborate and build directly on each other’s latest ideas, data and results, irrespective of the tools and infrastructure they happen to use.

As an open science platform, OpenML provides important benefits for the science community and beyond.

Benefits for Science

Many sciences have made significant breakthroughs by adopting online tools that help organizing, structuring and analyzing scientific data online. Indeed, any shared idea, question, observation or tool may be noticed by someone who has just the right expertise to spark new ideas, answer open questions, reinterpret observations or reuse data and tools in unexpected new ways. Therefore, sharing research results and collaborating online as a (possibly cross-disciplinary) team enables scientists to quickly build on and extend the results of others, fostering new discoveries.

Moreover, ever larger studies become feasible as a lot of data are already available. Questions such as “Which hyperparameter is important to tune?”, “Which is the best known workflow for analyzing this data set?” or “Which data sets are similar in structure to my own?” can be answered in minutes by reusing prior experiments, instead of spending days setting up and running new experiments.

Benefits for Scientists

Scientists can also benefit personally from using OpenML. For example, they can save time, because OpenML assists in many routine and tedious duties: finding data sets, tasks, flows and prior results, setting up experiments and organizing all experiments for further analysis. Moreover, new experiments are immediately compared to the state of the art without always having to rerun other people’s experiments.

Another benefit is that linking one’s results to those of others has a large potential for new discoveries (see, for instance, Feurer et al. 2015; Post et al. 2016; Probst et al. 2017), leading to more publications and more collaboration with other scientists all over the world.

Finally, OpenML can help scientists to reinforce their reputation by making their work (published or not) visible to a wide group of people and by showing how often one’s data, code and experiments are downloaded or reused in the experiments of others.

Benefits for Society

OpenML also provides a useful learning and working environment for students, citizen scientists and practitioners. Students and citizen scientist can easily explore the state of the art and work together with top minds by contributing their own algorithms and experiments. Teachers can challenge their students by letting them compete on OpenML tasks or by reusing OpenML data in assignments. Finally, machine learning practitioners can explore and reuse the best solutions for specific analysis problems, interact with the scientific community or efficiently try out many possible approaches.

Get involved

OpenML has grown into quite a big project. We could use many more hands to help us out 🔧.

You want to contribute?: Awesome! Check out our wiki page on how to contribute or get in touch. There may be unexpected ways for how you could help. We are open for any ideas.
You want to support us financially?: YES! Getting funding through conventional channels is very competitive, and we are happy about every small contribution. Please send an email to [email protected]!

GitHub organization structure

OpenML's code distrubuted over different repositories to simplify development. Please see their individual readme's and issue trackers of you like to contribute. These are the most important ones:

openml/OpenML: The OpenML web application, including the REST API.
openml/openml-python: The Python API, to talk to OpenML from Python scripts (including scikit-learn).
openml/openml-r: The R API, to talk to OpenML from R scripts (inclusing mlr).
openml/java: The Java API, to talk to OpenML from Java scripts.
openml/openml-weka: The WEKA plugin, to talk to OpenML from the WEKA toolbox.

openml's People

Contributors

Stargazers

Watchers

Forkers

schevalier mehdijamali rm3l vkramanuj jaksmid barrygolden stevenlol zardaloop anukat2015 rachelmsi sevenihust luwangbear strategist922 artklochkov tguillemot mutual-ai feilong0309 benjamesbabala xsongx kristinmcleod claudioapose jankneumann arlindkadra nodechef lawrennd ducnguyen77 muhabdulhaq zerojuls clustersdata cosmologist10 till-tomorrow codeaudit mmbabol longhua8800w amueller diehumblex dursunkoc rquintino georggr sfinxcz propixel-prc kumarveer chkoar mardillu ledell nanaakwasiabayieboateng rajasekharponakala liangtianumich mhpaler kavap xeransis adsbb pseudotensor alenwon ji-zhang evangelosdaniil eggachecat nadiaoliveira emailhy bin2000 thomascherickal liuweiping2020 passysosysmas cognitojayant 07avaz07 shawndegroot qiaoxingli ltdaovn mahamarif meissnereric ondrocks solversa mkim2001 pu55yf3r nolll77 nurur akashmavle5 johnfelipe marcingrze yogeshchaudhary7 mitagr andy-brainome risecai 4ai-vault genostack itsomsarraf decentralised-ai labxr eeinz sujaanr

openml's Issues

Better Dataset search and organisation

Make Datasets searchable for

2 Class Classfication / Multiclass Classification / Regression
n (NumberOfInstances)
Baseline Accuracy (DefaultAccuracy)
...

And / or provide a table with the most essential data features for each available dataset. This would also be my prefered overview when clicking on Search -> Datasets.

Error 207

Hi,

I'm having trouble uploading a run. I keep on getting error 207:
"File upload failed. One of the files uploaded has a problem"

Is it possible to provide more information? E.g., which file has a problem or even what the problem is. I think it's the output_files (in my case, a single .arff. Which format is expected?)...

Thanks in advance,
Dominik

Example for performance measures not correct

See:
http://expdb.cs.kuleuven.be/expdb/api/index.php#openml_evaluation_measures

Naming format is wrong here, measure names should be lower case.

Also: Why have this twice anyway? Probably best to remove the example output, just provide a link to the api call, this gives all needed info.

Also: area_under_ROC_curve
should probably be area_under_roc_curve

Dead links in online docs

https://github.com/openml/OpenML

Then click:

Service: openml.authenticate

Service: openml.data.upload

There are probably a few more!

Evaluation measures: undocumented / unclear

http://expdb.cs.kuleuven.be/expdb/api/?f=openml.evaluation.measures

For quite a lot of the measures it is unclear what they mean exactly, or it is, but it does not make sense to ask the client to optimize them in a task.

Examples:
a) How is kohavi_wolpert_bias_squared defined exactly?

b) Ho is the client supposed to optimize for "confusion_matrix"?

Solution:

Document measures, at least by providing links to definitions.
Remove the ones which do not make sense or explain to the user why they do.

CSV export of SQL query results does not work

My CSV file is always empty, just run any query.
I tried:
select * from implementation

Strange tasks "weka.RemovePercentage"

When I am searching for tasks I see these:

iris-weka.RemovePercentage-P:20

What are they? Should they be removed?

Best,

Bernd

Display implementation / run uploader

Hi,

Search - Datasets - select one - select a run / impl

If you click on "General information" of the implementation it would be nice to see the uploader displayed.

Yes, minor point for now.

Might probably be relevant of similar displays for other objects as well?

Server-side versioning

Right now, we have user-defined versioning for datasets and implementations, which means that users have to keep track of versions and have to select/invent a versioning system which will lead to a variety of versioning schemes on the server.

It would be better if OpenML could take care of versioning.

We can then remove the version field altogether. The user just provides a name for his dataset/implementation, the server then checks if that name exists, and if not, assigns version number 1 and stores a hash computed on the uploaded code. If the dataset/implementation is uploaded again, and the hash has changed, a new version number is assigned and the new id is returned.

Comments, please :)

More 404s

http://openml.org/learn
Sharing a run

Both links in: (Response / XSD Schema)

Returned file: Response

The response file returned implementation description file depending on the task type. For supervised classification, the API will also compute evaluation measures based on uploaded predictions, and return them as part of the response file. See the XSD Schema for details.

Extend output for classification/regression task: models

Allow to return a model built on the input data. This is useful for people actually interested in what is hidden in the input data. We don't want to force people to use PMML, so a model can be anything, such as a WEKA model file (.model) or an R data object (binary). Ideally, we can catalogue commonly used model formats (i.e. 'Weka model', 'R data object', ...) and describe then on the webpage, so that people know what to do with these model files.

I would propose to make this an option output for the classification/regression task, thus:
'model' -> POSTed file with the model.
'model_format' -> a string with the model format. Can be free text, people can add a description afterwards on the website.

API call missing? How to get performance metrics / predictions for a run?

For a given task, would like to get (on the client)

a) What runs / implemenations are available?

b) What are their performances metric values?

c) Get the complete prediction. Would be sufficient for just a selected implementation / run, because I could always loop thru this.

Do not used reserved names

In some of the tables reserved names like "class" and "type" are being used. This is a bad practice because it can cause some conflicts in some development languages. Ruby for example where the word class in the "Algorithm" table overrides the default ruby "Class".

My request is renaming of the columns "class" to something like "algorithm_class" and the columns type (tables: cvrun, task_type_estimation_procedure, math_function, experiment_variable, queries, task_type_prediction_feature) to something else.

Suggested interface for implementations

Implementations can currently be uploaded in many different ways. While this makes it easier for users to upload implementations, it makes it harder for other users to download and use those implementations. Hence, it would be good to define an interface for uploaded implementations that is simple enough for uploaders to provide, and that will allow downloaders to easily run the algorithm. It also allows us to provide further services on OpenML, such as automatically running implementations on the server.

We won't enforce this interface, but suggest it as a 'best practice', and state it as a prerequisite for more advanced OpenML services. We should adhere to it for our own plugins and provide clear examples for users to look at.

As usual, in what follows an implementation can be an script, program or workflow depending on its environment.

The interface:

An implementation should accept an 'OpenML task object' as an input (next to other inputs/parameters)
The implementation should return at least the outputs expected by the task type.
The implementation does not need to communicate with the OpenML server.
It will be used within an environment that constructs the task object (e.g., from a task_id), handles the outputs, and communicates with the OpenML API. This environment is typically a plugin. We will also provide standalone libraries for this (for Java, R, Python,...).
Optional(?): An implementation should be specific enough, i.e. don't write an implementation that wraps all of WEKA (e.g. takes an algorithm name as a parameter), unless of course you do some internal algorithm selection.

For the common case of running well-known library algorithms, an implementation will be a wrapper/adapter that handles the conversion from an OpenML task to the required inputs for the library algorithm and interprets its (intermediate) outputs to produce the expected outputs.

I believe it is also best that the implementation description lists the task_types that it supports. Bernd also previously suggested that implementations report which types of data they can/cannot handle.

Comments, please :)

Include tasks in dataset and implementation details

Currently, all results shown in both the implementation and dataset detail pages implicitly belong to the 'Supervised classification' task with 10x10 CV. It would be good to show that.

Maybe we should add a dropdown box showing the different tasks for which results can be returned? It is possible that the same dataset is used in more than one task.

General java lib for OpenML

This is more of a very general design question.

Would it make sense to have a general OpenML Java base lib, which contains all the common objects as Java classes and offers common functionality like downloading, parsing and uploading?

This would make very simple for next guy to connect another Java-based toolkit to OpenMl.

Or do you guys already do that?

Provide large scale data sets

For at least 1-2 projects I would like to have larger data sets on OpenML.
So with more than 10K-50K observations.

Some are available here:

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
http://mldata.org/
http://www.cs.ubc.ca/labs/beta/Projects/autoweka/datasets/

Issue: We might need to support another data format, especially w.r.t. sparse data.

There is HDF5.

There is also a converter:
http://mldata.org/about/hdf5/

Can you look into the general data format issue server-wise, than we can upload some data sets?

Issues/Questions about the stored implementations

Hi,

I just downloaded all implementations that are stored on the server at the moment. Therefore, I made an SQL-query and downloaded a .csv-table with names and versions of all implementations. Here are some issues/questions:

All implementations can be downloaded when id = "name(version)" is transmitted to the server, except for "weka.CfsSubsetEval(1.28)". Why's that? There are no problems using the "real", numerical ID though.
The following implementations contain the character "<" in some of the descriptions in the XML-document, so it cannot be parsed:
"weka.BVDecomposeSegCVSub(1.7)", "weka.RandomForest(1.6)", "weka.RandomTree(1.8.2.2)", "weka.classifiers.functions.LibLINEAR(5917)", "weka.classifiers.meta.RandomCommittee(1.13)", "weka.classifiers.meta.RandomSubSpace(1.4)", "weka.classifiers.trees.RandomForest(1.13)", "weka.classifiers.trees.RandomTree(5536)".
There's an empty implementation as a sub-component of "J48-forest(1.0)". I uploaded this myself, so it was probably my mistake, but shouldn't we check if an implementation is missing a non-optional slot?

The second point is obviously the most problematic one. Should it be forbidden to use "<" and ">" or are there possibilities to parse an XML that contatins these in its contents?

Wrong link

Download an implementation

"The implementation is returned by the server hosting it. This can be OpenML, but also any other code repository. Try it now"

"Try it now" links to http://expdb.cs.kuleuven.be/expdb/data/uci/nominal/anneal.arff
which is a data set.

factor variables with only 1 level/distinct value

In some data sets there are factor variables that have only one level. Sometimes there are two or more levels but all examples belong to the same level. I'm not quite sure where we should fix this. For machine learning, such a factor is useless and might lead to errors. Either the server deletes those factors or we do it locally. What do you think is better?

Parameter optimization

Currently, an uploaded result could be the result of running an implementation with default parameters, running an implementation that does internal parameter optimisation, running an implementation many times in a parameter sweep, or running an implementation with 'magically optimised' parameters.

When ranking implementations based on their evaluations, an unfair advantage will be given to parameter sweeps (data leakage).

Thus, it has been suggested that, during upload, users should flag the run with one of the following cases:

default parameters
parameter sweep
optimized parameters

With the latter, a short notice should indicate that this optimization must have been done internally using only the training set(s).

I do think that, even with default parameter settings, the parameter settings should be uploaded with the run.

Comments, please :).

Uploading the same run multiple times is possible

Hi,

during a test today I simply uploaded the same run (exactly the same object) three times and this was possible.

Do we really want this? I did not think this thru currently, mainly posting this as a question. But this is in 99% of cases a user error that we should catch I would suggest...

Uploading implementations

I have a problem uploading implementations to the server.

I downloaded weka.AODE(1.8.2.3) from the server, changed name and version in the XML file and tried to upload it agan. This doesn't work yet, I always get this error:

"Problem validating uploaded description file
XML does not correspond to XSD schema."

The XML looks like this now:

< oml:implementation xmlns:oml="http://openml.org/openml" >
< oml:name >testestest< /oml:name >< oml:version >1.0< /oml:version >< oml:description >test< /oml:description >< /oml:implementation >

What is wrong?

Old / broken data sets

These should probably be removed very soon. I think they are also flagged as "original".

                                                  name NumberOfFeatures NumberOfInstances
28                        cl2_view2_combined_and_view3                0                 0
30                                     cl2_view3_names                0                 0
35                        cl3_view2_combined_and_view3                0                 0
37                                           cl3_view3                0                 0
53 CoEPrA-2006_Classification_001_Calibration_Peptides                0                 0
55 CoEPrA-2006_Classification_002_Calibration_Peptides                0                 0
57 CoEPrA-2006_Classification_003_Calibration_Peptides                0                 0

Potential problem / question regarding impl. ids

They way I understand it:
Impl ID = name + version (Both user chosen)

When uploading, the server tells me, whether this combo is already in use and therefore not possible.

Could we please specify somewhere in the docs what chars are actually allowed for id and version? Do we really allow:
name = "Jörg's cool algorithm^2" ?

CamelCase evaluation measures

Evaluation measures in ExpDB are CamelCase. Should become lower_case.

Provide different examples of all XMLs / ARFFs to parse

Client programmers want / should check their parsers through unit tests with different examples for task.xml, dataset.xml and so on.

Therefore, the server needs to provide examples of different complexity for each of these.

Best is probably to have the server already provide them trough the standard API calls and for now just tell the client programmers how to access them. We might reserve special IDs for this "testing calls" for now, e.g.

(???) task_id = 100001 to 100005 are examples to test tasks for now (???)

new tasks?

I'm a bit confused. Task 1 used to be based on the Iris data set. Now it's annealing? Did you guys change the tasks? So,... the results that are provided by the new API call (http://www.openml.org/api/?f=openml.task.results&task_id=1) don't belong to the displayed data descriptions, right?

implementation schema vs. current implementations

Every implementation needs a name, a version and a description, but there are many implementations that do not contain all of these (most have name = version = ""). I only checked the first few algorithms, however.

Additionally, the implementation "weka.AODE(1.8.2.3)" is not parseable.

Web definition of algorithms must allow parameter definitions

We need to be able to at least specify:

Parameter name
Parameter data type

Bonus points (need not be done at once)
Simple constraints like box-constraints

data splits for task id = 1 are not valid

should be 2 reps of 10CV but is:

, , = TRAIN

  1   2   3   4   5   6   7   8   9  10

1 142 135 126 135 135 135 135 135 135 135
2 11 135 126 135 135 135 135 135 135 270

, , = TEST

  1   2   3   4   5   6   7   8   9  10

1 15 15 15 15 15 15 15 15 15 15
2 2 15 13 15 15 15 15 15 15 30

Uploading of implementations with neither source nor binary file

When thinking about uploading our first experiments, I noticed that sometimes I maybe do not want to upload either a source file or a binary file.

This mainly concerns applying "standard methods" from libraries. E.g., when I apply the libsvm implementation in the R package e1071, I only need to know the package name and the version number. Uploading the package itself (in binary or source form) makes no sense, this is hosted on the official CRAN R package server.

I could upload a very short code that uses this package and produces the desired predictions. Actually there are a few more subtle questions involed here and it might be easier to discuss them briefly on Skype, I would like to hear your opinions on this.

The question basically is, how much we want to enable users that download implementations to rerun the experiments in a convenient fashion.

Uploading implementations

Some clarification on how we are reimplementing code uploads/checks:

There will be 2 API calls:

'implementation.upload' (exists)
This call has as a required argument POST description: an XML file containing the implementation meta data. Currently, this XML file contains a field 'version', but this was ignored at upload time. The reason for this was that we don't want to force the user to provide a version number. Therefore, the server would pick a version number (1,2,3,...).
However, it often make sense for users to include some kind of versioning. For instance, if I maintain my code at GitHub I may want to add the version hash so I can revisit the code as it was at the time of upload.

Therefore, we will do the following. The description XML will have the following fields:

library (optional): the name of the library/plugin, e.g. 'WEKA', 'mlr', 'rweka', ...
name (required): the name of the implementation. User can choose freely.
version (optional): a user-defined version number, e.g. a github hash, weka versioning, a hash calculated by the plugin,...

Plugins can decide freely how to handle this. If there is a good versioning system already, use that, if not, maybe take a hash of the source code. As long as changes to the code correspond to changes in the version number.

What will happen is that the server will store this info, and then associate a 'label' to each upload (1,2,3,...) linked one-to-one to the user version number/hash. This label is merely aesthetic: in the web interface, you will see both the upload counter as the user-defined version number/hash value. If no version number is given, the server will compute a hash based on the uploaded code. The library-name-version combo will be linked to a unique implementation id.

If you try to upload an implementation with the same library, name, version the server will say that there already is an implementation with those keys, and return the id.

When you want to check what the id is of an already uploaded implementation, there will be the following api:

'implementation.getid' (or implementation.check')
arguments:
GET library_name
GET implementation_name
GET version (user-defined version number, e.g. the hash value)

Based on that info, the server will return the corresponding implementation id. If no match can be found, it will tell you that that implementation is unknown.

Sounds ok?

Cheers,
Joaquin

Cannot parse XML of data set description

Hi,

the problem seems to be the comments in tag oml:description.
Probably because there can be any kind of weird chars in there - and apparently there already are.

R does not parse the whole XML at all, but tells me:

xmlParseEntityRef: no name
xmlParseEntityRef: no name
xmlParseEntityRef: no name
Error: 1: xmlParseEntityRef: no name

If I nearly completely remove the contents of oml:description I can parse again, so the problem is definitely located there.

Any ideas?

Feature request: add 'forgot my password' link to login screen

See subject ;)

Task ID = 6 / DSD has no valid upload datae

Data set description contains:

oml:upload_date0000-00-00 00:00:00/oml:upload_date

In R this produces:

Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format

Discovered while unit testing all tasks.

Handling components of implementations

Hi all,

We're working on the WEKA-plugin and had the following question: Say you have an ensemble method, such as Bagging, and a base-learner like a decision tree.

It is currently possible to store this either as:

An implementation Bagging_J48 with parameters belonging to Bagging and J48
An implementation Bagging with a string value representing the component, e.g. "W=weka.classifiers.J48 - M=2"

I believe KNIME and Rapidminer would store these as separate subcomponents of the workflow. How are things currently handled in R? Do you use option 1 or 2?

I have a slight preference for the first method, mainly because it becomes easier to compare implementations (e.g. Bagging_J48 vs Bagging_OneR), even between environments (weka.Bagging_J48 vs KNIME.Bagging_J48_workflow), and to track the effect of parameters: I can track the effect of a J48 parameter easily without having to interpret strings.

This is indeed harder for us to implement because WEKA is kind of quirky in this area, but overall I think it makes things easier and more comparable.

Thanks,
Joaquin

anneal inconsistencies / representation of missing values

The data set description seems to be wrong. E.g., it says there are 798 instances but the data set has 898 rows.

Here you can find the same inconsistencies:
http://mldata.org/repository/data/viewslug/datasets-uci-anneal/
(tabs "summary" vs. "data")

I think this is what Bernd meant when he said someone should check all the data sets. Actually, the correctness of the data characteristics is way more important than the description. Let's check it:

[...]
NumberOfInstancesWithMissingValues: 0  
NumberOfMissingValues: 0
[...]

This is obviously wrong. I think we have to add a slot in the data set description for how missing values are signified. Also, the server should transform them into the desired representation (e.g., "NA") before computing the data qualities.

Empty data sets?

There are 43 data sets (for isOriginal = 'true') that have neither features nor instances. What's up with them? They have names like these:

cl1_view2, cl1_view2_combined, cl1_view2_combined_and_raw_data, cl1_view2_combined_and_view3, CoEPrA-2006_Classification_001_Calibration_Data, ...

Can they be deleted or labeled as "not original"?

There is another data set without features and instances called "eucalyptus". Well, at least this is what the server tells me.

JSON output of data qualities.

In the JSON output of the data qualities, no type information of the columns is given, when we directly query thru API / SQL.

Every columns has an undefined type and every value is encoded as a string, even if it is a number.

Can this be corrected?
Currently we use a trick in R so we do not have to convert manually.

Here is the API call

"http://www.openml.org/api_query/?q=SELECT%20d.name%20AS%20dataset,%20MAX(IF(dq.quality='NumberOfFeatures',%20dq.value,%20NULL))%20AS%20NumberOfFeatures,MAX(IF(dq.quality='NumberOfInstances',%20dq.value,%20NULL))%20AS%20NumberOfInstances,MAX(IF(dq.quality='NumberOfClasses',%20dq.value,%20NULL))%20AS%20NumberOfClasses,MAX(IF(dq.quality='MajorityClassSize',%20dq.value,%20NULL))%20AS%20MajorityClassSize,MAX(IF(dq.quality='MinorityClassSize',%20dq.value,%20NULL))%20AS%20MinorityClassSize,MAX(IF(dq.quality='NumberOfInstancesWithMissingValues',%20dq.value,%20NULL))%20AS%20NumberOfInstancesWithMissingValues,MAX(IF(dq.quality='NumberOfMissingValues',%20dq.value,%20NULL))%20AS%20NumberOfMissingValues,MAX(IF(dq.quality='NumberOfNumericFeatures',%20dq.value,%20NULL))%20AS%20NumberOfNumericFeatures,MAX(IF(dq.quality='NumberOfSymbolicFeatures',%20dq.value,%20NULL))%20AS%20NumberOfSymbolicFeatures%20FROM%20dataset%20d,%20data_quality%20dq%20WHERE%20d.did%20=%20dq.data%20AND%20d.isOriginal%20=%20'true'%20GROUP%20BY%20dataset"

Extend output for classification/regression task: parameters

It should be clear from an uploaded run how parameters were chosen. We previously agreed on the following three cases:

manual parameter settings (typically these are sensible defaults)
parameter sweep (try many settings in some experimental design)
optimized parameters (parameters are optimized internally by some algorithm)

We should add a field/flag to report this, e.g.
parameter_setting_type = [manual, sweep, optimized]

In cases 1 and 2, the parameter settings should be uploaded with the run. This is already supported.

In case 3, the optimized parameters are fold/repeat specific, and should thus be added to the predictions file. This can simply be an extra column in the predictions arff file. I propose a simple key-value format, maybe json, that can then be stored as a string:
{"parameter_name_1":0.4, "parameter_name_2":123}

We can thus extend the classification/regression task with the following:

a parameter_setting_type string field, see above
an extra optional column in the prediction arff file

Task search does not work

Tried to list all tasks on openml.org

Search -> Tasks -> Supervised Classif

Hit Search to list all tasks: Server error
I then typed "iris" in "Datasets": Server error

OpenML 1.0

Things are coming together nicely, but there are also many new things planned. Bernd suggested we define what features should be in a 1.0 version, and finish that as soon as possible, making sure it works so that we can really start spreading the word.

I'm just making a list here, most of which is already done. Feel free to add/remove. Paraphrasing Linus Torvalds, 'suggestions are welcome, but we won't promise we'll implement them :-)'.

Website

Search, overview of tasks and task types, datasets, code, runs.
Pages with all details on individual dataset, tasks, code, with discussion fields.
Basic visualization of query results.
Uploading of new datasets and code, including by url.
Details for developers. Tutorial for new users.
Ability to filter datasets on properties.

Task types

Supervised classification and regression
Page on website where the requirements/options are listed in human-readable form and can be discussed (new requests, what is implemented?)

Datasets

Support for ARFF: computation of dataset properties and generation of tasks (train/test folds)
Basic check on dataset upload: feature name characters, other checks?
Pull in ARFF datasets from uci, mldata, others?

REST API (documented)

Search tasks, datasets, code
Upload datasets, code, run
Download dataset, code, task
Free SQL query

Plugins

SDK for Java, R: interface for interacting with the server from these languages
Plugin in WEKA
Plugin in R (at least mlr)
Optional?: Allow to search for/ list tasks from within plugins (not just by entering id)

Content

Start of 'new' database containing only runs on tasks. Old database will be available from another server.
Initial sweep of experiments with WEKA
Initial sweep of experiments with R

Extension: record selected features with run upload

Bernd mentioned he would like to store the features selected in a run.

I would like to start a discussion about how to do this.

Is this one set of selected features obtained when running the method over all data? Or do we want to record this for every fold/repeat?
Can this be an extra optional output file, or do we want this in the run upload xml file?

Thanks,
Joaquin

Collect web interface example queries and use cases

The already emailed example queries for the web interface should be collected in the wiki and also extended! We can only design the interface in a good way, if we collect reasonable things that people want to do with it.

implementation schema(s)

There are 2 implementation schemas.

a) https://raw.github.com/openml/OpenML/master/XML/Schemas/implementation_upload.xsd

b)
https://raw.github.com/openml/OpenML/master/XML/Schemas/implementation.xsd

I understand that a) is uploaded by the user, b) is returned when you ask for it on the server to get it.

The problem is that both share about 90% of their xml fields, but the schemas are already not the same. Could they be made consistent?

Also we noticed this:
<xs:element name="version" minOccurs="0" type="xs:string"/>

minOccurrs = 0 is wrong, isnt it?

example data set desc contains invalid upload_date

R cannot parse this, fix pls

http://expdb.cs.kuleuven.be/expdb/api/?f=openml.data.description&data_id=61

0000-00-00 00:00:00

Check column names for special characters

Hey,

I discovered a few data sets that have special characters (";", "?", ...) or spaces in some of their column names. Some of them just start with a number, which is also not okay with R.
It would be great if the server could check for those problems and resolve them somehow.