I'm working on a function which lets users more easily run machine learning experiments through the OpenML API. For every run, the arff-file which contains the result should contain the names of the class labels. Currently there is no way to directly retrieve class labels from an OpenML dataset object.
In the code below, let's assume that 'dataset' is an OpenML Dataset entity of the iris dataset, so it should have the class labels {Iris-setosa,Iris-versicolor,Iris-virginica}.
What I would like is (something like) this:
dataset.class_labels
>>> ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Instead, I currently do this (which works):
arffFileName = dataset.data_file;
with open(arffFileName) as fh:
arffData = arff.ArffDecoder().decode(fh);
dataAttributes = dict(arffData['attributes']);
dataAttributes['class'];
That is, I open the associated ARFF file, and decode it to find the class label information. After executing above code, dataAttributes['class'] is exactly ["Iris-setosa", "Iris-versicolor", "Iris-virginica"].
At first I thought I would simply change the initialization of the dataset, and add a class_label attribute, but looking into it raised the following questions:
- In apiconnector.py line 456, which currently initializes the dataset objects, the dataset description is used to initialize the dataset object. Class labels are not information contained in the dataset description XML file, but instead in the data ARFF file. This means that in order to retrieve the class labels, the ARFF file has to be opened. First, this seems to clash with the design so far, in that the dataset can be constructed from the description alone (and if the dataset ARFF is already cached, it then does not need to be opened). Second, from the comment on apiconnector.py line 536 I gather that ARFF files might be sufficiently large not to want to load them into memory if not necessary. Opening a potentially colossal file just to search for the class labels attribute might be too much overhead, though since the attributes are listed before the data, I think a lazy read might negate much of the overhead. So should the class labels be stored in the description, or should the ARFF file be read upon construction of the dataset?
- Not all datasets have class labels (for example unsupervised learning problems), so in that sense it is not a generic dataset property. For those datasets, I'm in favor of just setting class_labels to None, but perhaps there are reasons to instead return an empty list. Alternatively, this could be an argument against having a class_label property.
If we can reach a conclusion, then I will try to implement it on my feature branch (feature/script (which I admit is poorly named and will rename to feature/run or feature/autorun)). Looking forward to hearing your thoughts.
Kind regards,
Pieter Gijsbers