christiansch / skml Goto Github PK

scikit-learn compatibel multi-label classification

Home Page: http://skml.readthedocs.io/en/latest/

License: MIT License

Makefile 0.70% Python 99.30%

machinelearning multi-label-classification scikit-learn machine-learning machine-learning-library multi-label-learning multi-label-problem python python3 artificial-intelligence data-science data-mining multi-label

skml's Introduction

skml

scikit-learn compatible multi-label classification implementations.

A multi-class classification (MLC) problem is given, if a subset of labels (picture of equation) shall be predicted for an example.

Currently Supported

Problem Transformations:
- Binary Relevance
- Label Powerset
- Classifier Chains
- Probabilistic Classifier Chains
Ensembles:
- Ensemble Classifier Chain

Installation

For production install via pip: ` pip install skml `

For development, clone this repo, change to the directory of skml and inside of the skml directory run the following: ` pip install -e .[dev] python setup. `

Python Supported

Due to dependencies we do not check for a working distribution of skml for the following Python versions:

skml's People

Contributors

Stargazers

Watchers

Forkers

asdhob will241 bw996

skml's Issues

travis doesn't work 😢

Traceback (most recent call last):
  File "/home/travis/virtualenv/python2.7.12/bin/green", line 11, in <module>
    sys.exit(main())
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/green/cmdline.py", line 75, in main
    result = run(test_suite, stream, args, testing)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/green/runner.py", line 92, in run
    targets = [(target, manager.Queue()) for target in toParallelTargets(suite, args.targets)]
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/green/loader.py", line 60, in toParallelTargets
    proto_test_list = toProtoTestList(suite)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/green/loader.py", line 45, in toProtoTestList
    toProtoTestList(i, test_list, doing_completions)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/green/loader.py", line 45, in toProtoTestList
    toProtoTestList(i, test_list, doing_completions)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/green/loader.py", line 45, in toProtoTestList
    toProtoTestList(i, test_list, doing_completions)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/green/loader.py", line 45, in toProtoTestList
    toProtoTestList(i, test_list, doing_completions)
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/green/loader.py", line 37, in toProtoTestList
    getattr(suite, exception_method)()
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/green/loader.py", line 235, in testFailure
    raise ImportError(message)
ImportError: Failed to import test.test_br computed from filename /home/travis/build/ChristianSch/skml/test/test_br.py
Traceback (most recent call last):
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/green/loader.py", line 212, in loadFromModuleFilename
    __import__(dotted_module)
  File "/home/travis/build/ChristianSch/skml/test/test_br.py", line 18, in <module>
    X, y = load_dataset('yeast')
  File "/home/travis/build/ChristianSch/skml/skml/datasets/load_datasets.py", line 17, in load_dataset
    data = fetch_mldata('yeast')
  File "/home/travis/virtualenv/python2.7.12/lib/python2.7/site-packages/sklearn/datasets/mldata.py", line 142, in fetch_mldata
    mldata_url = urlopen(urlname)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 435, in open
    response = meth(req, response)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 548, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 467, in error
    result = self._call_chain(*args)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 654, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 435, in open
    response = meth(req, response)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 548, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 473, in error
    return self._call_chain(*args)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/opt/python/2.7.12/lib/python2.7/urllib2.py", line 556, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
make: *** [test] Error 1

documentation is sometimes outright wrong

the signatures for predict for example
or the doc text for BinaryRelevance is the one for CC

Peer Review of implementations needed

As the title says, we need independent peer reviews of the implementations provided in this library to guarantee some sort of correctness.

remove predict_proba from classifier chain methods

PCC: predict_proba can produce nan

i.e. [1.46012870e-03, 9.31928431e-02, 2.58314587e-01, nan, 2.94657343e-01, nan]

fix support for sparse matrices

A good starting point would be to look at the failing tests regarding sparse matrices here.

I am wondering if you are one of the original owners/maintainers of scikit-multilearn, and if this repository will carry over that project. It seems that the former is already inactive: some pull requests are not being answered, multiple issues exist, and dependencies are not that managed properly for all Python versions (especially the thing about graph-tools and MEKA's Java dependency).

I am just wondering what the vision for that library would be. I believe that providing a multi-label classification library in Python, to augment scikit-learn, is very useful.

I'd also like to contribute in whatever way. I am currently a graduate student doing research in bioinformatics (hence multi-label classification). I'd be happy to help re-structuring the project, writing docs, refactor some code, and clean-up some unit tests.

Perhaps we can start by managing all dependencies to create a successful travis build. It seems that graph-tools is tricky, and the docker image provided is only for Python 3. In addition to that, there's also the MEKA extension to take care of.

Maybe we can omit for a while the support for these features? And focus for a while with the "easier" ones?

Thank you so much, I'd really love to help out in this project and find people who are still interested to continue maintaining scikit-multilearn.

implementation of PCC

reference implementations include:

doc: datasets not shown in documentation

ModuleNotFoundError: No module named 'skml'

How do I install skml? I get the error 'ModuleNotFoundError: No module named 'skml'' when I try to run the code. When I use 'pip install skml' I get the error ' ERROR: Could not find a version that satisfies the requirement skml (from versions: none)
ERROR: No matching distribution found for skml'

python3: module "future" not found

File "$HOME.virtualenvs/skml/lib/python3.6/site-packages/skmultilearn/dataset.py", line 2, in <module>
    from future import standard_library
ModuleNotFoundError: No module named 'future'

implement arff support

examples show bad practive

they don't show train/test split, but instead just use the full training data as evaluation.

[doc] module index is not build completely

check usage of "threshold" in classifier chain models

Not all CC-based models need thresholds. CC for itself for example.

implement balanced train/test split

see scikit-multilearn/scikit-multilearn#62

classifier chains "fit" function is wrong

It uses y_pred to train subsequent classifiers, instead of the ground truth.

fix all implementations of "predict*" that imply that only one single instance was given

The notation and docs hint at this, but we don't do it yet. The doc/notation implies that a number of instances was given, for which each of the instances should be getting a predicted target. Currently we just imply that X is an instance given as a vector and run predictions on that.

implement label space down sampling

Just like in the original PCC paper we'd like to introduce a way to remove labels from a given label vector easily. These are a few methods that come to mind:

by-threshold: only retain labels that occur in, say 95% of the instances
most-frequent: take only the top k labels that occur the most frequent (see PCC paper)

datasets

Currently load_dataset isn't really helpful. I guess mirroring them and having a custom load method that does not depend on skmultilearn would be good, as it's currently broken.

ECC: warnings on tests

/home/user/Documents/dev/skml/skml/ensemble/ensemble_classifier_chains.py:93: RuntimeWarning: invalid value encountered in greater_equal
  return (out >= self.threshold).astype(int)
/home/user/Documents/dev/skml/skml/ensemble/ensemble_classifier_chains.py:93: RuntimeWarning: invalid value encountered in true_divide
  out = preds / W_norm

parallelization of predict methods

Well, when running any classifier (including PCC), the fitting works in no time, the predictions however take quite some time and in good Python manier it runs only on a single CPU core. We have 2018 though, and most processors have round about 4-6 cores with a bunch of threads, we should utilize this.
This issue tries to clarify if and how we can utilize multi-core CPUs properly.

The first idea is to parallelize the predict methods just like sklearn does, via Parallel:

        all_importances = Parallel(n_jobs=self.n_jobs,
                                   backend="threading")(
            delayed(getattr)(tree, 'feature_importances_')
for tree in self.estimators_)

refactoring

rename estimator to base_estimator maybe?