refefer / fastxml Goto Github PK

View Code? Open in Web Editor NEW

150.0 10.0 47.0 119 KB

FastXML / PFastXML / PFastreXML - Implementation of Extreme Multi-label Classification

License: Other

Python 100.00%

machine-learning fastxml python multilabel-classification

fastxml's Introduction

FastXML / PFastXML / PFastreXML - Fast and Accurate Tree Extreme Multi-label Classifier

This is a fast implementation of FastXML, PFastXML, and PFastreXML based on the following papers:

"FastXML: A Fast, Accurate and Stable Tree-classifier for eXtreme Multi-label Learning" Paper
"Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Application" Paper
"DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification" Paper Code

DiSMEC makes it's appearance via an L2 penalty rather than an L1 which, when set with a high alpha and sparsity eps of 0.01-0.05, also can produce sparse linear classifiers.

It's implemented in the quasi-familiar scikit-learn clf format.

Release Notes

2.0

Version 2.0 is not backward compatible with 1.x
User model.save(path) to save models instead of cPickle
Rewrites data storage layer
Uses 50% the memory, loads 30% faster, and is 40% faster to inference

Binary

This repo provides a simple script along with the library, fxml.py, which allows easy train / testing of simple datasets.

It takes two formats: a simple JSON format and the standard extreme multi label dataset format.

Standard Benchmark Datasets

As an example, to train a standalone classifier against the Delicious-200K dataset:

fxml.py delicious.model deliciousLarge_train.txt --standard-dataset --verbose train --iters 5 --trees 20 --label-weight propensity --alpha 1e-4 --leaf-classifiers --no-remap-labels

To test:

fxml.py delicious.model deliciousLarge_test.txt --standard-dataset inference

JSON File

As fxml.py is intended as an easy to understand example for setting up a FastXML classifier, the JSON format is very simple. It is newline delimited format.

train.json:

{"title": "red dresses", "tags": ["clothing", "women", "dresses"]}
{"title": "yellow dresses for sweet 16", "tags": ["yellow", "summer dresses", "occasionwear"]}
...

It can then be trained:

fxml.py my_json.model train.json --verbose train --iters 5 --trees 20 --label-weight propensity --alpha 1e-4 --leaf-classifiers

Not the omission of the flags "--standard-dataset" and "--no-remap-labels". Since the tags/classes provided are strings, fxml.py will remap them to an integer label space for training. During inference, it will map the label index back

Simple Python Usage

from fastxml import Trainer, Inferencer

X = [Sparse or numpy arrays]
y = [[1, 3]] # Currently requires list[list[int]]

trainer = Trainer(n_trees=32, n_jobs=-1)

trainer.fit(X, y)

trainer.save(path)

clf = Inferencer(path)

clf.predict(X)
# or
clf.predict(X, fmt='dict')

#############
# PFastXML
#############

from fastxml.weights import propensity

weights = propensity(y)
trainer.fit(X, y, weights)

###############
# PFastreXML
###############
trainer = Trainer(n_trees=32, n_jobs=-1, leaf_classifiers=True)
trainer.fit(X, y, weights)

TODO

Run all the standard benchmark datasets against it.
Refactor. Most of the effort has been spent on speed and it needs to be cleaned up.

fastxml's People

Contributors

Stargazers

Watchers

Forkers

mlaprise hikylemorris lukehe kevinking flyingdata gubobo durgaprasd robbymeals dcard codingsparse fangyizhang techbala laisun hua-ming zdstandup pculliton karan2k loretoparisi jonberliner dimitriscc yupbank calvinvbigyi xang1234 jacklangerman nehapspathak kuni88 enzoampil siddu9501 tpalczew takuyats asian-delirium cjopengler gaoyz0625 chetwanimanish michael-wzhu sandy4321 shubhampachori12110095 shandou pavan-naik exverbum akamil-etsy asdlkfh dghoffra humzatahir francoisblombrned supercoder-dev

fastxml's Issues

Trainer.fit "Requires list of csr_matrix"

I tried to train a model with input:

X_train
<4768x31412 sparse matrix of type '<class 'numpy.float64'>'
	with 398434 stored elements in Compressed Sparse Row format>
Y_train  # of length 4768
[[52, 62, 33],
 [31],
 [71], ...]

then I run:

from fastxml import Trainer, Inferencer
trainer = Trainer(n_trees=32, n_jobs=-1)
trainer.fit(X_train, Y_train)

it gives

AssertionError                            Traceback (most recent call last)
<ipython-input-15-f463a58ca9a3> in <module>()
      1 trainer = Trainer(n_trees=32, n_jobs=-1)
      2 
----> 3 trainer.fit(X_train, Y_train)
      4 
      5 

/usr/local/lib/python3.5/dist-packages/fastxml-2.0.0-py3.5-linux-x86_64.egg/fastxml/trainer.py in fit(self, X, y, weights)
    463 
    464     def fit(self, X, y, weights=None):
--> 465         self.roots = self._build_roots(X, y, weights)
    466         if self.leaf_classifiers:
    467             self.norms_, self.uxs_, self.xr_ = self._compute_leaf_probs(X, y)

/usr/local/lib/python3.5/dist-packages/fastxml-2.0.0-py3.5-linux-x86_64.egg/fastxml/trainer.py in _build_roots(self, X, y, weights)
    381 
    382     def _build_roots(self, X, y, weights):
--> 383         assert isinstance(X, list) and isinstance(X[0], sp.csr_matrix), "Requires list of csr_matrix"
    384         if self.n_jobs > 1:
    385             f = fork_call(self.grow_root)

AssertionError: Requires list of csr_matrix

why does it require list of csr_matrix? what does each csr_matrix mean?

init() got an unexpected keyword argument 'n_iter'

OS: Ubuntu 16.04.5 LTS
python: 3.7.1

Today when I tried to train a model based on a standard dataset(Wiki10-31k) downloaded from this website, I run the command
fxml.py wiki.model wiki10_train.txt --standard-dataset --verbose train --iters 5 --trees 20 --label-weight propensity --alpha 1e-4 --leaf-classifiers --no-remap-labels
and got error of
__init__() got an unexpected keyword argument 'n_iter' for many times.
I also tried to install fastxml using pip install git+https://github.com/Refefer/fastxml.git and run through python wrapper. However I got the same error.
I will really appreciate it if you can give me some advice on it.

Below are the top lines of standard outputs:
10000 docs encoded
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Splitting 14145
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Splitting 5092
Splitting 6967
Training classifier
Training classifier
Splitting 7085
Splitting 7014
Splitting 5058
Training classifier
Splitting 7530
Splitting 9054
Splitting 8987
Splitting 5214
Splitting 7108
Training classifier
Splitting 7875
Training classifier
Splitting 8995
Training classifier
Splitting 2643
Training classifier
Splitting 7119
Training classifier
Training classifier
Training classifier
Splitting 6980
Training classifier
Training classifier
Splitting 5108
Splitting 6604
Training classifier
Splitting 7168
Training classifier
Splitting 6247
Training classifier
Training classifier
Training classifier
Training classifier
Splitting 2688
Training classifier
Splitting 2454
Process Process-6:
Splitting 9080
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.7/site-packages/fastxml-2.0.0-py3.7-linux-x86_64.egg/fastxml/proc.py", line 37, in _remote_call
results = f(*args)
File "/opt/conda/lib/python3.7/site-packages/fastxml-2.0.0-py3.7-linux-x86_64.egg/fastxml/trainer.py", line 322, in grow_root
node = self.grow_tree(X, y, idxs, rs, splitter)
File "/opt/conda/lib/python3.7/site-packages/fastxml-2.0.0-py3.7-linux-x86_64.egg/fastxml/trainer.py", line 358, in grow_tree
lNode = self.grow_tree(X, y, l_idx, rs, splitter)
File "/opt/conda/lib/python3.7/site-packages/fastxml-2.0.0-py3.7-linux-x86_64.egg/fastxml/trainer.py", line 358, in grow_tree
lNode = self.grow_tree(X, y, l_idx, rs, splitter)
File "/opt/conda/lib/python3.7/site-packages/fastxml-2.0.0-py3.7-linux-x86_64.egg/fastxml/trainer.py", line 358, in grow_tree
lNode = self.grow_tree(X, y, l_idx, rs, splitter)
[Previous line repeated 2 more times]
File "/opt/conda/lib/python3.7/site-packages/fastxml-2.0.0-py3.7-linux-x86_64.egg/fastxml/trainer.py", line 336, in grow_tree
l_idx, r_idx, (clf, clff) = self.split_train(X, idxs, splitter, rs)
File "/opt/conda/lib/python3.7/site-packages/fastxml-2.0.0-py3.7-linux-x86_64.egg/fastxml/trainer.py", line 317, in split_train
clf, clf_fast = self.train_clf(X, [l_idx, r_idx], rs)
File "/opt/conda/lib/python3.7/site-packages/fastxml-2.0.0-py3.7-linux-x86_64.egg/fastxml/trainer.py", line 222, in train_clf
random_state=rs)
TypeError: init() got an unexpected keyword argument 'n_iter'

RuntimeError

Hello,

first thank you for this implementation.

When i call Trainer.fit() with n_jobs greater than one the following error is logged:

  File "<string>", line 1, in <module>
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/evosec/Developer/apps/awavo-predictor/src/tools/Trainer.py", line 80, in <module>
    trainer.fit(X, y)
  File "/Users/evosec/.local/share/virtualenvs/awavo-predictor-8WkHYgGA/lib/python3.8/site-packages/fastxml/trainer.py", line 465, in fit
    self.roots = self._build_roots(X, y, weights)
  File "/Users/evosec/.local/share/virtualenvs/awavo-predictor-8WkHYgGA/lib/python3.8/site-packages/fastxml/trainer.py", line 407, in _build_roots
    procs.append(f(X, y, next(idxs), rs, splitter))
  File "/Users/evosec/.local/share/virtualenvs/awavo-predictor-8WkHYgGA/lib/python3.8/site-packages/fastxml/proc.py", line 50, in f2
    p.start()
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

This doesn't lead to a crash but the fitting is stopped.

pip freeze returns:

astroid==2.4.2
autopep8==1.5.4
category-encoders==2.2.2
click==7.1.2
Cython==0.29.21
fastxml @ git+https://github.com/Refefer/fastxml@03440e432e2ce9f66df286581e4d50b99ad209ef
Flask==1.1.2
future==0.18.2
isort==5.6.0
itsdangerous==1.1.0
Jinja2==2.11.2
joblib==0.17.0
lazy-object-proxy==1.4.3
MarkupSafe==1.1.1
mccabe==0.6.1
numpy==1.19.2
pandas==1.1.3
patsy==0.5.1
pycodestyle==2.6.0
pylint==2.6.0
python-dateutil==2.8.1
python-dotenv==0.14.0
pytz==2020.1
rope==0.18.0
scikit-learn==0.23.2
scipy==1.5.2
six==1.15.0
statsmodels==0.12.0
threadpoolctl==2.1.0
toml==0.10.1
Werkzeug==1.0.1
wrapt==1.12.1

TypeError: Object of type 'int64' is not JSON serializable during trainer.save('bah')

Hi,
Looking forward to using FastXML. This is not quite a bug, but it might be worth handling? Just thought I'd report it in case anyone else comes across it. JSON doesn't take numpy data types, so Y has to be changed to int when converting from numpy labels.
This is my setup:

from fastxml import Trainer, Inferencer
from sklearn.datasets import make_multilabel_classification

X, Y = make_multilabel_classification(n_classes=10, n_labels=1,
                                      allow_unlabeled=True,
                                      random_state=1)

X = [X[i].astype('float32') for i in range(X.shape[0])]
X_sparse = [csr_matrix(b) for b in X]

##This line will lead to trainer.save('bah') failing
Y_list = [list(np.where(i==1)[0]) for i in Y]

##This line converts the values to ints, and then trainer.save('bah') will work down the line
Y_list = [[int(k) for k in list(np.where(i==1)[0])] for i in Y]

trainer = Trainer(n_trees=10, n_jobs=1)
trainer.fit(X_sparse, Y_list)

trainer.save('bah')

TypeError: can't pickle fastxml.splitter.Splitter objects

Hi,

X is a list of csr_matrix.
y is a list of lists (as shown below)

from scipy import sparse
X = np.random.rand(10, 10)
y = [[1], [2], [3], [1], [2], [3], [1], [2], [3], [1]]
XList = []
for i in range(10):
    XList.append(sparse.csr_matrix(X[i]))
	
X = XList	
trainer = Trainer(n_trees=32, n_jobs=-1)
trainer.fit(X, y)

trainer.fit(X, y)
Traceback (most recent call last):
File "", line 1, in
File "C:\ProgramData\Anaconda3\lib\site-packages\fastxml-2.0.0-py3.6-win-amd64.egg\fastxml\trainer.py", line 465, in fit
self.roots = self._build_roots(X, y, weights)
File "C:\ProgramData\Anaconda3\lib\site-packages\fastxml-2.0.0-py3.6-win-amd64.egg\fastxml\trainer.py", line 407, in _build_roots
procs.append(f(X, y, next(idxs), rs, splitter))
File "C:\ProgramData\Anaconda3\lib\site-packages\fastxml-2.0.0-py3.6-win-amd64.egg\fastxml\proc.py", line 50, in f2
p.start()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle fastxml.splitter.Splitter objects

Propensity scored nDCG

Hello,
Could you give me directions on the Propensity scored nDCG loss function implementation?

How to calculate ndcg from the clf predictions

I am a little bit lost. I saw in the bin/fxml.py, that you predict ndcg with other performance metrics as well. However, using variable names that I don't understand is difficult for me to reproduce it.

X = csr_matrix(X_train.values)
X = [X[i].astype('float32') for i in range(X.shape[0])]
y = [[int(k) for k in list(np.where(i==1)[0])] for i in y_train.values]

w = weights.propensity(y)    
trainer = Trainer(n_trees=32, n_jobs=-1, leaf_classifiers=True)
trainer.fit(X,y, w)
trainer.save("multilabel_default_fastreXML.h5")

X = csr_matrix(self.X_test.values)
X = [X[i].astype('float32') for i in range(X.shape[0])]
clf = Inferencer(helpers.getModelsPath() + "multilabe_default_fastreXML.h5")
y_pred = clf.predict(X)

This is what I do, where I have X_train, y_train, X_test and y_test dataframes. The prediction is working, however, I am not sure how to proceed and use your functions to get the ndcg. Any idea?

why do we limit X to be a list of csr_matrix for training ?

https://github.com/Refefer/fastxml/blob/master/fastxml/trainer.py#L383

What is the output prediction means ?

I tired to use python API:
X = [Sparse or numpy arrays]
y = [[1, 3]] # Currently requires list[list[int]]

trainer = Trainer(n_trees=32, n_jobs=-1)

trainer.fit(X, y)

trainer.save(path)

clf = Inferencer(path)

clf.predict(X, fmt='dict')

And the prediction result of an example look like this:
[(5866, -0.40310976),
(437, -0.67100734),
(995, -0.8778681),
(2642, -1.1181042),
(5217, -1.1278155),
(5967, -1.1540765),
(7558, -1.3282802),
(4391, -1.5430373),
(4017, -1.624005),
(5781, -1.9639409),
(1347, -2.012597),
(4063, -2.0736518)

What -0.40310976, -0.67100734, ..... mean ?
And how can I select the most appropriate labels from this result?

pyx format file

why not upload the origin file?

'utf-8' codec can't decode byte 0xc0 in position

Hi, when running the simple case in Python and creating the Inferencer, I get an error in read_row saying:
'utf-8' codec can't decode byte 0xd5 in position 5: invalid continuation byte
Any ideas how to solve that?

Thanks

The pyx file is wrote by yourself or auto-generated?

@Refefer Thank you!!

relative import

@mlaprise @Refefer @codingsparse @siddu9501
what an awesome proudct you have produced!
do you know how can solve this error... thanks!

root@yishai-remotedocker-0:/persistent/Sefaria-Project/ML# cd /persistent/Sefaria-Project/ML ; env /usr/local/bin/python /root/.vscode-server/extensions/ms-python.python-2020.6.91350/pythonFiles/lib/python/debugpy/launcher 35679 -- /usr/local/lib/python3.7/site-packages/fastxml/trainer.py
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.vscode-server/extensions/ms-python.python-2020.6.91350/pythonFiles/lib/python/debugpy/main.py", line 45, in
cli.main()
File "/root/.vscode-server/extensions/ms-python.python-2020.6.91350/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/root/.vscode-server/extensions/ms-python.python-2020.6.91350/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 267, in run_file
runpy.run_path(options.target, run_name=compat.force_str("main"))
File "/usr/local/lib/python3.7/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/usr/local/lib/python3.7/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/fastxml/trainer.py", line 20, in
from .splitter import Splitter, sparsify, sparse_mean_64, radius
ImportError: attempted relative import with no known parent package
root@yishai-remotedocker-0:/persistent/Sefaria-Project/ML#

Setup Error: Cannot open source file: 'fastxml/splitter.cpp': No such file or directory

I am trying to run the setup.py and I am running into the following error:
I am on Windows10. Any idea how to work with this?

building 'fastxml.splitter' extension
C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.13.26128\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\svajjala\Python36\include -IC:\Users\svajjala\Python36\include -IC:\Users\svajjala\Python36\lib\site-packages\numpy\core\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.13.26128\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.16299.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.16299.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.16299.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.16299.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.16299.0\cppwinrt" /EHsc /Tpfastxml/splitter.cpp /Fobuild\temp.win-amd64-3.6\Release\fastxml/splitter.obj -O3 -std=c++11
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
splitter.cpp
c1xx: fatal error C1083: Cannot open source file: 'fastxml/splitter.cpp': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.13.26128\bin\HostX86\x64\cl.exe' failed with exit status 2

Possible issue when computing unit norms

I wanted to report a small bug I found when going a bit deeper into the code.

In trainer.py there is a function for computing the unit norm of the training data. This is the function:

def compute_unit_norms(X):
    norms = np.zeros(X[0].shape[1])
    for Xi in X:
        for i, ind in enumerate(Xi.indices):
            norms[ind] = Xi.data[i] ** 2

    norms = norms ** .5
    norms[np.where(norms == 0)] = 1.0
    return norms.astype('float32')

My question here is the following: aren't you missing a += instead of a =inside the for loop?. The forth line of the code then would be

norms[ind] += Xi.data[i] ** 2

And thank you guys for sharing the library :)

UnicodeDecodeError

I'm using Python3.6.5 on Ubuntu 18.04. I'm able to train a classifier using:

trainer = Trainer(n_trees = 18)
trainer.fit(X_train_list, y_train)
trainer.save('fastxml.trained')

X_train_list above is a list of csr_matrix objects.

The settings file I get looks like this:

{"n_trees": 18, "max_leaf_size": 10, "max_labels_per_leaf": 20, "re_split": 0, "n_jobs": 1, "alpha": 0.0001, "seed": 2016, "n_epochs": 2, "n_updates": 100.0, "verbose": false, "bias": true, "subsample": 1, "loss": "log", "sparse_multiple": 25, "leaf_classifiers": false, "gamma": 30, "blend": 0.8, "leaf_eps": 1e-05, "optimization": "fastxml", "engine": "auto", "auto_weight": 32, "eps": 1e-06, "C": 1, "leaf_probs": false, "n_labels": 58}

However, when trying to use the inferencer I get UnicodeDecodeErrors such as this:

clf = Inferencer('fastxml.trained')

----------------------------------------------------

Traceback (most recent call last):
  File "fastxml/inferencer.pyx", line 123, in fastxml.inferencer.load_sparse
    values = read_row(f, 'If')
  File "fastxml/inferencer.pyx", line 107, in fastxml.inferencer.read_row
    d = f.read(struct.calcsize('I'))
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Thank you.

Run benchmarks against the library

There is a TODO bullet point in the README file that reads: "Run all the standard benchmark datasets against it."

I was planning on using this framework for one of my courses at university. As a part of my small project, I will be running the FastXML and pFastreXML algorithms against (all?) the benchmark datasets that the papers use (I basically need a baseline for my results).

Because of this, I would like to know if you still want to test the application and, if so, which data do you need. I might be able to give you a hand with it.

Error with loading classifier

Hi! First time I am using XMC classifiers for a research project. Managed to fit the Trainer object following the instructions in under "Simple Python Usage" in the Readme. However, I get a file not found error when running

clf = Inferencer("./models")

I get the following error doing so:

Not sure if I am missing a step, will really appreciate any help😅

ValueError: Buffer dtype mismatch, expected 'float32_t' but got 'double'

Using the code below I succeeded in training and saving the model:

trainer = Trainer(n_trees=64, n_jobs=-1)
trainer.fit(X_tr, y)
trainer.save("../models/model_64_trees")

However when I try to predict, even on the very set it was trained (X_tr), by running the code below:

clf = Inferencer("../models/model_64_trees")
y_train_preds = clf.predict(X_tr, fmt='dict')

I get the following error (with traceback):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-19b6d12937ec> in <module>
----> 1 y_train_preds = clf.predict(X_tr, fmt='dict')

c:\pankaj\projects\00-learning\python\fastxml\fastxml\fastxml\fastxml.py in predict(self, X, fmt)
     37             Xi = X[i]
     38             mean = self.predictor.predict(Xi.data, Xi.indices, 
---> 39                     self.blend, self.gamma, self.leaf_probs)
     40 
     41             if fmt == 'sparse':

c:\pankaj\projects\00-learning\python\fastxml\fastxml\fastxml\inferencer.pyx in fastxml.inferencer.IForestBlender.predict()

ValueError: Buffer dtype mismatch, expected 'float32_t' but got 'double'

Environment:

numpy                     1.19.2                   pypi_0    pypi
numpy-base                1.19.1           py36ha3acd2a_0
scipy                     1.5.3                    pypi_0    pypi
scikit-learn              0.23.2                   pypi_0    pypi
cython                    0.29.21                  pypi_0    pypi

How to fix this error?

Not able to execute setup.py

Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/queues.py", line 266, in _feed send(obj) IOError: bad message length

Hi!

When I run training on 10M examples (each described by a small subset of 100K features), it breaks with the error:

....
Splitting 2033
Training classifier
Splitting 1201
Training classifier
Splitting 1323
Training classifier
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/queues.py", line 266, in _feed
send(obj)
IOError: bad message length

Do you know what is the reason and how it could be fixed?

I tried smaller datasets (100K, 1M examples) and the training worked for them.

Cheers,
Michal

Building module fastxml.inferencer failed: ["distutils.errors.CompileError: command 'gcc' failed with exit status 1\n"]

when I import the inferencer.pyx, it is wrong? How can I solve it? Thanks!

strategy for hyper-parameter tuning.

What parameters are most influential to the performance of the fastxml algorithm.

Any advice will be very helpful.
Thank you.

where to get the dataset like deliciousLarge_train.txt

Error while loading model

Hi, I run the following code

path = 'fastxml_model'
trainer.save(path)
clf = Inferencer(path)
pred = clf.predict(test_set[0])

And got an exception:

FileNotFoundError                         Traceback (most recent call last)
FileNotFoundError: [Errno 2] No such file or directory: 'fastxml_model/tree.0.weights'

Exception ignored in: 'fastxml.inferencer.load_sparse'
FileNotFoundError: [Errno 2] No such file or directory: 'fastxml_model/tree.0.weights'
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-43-0624c4270f20> in <module>()
      1 path = 'fastxml_model'
      2 trainer.save(path)
----> 3 clf = Inferencer(path)
      4 #pred = clf.predict(test_set[0])

/usr/local/lib/python3.5/dist-packages/fastxml-2.0.0-py3.5-linux-x86_64.egg/fastxml/fastxml.py in __init__(self, dname, gamma, blend, leaf_probs)
     21         self.leaf_probs = leaf_probs
     22 
---> 23         forest = IForest(dname, self.n_trees, self.n_labels)
     24         if self.leaf_classifiers:
     25             lc = LeafComputer(dname)

fastxml/inferencer.pyx in fastxml.inferencer.IForest.__init__()

fastxml/inferencer.pyx in fastxml.inferencer.ITree.__init__()

fastxml/inferencer.pyx in fastxml.inferencer.load_dense_f32()

FileNotFoundError: [Errno 2] No such file or directory: 'fastxml_model/tree.0.bias'

Any idea how to solve it or how to create a classifier without saving to a file?

Thanks, Tal

Unable to run bin/fxml.py

While trying to run fxml.py, it is giving error :

fastxml not found.This got resolved by moving fxml outside bin but after that other error popped-up
inferencer module not found.
I think it is unable to read inferencer.pyx. I had already pip installed cython.

the score of deliciousLarge

Hi! when you use fxml.py delicious.model deliciousLarge_train.txt --standard-dataset --verbose train --iters 5 --trees 20 --label-weight propensity --alpha 1e-4 --leaf-classifiers --no-remap-labels to train and use fxml.py delicious.model deliciousLarge_test.txt --standard-dataset inference to test,
what is the final result?
I get very low score, like that:
P@1: 0.4287878787878788 P@3: 0.38484848484848483 P@5: 0.3565656565656566 NDCG@1: 0.4287878787878788 NDCG@3: 0.3957641604407556 NDCG@5: 0.3747889067931902 pNDCG@1: 0.45656907373737377 pNDCG@3: 0.41984043916568803 pNDCG@5: 0.396182084153063

unable to solve this in windows 10

the process was hang ....

ub16hp@UB16HP:~/ub16_prj/fastxml$ fxml.py delicious.model ../../Downloads/ML_from_napkinXML/DeliciousLarge/deliciousLarge_train.txt --standard-dataset --verbose train --iters 5 --trees 20 --label-weight propensity --alpha 1e-4 --leaf-classifiers --no-remap-labels
10000 docs encoded
20000 docs encoded
30000 docs encoded
40000 docs encoded
50000 docs encoded
60000 docs encoded
70000 docs encoded
80000 docs encoded
90000 docs encoded
100000 docs encoded
110000 docs encoded
120000 docs encoded
130000 docs encoded
140000 docs encoded
150000 docs encoded
160000 docs encoded
170000 docs encoded
180000 docs encoded
190000 docs encoded
Splitting 196605
Splitting 196605
Splitting 196605
Splitting 196605
Splitting 196605
Splitting 196605
Splitting 196605
Splitting 196605
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Training classifier
Splitting 46046
Splitting 46031
Splitting 73030
Splitting 56692
Splitting 51019
Training classifier
Training classifier
Training classifier
Splitting 55835
Training classifier
Training classifier
Training classifier
Splitting 17790
Splitting 123642
Training classifier
Splitting 24480
Training classifier
Splitting 28929
Splitting 8203
Training classifier
Splitting 47112
Splitting 35059
Training classifier
Training classifier
Splitting 12392
Splitting 4414
Training classifier
Training classifier
Splitting 22024
Training classifier
Training classifier
Splitting 73156
Splitting 2160

.................

Splitting 1305
Training classifier
Splitting 2476
Training classifier
Splitting 1384
Training classifier
Splitting 1092
Training classifier

How to perform Performance Evaluation??

Sorry, I am new to multi-labeling. I want to know how can I perform performance evaluation on the testing dataset in terms of accuracy. Also, can someone explain how to read the result of the prediction? What I see is that "Label{...................} Predict {................}".

Any help would be greatly appreciated.

How to solve the problem that topK's K is different for every input text?

The output topK's K is fixed now.

Do you think training a classifier to predict the value of K for every input is a good solution?

Not able to execute fastxml.py file .

I did all the requirements.
numpy>=1.8.1
scipy>=0.13.3
scikit-learn>=0.17
Cython>=0.23.4
future>=0.16.0

after that i downloaded deliciousLarge dataset from Extreme Classification Repository.
I saved this dataset in fastxml_master folder.

I tried following command :
fxml.py delicious.model deliciousLarge_train.txt --standard-dataset --verbose train --iters 5 --trees 20 --label-weight propensity --alpha 1e-4 --leaf-classifiers --no-remap-labels

I got Syntax error. at delicious.model.
what is model file?

Can not run as sample, is something wrong with the source code?

from fastxml import Trainer, Inferencer
X = [Sparse or numpy arrays]
y = [[1, 3]] # Currently requires list[list[int]]
trainer = Trainer(n_trees=32, n_jobs=-1)
trainer.fit(X, y)
my input X is list of csr_matrix and input y is csr_matrix
but after my run error happen like this
file ''build/bdist.linux-x86_64/egg/fastxml/trainer.py'', line 389, in _build_roots
TypeError: iteration over a 0-d array
besides: Only when input X is list of csr_matrix the program is ok
def _build_roots(self, X, y, weights):
assert isinstance(X, list) and isinstance(X[0], sp.csr_matrix), "Requires list of csr_matrix"
if self.n_jobs > 1:
f = proc.fork_call(self.grow_root)
else:
f = proc.faux_fork_call(self.grow_root)

help using from python

First of all, thank very much for your work!
I still didn't understand how to pass values from python.

I have a scipy.csr_matrix with dimensions (3.000.000, 8.000) which I am passing to the fit method. But I get a message: AssertionError: Requires list of csr_matrix.

Do I need to input a list of 3.000.000 elements, each one as acsr matrix?

Thanks

ModuleNotFoundError: No module named 'fastxml.inferencer'

I've installed the module's requirements with conda install --file requirements.txt and have run the setup with python setup.py install without any issues (apart from the deprecation warnings I see when I run the setup)

However, when I attempt to import the module, I get the following error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/leskokm/fastxml/fastxml/__init__.py", line 1, in <module>
from .fastxml import Inferencer
File "/home/leskokm/fastxml/fastxml/fastxml.py", line 9, in <module>
from .inferencer import IForest, LeafComputer, Blender, IForestBlender
ModuleNotFoundError: No module named 'fastxml.inferencer'

Could there be an issue with the setup.py?