anttttti / wordbatch Goto Github PK
View Code? Open in Web Editor NEWPython library for distributed AI processing pipelines, using swappable scheduler backends.
License: GNU General Public License v2.0
Python library for distributed AI processing pipelines, using swappable scheduler backends.
License: GNU General Public License v2.0
I have tried to install from source both 1.3.3 and 1.3.5 versions. When issuing the import FTRL commands I get a strange "Ilegal instruction" message.
I am working on an Ubuntu 14.01 system and install python3.6 in a virtualenv. Other than some warnings:
wordbatch/models/fm_ftrl.c:2916:10: note: ‘__pyx_v_d’ was declared here
double __pyx_v_d;
I don't see any suspicious in the installation.
This is the message I get:
(python36) voglis:~$ python3
Python 3.6.5 (default, Mar 29 2018, 00:00:00)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import wordbatch
>>> from wordbatch.models import FTRL
Illegal instruction (core dumped)
And this is my Python package list:
(python36) voglis:~$ pip list
Package Version
------------------ -------
Cython 0.28.2
numpy 1.14.2
pandas 0.22.0
pip 10.0.1
py-lz4framed 0.11.0
python-dateutil 2.7.2
python-Levenshtein 0.12.0
pytz 2018.4
randomgen 1.14.4
randomstate 1.14.0
scikit-learn 0.19.1
scipy 1.0.1
setuptools 39.0.1
six 1.11.0
wheel 0.31.0
Wordbatch 1.3.3
I developed this, because at the time there was nothing. However I really like your api. So I'm going to try to use it in my next blog post.
Cheers!
Hi, I'm building a very simple test script in jupyter using your own example dataset Tweets.csv with the same preprocessing and normalization.
If I use method="serial", it runs at the same time no error, but if I change to multiprocessing it hangs and stay there forever, does not matter the size of the corpus...
I'm pretty sure there is no error at the corpus since serial runs ok... looks like there is some lock or limit for multiprocessing in windows... I'm researching it, if you have any fix please let me know.
I would like to use FM_FTRL
in an sklearn cross-validation pipeline, e.g.,
from wordbatch.models import FM_FTRL
modelF = FM_FTRL(
alpha=0.01, # learning rate
beta=0.1,
L1=0.00001,
L2=0.10,
D=X_train.shape[1],
alpha_fm=0.01,
L2_fm=0.0,
init_fm=0.01,
D_fm=50,
e_noise=0.0001,
iters=5,
inv_link='sigmoid',
threads=4
)
cv_scores = cross_val_score(modelF, X_train.tocsc(), y_train_fm.target.values, scoring='roc_auc', cv=time_split)
This throws
TypeError: Cannot clone object '<wordbatch.models.fm_ftrl.FM_FTRL object at 0x557056cfbfa0>' (type <class 'wordbatch.models.fm_ftrl.FM_FTRL'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.
This error is also thrown when trying to pass a FM_FTRL model to GridSearchCV
.
Can you provide some guidance on how to make this work?
I can see in this thread that you tuned hyperparameters with random search. Can you provide guidance on that?
Thank you!
My configuration -
wordbatch-1.3.0
pandas-0.22
python 3.6.2
ubuntu 14.04
Executing kaggle script without any changes
https://www.kaggle.com/anttip/wordbatch-ftrl-fm-lgb-lbl-0-42555
TypeError Traceback (most recent call last)
in ()
153 merge['name'] = merge['name'].astype(str)
154 print(len(merge['name']))
--> 155 X_name = wb.fit_transform(merge['name'])
156 del(wb)
157 X_name = X_name[:, np.array(np.clip(X_name.getnnz(axis=0) - 1, 0, 1), dtype=bool)]
~/lal/Kaggle/kaggleme/input/bkup/wordbatch/wordbatch.py in fit_transform(self, texts, labels, extractor, cache_features, input_split)
239
240 def fit_transform(self, texts, labels=None, extractor= None, cache_features= None, input_split= False):
--> 241 return self.transform(texts, labels, extractor, cache_features, input_split)
242
243 def partial_fit(self, texts, labels=None, input_split= False, merge_output= True):
~/lal/Kaggle/kaggleme/input/bkup/wordbatch/wordbatch.py in transform(self, texts, labels, extractor, cache_features, input_split)
248 if extractor== None: extractor= self.extractor
249 if cache_features != None and os.path.exists(cache_features): return extractor.load_features(cache_features)
--> 250 if not(input_split): texts= self.split_batches(texts)
251 texts= self.fit(texts, return_texts=True, input_split=True, merge_output=False)
252 if extractor!= None:
~/lal/Kaggle/kaggleme/input/bkup/wordbatch/wordbatch.py in split_batches(self, *args, **kwargs)
265
266 def split_batches(self, *args, **kwargs):
--> 267 return self.batcher.split_batches(*args, **kwargs)
268
269 def merge_batches(self, *args, **kwargs):
~/lal/Kaggle/kaggleme/input/bkup/wordbatch/batcher.py in split_batches(self, data, minibatch_size)
70 else: len_data= data.shape[0]
71 if minibatch_size> len_data: minibatch_size= len_data
---> 72 if data_type == pd.DataFrame:
73 data_split = [data.iloc[x * minibatch_size:(x + 1) * minibatch_size] for x in
74 range(int(ceil(len_data / minibatch_size)))]
~/anaconda2/envs/sdp/lib/python3.6/site-packages/pandas/core/ops.py in f(self, other)
1326 return self._compare_frame(other, func, str_rep)
1327 elif isinstance(other, ABCSeries):
-> 1328 return self._combine_series_infer(other, func, try_cast=False)
1329 else:
1330
~/anaconda2/envs/sdp/lib/python3.6/site-packages/pandas/core/frame.py in _combine_series_infer(self, other, func, level, fill_value, try_cast)
3946 def _combine_series_infer(self, other, func, level=None,
3947 fill_value=None, try_cast=True):
-> 3948 if len(other) == 0:
3949 return self * np.nan
3950
TypeError: object of type 'type' has no len()
Hi @anttttti, thanks for your great package and congrats for your 5th place in Mercair !
I'm trying to get wordbatch to work on my own PC (kaggle kernels are great but I prefer getting things done locally ;), however I can't seem to get the pip installtion right.
I'm using Anaconda with python 3.6 and VC++ build tools 2015 installed
Here is the full installation output :
pip install wordbatch
Collecting wordbatch
Using cached Wordbatch-1.3.0.tar.gz
Requirement already satisfied: cython in c:\users\olivier\anaconda3\lib\site-packages (from wordbatch)
Requirement already satisfied: scikit-learn in c:\users\olivier\anaconda3\lib\site-packages (from wordbatch)
Requirement already satisfied: python-Levenshtein in c:\users\olivier\anaconda3\lib\site-packages (from wordbatch)
Requirement already satisfied: py-lz4framed in c:\users\olivier\anaconda3\lib\site-packages (from wordbatch)
Requirement already satisfied: setuptools in c:\users\olivier\anaconda3\lib\site-packages (from python-Levenshtein->wordb
atch)
Building wheels for collected packages: wordbatch
Running setup.py bdist_wheel for wordbatch ... error
Complete output from command C:\Users\olivier\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\olivier\AppData\Local\Temp\pip-build-0xefvztp\wordbatch\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d C:\Users\olivier\AppData\Local\Temp\tmp7rkq7xj6pip-wheel- --python-tag cp36:
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\wordbatch
copying wordbatch\wordbatch.py -> build\lib.win-amd64-3.6\wordbatch
copying wordbatch_init_.py -> build\lib.win-amd64-3.6\wordbatch
creating build\lib.win-amd64-3.6\wordbatch\extractors
copying wordbatch\extractors_init_.py -> build\lib.win-amd64-3.6\wordbatch\extractors
creating build\lib.win-amd64-3.6\wordbatch\models
copying wordbatch\models_init_.py -> build\lib.win-amd64-3.6\wordbatch\models
running build_ext
error: [WinError 2] The system cannot find the file specified
Failed building wheel for wordbatch
Running setup.py clean for wordbatch
Failed to build wordbatch
Installing collected packages: wordbatch
Running setup.py install for wordbatch ... error
Complete output from command C:\Users\olivier\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\U
sers\olivier\AppData\Local\Temp\pip-build-0xefvztp\wordbatch\setup.py';f=getattr(tokenize, 'open', open)(file)
;code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\olivier\A
ppData\Local\Temp\pip-07pzlhjs-record\install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\wordbatch
copying wordbatch\wordbatch.py -> build\lib.win-amd64-3.6\wordbatch
copying wordbatch_init_.py -> build\lib.win-amd64-3.6\wordbatch
creating build\lib.win-amd64-3.6\wordbatch\extractors
copying wordbatch\extractors_init_.py -> build\lib.win-amd64-3.6\wordbatch\extractors
creating build\lib.win-amd64-3.6\wordbatch\models
copying wordbatch\models_init_.py -> build\lib.win-amd64-3.6\wordbatch\models
running build_ext
error: [WinError 2] The system cannot find the file specified
----------------------------------------
Command "C:\Users\olivier\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\olivier\AppData\Local\Temp\pip-build-0xefvztp\wordbatch\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\olivier\AppData\Local\Temp\pip-07pzlhjs-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\olivier\AppData\Local\Temp\pip-build-0xefvztp\wordbatch\
Would you have any idea?
Thanks.
Hey dude, this library is amazing and I like it very much. Can I write documentation for it? I have gone through some of the codes in the process of using it and understanding it. I think such a great work will be greater with a documentation. Maybe we can collaborate?
Waiting for your reply : )
hello i m using following code to transform features:
wb.fit(X_train_with_new_feature['name'].tolist())
X_train_name_wordbatch = wb.transform(X_train_with_new_feature['name'].tolist())
but keep getting error " 'tuple' object has no attribute 'transform' "
can anyone help me?
Tried to use: pip install word batch but failed building wheel for py-lz4framed.
I do have python 3.6 installed under anaconda package.
The computer os: Mac os high sierra.
Does the library only work in linux operation system?
Thank you!
Any idea how to fix? Has just started happening without changing code. Thought this was from running too many kernels at once but even if I run one kernel I get this error....
From kaggle script environment
WORBBAG_ITEM_DESC_PARAMS = {'hash_ngrams': 2, 'hash_ngrams_weights': [1.0, 1.0],
'hash_size': 2 ** 26, 'norm': 'l2', 'tf': 1.0, 'idf': None}
wb = wordbatch.WordBatch(normalize_text, extractor=(WordBag, WORBBAG_ITEM_DESC_PARAMS),
procs=procs)
wb.dictionary_freeze= True
X_description = wb.fit_transform(full_df['item_description'])
2237.9s
477
Parallelization fail. Method: multiprocessing Task: <function batch_normalize_texts at 0x7f76ce912ea0>
Retrying, attempt: 1 timeout limit: 1200 seconds
2250.8s
478
Parallelization fail. Method: multiprocessing Task: <function batch_normalize_texts at 0x7f76ce912ea0>
Retrying, attempt: 2 timeout limit: 2400 seconds
2263.4s
479
Parallelization fail. Method: multiprocessing Task: <function batch_normalize_texts at 0x7f76ce912ea0>
Retrying, attempt: 3 timeout limit: 4800 seconds
2276.2s
480
Parallelization fail. Method: multiprocessing Task: <function batch_normalize_texts at 0x7f76ce912ea0>
Retrying, attempt: 4 timeout limit: 9600 seconds
2288.9s
481
Parallelization fail. Method: multiprocessing Task: <function batch_normalize_texts at 0x7f76ce912ea0>
Extract wordbags
2288.9s
482
Traceback (most recent call last):
File "../src/script.py", line 573, in <module>
sparse_mat = preprocess_for_fm(full_df)
File "../src/script.py", line 552, in preprocess_for_fm
X_description = wb.fit_transform(full_df['item_description'])
File "/opt/conda/lib/python3.6/site-packages/wordbatch/wordbatch.py", line 230, in fit_transform
return self.transform(texts, labels, extractor, cache_features, input_split)
File "/opt/conda/lib/python3.6/site-packages/wordbatch/wordbatch.py", line 242, in transform
texts= extractor.transform(texts, input_split= True, merge_output= True)
File "wordbatch/extractors/extractors.pyx", line 185, in wordbatch.extractors.extractors.WordBag.transform
File "/opt/conda/lib/python3.6/site-packages/wordbatch/wordbatch.py", line 305, in parallelize_batches
paral_params= [[data_batch]+ args for data_batch in data]
TypeError: 'NoneType' object is not iterable
2288.9s
483
2288.9s
484
Failed. Exited with code 1.
will it work for Windows ?
This is an amazing library and I'd like to use it for some commercial work, but we're not able to satisfy the GNU GPL conditions of stating changes or disclosing source. Would you be willing to distribute under an MIT, Apache, or BSD license so that we could use this work?
I developed using your code for fmftrl, received this error when running fit on it.
Traceback (most recent call last):
File "../src/script.py", line 259, in
clf.fit(train_features, labels, 0.2, reset=False)
File "wordbatch/models/fm_ftrl.pyx", line 227, in wordbatch.models.fm_ftrl.FM_FTRL.fit
1263.7s
18
File "wordbatch/models/fm_ftrl.pyx", line 272, in wordbatch.models.fm_ftrl.FM_FTRL.fit_f
IndexError: too many indices for array
My train and test set shape is Shapes : (1503424, 10016) (508438, 10016)
any idea how to solve?
I was unable to successfully install this wordbatch lib through the last issue solution,can anyone give me some advice?
last issue solution:
error: command 'gcc-7' failed with exit status 1
Hello,
I run the following code with WordBatch on windows:
import wordbatch
from wordbatch.extractors import WordBag
from wordbatch.models import FTRL, FM_FTRL
import pandas as pd
import pandas as pd
import numpy as np
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=10000, noise=0.7)
from sklearn.model_selection import train_test_split
X_t, X_v, y_t, y_v = train_test_split(X, y, stratify=y, test_size=0.2, random_state=8)
train_weight = np.array(pd.DataFrame(y_t).replace(1,400).replace(0,1).astype('float64'))
clf = FM_FTRL(alpha=0.5, beta=1, L1=10.0, L2=10.0, D=2 * 20, alpha_fm=0.02,
L2_fm=0.0, init_fm=0.01, weight_fm=1.0,
D_fm=2, e_noise=0.0, iters=3,
inv_link="sigmoid", threads=4
)
clf.fit(X_t, y_t, train_weight)
class_pred = clf.predict(X_v)
I get the following error:
TypeError Traceback (most recent call last)
in ()
---> 30 clf.fit(X_t, y_t, train_weight)
31 class_pred = clf.predict(X_v)
wordbatch\models\fm_ftrl.pyx in wordbatch.models.fm_ftrl.FM_FTRL.fit()
TypeError: only size-1 arrays can be converted to Python scalars
It appears only when WEIGHT is used.
Do you know possible cause?
Thanks,
Nikita
I wrote in Dec/18 a post in medium about your awesome library, for anybody that is looking for more info please visit:
https://medium.com/@d.canivel/wordbatch-a-parallel-text-feature-extraction-for-machine-learning-eb3696f40996
Thanks
Hi. I trained an FM_FTRL model for 5 iterations which took 5 hours on 120 million records data.
But when I try to output predictions on this trained model, it takes a very long time (I end up killing it after it runs for 1 hour).
Is this normal ? Is prediction supposed to take this long ?
I use the latest github version of wordbatch: 1.3.5.
By the way, pickle_model() does not seem to work, it uses get_params which is not implemented.
I ended up using regular pickle.dump()
Thanks,
Hi!
I'm running this kernel on kaggle:
kernel
After the fit on a valid matrix (without NA) the status of the model still contains Nan.
During the fit there are neither exceptions nor warnings.
To see the status of de model I have called :
model.__getstate__()
The result is:
model FM_FTRL status:
(0.01, 0.01, 1e-05, 0.1, 0.01, 0.0, 1.0, 0.0, 1020688, 200, 17, array([ nan, 0., 0., ..., 1., 1., 1.]), array([ nan, nan, nan, ..., nan, nan, nan]), array([ nan, nan, nan, ..., nan, nan, nan]), array([ nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan]), array([ nan, nan, nan, ..., nan, nan, nan]), array([ nan, nan, nan, ..., nan, nan, nan]), 0, 0, True)
After trying out wordbatch using WordVec extractor, I am facing the following problem.
I have used the following code to initialize the wordbatch
wb= wordbatch.WordBatch(normalize_text,extractor=(Hstack, [(WordVec, {"wordvec_file": "../input/glove6b300dtxt/glove.6B.300d.txt", "normalize_text": normalize_text}), (WordVec, {"wordvec_file": "../../../data/word2vec/glove.6B.50d.txt.gz", "normalize_text": normalize_text})]))
Python = 3.6
Any idea on why this error came out?
('memory GB:', 5.425804138183594)
Illegal instruction (core dumped)
Thanks,
E.g. if I want sequences of integers, with ngrams appended to the end?
Hi
I installed wordbatch on MacOSX Sierra using PIP. The installation was successful. The following commands work in Jupyter notebook
import wordbatch
from wordbatch.extractors import WordBag, WordHash
However importing FTRL gives an Import errror (screen shot attached).
from wordbatch.models import FTRL
Please advise.
Regards
Shanth
I can "import wordbatch", but importing wordbatch.extractors kills the interpreter with "Illegal opration".
My environment:
The environment I'm working with is not Anaconda, however, I was able to reproduce it on the very same OS with Python 3.6.4 on Anaconda.
First, thanks for this amazing tool!
My question. Is 3-4 second time normal for a single tiff-IDF calculation (a text of 300 words approx)?
I want to use "lime" (Explaining the predictions of any machine learning classifier) but this time is just too big for the amount of iterations that lime needs.
thanks!
Thanks for this great work, hopefully this could incrementally importing all the NLP feature extraction tricks!
I tried to pickle the fitted model for future testing data, but bumped into this Error which says:
wordbatch/extractors/extractors.pyx in wordbatch.extractors.extractors._pickle_method()
AttributeError: 'function' object has no attribute 'im_self'
Traceback is ```
Command "/Users/carenv/bin/python3 -u -c "import setuptools, tokenize;file='/private/var/folders/h7/6h97dljx0n7bzvtttwv4jj6c0000gn/T/pip-build-7ibtzw0e/wordbatch/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /var/folders/h7/6h97dljx0n7bzvtttwv4jj6c0000gn/T/pip-0c39szdv-record/install-record.txt --single-version-externally-managed --compile --install-headers /Users/shleifer/Dropbox/projects/mercari/carenv/bin/../include/site/python3.5/wordbatch" failed with error code 1 in /private/var/folders/h7/6h97dljx0n7bzvtttwv4jj6c0000gn/T/pip-build-7ibtzw0e/wordbatch/
Hi, I am getting an error when I access wordbatch.data_utils, can you please help.
ImportError Traceback (most recent call last)
in ()
9 import gc
10 from contextlib import contextmanager
---> 11 from wordbatch.data_utils import *
ImportError: No module named 'wordbatch.data_utils'
I installed wordbatch on mac (OS X El Capitan).
import wordbatch
doesn't give errors
but from wordbatch.models import FM_FTRL
throws this error:
ImportError Traceback (most recent call last)
<ipython-input-5-6f5587655718> in <module>()
----> 1 from wordbatch.models import FM_FTRL
/Users/yuliamahtani/anaconda/lib/python3.5/site-packages/wordbatch/models/__init__.py in <module>()
2 from .fm_ftrl import FM_FTRL
3 from .nn_relu_h1 import NN_ReLU_H1
----> 4 from .nn_relu_h2 import NN_ReLU_H2
wordbatch/models/nn_relu_h2.pyx in init wordbatch.models.nn_relu_h2 (wordbatch/models/nn_relu_h2.c:24869)()
/Users/yuliamahtani/anaconda/lib/python3.5/site-packages/randomgen/__init__.py in <module>()
1 from randomgen.dsfmt import DSFMT
2 from randomgen.generator import RandomGenerator
----> 3 from randomgen.mt19937 import MT19937
4 from randomgen.pcg32 import PCG32
5 from randomgen.pcg64 import PCG64
/Users/yuliamahtani/anaconda/lib/python3.5/site-packages/randomgen/mt19937.pyx in init randomgen.mt19937()
9 cimport numpy as np
10
---> 11 from randomgen.common import interface
12 from randomgen.common cimport *
13 from randomgen.distributions cimport brng_t
ImportError: cannot import name interface
please help
Hello,
If I save the model into a pickle ( the wordbatch.worbatch object, TFIDF calculated) and then load in other file, it says "Can't get attribute 'normalize_text' on <module 'main'>" , even if I have initialized wordbatch given to it the normalize function when I saved in the pickle.
If I manually copy the normalize function in the main scope, the problem is solved, but this approach is not useful if I want to use whit Gunicorn server, for example.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.