Comments (2)
Nice writeup, but the first graph is misleading, and it would be nice to fix it by removing constants time-uses and adding an alternative use of WB with the ApplyBatch class, that shows the toolkit capabilities better.
First: you're not removing the time use of data loading etc, which is why you're getting the slope in your graph. If you remove external constants and choose "nprocs" and "minibatch_size" appropriately, you'll get the 30% improvement in your example with much smaller data sizes as well.
Second: these extractors were mostly built for cases with detailed text preprocessing, where there's heavy computation within the normalize_text functions that process each row of data. In your minimal example the normalization doesn't do much. So if you just add a couple of regexp replacements within normalize_text that you pass, you'll get to around 50% improvement.
Also, if you just want to use HashingVectorizer, use the ApplyBatch class instead, like this:
texts = ApplyBatch(wordbatch.batcher.Batcher(minibatch_size=20000, method="multiprocessing", procs=5), vectorizer.transform).transform(texts)
ApplyBatch in this case should give you around 80% improvement. Testing on 100k rows of IMBD reviews now I get speedup from 44s to 11s using the one-liner above. This is tested on a laptop with aggressive thermal throttling with multiple cores, so with 5 processes this 4-to-1 speedup is practically overhead-free and linear.
from wordbatch.
I made a more complete performance comparison available at Medium:
https://towardsdatascience.com/benchmarking-python-distributed-ai-backends-with-wordbatch-9872457b785c
from wordbatch.
Related Issues (20)
- WordVec extractor failing due to decode error HOT 1
- cannot install on windows 8.1 HOT 4
- "Illegal operation" when importing wordbatch.extractors HOT 2
- Licensing for commercial use without open source? HOT 1
- Tried to pickle the fitted wordbatch model, but bumped into this Error: AttributeError: 'function' object has no attribute 'im_self' HOT 3
- Import FTRL fails HOT 1
- Error on trying to import FM_FTRL HOT 1
- predict() takes a very long time HOT 1
- from wordbatch.data_utils import * HOT 3
- IndexError: too many indices for array HOT 1
- Illegal instruction (core dumped) HOT 1
- TypeError: only size-1 arrays can be converted to Python scalars (Windows, Python 3.5) HOT 1
- Multiprocessing Hanging in Python 3.6+ HOT 7
- are this times normal? HOT 2
- AttributeError: Can't get attribute 'normalize_text' on <module '__main__'> HOT 1
- pip install wordbatch on macos---error: command 'gcc-7' failed with exit status 1
- 'tuple' object has no attribute 'transform' HOT 3
- cross validation and grid search HOT 3
- will it work for Windows ?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wordbatch.