I wrote in Dec/18 a post in medium about your awesome library, for anybody that is loo

About Wordbatch about wordbatch HOT 2 CLOSED

anttttti commented on July 28, 2024

About Wordbatch

from wordbatch.

Comments (2)

anttttti commented on July 28, 2024

Nice writeup, but the first graph is misleading, and it would be nice to fix it by removing constants time-uses and adding an alternative use of WB with the ApplyBatch class, that shows the toolkit capabilities better.

First: you're not removing the time use of data loading etc, which is why you're getting the slope in your graph. If you remove external constants and choose "nprocs" and "minibatch_size" appropriately, you'll get the 30% improvement in your example with much smaller data sizes as well.

Second: these extractors were mostly built for cases with detailed text preprocessing, where there's heavy computation within the normalize_text functions that process each row of data. In your minimal example the normalization doesn't do much. So if you just add a couple of regexp replacements within normalize_text that you pass, you'll get to around 50% improvement.

Also, if you just want to use HashingVectorizer, use the ApplyBatch class instead, like this:
texts = ApplyBatch(wordbatch.batcher.Batcher(minibatch_size=20000, method="multiprocessing", procs=5), vectorizer.transform).transform(texts)

ApplyBatch in this case should give you around 80% improvement. Testing on 100k rows of IMBD reviews now I get speedup from 44s to 11s using the one-liner above. This is tested on a laptop with aggressive thermal throttling with multiple cores, so with 5 processes this 4-to-1 speedup is practically overhead-free and linear.

from wordbatch.

anttttti commented on July 28, 2024

I made a more complete performance comparison available at Medium:
https://towardsdatascience.com/benchmarking-python-distributed-ai-backends-with-wordbatch-9872457b785c

from wordbatch.

Recommend Projects

About Wordbatch about wordbatch HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent