Giter Site home page Giter Site logo

About Wordbatch about wordbatch HOT 2 CLOSED

anttttti avatar anttttti commented on July 28, 2024
About Wordbatch

from wordbatch.

Comments (2)

anttttti avatar anttttti commented on July 28, 2024

Nice writeup, but the first graph is misleading, and it would be nice to fix it by removing constants time-uses and adding an alternative use of WB with the ApplyBatch class, that shows the toolkit capabilities better.

First: you're not removing the time use of data loading etc, which is why you're getting the slope in your graph. If you remove external constants and choose "nprocs" and "minibatch_size" appropriately, you'll get the 30% improvement in your example with much smaller data sizes as well.

Second: these extractors were mostly built for cases with detailed text preprocessing, where there's heavy computation within the normalize_text functions that process each row of data. In your minimal example the normalization doesn't do much. So if you just add a couple of regexp replacements within normalize_text that you pass, you'll get to around 50% improvement.

Also, if you just want to use HashingVectorizer, use the ApplyBatch class instead, like this:
texts = ApplyBatch(wordbatch.batcher.Batcher(minibatch_size=20000, method="multiprocessing", procs=5), vectorizer.transform).transform(texts)

ApplyBatch in this case should give you around 80% improvement. Testing on 100k rows of IMBD reviews now I get speedup from 44s to 11s using the one-liner above. This is tested on a laptop with aggressive thermal throttling with multiple cores, so with 5 processes this 4-to-1 speedup is practically overhead-free and linear.

from wordbatch.

anttttti avatar anttttti commented on July 28, 2024

I made a more complete performance comparison available at Medium:
https://towardsdatascience.com/benchmarking-python-distributed-ai-backends-with-wordbatch-9872457b785c

from wordbatch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.