Why is it said that only ds_zero is currently doing world_size streams on world_size g

How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..." about transformers-bloom-inference HOT 1 OPEN

HuipengXu commented on June 1, 2024

How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..."

from transformers-bloom-inference.

Comments (1)

mayank31398 commented on June 1, 2024

Hey, ds-inference is also doing world_size streams
However, accelerate is only doing 1 stream since we are just using naive pipeline parallelism capability from accelerate.
A more efficient approach for pipeline parallelism could be overlapping microbatches in the forward pass (no backward pass is needed)

For example, check this image from the Megatron-LM paper. This would be more efficient when serving. I think this will require you to have multiple processes for implementing this. But you might still get better throughout using DS-inference.

Also, if you are really interested in exploring serving models, I would suggest using text-gen-inference. This does dynamic batching and is much more efficient.

from transformers-bloom-inference.

How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..." about transformers-bloom-inference HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent