Firstly, thank you for making the weights open to all! Now, how to d

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

ZeRO 3 NVMe Offload? about yalm-100b HOT 9 CLOSED

yandex commented on July 28, 2024

ZeRO 3 NVMe Offload?

from yalm-100b.

Comments (9)

pskucherov commented on July 28, 2024 3

@pskucherov do you have a fork that's ready to pick up and start from for a single GPU?

No, I don't have a working code yet. But it will work soon.
Running on nvme in this config is useless. On CPU RAM it is more realistic, although it is still very long.

In any case, it's all in a private repository.

from yalm-100b.

MichaelEk commented on July 28, 2024 1

You can use ZeRO Offload while finetuning your model.
You can extend our version of Megatron by adding CPU/NVME offload for inference, it is not very hard. However, GPU wil have to download entire model (200 GB) for each inferred token. It takes additional 8 sec for each inferred token in best configuration =(

from yalm-100b.

lostmsu commented on July 28, 2024 1

@MichaelEk can't you generate multiple tokens in parallel for different string prefixes? E.g. forward with a batch of 4 sentences. That way one can remove the bottleneck of streaming weights to GPU memory if their need permits such use.

from yalm-100b.

Vbansal21 commented on July 28, 2024

@MichaelEk Thanks for replying.

Actually I wasn't able to implement the offload while inference, that's why I created an issue here. It would be nice if an updated script could be provided.😅

from yalm-100b.

MichaelEk commented on July 28, 2024

@MichaelEk can't you generate multiple tokens in parallel for different string prefixes? E.g. forward with a batch of 4 sentences. That way one can remove the bottleneck of streaming weights to GPU memory if their need permits such use.

Yes, actually you can. If you have a lot input prefixes, you can significantly reduce the the total time spent loading weights. However, the total time of one generation will be too long to use it, for example, in chatbot.

from yalm-100b.

pskucherov commented on July 28, 2024

You can use ZeRO Offload while finetuning your model. You can extend our version of Megatron by adding CPU/NVME offload for inference, it is not very hard.

Thanks for the hint. Was able to run on RTX 3070 TI 😃

can't you generate multiple tokens in parallel for different string prefixes?
Yes, actually you can. If you have a lot input prefixes, you can significantly reduce the the total time spent loading weights.

Can you tell a little more about this? I feel that now I need to do this, but I do not know where to start.

from yalm-100b.

lostmsu commented on July 28, 2024

@pskucherov do you have a fork that's ready to pick up and start from for a single GPU?

from yalm-100b.

alkavan commented on July 28, 2024

@pskucherov Being able to run on RTX 3070 Ti is really great.
If you could share some code how to "ZeRO Offload" that would be nice!

from yalm-100b.

Vbansal21 commented on July 28, 2024

Huggingface accelerate seems to be the solution. Closing issue.

from yalm-100b.

ZeRO 3 NVMe Offload? about yalm-100b HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent