Giter Site home page Giter Site logo

ZeRO 3 NVMe Offload? about yalm-100b HOT 9 CLOSED

yandex avatar yandex commented on July 28, 2024
ZeRO 3 NVMe Offload?

from yalm-100b.

Comments (9)

pskucherov avatar pskucherov commented on July 28, 2024 3

@pskucherov do you have a fork that's ready to pick up and start from for a single GPU?

No, I don't have a working code yet. But it will work soon.
Running on nvme in this config is useless. On CPU RAM it is more realistic, although it is still very long.

In any case, it's all in a private repository.

from yalm-100b.

MichaelEk avatar MichaelEk commented on July 28, 2024 1

You can use ZeRO Offload while finetuning your model.
You can extend our version of Megatron by adding CPU/NVME offload for inference, it is not very hard. However, GPU wil have to download entire model (200 GB) for each inferred token. It takes additional 8 sec for each inferred token in best configuration =(

from yalm-100b.

lostmsu avatar lostmsu commented on July 28, 2024 1

@MichaelEk can't you generate multiple tokens in parallel for different string prefixes? E.g. forward with a batch of 4 sentences. That way one can remove the bottleneck of streaming weights to GPU memory if their need permits such use.

from yalm-100b.

Vbansal21 avatar Vbansal21 commented on July 28, 2024

@MichaelEk Thanks for replying.

Actually I wasn't able to implement the offload while inference, that's why I created an issue here. It would be nice if an updated script could be provided.πŸ˜…

from yalm-100b.

MichaelEk avatar MichaelEk commented on July 28, 2024

@MichaelEk can't you generate multiple tokens in parallel for different string prefixes? E.g. forward with a batch of 4 sentences. That way one can remove the bottleneck of streaming weights to GPU memory if their need permits such use.

Yes, actually you can. If you have a lot input prefixes, you can significantly reduce the the total time spent loading weights. However, the total time of one generation will be too long to use it, for example, in chatbot.

from yalm-100b.

pskucherov avatar pskucherov commented on July 28, 2024

You can use ZeRO Offload while finetuning your model. You can extend our version of Megatron by adding CPU/NVME offload for inference, it is not very hard.

Thanks for the hint. Was able to run on RTX 3070 TI πŸ˜ƒ
image

can't you generate multiple tokens in parallel for different string prefixes?
Yes, actually you can. If you have a lot input prefixes, you can significantly reduce the the total time spent loading weights.

Can you tell a little more about this? I feel that now I need to do this, but I do not know where to start.

from yalm-100b.

lostmsu avatar lostmsu commented on July 28, 2024

@pskucherov do you have a fork that's ready to pick up and start from for a single GPU?

from yalm-100b.

alkavan avatar alkavan commented on July 28, 2024

@pskucherov Being able to run on RTX 3070 Ti is really great.
If you could share some code how to "ZeRO Offload" that would be nice!

from yalm-100b.

Vbansal21 avatar Vbansal21 commented on July 28, 2024

Huggingface accelerate seems to be the solution. Closing issue.

from yalm-100b.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.