Comments (9)
@pskucherov do you have a fork that's ready to pick up and start from for a single GPU?
No, I don't have a working code yet. But it will work soon.
Running on nvme in this config is useless. On CPU RAM it is more realistic, although it is still very long.
In any case, it's all in a private repository.
from yalm-100b.
You can use ZeRO Offload while finetuning your model.
You can extend our version of Megatron by adding CPU/NVME offload for inference, it is not very hard. However, GPU wil have to download entire model (200 GB) for each inferred token. It takes additional 8 sec for each inferred token in best configuration =(
from yalm-100b.
@MichaelEk can't you generate multiple tokens in parallel for different string prefixes? E.g. forward with a batch of 4 sentences. That way one can remove the bottleneck of streaming weights to GPU memory if their need permits such use.
from yalm-100b.
@MichaelEk Thanks for replying.
Actually I wasn't able to implement the offload while inference, that's why I created an issue here. It would be nice if an updated script could be provided.π
from yalm-100b.
@MichaelEk can't you generate multiple tokens in parallel for different string prefixes? E.g. forward with a batch of 4 sentences. That way one can remove the bottleneck of streaming weights to GPU memory if their need permits such use.
Yes, actually you can. If you have a lot input prefixes, you can significantly reduce the the total time spent loading weights. However, the total time of one generation will be too long to use it, for example, in chatbot.
from yalm-100b.
You can use ZeRO Offload while finetuning your model. You can extend our version of Megatron by adding CPU/NVME offload for inference, it is not very hard.
Thanks for the hint. Was able to run on RTX 3070 TI π
can't you generate multiple tokens in parallel for different string prefixes?
Yes, actually you can. If you have a lot input prefixes, you can significantly reduce the the total time spent loading weights.
Can you tell a little more about this? I feel that now I need to do this, but I do not know where to start.
from yalm-100b.
@pskucherov do you have a fork that's ready to pick up and start from for a single GPU?
from yalm-100b.
@pskucherov Being able to run on RTX 3070 Ti is really great.
If you could share some code how to "ZeRO Offload" that would be nice!
from yalm-100b.
Huggingface accelerate seems to be the solution. Closing issue.
from yalm-100b.
Related Issues (20)
- ΠΡΠΈΠ²Π΅Ρ HOT 2
- How did you used LAMB optimizer with ZeRO CPU offload? HOT 2
- Run on networked nodes
- AWS HOT 1
- Could you share the md5 value for those checkpoints? HOT 2
- Can it be launched on usual VPS? For example, 6 CPU 16 RAM (usual chips) HOT 2
- Would it be possible to run the model on single A100 (40GB) or 2xV100 (32GB) ? HOT 2
- No mention of `bfloat16` in source, and yet weights are `bfloat16`
- CUDA out of memory HOT 6
- NCCL error HOT 1
- PCI x1 or PCI x16 for GPU
- Is there any plans for making cloud service? HOT 1
- Has anyone deployed it on 10x 3090 ? Or any similar configuration? HOT 1
- Provide pruned version for weaker hardware HOT 2
- Citation bibtex? HOT 2
- Request to Open "Russian Pile" Dataset for Public Access
- How to use it with LangChain? HOT 2
- Timeout on 8 x RTX A6000 HOT 2
- Why usage ssh-agent and openssh-client package in docker
- gguf / mlx format?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yalm-100b.