Comments (8)
Hi @stas00 @mayank31398 and @zcrypt0,
Yes, I confirm @stas00, and I agree with putting some kind of instructions there for people to generate their own int8-versions for any model. Regarding the checkpoint loading part, it is already added just for BLOOM and I am working on adding a general support for other models too, so that you can create your own checkpoint and easily load it with deepspeed-inference.
Thanks,
Reza
from transformers-bloom-inference.
@stas00 are there any docs on how to go about creating sharded int8 weights oneself?
If I remember correctly
- You provide the dtype of
torch.int8
here:
- and you also add:
save_mp_checkpoint_path=/where/to/save/it
in the same call.
@RezaYazdaniAminabadi could you please confirm? Also is it possible to document this type of recipe?
Perhaps here? https://www.deepspeed.ai/tutorials/inference-tutorial/
I think users would want that to save download size and speed up the download. (i.e. for other models besides bloom which don't have the repo you created)
Thank you!
I assume they are based on the bitsandbytes BLOOM int8 weights,
bnb has nothing to do with deepspeed's quantized solution. These are 2 different implementations. You'd use bnb's way when you don't use deepspeed. e.g.
transformers
uses bnb
behind the scenes, but I'm not sure if there is away to pre-make bnb weights - it creates them on the fly.
from transformers-bloom-inference.
it is still ok to use the same repo with 4 gpus. once loaded it's all the same, perhaps loading will be slightly slower but I doubt so since 4 processes will still do faster IO than 8 I'm pretty sure.
But yes, you can always save the 4-gpu version and use that as well.
from transformers-bloom-inference.
Thanks for the quick reply.
from transformers-bloom-inference.
@stas00 are there any docs on how to go about creating sharded int8 weights oneself?
I assume they are based on the bitsandbytes BLOOM int8 weights, but am not sure how you loaded them into Deepspeed for sharding.
from transformers-bloom-inference.
@zcrypt0 I haven't quantized a model myself.
But this example should work I think:
https://github.com/microsoft/DeepSpeedExamples/blob/master/model_compression/gpt2/bash_script/run_zero_quant.sh
from transformers-bloom-inference.
Thank you @stas00 and @RezaYazdaniAminabadi
If I understand correctly, you simply change the dtype in the script, and add save_mp_checkpoint_path
as if you were sharding the original weights in fp16.
Is it correct that you set the model name to bigscience/bloom
?
I tried this same protocol several days ago on a 7x A6000 node and ran into a tensor size mismatch:
RuntimeError: The size of tensor a (6144) must match the size of tensor b (14336) at non-singleton dimension 1
I haven't ever tried a 7x GPU node with the fp16 weights, but I assumed it would work since 7 divides the number of BLOOM attention heads evenly.
I did have a 14x GPU node working though.
Is there a fundamental limitation here with the 7x GPU case, or is it perhaps a bug in Deepspeed?
from transformers-bloom-inference.
Is it correct that you set the model name to bigscience/bloom?
you can call it anything you want as you're saving it into a custom model dump on your disc, you can set the saving path to say /path/to/bloom-int8-7gpus
. The name of the repo is only important if you're trying to retrieve it from online, locally it's whatever you want it to be. It becomes important again should you upload it to the hub. and then it'll be your_username/your_model_name
.
when you load it after saving locally you'd now use the local path instead of the name, so you'd replace bigscience/bloom
with /path/to/bloom-int8-7gpus
if you follow my example.
I tried this same protocol several days ago on a 7x A6000 node and ran into a tensor size mismatch
I will let @RezaYazdaniAminabadi reply to that as it's his domain.
from transformers-bloom-inference.
Related Issues (20)
- It does not work with Falcon-40B correctly
- ValueError: Couldn't instantiate the backend tokenizer from one of:
- root_dir in TemporaryCheckpointsJSON is redundant
- The details of hf-accelerate pp. HOT 2
- Big batchsize cause OOM in bloom-ds-inference.py, how to adjust max_split_size_mb value HOT 1
- Unable to reload a quantized model HOT 4
- why no use deepspeed.init_inference in zero benchmark HOT 2
- question regarding the float16 and bfloat HOT 1
- pip install command does not work as expected HOT 2
- Bloom176B RuntimeError: expected scalar type Half but found BFloat16 HOT 3
- Inference(chatbot) does not work as expected on 2 gpus with bigscience/bloom-7b1 model HOT 2
- `accelerate` in `bloom-inference-scripts`? HOT 1
- ds_inference success but OOM when use tp_presharded_mode=True HOT 2
- Can I combine fastertransformer to make it faster HOT 1
- Are there fine-tuning and inference scripts available for int4 quantization in bloom-7b? Is it possible to limit the GPU memory usage to within 10GB? HOT 1
- AttributeError: 'BloomForCausalLM' object has no attribute 'module'
- The Makefile execution was successful, but there is no response when entering text.
- When deploying the Bloom model, I noticed that the POST method is used for the generation task. Is it possible to modify it to perform question-answering instead?
- does this work for llama 65B HOT 1
- How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..." HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers-bloom-inference.