On the hf model page for microsoft/bloom-deepspeed-inference

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

[Question] 4-GPU shard `microsoft/bloom-deepspeed-inference-int8` about transformers-bloom-inference HOT 8 CLOSED

huggingface commented on May 17, 2024

[Question] 4-GPU shard `microsoft/bloom-deepspeed-inference-int8`

from transformers-bloom-inference.

Comments (8)

RezaYazdaniAminabadi commented on May 17, 2024 2

Hi @stas00 @mayank31398 and @zcrypt0,

Yes, I confirm @stas00, and I agree with putting some kind of instructions there for people to generate their own int8-versions for any model. Regarding the checkpoint loading part, it is already added just for BLOOM and I am working on adding a general support for other models too, so that you can create your own checkpoint and easily load it with deepspeed-inference.
Thanks,
Reza

from transformers-bloom-inference.

stas00 commented on May 17, 2024 1

@stas00 are there any docs on how to go about creating sharded int8 weights oneself?

If I remember correctly

You provide the dtype of torch.int8 here:

transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py

Line 190 in d514638

dtype=getattr(torch, infer_dtype),

and you also add: save_mp_checkpoint_path=/where/to/save/it in the same call.

@RezaYazdaniAminabadi could you please confirm? Also is it possible to document this type of recipe?
Perhaps here? https://www.deepspeed.ai/tutorials/inference-tutorial/
I think users would want that to save download size and speed up the download. (i.e. for other models besides bloom which don't have the repo you created)
Thank you!

I assume they are based on the bitsandbytes BLOOM int8 weights,

bnb has nothing to do with deepspeed's quantized solution. These are 2 different implementations. You'd use bnb's way when you don't use deepspeed. e.g.

transformers-bloom-inference/bloom-inference-scripts/bloom-accelerate-inference.py

Lines 108 to 110 in d514638

    
           if infer_dtype == "int8": 
        
               print_rank0("Using `load_in_8bit=True` to use quanitized model") 
        
               kwargs['load_in_8bit'] = True

transformers uses bnb behind the scenes, but I'm not sure if there is away to pre-make bnb weights - it creates them on the fly.

from transformers-bloom-inference.

stas00 commented on May 17, 2024

it is still ok to use the same repo with 4 gpus. once loaded it's all the same, perhaps loading will be slightly slower but I doubt so since 4 processes will still do faster IO than 8 I'm pretty sure.

But yes, you can always save the 4-gpu version and use that as well.

from transformers-bloom-inference.

zcrypt0 commented on May 17, 2024

Thanks for the quick reply.

from transformers-bloom-inference.

zcrypt0 commented on May 17, 2024

@stas00 are there any docs on how to go about creating sharded int8 weights oneself?

I assume they are based on the bitsandbytes BLOOM int8 weights, but am not sure how you loaded them into Deepspeed for sharding.

from transformers-bloom-inference.

mayank31398 commented on May 17, 2024

@zcrypt0 I haven't quantized a model myself.
But this example should work I think:
https://github.com/microsoft/DeepSpeedExamples/blob/master/model_compression/gpt2/bash_script/run_zero_quant.sh

from transformers-bloom-inference.

zcrypt0 commented on May 17, 2024

Thank you @stas00 and @RezaYazdaniAminabadi

If I understand correctly, you simply change the dtype in the script, and add save_mp_checkpoint_path as if you were sharding the original weights in fp16.

Is it correct that you set the model name to bigscience/bloom?

I tried this same protocol several days ago on a 7x A6000 node and ran into a tensor size mismatch:

RuntimeError: The size of tensor a (6144) must match the size of tensor b (14336) at non-singleton dimension 1

Full Traceback

I haven't ever tried a 7x GPU node with the fp16 weights, but I assumed it would work since 7 divides the number of BLOOM attention heads evenly.

I did have a 14x GPU node working though.

Is there a fundamental limitation here with the 7x GPU case, or is it perhaps a bug in Deepspeed?

from transformers-bloom-inference.

stas00 commented on May 17, 2024

Is it correct that you set the model name to bigscience/bloom?

you can call it anything you want as you're saving it into a custom model dump on your disc, you can set the saving path to say /path/to/bloom-int8-7gpus. The name of the repo is only important if you're trying to retrieve it from online, locally it's whatever you want it to be. It becomes important again should you upload it to the hub. and then it'll be your_username/your_model_name.

when you load it after saving locally you'd now use the local path instead of the name, so you'd replace bigscience/bloom with /path/to/bloom-int8-7gpus if you follow my example.

I tried this same protocol several days ago on a 7x A6000 node and ran into a tensor size mismatch

I will let @RezaYazdaniAminabadi reply to that as it's his domain.

from transformers-bloom-inference.

[Question] 4-GPU shard `microsoft/bloom-deepspeed-inference-int8` about transformers-bloom-inference HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if infer_dtype == "int8":
	print_rank0("Using `load_in_8bit=True` to use quanitized model")
	kwargs['load_in_8bit'] = True