Giter Site home page Giter Site logo

Comments (8)

RezaYazdaniAminabadi avatar RezaYazdaniAminabadi commented on May 17, 2024 2

Hi @stas00 @mayank31398 and @zcrypt0,

Yes, I confirm @stas00, and I agree with putting some kind of instructions there for people to generate their own int8-versions for any model. Regarding the checkpoint loading part, it is already added just for BLOOM and I am working on adding a general support for other models too, so that you can create your own checkpoint and easily load it with deepspeed-inference.
Thanks,
Reza

from transformers-bloom-inference.

stas00 avatar stas00 commented on May 17, 2024 1

@stas00 are there any docs on how to go about creating sharded int8 weights oneself?

If I remember correctly

  1. You provide the dtype of torch.int8 here:

  1. and you also add: save_mp_checkpoint_path=/where/to/save/it in the same call.

@RezaYazdaniAminabadi could you please confirm? Also is it possible to document this type of recipe?
Perhaps here? https://www.deepspeed.ai/tutorials/inference-tutorial/
I think users would want that to save download size and speed up the download. (i.e. for other models besides bloom which don't have the repo you created)
Thank you!

I assume they are based on the bitsandbytes BLOOM int8 weights,

bnb has nothing to do with deepspeed's quantized solution. These are 2 different implementations. You'd use bnb's way when you don't use deepspeed. e.g.

if infer_dtype == "int8":
print_rank0("Using `load_in_8bit=True` to use quanitized model")
kwargs['load_in_8bit'] = True

transformers uses bnb behind the scenes, but I'm not sure if there is away to pre-make bnb weights - it creates them on the fly.

from transformers-bloom-inference.

stas00 avatar stas00 commented on May 17, 2024

it is still ok to use the same repo with 4 gpus. once loaded it's all the same, perhaps loading will be slightly slower but I doubt so since 4 processes will still do faster IO than 8 I'm pretty sure.

But yes, you can always save the 4-gpu version and use that as well.

from transformers-bloom-inference.

zcrypt0 avatar zcrypt0 commented on May 17, 2024

Thanks for the quick reply.

from transformers-bloom-inference.

zcrypt0 avatar zcrypt0 commented on May 17, 2024

@stas00 are there any docs on how to go about creating sharded int8 weights oneself?

I assume they are based on the bitsandbytes BLOOM int8 weights, but am not sure how you loaded them into Deepspeed for sharding.

from transformers-bloom-inference.

mayank31398 avatar mayank31398 commented on May 17, 2024

@zcrypt0 I haven't quantized a model myself.
But this example should work I think:
https://github.com/microsoft/DeepSpeedExamples/blob/master/model_compression/gpt2/bash_script/run_zero_quant.sh

from transformers-bloom-inference.

zcrypt0 avatar zcrypt0 commented on May 17, 2024

Thank you @stas00 and @RezaYazdaniAminabadi

If I understand correctly, you simply change the dtype in the script, and add save_mp_checkpoint_path as if you were sharding the original weights in fp16.

Is it correct that you set the model name to bigscience/bloom?

I tried this same protocol several days ago on a 7x A6000 node and ran into a tensor size mismatch:

RuntimeError: The size of tensor a (6144) must match the size of tensor b (14336) at non-singleton dimension 1

Full Traceback

I haven't ever tried a 7x GPU node with the fp16 weights, but I assumed it would work since 7 divides the number of BLOOM attention heads evenly.

I did have a 14x GPU node working though.

Is there a fundamental limitation here with the 7x GPU case, or is it perhaps a bug in Deepspeed?

from transformers-bloom-inference.

stas00 avatar stas00 commented on May 17, 2024

Is it correct that you set the model name to bigscience/bloom?

you can call it anything you want as you're saving it into a custom model dump on your disc, you can set the saving path to say /path/to/bloom-int8-7gpus. The name of the repo is only important if you're trying to retrieve it from online, locally it's whatever you want it to be. It becomes important again should you upload it to the hub. and then it'll be your_username/your_model_name.

when you load it after saving locally you'd now use the local path instead of the name, so you'd replace bigscience/bloom with /path/to/bloom-int8-7gpus if you follow my example.

I tried this same protocol several days ago on a 7x A6000 node and ran into a tensor size mismatch

I will let @RezaYazdaniAminabadi reply to that as it's his domain.

from transformers-bloom-inference.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.