Giter Site home page Giter Site logo

huggingface / transformers-bloom-inference Goto Github PK

View Code? Open in Web Editor NEW
547.0 12.0 111.0 215 KB

Fast Inference Solutions for BLOOM

License: Apache License 2.0

Python 85.89% Makefile 5.30% Dockerfile 1.89% CSS 0.48% JavaScript 4.16% HTML 2.29%
bloom huggingface-transformers nlp pytorch

transformers-bloom-inference's Introduction

Fast Inference Solutions for BLOOM

This repo provides demos and packages to perform fast inference solutions for BLOOM. Some of the solutions have their own repos in which case a link to the corresponding repos is provided instead.

Inference solutions for BLOOM 176B

We support HuggingFace accelerate and DeepSpeed Inference for generation.

Install required packages:

pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.2

alternatively you can also install deepspeed from source:

git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
CFLAGS="-I$CONDA_PREFIX/include/" LDFLAGS="-L$CONDA_PREFIX/lib/" TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

All the provided scripts are tested on 8 A100 80GB GPUs for BLOOM 176B (fp16/bf16) and 4 A100 80GB GPUs for BLOOM 176B (int8). These scripts might not work for other models or a different number of GPUs.

DS inference is deployed using logic borrowed from DeepSpeed MII library.

Note: Sometimes GPU memory is not freed when DS inference deployment crashes. You can free this memory by running killall python in terminal.

For using BLOOM quantized, use dtype = int8. Also, change the model_name to microsoft/bloom-deepspeed-inference-int8 for DeepSpeed-Inference. For HF accelerate, no change is needed for model_name.

HF accelerate uses LLM.int8() and DS-inference uses ZeroQuant for post-training quantization.

BLOOM inference via command-line

This asks for generate_kwargs everytime. Example: generate_kwargs =

{"min_length": 100, "max_new_tokens": 100, "do_sample": false}
  1. using HF accelerate
python -m inference_server.cli --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
  1. using DS inference
python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'

BLOOM server deployment

make <model_name> can be used to launch a generation server. Please note that the serving method is synchronous and users have to wait in queue until the preceding requests have been processed. An example to fire server requests is given here. Alternativey, a Dockerfile is also provided which launches a generation server on port 5000.

An interactive UI can be launched via the following command to connect to the generation server. The default URL of the UI is http://127.0.0.1:5001/. The model_name is just used by the UI to check if the model is decoder or encoder-decoder model.

python -m ui --model_name bigscience/bloom

This command launches the following UI to play with generation. Sorry for the crappy design. Unfotunately, my UI skills only go so far. ๐Ÿ˜…๐Ÿ˜…๐Ÿ˜… image

Benchmark system for BLOOM inference

  1. using HF accelerate
python -m inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5
  1. using DS inference
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5

alternatively, to load model faster:

deepspeed --num_gpus 8 --module inference_server.benchmark --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
  1. using DS ZeRO
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5

Support

If you run into things not working or have other questions please open an Issue in the corresponding backend:

If there a specific issue with one of the scripts and not the backend only then please open an Issue here and tag @mayank31398.

Other inference solutions

Client-side solutions

Solutions developed to perform large batch inference locally:

JAX:

Server solutions

A solution developed to be used in a server mode (i.e. varied batch size, varied request rate) can be found here. This is implemented in Rust.

transformers-bloom-inference's People

Contributors

anselmwang avatar arvindsun avatar koreyou avatar li-plus avatar mayank31398 avatar narsil avatar njhill avatar stas00 avatar stoyanstatanasov avatar varung avatar younesbelkada avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transformers-bloom-inference's Issues

TypeError: init_inference() got an unexpected keyword argument 'base_dir' in bloom-ds-inference.py

When running

deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16 --benchmark

I receive the following error:

Traceback (most recent call last):
  File "bloom-ds-inference.py", line 187, in <module>
    model = deepspeed.init_inference(model,
TypeError: init_inference() got an unexpected keyword argument 'base_dir'

After removing the base_dir arg from this line and re-running I get this error:

NotImplementedError: Cannot copy out of meta tensor; no data!
Traceback (most recent call last):
  File "bloom-inference-scripts/bloom-ds-inference.py", line 187, in <module>
    model = deepspeed.init_inference(model,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 292, in init_inference
    engine = InferenceEngine(model,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 151, in __init__
    self.module.to(device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

Running deepspeed==0.7.2. Please advise.

Edit: Apologies, pasted the wrong error in second snippet. Updated.

stuck when inferring

Screen Shot 2023-03-14 at 12 31 46

I run this script
deepspeed --num_gpus 1 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloomz-7b1 --batch_size 8
and it gets stuck just like in the picture.

Log:

(base) raihanafiandi@instance-1:~/playground/transformers-bloom-inference$ deepspeed --num_gpus 1 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloomz-7b1 --batch_size 8
[2023-03-14 05:30:02,152] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-14 05:30:02,965] [INFO] [runner.py:550:main] cmd = /opt/conda/bin/python3.7 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloomz-7b1 --batch_size 8
[2023-03-14 05:30:04,255] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-03-14 05:30:04,255] [INFO] [launch.py:149:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-03-14 05:30:04,255] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-03-14 05:30:04,255] [INFO] [launch.py:162:main] dist_world_size=1
[2023-03-14 05:30:04,255] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-03-14 05:30:05,816] [INFO] [comm.py:663:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloomz-7b1
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 104857.60it/s]
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 113359.57it/s]
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 99864.38it/s]
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 32832.13it/s]
[2023-03-14 05:30:13,610] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown
[2023-03-14 05:30:13,611] [WARNING] [config_utils.py:77:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-14 05:30:13,611] [INFO] [logging.py:77:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.0 does not match the version torch was compiled with 11.1 but since the APIs are compatible, accepting this combination
Using /home/raihanafiandi/.cache/torch_extensions/py37_cu111 as PyTorch extensions root...

Is it because I run other bloom models? (bloomz-7b-1)? Please help me on this. Thank you

root_dir in TemporaryCheckpointsJSON is redundant

In TemporaryCheckpointsJSON(https://github.com/huggingface/transformers-bloom-inference/blob/main/inference_server/models/ds_inference.py#L80) ,
image

When use glob.glob(f"{self.model_path}/*.bin"), files path in the list will all contain model_path prefix (eg: modelname is bigscience/bloom ).

{"type": "BLOOM", "checkpoints": ["bigscience/bloom/pytorch_model.bin"], "version": 1.0}

While set it as root_dir (glob.glob("*.bin", root_dir=self.model_path)) will not:

{"type": "BLOOM", "checkpoints": ["pytorch_model.bin"], "version": 1.0}

And it will align to DeepSpeed's loading way (replace_module.py). Because when loading, it will add root_dir again:

sd = [
  torch.load(os.path.join(base_dir1, checkpoint[i]), map_location='cpu')
]

So, current dump way will duplicate model_path.

And I raised a PR: #71 .

question regarding the float16 and bfloat

kernel_inject = True
# kernel_inject = False
if kernel_inject:
# XXX: for now ds-inference only works with fp16
dtype = torch.float16
else:
dtype = torch.bfloat16
if args.benchmark:
torch.cuda.empty_cache()
gc.collect()
deepspeed.runtime.utils.see_memory_usage("pre-from-pretrained", force=True)
# Construct model with fake meta tensors, later will be replaced during ds-inference ckpt load
with deepspeed.OnDevice(dtype=dtype, device="meta"):
model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.bfloat16)

In this code the first argument in with deepspeed.OnDevice(dtype=dtype, device="meta"): we use float16, while

model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.bfloat16)

here we use bfloat16.

I wonder why we should have this inconsistency?

why no use deepspeed.init_inference in zero benchmark

from the source code๏ผŒwhen use zero benchmark๏ผŒis not use deepspeed.init_inference

# ds_zero.py
self.model = deepspeed.initialize(model=self.model, config_params=ds_config)[0]

if not use deepspeed.init_inference there is no transformer kernel optimze, is it expected?

BUILD ERROR with nvcc

when i run ds_inference.py, get this error:

Emitting ninja build file /root/.cache/torch_extensions/py37_cu117/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/5] /usr/local/cuda-12.1/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/usr/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /usr/lib/python3.7/site-packages/torch/include -isystem /usr/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/lib/python3.7/site-packages/torch/include/TH -isystem /usr/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda-12.1/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -std=c++14 -c /usr/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu -o apply_rotary_pos_emb.cuda.o
FAILED: apply_rotary_pos_emb.cuda.o

why nvcc using D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ to compile, which generate error.

how to close this

Inference returns nan log-probability

I'm evaluating the Bloom model on the BIG-Bench benchmark.

I've modified the bloom inference server code to return logprob of a prompt, so that multi-choice tasks can be evaluated.
(code base here: https://github.com/vinhngx/transformers-bloom-inference/commits/main)

However, I'm observing that the INT8 model microsoft/bloom-deepspeed-inference-int8 on a DGX-A100 80G, not always but for many BIG-bench queries, the logprob is nan, for example for the below prompt:

Sentence: เคคเฅ‡เคœเคพเคœเฅ€ เคจเฅˆ เคธเคฐเคช เคฐเฅˆ เค•เคพเคŸเคฃ เคฐเฅ‡ เคฌเคพเคฌเคค เค•เฅ‡เคˆ เคฒเฅ‹เค•เค—เคพเคฅเคพเคตเคพเค‚ เคฎเคฟเคฒเฅˆเฅค เค…เฅ‡เค• เค—เคพเคฅเคพ เคฎเฅเคœเคฌ เคœเคฆ เคคเฅ‡เคœเคพเคœเฅ€ เค†เคชเคฐเฅˆ เคธเคพเคธเคฐเฅˆ เคœเคพเคฏ เคฐเฅˆเคฏเคพ เคนเคพ, เคคเฅŒ เคฐเคธเฅเคคเฅˆ เคฎเฅ‡เฅ‡เค‚ เค†เค‚ เค…เฅ‡เค• เคธเคฐเคช เคจเฅˆ เคฌเคฒเคฃ เคธเฅ‚เค‚ เคฌเคšเคพเคฏ เคฒเคฟเคฏเฅŒ, เคชเคฃ เค‡เคฃเคพเค‚ เคฐเฅ€ เค‡เคฃ เค•เฅ‹เคธเคฟเคธ เคฐเฅˆ เคฆเคฐเคฎเฅเคฏเคพเคจ เค‰เคฃเคฐเฅ€ เคธเคฐเคชเคฃเฅ€ เคฌเคฒ เคšเฅเค•เฅ€ เคนเฅ€เฅค เคธเฅ‹ เคฌเคšเคฟเคฏเฅ‹เคกเคผเฅŒ เคธเคฐเคช เค•เฅเคฐเฅ‹เคง เคฎเฅ‡เค‚ เคชเคพเค—เคฒ เคนเฅเคฏเค—เฅเคฏเฅŒ เค…เคฐ เคคเฅ‡เคœเคพเคœเฅ€ เคจเฅˆ เคกเคธเคฃ เคฒเคพเค—เฅเคฏเฅŒเฅค เคคเคฆ เคคเฅ‡เคœเคพเคœเฅ€ เค‰เคฃเคจเฅˆ เคฐเฅ‹เค•เคคเคพ เคฅเค•เคพเค‚ เคตเคšเคจ เคฆเคฟเคฏเฅŒ เค•เฅ‡ - "เคธเคพเคธเคฐเฅˆ เคœเคพเคฏ'เคฐ เคฎเฅเคนเฅˆเค‚ เคชเคพเค›เฅŒ เคฅเคพเคฐเฅˆ เค•เคจเฅˆ เค†เคตเฅ‚เค‚, เคคเคฆ เคฅเฅ‚เค‚ เคฎเฅเคนเคจเฅˆ เคกเคธ เคฒเฅ€เคœเฅˆเฅค" เคธเคพเคธเคฐเฅˆ เค—เคฏเคพเค‚ เค‡เคฃเคพเค‚ เคจเฅˆ เค…เคšเคพเคฃเคšเค• เค—เคพเคฏเคพเค‚ เค›เฅเคกเคพเคตเคฃ เคธเคพเคฐเฅ‚ เคฎเฅ‡เคฐเคพเค‚ เคฐเฅˆ เคฒเคพเคฐเฅˆ เคœเคพเคตเคฃเฅŒ เคชเคกเฅเคฏเฅŒเฅค เคฎเฅ‡เคฐเคพเค‚ เคฐเฅˆ เคธเคพเคฅเฅ‡ เคนเฅเคฏเฅˆ เคธเค‚เค˜เคฐเฅเคท เคฎเฅ‡เค‚ เค…เฅˆ เค…เคฃเฅ‚เค‚เคคเคพ เคˆ เค˜เคพเคฏเคฒ เคนเฅเคฏเค—เฅเคฏเคพ เคนเคพ, เคคเฅŒ เคˆ เค†เคชเคฐเฅˆ เคตเคšเคจ เคจเคฟเคญเคพเคตเคฃ เคธเคพเคฐเฅ‚ เค…เฅˆ เค‰เคฃ เคธเคฐเคช เค•เคจเฅˆ เคชเฅ‚เค—เฅเคฏเคพเฅค เคคเคฆ เค‡เคฃเคพเค‚เคจเฅ‡ เคฆเฅ‡เค–'เคฐ เคธเคฐเคช เค•เฅˆเคฏเฅŒ เค•เฅ‡ - " เคฅเคพเค‚เคฐเฅŒ เคคเฅ‹ เคธเค—เคฒเฅŒ เคธเคฐเฅ€เคฐ เคˆ เค˜เคพเคตเคพเค‚ เคธเฅ‚เค‚ เคญเคฐเฅเคฏเฅŒ เคชเคกเฅเคฏเฅŒ เคนเฅˆ, เคฎเฅเคนเฅˆเค‚ เคกเคธเฅ‚เค‚ เคˆ เคคเฅŒ เค•เค เฅˆ เคกเคธเฅ‚เค‚?" เค‡เคฃ เคฌเคพเคค เคฎเคพเคฅเฅˆ เคคเฅ‡เคœเคพเคœเฅ€ เค†เคชเคฐเฅ€ เคœเฅ€เคญ เคจเคฟเค•เคพเคฒเฅ€ เค…เคฐ เคธเคฐเคช เค‡เคฃเคพเค‚ เคฐเฅ€ เคœเฅ€เคญ เคจเฅˆ เคกเคธ เคฒเคฟเคฏเฅŒเฅค เคฆเฅ‚เคœเฅˆ เค•เคพเคจเฅ€ เคฒเฅ‹เค•เค—เคพเคฅเคพ เคฎเฅเคœเคฌ เคคเฅ‡เคœเคพเคœเฅ€ เคœเคฆ เค—เคพเคฏเคพเค‚ เคšเคฐเคพเคตเคฃ เคจเฅˆ เคœเคพเคฏเคพ เค•เคฐเคคเคพ เคนเคพ, เคคเคฆ เค…เฅ‡เค• เค—เคพเคฏ เค…เคฒเค— เคนเฅเคฏ'เคฐ เค…เฅ‡เค• เคฌเคฟเคฒ เคฐเฅˆ เค•เคจเฅˆ เคœเคพเคตเคคเฅ€ เคชเคฐเฅ€ เคนเฅ€, เคœเค เฅˆ เค…เฅ‡เค• เคธเคฐเคช เคจเคฟเค•เคฒ'เคฐ เค—เคพเคฏ เคฐเฅŒ เคฆเฅ‚เคง เคชเฅ€เคฏ เคœเคพเคตเคคเฅŒ เคนเฅŒเฅค เค† เคฌเคพเคค เค เคพ เคชเคกเคผเคฟเคฏเคพเค‚ เคคเฅ‡เคœเคพเคœเฅ€ เคธเคฐเคช เคจเฅˆ เคจเคฟเคค เคฆเฅ‚เคง เคชเคพเคตเคฃ เคฐเฅŒ เคตเคพเคฆเฅŒ เค•เคฐเฅเคฏเฅŒเฅค เคชเคฃ, เค•เคฟเคฃเฅ€ เคตเคœเฅˆ เคธเฅ‚เค‚ เค…เฅ‡เค• เคฆเคฟเคจ เค…เฅˆ เค‰เคฃเคจเฅ‡ เคฆเฅ‚เคง เคชเคพเคตเคฃเฅŒ เคญเฅ‚เคฒเค—เฅเคฏเคพเฅค เค‡เคฃ เคฌเคพเคค เคฎเคพเคฅเฅˆ เคธเคฐเคช เคฐเฅ€เคธ เคฎเฅ‡เค‚ เคเคพเคฒเฅŒเคเคพเคฒ เคนเฅเคฏ'เคฐ เค‡เคฃเคพเค‚ เคจเฅˆ เคกเคธเคฃเฅŒ เคšเคพเคฏเฅŒเฅค เคคเคฆ เคคเฅ‡เคœเคพเคœเฅ€ เคธเคพเคธเคฐเฅˆ เคœเคพเคฏ'เคฐ เค‰เค เฅˆ เคธเฅ‚เค‚ เคฌเคพเคตเคกเคผเคฟเคฏเคพเค‚ เค–เฅเคฆเฅŒเค–เฅเคฆ เคจเฅˆ เคกเคธเคพเคตเคฃ เคฐเฅŒ เคตเคพเคฏเคฆเฅŒ เค•เคฐเฅเคฏเฅŒเฅค เคœเคฆ เคคเฅ‡เคœเคพเคœเฅ€ เคธเคพเคธเคฐเฅˆ เคธเฅ‚เค‚ เค˜เคพเคฏเคฒ เคนเฅเคฏเฅ‹เคกเคผเคพ เค†เคฏ'เคฐ เค‰เคฃ เคœเค—เฅˆ เคชเฅ‚เค—เฅเคฏเคพ, เคคเฅŒ เคธเคฐเคช เคคเฅ‡เคœเคพเคœเฅ€ เคฐเฅŒ เคธเคพเคฐเฅŒ เคธเคฐเฅ€เคฐ เค˜เคพเคฏเคฒ เคฆเฅ‡เค–'เคฐ เคœเฅ€เคญ เคฎเคพเคฅเฅˆ เคกเคธ เคฒเคฟเคฏเฅŒเฅค เคคเฅ€เคธเคฐเฅ€ เคฒเฅ‹เค•เค˜เคพเคฅเคพ เคฎเฅเคœเคฌเคฎเฅ‡เคฐเคพเค‚ เคฐเฅˆ เคธเคพเคฅเฅˆ เคนเฅเคฏเฅˆ เคธเค‚เค˜เคฐเฅเคท เคฎเฅ‡เค‚ เคตเฅˆ เค…เคฃเฅ‚เค‚เคคเคพ เคˆ เค˜เคพเคฏเคฒ เคนเฅเคฏเค—เฅเคฏเคพ เคนเคพ, เคธเฅ‹ เคตเฅˆ เค‰เค เฅˆ เคˆ เคนเฅ‡เค เคพ เคชเคกเคผเค—เฅเคฏเคพเฅค เค‰เคฃ เคฌเค—เคค เค‰เค เฅˆ เคธเคฐเคช เคฌเฅˆเค เฅเคฏเฅŒ เคนเฅŒ เคœเคฟเค•เฅŒ เค…เคœเฅ‡เคœ เคคเฅ‡เคœเคพเคœเฅ€ เคฐเฅ€ เคœเฅ€เคญ เคฎเคพเคฅเฅˆ เค•เคพเคŸ เค–เคพเคฏเฅŒเฅค
  choice: Montenegrin
  choice: Aguacateco
  choice: Waima
  choice: Wipi
  choice: Central Buang
  choice: Balangao
  choice: Tosk Albanian
  choice: Golin
  choice: Sena
  choice: Eastern Bolivian Guarani
  choice: Rajasthani
Language: Tosk Albanian

The logprob is

[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]], nan)

Should I use bf16 or fp16?

Since bf16 and fp16 are different schemes, which should I use for bigscience/bloomz, bigscience/bloom? Or loading in bf16 or fp15 produce the same results?

cuBLAS error with NVIDIA H100 HGX, CUDA v12.1, and cuDNN 8.8.1

cuBLAS error when running the HuggingFace accelerate following benchmark code on NVIDIA H100 HGX, CUDA v12.1, cuDNN 8.8.1, pytorch==2.0.0+cu118, within Jupyter Notebook:

!CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python transformers-bloom-inference/bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark

...or at CLI:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python transformers-bloom-inference/bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark

SOLUTION:

Make the following edits within transformers-bloom-inference/bloom-inference-scripts/bloom-accelerate-inference.py to enable mixed precision:

line 10:
Add BitsAndBytesConfig to import call, as follows:
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

line 56:
Add quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) to statement as follows:
infer_dtype = args.dtype if infer_dtype == "int8": dtype = torch.int8 quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) kwargs = dict( device_map="auto", )

line 77:
Edit kwargs to include quantization_config, as follows:
if infer_dtype == "int8": print_rank0("Using load_in_8bit=True to use quanitized model") #kwargs["load_in_8bit"] = True kwargs={"load_in_8bit":True, "quantization_config": quantization_config, "device_map": "auto"} else: kwargs["torch_dtype"] = dtype

Save the updated PY file, then run the accelerate inferencing code.

Updated PY file runs benchmarks without errors. Recommend making these, or similar code changes, to the parent repo.

Error running bloom-7b1 model with deepspeed

Thanks for these scripts! I tried running the bloom-7b1 model

deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name 'bigscience/bloom-7b1'

I get the following error when loading the checkpoints:

KeyError: 'h.19.input_layernorm.weight'

The following is the log

Loading 2 checkpoint shards:   0%|                                                                   | 0/2 [00:00<?, ?it/s]
Loading 2 checkpoint shards:   0%|                                                                   | 0/2 [00:00<?, ?it/s]
Loading 2 checkpoint shards:   0%|                                                                   | 0/2 [00:00<?, ?it/s]
Loading 2 checkpoint shards:   0%|                                                                   | 0/2 [00:00<?, ?it/s]
Loading 2 checkpoint shards:   0%|                                                                   | 0/2 [00:00<?, ?it/s]
Loading 2 checkpoint shards:   0%|                                                                   | 0/2 [00:00<?, ?it/s]
Loading 2 checkpoint shards:   0%|                                                                   | 0/2 [00:00<?, ?it/s]
Loading 2 checkpoint shards:   0%|                                                                   | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
Traceback (most recent call last):
  File "bloom-inference-scripts/bloom-ds-inference.py", line 199, in <module>
  File "bloom-inference-scripts/bloom-ds-inference.py", line 199, in <module>
Traceback (most recent call last):
  File "bloom-inference-scripts/bloom-ds-inference.py", line 199, in <module>
    **kwargs,
      File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/__init__.py", line 326, in init_inference
**kwargs,
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/__init__.py", line 326, in init_inference
    **kwargs,
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/__init__.py", line 326, in init_inference
        base_dir)base_dir)

  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/inference/engine.py", line 158, in __init__
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/inference/engine.py", line 158, in __init__
    base_dir)
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/inference/engine.py", line 158, in __init__
        base_dir=base_dir)base_dir=base_dir)

  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/inference/engine.py", line 393, in _apply_injection_policy
      File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/inference/engine.py", line 393, in _apply_injection_policy
base_dir=base_dir)
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/inference/engine.py", line 393, in _apply_injection_policy
        enable_cuda_graph=self.enable_cuda_graph)enable_cuda_graph=self.enable_cuda_graph)
    
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/replace_module.py", line 962, in replace_transformer_layer
enable_cuda_graph=self.enable_cuda_graph)  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/replace_module.py", line 962, in replace_transformer_layer

  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/replace_module.py", line 962, in replace_transformer_layer
        quantizer,    quantizer,
quantizer,
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 229, in load_model_with_checkpoint

  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 229, in load_model_with_checkpoint
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 229, in load_model_with_checkpoint
        load_module_recursive(r_module)load_module_recursive(r_module)

  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 227, in load_module_recursive
      File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 227, in load_module_recursive
load_module_recursive(r_module)
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 227, in load_module_recursive
    level + 1)
      File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 227, in load_module_recursive
level + 1)
      File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 227, in load_module_recursive
level + 1)
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 227, in load_module_recursive
    level + 1)
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 222, in load_module_recursive
    level + 1)
      File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 222, in load_module_recursive
level + 1)
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 222, in load_module_recursive
    layer_policies[child.__class__](child, prefix + name + '.')
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 140, in load_transformer_layer
    layer_policies[child.__class__](child, prefix + name + '.')
      File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 140, in load_transformer_layer
layer_policies[child.__class__](child, prefix + name + '.')
  File "/home/deepspeed/deepspeed/lib/python3.7/site-packages/deepspeed/module_inject/load_checkpoint.py", line 140, in load_transformer_layer
    module.norm_w.data.copy_(sd[0][prefix + 'input_layernorm.' + 'weight'])
    KeyErrormodule.norm_w.data.copy_(sd[0][prefix + 'input_layernorm.' + 'weight']): 
    'h.19.input_layernorm.weight'module.norm_w.data.copy_(sd[0][prefix + 'input_layernorm.' + 'weight'])
KeyError
: 'h.19.input_layernorm.weight'
KeyError: 'h.19.input_layernorm.weight'

I am using the following versions:
transformers version: 4.23.1
deepspeed version: 0.7.4

Any help/suggestion would be very helpful!

Unable to reload a quantized model

After setting load_in_8bit=True for .from_pretrained() function and get model quantized. How should I save this model and get it reloaded with .from_pretrained() again? And make all weights be loaded normally

transformers 4.21.3 doesn't support `load_in_8bit`

Using transformers version 4.21.3, I see this error:

Using `load_in_8bit=True` to use quanitized model
Traceback (most recent call last):
 File "bloom-accelerate-inference.py", line 115, in <module>
   model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
 File "/disk1/srahnamoon/llm/.lvenv/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
   return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
 File "/disk1/srahnamoon/llm/.lvenv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2113, in from_pretrained
   model = cls(config, *model_args, **model_kwargs)
TypeError: __init__() got an unexpected keyword argument 'load_in_8bit'

upgrading to 4.22.0 resolved the issue. Maybe the documentation for HF Accelerate solution needs to be corrected?

concurrent requests

I read in the README: "Please note that the serving method is synchronous and users have to wait in queue until the preceding requests have been processed. "

Is there any way to support concurrent requests? As it may be waste of resource to fire requests one by one.

Is there a way to initialize a random weight for

Hi! Thank you for the wonderful work of LLM inference.

For now, I am profiling the performance of each part of the code, but I do not care about the output. And the checkpoint loading takes too much time, so I wonder if is there a way to load empty ckpt or just initialize the model with random ones.

How do build a web api for deepspeed inference

Hi Mayank,

Really nice to see your work here. appreciate what you are doing here for the community. I have a question for you. Based on your code I want to build a minimal api server using sanic. when i use the bloom-ds-inference.py script it runs well. how ever when i build some api related code using sanic i see that the server spawns automatically on all the GPU's. How do I get around this? is there a specific approach i need to take here. Your help and input would be highly appreciated.

Regards,
Vamsi

Running Error with microsoft/bloom-deepspeed-inference-fp16

Hi,

Thanks for the great repo! I have successfully run the bloom model on a 16X40GB machine via the following command:

deepspeed --num_gpus=16 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom --benchmark

When I change the model to microsoft/bloom-deepspeed-inference-fp16 via the following command:

deepspeed --num_gpus=16 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16 --benchmark

It gives me the following error when Loading the checking points:

Traceback (most recent call last):
  File "bloom-inference-scripts/bloom-ds-inference.py", line 200, in <module>
    model = deepspeed.init_inference(
  File "/export/home/DeepSpeed/deepspeed/__init__.py", line 305, in init_inference
    engine = InferenceEngine(model,
  File "/export/home/DeepSpeed/deepspeed/inference/engine.py", line 149, in __init__
    self._apply_injection_policy(
  File "/export/home/DeepSpeed/deepspeed/inference/engine.py", line 367, in _apply_injection_policy
    replace_transformer_layer(
  File "/export/home/DeepSpeed/deepspeed/module_inject/replace_module.py", line 987, in replace_transformer_layer
    load_model_with_checkpoint(replaced_module,
  File "/export/home/DeepSpeed/deepspeed/module_inject/load_checkpoint.py", line 229, in load_model_with_checkpoint
    load_module_recursive(r_module)
  File "/export/home/DeepSpeed/deepspeed/module_inject/load_checkpoint.py", line 224, in load_module_recursive
    load_module_recursive(
  File "/export/home/DeepSpeed/deepspeed/module_inject/load_checkpoint.py", line 224, in load_module_recursive
    load_module_recursive(
  File "/export/home/DeepSpeed/deepspeed/module_inject/load_checkpoint.py", line 222, in load_module_recursive
    layer_policies[child.__class__](child, prefix + name + '.')
  File "/export/home/DeepSpeed/deepspeed/module_inject/load_checkpoint.py", line 138, in load_transformer_layer
    load_parameters(child, prefix + n + '.')
  File "/export/home/DeepSpeed/deepspeed/module_inject/load_checkpoint.py", line 81, in load_parameters
    scale = scale.view(
AttributeError: 'NoneType' object has no attribute 'view'

Loading 8 checkpoint shards:  12%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–

I used the most recent transformers and DeepSpeed installed from the main branch:

>>> deepspeed.__version__
'0.7.4+cfead551'

>>> transformers.__version__
'4.22.0.dev0'

Really appreciate if you could provide some suggestions!

Distributed Training using the same loading method

I tried to use the same model loading method as in the bloom-accelerate-inference.py script and then instead of the generate function added a Trainer with data loaders to train few layers of the model (others were frozen). I set the local_rank argument in TrainingArgs and also set trainer.is_model_parallel to True.

I got the following error:

File "/----/Anandamoy/anaconda3/envs/my_env/lib/python3.8/site-packages/torch/nn/functional.py", line 2503, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__native_layer_norm)

Could you please suggest what I might be doing wrong and what would be the correct way to use the loaded distributed model for training/finetuning?

"bloom-ds-zero-inference.py" works but "inference_server.cli --deployment_framework ds_zero" fails

deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-zero-inference.py --name /raid/data/richardwang/bloomz --cpu_offload worked and gave me inference output. /raid/data/richardwang/bloomz is a downloaded copy of bigscience/bloomz

However python -m inference_server.cli --model_name /raid/data/richardwang/bloomz --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework ds_zero --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}' failed with error message:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/richardwang/transformers-bloom-inference/inference_server/cli.py", line 43, in <module>
    main()
  File "/home/richardwang/transformers-bloom-inference/inference_server/cli.py", line 18, in main
    model = ModelDeployment(args, True)
  File "/home/richardwang/transformers-bloom-inference/inference_server/model_handler/deployment.py", line 54, in __init__
    self.model = get_model_class(args.deployment_framework)(args)
  File "/home/richardwang/transformers-bloom-inference/inference_server/models/ds_zero.py", line 51, in __init__
    self.model = get_hf_model_class(args.model_class).from_pretrained(args.model_name, torch_dtype=args.dtype)
  File "/home/richardwang/venv/llamaenv/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 464, in from_pretrained
    return model_class.from_pretrained(
  File "/home/richardwang/venv/llamaenv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2357, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/richardwang/venv/llamaenv/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 657, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(
  File "/home/richardwang/venv/llamaenv/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 808, in __init__
    self._configure_train_batch_size()
  File "/home/richardwang/venv/llamaenv/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 991, in _configure_train_batch_size
    self._batch_assertion()
  File "/home/richardwang/venv/llamaenv/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 926, in _batch_assertion
    assert (
AssertionError: Train batch size: 0 has to be greater than 0

Environment:

torch              1.12.0+cu113
deepspeed          0.8.2
deepspeed-mii      0.0.4
transformers       4.26.1

I have check that both bloom-inference-scripts/bloom-ds-zero-inference.py and inference_server/models/ds_zero.py call the same thing AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) How can I make it work to start up a server for bloomz with deepspeed inference?

Max tokens generated remains constant for whatever the input token size

what ever the max new token size is the output tokens generated is always 100. Please take a look at the few call I made using curl

โฏ curl 'http://myserver.com:8000/generate/' \
  -H 'Content-Type: application/json; charset=UTF-8' \
  --data-raw '{"text":["what is text generation?"],"temperature":1,"top_k":50,"top_p":1,"max_new_tokens":256,"repetition_penalty":1,"do_sample":true,"remove_input_from_output":true}' \
  --compressed \
  --insecure
{"method":"generate","num_generated_tokens":[100],"query_id":54,"text":["\u201d. And after I get into the nitty gritty of how it works.\nThis is a simple way to get into language generation. There are other ways to get into it, but this gets right to the heart of it.\nThere is a book called How to Learn Any Language, that takes a very interesting approach to the whole situation. But it is expensive, so you have to be sure that it is on par with your goals. I would suggest taking a look at it.\nThe best"],"total_time_taken":"10.58 secs"}
โฏ curl 'http://myserver.com:8000/generate/' \
  -H 'Content-Type: application/json; charset=UTF-8' \
  --data-raw '{"text":["write hello world program in C++"],"temperature":1,"top_k":50,"top_p":1,"max_new_tokens":256,"repetition_penalty":1,"do_sample":true,"remove_input_from_output":true}' \
  --compressed \
  --insecure
{"method":"generate","num_generated_tokens":[100],"query_id":56,"text":[". In that program I also used a class. I have made some change in it. After that when I am using some commands on that file, it shows me some error. I think error is coming when i am compiling that file. But there is not any error in the program. This program is written with #include iostream class with using namespace std;. So help me. Sorry for my poor English. Please provide me a good solution. Or any new way to make this.\n#include <iostream>\n\n"],"total_time_taken":"10.59 secs"}
โฏ curl 'http://myserver.com:8000/generate/' \
  -H 'Content-Type: application/json; charset=UTF-8' \
  --data-raw '{"text":["write hello world program in python"],"temperature":1,"top_k":50,"top_p":1,"max_new_tokens":200,"repetition_penalty":1,"do_sample":true,"remove_input_from_output":true}' \
  --compressed \
  --insecure
{"method":"generate","num_generated_tokens":[100],"query_id":57,"text":["\nimport os, sys\nimport time as time\nfrom random import *\nn=10\n\ndef random(n):\n    x = int(random() * n)\n    return x\n\ndef main():\n    global n\n    n = random(n)\n    print n\n    print random(n)\n    print n\n\nmain()\n\nThe script is about creating a random number, using rand() and the time() object.\nCan someone help me with this problem? Thanks in advance.\nIt needs to be done"],"total_time_taken":"10.58 secs"}

DeepSpeed runtime partition failed

Hi All,

After trying the code:

deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom

With runtime partition, I am facing the below error:

[INFO ] PyProcess - [1,6]<stdout>:  File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/load_checkpoint.py", line 140, in load_transformer_layer
[INFO ] PyProcess - [1,6]<stdout>:    module.norm_w.data.copy_(sd[0][prefix + 'input_layernorm.' + 'weight'])
[INFO ] PyProcess - [1,6]<stdout>:KeyError: 'h.69.input_layernorm.weight'

From the logic, I kind of understand the process is writing a checkpoints.json:

{"type": "BLOOM", "checkpoints": ["pytorch_model-00025-of-00026.bin", 
"pytorch_model-00004-of-00026.bin", 
"pytorch_model-00015-of-00026.bin", 
...
], "version": 1.0}

But it seemed does not work with DeepSpeed 0.7.3.

I believe this is a Chicken n Egg problem:

DeepSpeed is looking for this: https://huggingface.co/microsoft/bloom-deepspeed-inference-int8/blob/main/ds_inference_config.json where nobody can produce without a checkpoint.json...

[Question] 4-GPU shard `microsoft/bloom-deepspeed-inference-int8`

On the hf model page for microsoft/bloom-deepspeed-inference-int8 it states that those weights are broken up into shards for 8 gpus.

In this repo, there is an example using 4 gpus with that model.

My question is whether there is a benefit to producing (manually) a 4-GPU sharded version, or if there would be any difference at all.

/cc @stas00

transformers 4.22.1 fails with accelerate

@stas00 fyi
the current memory map {0, 51, ...} is failing for transformers 4.22.1 with the error:
some params were offloaded to disk due to the current device_map. Please provide offload folder.

Seems like a lot of things changes from 4.21.2 to 4.22.1.
I can work on fixing this. Any ideas how to go about it? Maybe just add 1 GB to first GPU (seems like the most easiest)?

The generated results are different when using greedy search during generation

Thank you very much for your work. I got a problem when I ran BLOOM-176B on 8*A100.

I followed the README.md and executed the following command. To be specific, I set do_sample = true and top_k = 1 which I thought it was equivalent to greedy search:

python -m inference_server.cli --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": true, "top_k": 1}'

However, the generated outputs of several forwards were different with the same inputs. This situation happened occasionally.

Do you have any clues or ideas about this?

My env info:

CUDA 11.7
nccl 2.14.3

accelerate 0.17.1
Flask 2.2.3
Flask-API 3.0.post1
gunicorn 20.1.0
pydantic 1.10.6
huggingface-hub 0.13.2

Can not generate text correctly after loading an int8 model

Hi,
I am confused about how to reload a quantized model.
when I load model with the code below ,it could generate texts normally

model = AutoModelForCausalLM.from_pretrained(
    'bigscience/bloom-7b1',
    device_map='auto',
    torch_dtype=torch.int8,
    load_in_8bit=True,
    )

while I save this inference model with model.save_pretrained(output_dir, state_dict=state_dict) and reload with
model = AutoModelForCausalLM.from_pretrained( CHECKPOINT_PATH, device_map='auto', )
the text generated was meaningless and it seems that some weights of the model checkpoint were not used when initializing BloomForCausalLM, warnings like this:
['transformer.h.4.self_attention.dense.SCB', 'transformer.h.18.mlp.dense_4h_to_h.SCB', 'transformer.h.2.mlp.dense_h_to_4h.SCB', 'transformer.h.24.mlp.dense_4h_to_h.SCB', 'transformer.h.17.self_attention.query_key_value.SCB', 'transformer.h.25.self_attention.query_key_value.SCB', 'transformer.h.14.self_attention.dense.SCB', 'transformer.h.28.mlp.dense_4h_to_h.SCB', 'transformer.h.6.mlp.dense_4h_to_h.SCB', 'transformer.h.3.self_attention.query_key_value.SCB', 'transformer.h.13.self_attention.dense.SCB', 'transformer.h.2.self_attention.dense.SCB', 'transformer.h.11.mlp.dense_4h_to_h.SCB', 'transformer.h.6.mlp.dense_h_to_4h.SCB', 'transformer.h.13.mlp.dense_h_to_4h.SCB', 'transformer.h.19.mlp.dense_h_to_4h.SCB', 'transformer.h.19.self_attention.dense.SCB', 'transformer.h.29.self_attention.dense.SCB', 'transformer.h.11.mlp.dense_h_to_4h.SCB', 'transformer.h.14.mlp.dense_h_to_4h.SCB', 'transformer.h.11.self_attention.query_key_value.SCB', 'transformer.h.19.mlp.dense_4h_to_h.SCB', 'transformer.h.0.mlp.dense_h_to_4h.SCB', 'transformer.h.3.self_attention.dense.SCB', 'transformer.h.20.mlp.dense_h_to_4h.SCB', 'transformer.h.29.mlp.dense_4h_to_h.SCB', 'transformer.h.22.self_attention.dense.SCB', 'transformer.h.12.mlp.dense_h_to_4h.SCB', 'transformer.h.29.mlp.dense_h_to_4h.SCB', 'transformer.h.23.self_attention.query_key_value.SCB', 'transformer.h.7.self_attention.dense.SCB', 'transformer.h.22.mlp.dense_h_to_4h.SCB', 'transformer.h.21.mlp.dense_4h_to_h.SCB', 'transformer.h.21.self_attention.dense.SCB', 'transformer.h.8.mlp.dense_h_to_4h.SCB', 'transformer.h.24.mlp.dense_h_to_4h.SCB', 'transformer.h.1.mlp.dense_h_to_4h.SCB', 'transformer.h.5.self_attention.dense.SCB', 'transformer.h.18.self_attention.dense.SCB', 'transformer.h.3.mlp.dense_4h_to_h.SCB', 'transformer.h.14.self_attention.query_key_value.SCB', 'transformer.h.15.self_attention.query_key_value.SCB', 'transformer.h.21.mlp.dense_h_to_4h.SCB', 'transformer.h.5.mlp.dense_h_to_4h.SCB', 'transformer.h.20.mlp.dense_4h_to_h.SCB', 'transformer.h.7.mlp.dense_4h_to_h.SCB', 'transformer.h.16.mlp.dense_h_to_4h.SCB', 'transformer.h.5.self_attention.query_key_value.SCB', 'transformer.h.10.self_attention.dense.SCB', 'transformer.h.24.self_attention.dense.SCB', 'transformer.h.10.mlp.dense_4h_to_h.SCB', 'transformer.h.15.mlp.dense_h_to_4h.SCB', 'transformer.h.26.mlp.dense_4h_to_h.SCB', 'transformer.h.6.self_attention.query_key_value.SCB', 'transformer.h.26.mlp.dense_h_to_4h.SCB', 'transformer.h.27.self_attention.dense.SCB', 'transformer.h.7.mlp.dense_h_to_4h.SCB', 'transformer.h.21.self_attention.query_key_value.SCB', 'transformer.h.0.self_attention.query_key_value.SCB', 'transformer.h.23.mlp.dense_4h_to_h.SCB', 'transformer.h.10.mlp.dense_h_to_4h.SCB', 'transformer.h.17.self_attention.dense.SCB', 'transformer.h.8.mlp.dense_4h_to_h.SCB', 'transformer.h.12.mlp.dense_4h_to_h.SCB', 'transformer.h.7.self_attention.query_key_value.SCB', 'transformer.h.16.self_attention.query_key_value.SCB', 'transformer.h.28.self_attention.dense.SCB', 'transformer.h.3.mlp.dense_h_to_4h.SCB', 'transformer.h.4.mlp.dense_h_to_4h.SCB', 'transformer.h.9.self_attention.query_key_value.SCB', 'transformer.h.15.self_attention.dense.SCB', 'transformer.h.0.self_attention.dense.SCB', 'transformer.h.16.mlp.dense_4h_to_h.SCB', 'transformer.h.1.mlp.dense_4h_to_h.SCB', 'transformer.h.13.mlp.dense_4h_to_h.SCB', 'transformer.h.11.self_attention.dense.SCB', 'transformer.h.26.self_attention.query_key_value.SCB', 'transformer.h.28.self_attention.query_key_value.SCB', 'transformer.h.13.self_attention.query_key_value.SCB', 'transformer.h.22.self_attention.query_key_value.SCB', 'transformer.h.26.self_attention.dense.SCB', 'transformer.h.29.self_attention.query_key_value.SCB', 'transformer.h.19.self_attention.query_key_value.SCB', 'transformer.h.12.self_attention.query_key_value.SCB', 'transformer.h.27.mlp.dense_4h_to_h.SCB', 'transformer.h.25.self_attention.dense.SCB', 'transformer.h.2.self_attention.query_key_value.SCB', 'transformer.h.10.self_attention.query_key_value.SCB', 'transformer.h.9.mlp.dense_4h_to_h.SCB', 'transformer.h.9.self_attention.dense.SCB', 'transformer.h.20.self_attention.query_key_value.SCB', 'transformer.h.9.mlp.dense_h_to_4h.SCB', 'transformer.h.27.mlp.dense_h_to_4h.SCB', 'transformer.h.14.mlp.dense_4h_to_h.SCB', 'transformer.h.8.self_attention.query_key_value.SCB', 'transformer.h.24.self_attention.query_key_value.SCB', 'transformer.h.22.mlp.dense_4h_to_h.SCB', 'transformer.h.4.mlp.dense_4h_to_h.SCB', 'transformer.h.5.mlp.dense_4h_to_h.SCB', 'transformer.h.1.self_attention.dense.SCB', 'transformer.h.1.self_attention.query_key_value.SCB', 'transformer.h.6.self_attention.dense.SCB', 'transformer.h.12.self_attention.dense.SCB', 'transformer.h.23.self_attention.dense.SCB', 'transformer.h.17.mlp.dense_h_to_4h.SCB', 'transformer.h.23.mlp.dense_h_to_4h.SCB', 'transformer.h.8.self_attention.dense.SCB', 'transformer.h.25.mlp.dense_4h_to_h.SCB', 'transformer.h.20.self_attention.dense.SCB', 'transformer.h.25.mlp.dense_h_to_4h.SCB', 'transformer.h.28.mlp.dense_h_to_4h.SCB', 'transformer.h.4.self_attention.query_key_value.SCB', 'transformer.h.18.self_attention.query_key_value.SCB', 'transformer.h.2.mlp.dense_4h_to_h.SCB', 'transformer.h.18.mlp.dense_h_to_4h.SCB', 'transformer.h.17.mlp.dense_4h_to_h.SCB', 'transformer.h.0.mlp.dense_4h_to_h.SCB', 'transformer.h.15.mlp.dense_4h_to_h.SCB', 'transformer.h.27.self_attention.query_key_value.SCB', 'transformer.h.16.self_attention.dense.SCB']
How could I load a quantized model in a right way?
Looking forward to your reply๏ผŒThank you~

Cannot explain recurring OOM error

Hi there,

I am trying to use the int8 quantized model of BLOOM 175B for inference and am closely following the bloom-accelerate-inference.py script. I have about 1000 prompts for which I need the outputs. I use beam size of 1 (greedy search) and batch size of 1 since I can't fit more into my GPU memory (I have 4 * 80 GB A100 GPUs). max_new_tokens is set to 64.

When running inference on this list of prompts, after successfully generating on the first few sentences (61 in this case), my script crashes with an OOM error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 79.17 GiB total capacity; 77.63 GiB already allocated; 11.31 MiB free; 77.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Though long prompts often cause OOM, in this case, I do not think it is due to the length of the current prompt. I logged just to make sure, but prompts longer than the current one have been successfully generated in the past (in the first 61 prompts I was referring to).

I am unable to figure out what the possible reason could be. Any suggestions/ideas?

8x40GB GPU instance failing with microsoft/bloom-deepspeed-inference-int8

I'm trying to get BLOOM to run on a Lambda Labs GPU cloud 8x40GB instance and ran into the below issue. Not sure if I should post this here or in DeepSpeed-MII repo.

I ran the following in Jupyter IDE:

!python bloom-inference-server/cli.py --model_name microsoft/bloom-deepspeed-inference-int8 --dtype int8 --deployment_framework ds_inference --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'

Here is the full output:

[2022-09-17 23:54:09,940] [INFO] [deployment.py:74:deploy] *************DeepSpeed Optimizations: True*************
[2022-09-17 23:54:12,927] [INFO] [server_client.py:206:_initialize_service] multi-gpu deepspeed launch: ['deepspeed', '--num_gpus', '8', '--no_local_rank', '--no_python', '/usr/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-generation', '--model', 'bigscience/bloom', '--model-path', '/home/ubuntu/.cache/huggingface/hub/models--microsoft--bloom-deepspeed-inference-int8/snapshots/aa00a6626f6484a2eef68e06d1e089e4e32aa571', '--port', '50950', '--ds-optimize', '--provider', 'hugging-face-llm', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiA4LCAicG9ydF9udW1iZXIiOiA1MDk1MCwgImR0eXBlIjogImludDgiLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IHsiY2hlY2twb2ludHMiOiB7Im5vbl90cCI6IFsibm9uLXRwLnB0Il0sICJ0cCI6IFsidHBfMDBfMDAucHQiLCAidHBfMDFfMDAucHQiLCAidHBfMDJfMDAucHQiLCAidHBfMDNfMDAucHQiLCAidHBfMDBfMDEucHQiLCAidHBfMDFfMDEucHQiLCAidHBfMDJfMDEucHQiLCAidHBfMDNfMDEucHQiLCAidHBfMDBfMDIucHQiLCAidHBfMDFfMDIucHQiLCAidHBfMDJfMDIucHQiLCAidHBfMDNfMDIucHQiLCAidHBfMDBfMDMucHQiLCAidHBfMDFfMDMucHQiLCAidHBfMDJfMDMucHQiLCAidHBfMDNfMDMucHQiLCAidHBfMDBfMDQucHQiLCAidHBfMDFfMDQucHQiLCAidHBfMDJfMDQucHQiLCAidHBfMDNfMDQucHQiLCAidHBfMDBfMDUucHQiLCAidHBfMDFfMDUucHQiLCAidHBfMDJfMDUucHQiLCAidHBfMDNfMDUucHQiLCAidHBfMDBfMDYucHQiLCAidHBfMDFfMDYucHQiLCAidHBfMDJfMDYucHQiLCAidHBfMDNfMDYucHQiLCAidHBfMDBfMDcucHQiLCAidHBfMDFfMDcucHQiLCAidHBfMDJfMDcucHQiLCAidHBfMDNfMDcucHQiXX0sICJkdHlwZSI6ICJpbnQ4IiwgInBhcmFsbGVsaXphdGlvbiI6ICJ0cCIsICJ0cF9zaXplIjogNCwgInR5cGUiOiAiQkxPT00iLCAidmVyc2lvbiI6IDF9fQ==']
[2022-09-17 23:54:14,006] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-09-17 23:54:14,304] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --no_python --no_local_rank /usr/bin/python -m mii.launch.multi_gpu_server --task-name text-generation --model bigscience/bloom --model-path /home/ubuntu/.cache/huggingface/hub/models--microsoft--bloom-deepspeed-inference-int8/snapshots/aa00a6626f6484a2eef68e06d1e089e4e32aa571 --port 50950 --ds-optimize --provider hugging-face-llm --config eyJ0ZW5zb3JfcGFyYWxsZWwiOiA4LCAicG9ydF9udW1iZXIiOiA1MDk1MCwgImR0eXBlIjogImludDgiLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IHsiY2hlY2twb2ludHMiOiB7Im5vbl90cCI6IFsibm9uLXRwLnB0Il0sICJ0cCI6IFsidHBfMDBfMDAucHQiLCAidHBfMDFfMDAucHQiLCAidHBfMDJfMDAucHQiLCAidHBfMDNfMDAucHQiLCAidHBfMDBfMDEucHQiLCAidHBfMDFfMDEucHQiLCAidHBfMDJfMDEucHQiLCAidHBfMDNfMDEucHQiLCAidHBfMDBfMDIucHQiLCAidHBfMDFfMDIucHQiLCAidHBfMDJfMDIucHQiLCAidHBfMDNfMDIucHQiLCAidHBfMDBfMDMucHQiLCAidHBfMDFfMDMucHQiLCAidHBfMDJfMDMucHQiLCAidHBfMDNfMDMucHQiLCAidHBfMDBfMDQucHQiLCAidHBfMDFfMDQucHQiLCAidHBfMDJfMDQucHQiLCAidHBfMDNfMDQucHQiLCAidHBfMDBfMDUucHQiLCAidHBfMDFfMDUucHQiLCAidHBfMDJfMDUucHQiLCAidHBfMDNfMDUucHQiLCAidHBfMDBfMDYucHQiLCAidHBfMDFfMDYucHQiLCAidHBfMDJfMDYucHQiLCAidHBfMDNfMDYucHQiLCAidHBfMDBfMDcucHQiLCAidHBfMDFfMDcucHQiLCAidHBfMDJfMDcucHQiLCAidHBfMDNfMDcucHQiXX0sICJkdHlwZSI6ICJpbnQ4IiwgInBhcmFsbGVsaXphdGlvbiI6ICJ0cCIsICJ0cF9zaXplIjogNCwgInR5cGUiOiAiQkxPT00iLCAidmVyc2lvbiI6IDF9fQ==
[2022-09-17 23:54:15,364] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2022-09-17 23:54:15,364] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=8, node_rank=0
[2022-09-17 23:54:15,364] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2022-09-17 23:54:15,364] [INFO] [launch.py:156:main] dist_world_size=8
[2022-09-17 23:54:15,364] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2022-09-17 23:54:17,952] [INFO] [server_client.py:115:_wait_until_server_is_live] waiting for server to start...
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            158-101-132-13
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            158-101-132-13
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           158-101-132-13
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           158-101-132-13
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            158-101-132-13
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           158-101-132-13
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            158-101-132-13
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           158-101-132-13
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            158-101-132-13
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            158-101-132-13
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           158-101-132-13
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           158-101-132-13
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            158-101-132-13
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           158-101-132-13
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            158-101-132-13
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           158-101-132-13
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
[2022-09-17 23:54:22,957] [INFO] [server_client.py:115:_wait_until_server_is_live] waiting for server to start...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 70, in <module>
    main()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 56, in main
    inference_pipeline = load_models(task_name=args.task_name,
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/load_models.py", line 45, in load_models
    from mii.models.providers.llm import load_hf_llm
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/providers/llm.py", line 8, in <module>
    from transformers.utils import WEIGHTS_NAME, WEIGHTS_INDEX_NAME, cached_path, hf_bucket_url
ImportError: cannot import name 'cached_path' from 'transformers.utils' (/home/ubuntu/.local/lib/python3.8/site-packages/transformers/utils/__init__.py)
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
Traceback (most recent call last):
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    exec(code, run_globals)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 70, in <module>
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 70, in <module>
    main()
    main()
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 56, in main
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 56, in main
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    inference_pipeline = load_models(task_name=args.task_name,
    inference_pipeline = load_models(task_name=args.task_name,
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/load_models.py", line 45, in load_models
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/load_models.py", line 45, in load_models
    exec(code, run_globals)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 70, in <module>
    from mii.models.providers.llm import load_hf_llm
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/providers/llm.py", line 8, in <module>
    from mii.models.providers.llm import load_hf_llm
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/providers/llm.py", line 8, in <module>
    from transformers.utils import WEIGHTS_NAME, WEIGHTS_INDEX_NAME, cached_path, hf_bucket_url
    main()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 56, in main
ImportError: cannot import name 'cached_path' from 'transformers.utils' (/home/ubuntu/.local/lib/python3.8/site-packages/transformers/utils/__init__.py)
    from transformers.utils import WEIGHTS_NAME, WEIGHTS_INDEX_NAME, cached_path, hf_bucket_url
ImportError: cannot import name 'cached_path' from 'transformers.utils' (/home/ubuntu/.local/lib/python3.8/site-packages/transformers/utils/__init__.py)
    inference_pipeline = load_models(task_name=args.task_name,
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/load_models.py", line 45, in load_models
    from mii.models.providers.llm import load_hf_llm
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/providers/llm.py", line 8, in <module>
    from transformers.utils import WEIGHTS_NAME, WEIGHTS_INDEX_NAME, cached_path, hf_bucket_url
ImportError: cannot import name 'cached_path' from 'transformers.utils' (/home/ubuntu/.local/lib/python3.8/site-packages/transformers/utils/__init__.py)
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 70, in <module>
    main()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 56, in main
    inference_pipeline = load_models(task_name=args.task_name,
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/load_models.py", line 45, in load_models
Traceback (most recent call last):
    from mii.models.providers.llm import load_hf_llm
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/providers/llm.py", line 8, in <module>
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    from transformers.utils import WEIGHTS_NAME, WEIGHTS_INDEX_NAME, cached_path, hf_bucket_url
ImportError: cannot import name 'cached_path' from 'transformers.utils' (/home/ubuntu/.local/lib/python3.8/site-packages/transformers/utils/__init__.py)
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 70, in <module>
Traceback (most recent call last):
    main()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 56, in main
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    inference_pipeline = load_models(task_name=args.task_name,
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/load_models.py", line 45, in load_models
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    from mii.models.providers.llm import load_hf_llm
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/providers/llm.py", line 8, in <module>
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    from transformers.utils import WEIGHTS_NAME, WEIGHTS_INDEX_NAME, cached_path, hf_bucket_url
ImportError: cannot import name 'cached_path' from 'transformers.utils' (/home/ubuntu/.local/lib/python3.8/site-packages/transformers/utils/__init__.py)
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 70, in <module>
    exec(code, run_globals)
    main()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 70, in <module>
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 56, in main
    inference_pipeline = load_models(task_name=args.task_name,
    main()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/load_models.py", line 45, in load_models
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 56, in main
    inference_pipeline = load_models(task_name=args.task_name,
    from mii.models.providers.llm import load_hf_llm
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/providers/llm.py", line 8, in <module>
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/load_models.py", line 45, in load_models
    from transformers.utils import WEIGHTS_NAME, WEIGHTS_INDEX_NAME, cached_path, hf_bucket_url
ImportError: cannot import name 'cached_path' from 'transformers.utils' (/home/ubuntu/.local/lib/python3.8/site-packages/transformers/utils/__init__.py)
    from mii.models.providers.llm import load_hf_llm
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/models/providers/llm.py", line 8, in <module>
    from transformers.utils import WEIGHTS_NAME, WEIGHTS_INDEX_NAME, cached_path, hf_bucket_url
ImportError: cannot import name 'cached_path' from 'transformers.utils' (/home/ubuntu/.local/lib/python3.8/site-packages/transformers/utils/__init__.py)
[2022-09-17 23:54:24,425] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 78388
[2022-09-17 23:54:24,443] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 78389
[2022-09-17 23:54:24,458] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 78390
[2022-09-17 23:54:24,474] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 78391
[2022-09-17 23:54:24,474] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 78392
[2022-09-17 23:54:24,489] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 78393
[2022-09-17 23:54:24,504] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 78394
[2022-09-17 23:54:24,519] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 78395
[2022-09-17 23:54:24,534] [ERROR] [launch.py:292:sigkill_handler] ['/usr/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-generation', '--model', 'bigscience/bloom', '--model-path', '/home/ubuntu/.cache/huggingface/hub/models--microsoft--bloom-deepspeed-inference-int8/snapshots/aa00a6626f6484a2eef68e06d1e089e4e32aa571', '--port', '50950', '--ds-optimize', '--provider', 'hugging-face-llm', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiA4LCAicG9ydF9udW1iZXIiOiA1MDk1MCwgImR0eXBlIjogImludDgiLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IHsiY2hlY2twb2ludHMiOiB7Im5vbl90cCI6IFsibm9uLXRwLnB0Il0sICJ0cCI6IFsidHBfMDBfMDAucHQiLCAidHBfMDFfMDAucHQiLCAidHBfMDJfMDAucHQiLCAidHBfMDNfMDAucHQiLCAidHBfMDBfMDEucHQiLCAidHBfMDFfMDEucHQiLCAidHBfMDJfMDEucHQiLCAidHBfMDNfMDEucHQiLCAidHBfMDBfMDIucHQiLCAidHBfMDFfMDIucHQiLCAidHBfMDJfMDIucHQiLCAidHBfMDNfMDIucHQiLCAidHBfMDBfMDMucHQiLCAidHBfMDFfMDMucHQiLCAidHBfMDJfMDMucHQiLCAidHBfMDNfMDMucHQiLCAidHBfMDBfMDQucHQiLCAidHBfMDFfMDQucHQiLCAidHBfMDJfMDQucHQiLCAidHBfMDNfMDQucHQiLCAidHBfMDBfMDUucHQiLCAidHBfMDFfMDUucHQiLCAidHBfMDJfMDUucHQiLCAidHBfMDNfMDUucHQiLCAidHBfMDBfMDYucHQiLCAidHBfMDFfMDYucHQiLCAidHBfMDJfMDYucHQiLCAidHBfMDNfMDYucHQiLCAidHBfMDBfMDcucHQiLCAidHBfMDFfMDcucHQiLCAidHBfMDJfMDcucHQiLCAidHBfMDNfMDcucHQiXX0sICJkdHlwZSI6ICJpbnQ4IiwgInBhcmFsbGVsaXphdGlvbiI6ICJ0cCIsICJ0cF9zaXplIjogNCwgInR5cGUiOiAiQkxPT00iLCAidmVyc2lvbiI6IDF9fQ=='] exits with return code = 1
[2022-09-17 23:54:27,962] [INFO] [server_client.py:115:_wait_until_server_is_live] waiting for server to start...
Traceback (most recent call last):
  File "bloom-inference-server/cli.py", line 70, in <module>
    main()
  File "bloom-inference-server/cli.py", line 29, in main
    model = DSInferenceGRPCServer(args)
  File "/home/ubuntu/transformers-bloom-inference/bloom-inference-server/ds_inference/grpc_server.py", line 35, in __init__
    mii.deploy(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/deployment.py", line 94, in deploy
    return _deploy_local(deployment_name, model_path=model_path)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/deployment.py", line 100, in _deploy_local
    mii.utils.import_score_file(deployment_name).init()
  File "/tmp/mii_cache/ds_inference_grpc_server/score.py", line 29, in init
    model = mii.MIIServerClient(task,
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/server_client.py", line 90, in __init__
    self._wait_until_server_is_live()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/mii/server_client.py", line 113, in _wait_until_server_is_live
    raise RuntimeError("server crashed for some reason, unable to proceed")
RuntimeError: server crashed for some reason, unable to proceed

If it helps, I am using:

  • transformers>=4.22.0
  • deepspeed>=0.7.3
  • protobuf==3.20.*

running error with "Bus error: nonexistent physical address"

Hello,
Thank you for this interesting script. I am trying to test the deepspeed inference, but it gives me such error (I am just using a small bloom model to start with)
Any hint on what is going wrong right here? Look like some nccl issue

transformers-bloom-inference# deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom-3b --benchmark --batch_size 8

[2022-10-06 20:04:34,388] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-10-06 20:04:35,202] [INFO] [runner.py:504:main] cmd = /opt/conda/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom-3b --benchmark --batch_size 8
[2022-10-06 20:04:37,529] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.12.12
[2022-10-06 20:04:37,529] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2022-10-06 20:04:37,529] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=8, node_rank=0
[2022-10-06 20:04:37,529] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2022-10-06 20:04:37,529] [INFO] [launch.py:156:main] dist_world_size=8
[2022-10-06 20:04:37,530] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2022-10-06 20:04:40,295] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom-3b
[2022-10-06 20:04:42,571] [INFO] [utils.py:827:see_memory_usage] pre-from-pretrained
[2022-10-06 20:04:42,572] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-10-06 20:04:42,572] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 8.58 GB, percent = 1.3%
[2022-10-06 20:04:42,707] [INFO] [utils.py:827:see_memory_usage] post-from-pretrained
[2022-10-06 20:04:42,707] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-10-06 20:04:42,708] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 8.58 GB, percent = 1.3%
[2022-10-06 20:04:42,821] [INFO] [utils.py:827:see_memory_usage] post-init-ds-zero-init
[2022-10-06 20:04:42,822] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-10-06 20:04:42,822] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 8.58 GB, percent = 1.3%
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 22504.65it/s]
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 18893.26it/s]
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 22177.42it/s]
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 23933.26it/s]
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 23780.60it/s]
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 24672.38it/s]
[2022-10-06 20:04:42,938] [INFO] [utils.py:827:see_memory_usage] pre-ds-inference-init
[2022-10-06 20:04:42,939] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-10-06 20:04:42,939] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 8.73 GB, percent = 1.3%
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 21931.00it/s]
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 22595.58it/s]
Fetching 8 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 8/8 [00:00<00:00, 22779.66it/s]
[sfr-pod-a100-x8-tian-lan:4200 :0:5033] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   5033) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000006929c ncclGroupEnd()  ???:0
 3 0x000000000005e36d ncclGroupEnd()  ???:0
 4 0x0000000000008609 start_thread()  ???:0
 5 0x000000000011f133 clone()  ???:0
=================================
[2022-10-06 20:04:59,604] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 4195
[2022-10-06 20:04:59,779] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 4196
[2022-10-06 20:04:59,952] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 4197
[2022-10-06 20:05:00,125] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 4198
[2022-10-06 20:05:00,299] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 4199
[2022-10-06 20:05:00,472] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 4200
[2022-10-06 20:05:00,472] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 4201
[2022-10-06 20:05:00,645] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 4202
[2022-10-06 20:05:00,818] [ERROR] [launch.py:292:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'bloom-inference-scripts/bloom-ds-inference.py', '--local_rank=7', '--name', 'bigscience/bloom-3b', '--benchmark', '--batch_size', '8'] exits with return code = -7

Wrong prediction from "bloom-deepspeed-inference-int8"

I'm running bloom-deepspeed-inference-int8 using the following command on 8 x 40G A100 machine.

deepspeed --num_gpus 8 xxx.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8 --batch_size 8

I got the generation result, but they have a lot of repetition which is not the case for accelerate-based bloom int8 implementation.

Generate args {'max_new_tokens': 100, 'do_sample': False}
------------------------------------------------------------
in=DeepSpeed is a machine learning framework
out=DeepSpeed is a machine learning framework for deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep de
ep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep
 deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep deep

------------------------------------------------------------
in=He is working on
out=He is working on a new album, and is also working on a new album with his band, and is also working on a new album with his band, and is also working on a new album with his band, and is working on a new album, and is
working on a new album,

------------------------------------------------------------
in=He has a
out=He has a lot of money.
He has a lot of money.
He has a lot of money.
He has a lot of money.
He has a lot of money.
He has a

------------------------------------------------------------
in=He got all
out=He got all the way to the top of the mountain, and he was so very very very very very very very very very very very very very very

------------------------------------------------------------
in=Everyone is happy and I can
out=Everyone is happy and I can see that. I am happy too. I am happy too. I am happy too.

------------------------------------------------------------
in=The new movie that got Oscar this year                                                                                                                                                                                     out=The new movie that got Oscar this year is a movie about a movie about a movie about a movie about

------------------------------------------------------------
in=In the far far distance from our galaxy,
out=In the far far distance from our galaxy, there is a a a a a a a a a galaxy

------------------------------------------------------------
in=Peace is the only way
out=Peace is the only way to live. We must be peaceful and live in

beam search

Hello, I wonder if this repo support beam search? Thanks!

RuntimeError: Error building extension 'transformer_inference'

I'm trying to load the bigscience/bloomz-3b for some benchmarks using the bloom-inference-scripts/bloom-ds-inference.py script but upon execution it raises the error:

FAILED: transformer_inference.so c++ pt_binding.o gelu.cuda.o relu.cuda.o layer_norm.cuda.o softmax.cuda.o dequantize.cuda.o apply_rotary_pos_emb.cuda.o transform.cuda.o -shared -lcurand -L/home/mahyar/anaconda3/envs/deepspeed-testing/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda-11.6/lib64 -lcudart -o transformer_inference.so /usr/local/cuda-11.6/lib64/libcurand.so: file not recognized: File truncated collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/mahyar/anaconda3/envs/deepspeed-testing/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build subprocess.run( File "/home/mahyar/anaconda3/envs/deepspeed-testing/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

RuntimeError: Error building extension 'transformer_inference' [2023-02-02 03:27:36,042] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 11492 [2023-02-02 03:27:36,043] [ERROR] [launch.py:324:sigkill_handler] ['/home/mahyar/anaconda3/envs/deepspeed-testing/bin/python', '-u', 'bloom-inference-scripts/bloom-ds-inference.py', '--local_rank=0', '--name', 'bigscience/bloomz-3b'] exits with return code = 1

Env Details:
pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3

Which version of `huggingface_hub` to use

Hi there!

Thanks for the amazing inference scripts ๐Ÿ”ฅ
I am trying to run it in an offline mode, with the latest version of huggingface_hub. I am getting this issue while running the command:

TRANSFORMERS_OFFLINE=1,HF_HUB_OFFLINE=1 deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name /gpfsscratch/rech/six/uan68tv/distill-bloom/models/bloom/

Error below:

    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/younes-distill-bert/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 92, in _inner_fn
Traceback (most recent call last):
  File "/gpfsssd/scratch/rech/six/uan68tv/distill-bloom/code/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 184, in <module>
    validate_repo_id(arg_value)
  File "/gpfswork/rech/six/commun/conda/younes-distill-bert/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 136, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/gpfsscratch/rech/six/uan68tv/distill-bloom/models/bloom/'. Use `repo_type` argument if needed.
    repo_root = get_repo_root(model_name)
  File "/gpfsssd/scratch/rech/six/uan68tv/distill-bloom/code/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 80, in get_repo_root
    cached_repo_dir = snapshot_download(
  File "/gpfswork/rech/six/commun/conda/younes-distill-bert/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f
    return f(*args, **kwargs)
  File "/gpfswork/rech/six/commun/conda/younes-distill-bert/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 92, in _inner_fn
    validate_repo_id(arg_value)
  File "/gpfswork/rech/six/commun/conda/younes-distill-bert/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 136, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/gpfsscratch/rech/six/uan68tv/distill-bloom/models/bloom/'. Use `repo_type` argument if needed.

It's probably related to the huggingface_hub version but also I suspect that I might be doing something wrong. I naively copied the full copy of the HF bloom repo to /gpfsscratch/rech/six/uan68tv/distill-bloom/models/bloom/

Thanks !

[Bug] Int8 quantize inference failed using bloom-inference-scripts/bloom-ds-inference.py with deepspeed==0.9.0 on multi-gpus

I am using multi-gpu to quantize the model and inference with deepspeed==0.9.0, but failed.

Device: RTX-3090 x 8 Server
Docker: nvidia-pytorch-container which tag is 22.07-py3. Then git clone this codebase in docker.
Command:

deepspeed --include localhost:1,6 bloom-inference-scripts/bloom-ds-inference.py --local_rank=0 --name bigscience/bloomz-7b1-mt --dtype int8

ErrorLog:

Traceback (most recent call last):
  File "bloom-inference-scripts/bloom-ds-inference.py", line 182, in <module>
    model = deepspeed.init_inference(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 324, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 194, in __init__
    self._apply_injection_policy(config)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 396, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 519, in replace_transformer_layer
    load_model_with_checkpoint(replaced_module,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 243, in load_model_with_checkpoint
    load_module_recursive(r_module)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 237, in load_module_recursive
    load_module_recursive(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 237, in load_module_recursive
    load_module_recursive(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 235, in load_module_recursive
    layer_policies[child.__class__](child, prefix + name + '.')
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 173, in load_transformer_layer
    container.load_params(module, sd[0], weight_quantizer, mp_replace, prefix)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/bloom.py", line 51, in load_params
    maybe_copy(module.attention,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/policy.py", line 181, in maybe_copy
    dst = mp_replace.copy(dst, weight_quantizer.quantize(tmp if weight_quantizer.q_int8 else \
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 111, in copy
    dst.data.copy_(src[:, self.gpu_index * dst_shape[self.out_dim]: (self.gpu_index + 1) * dst_shape[self.out_dim]] if outer_dim == 1 else \
RuntimeError: The size of tensor a (6144) must match the size of tensor b (4096) at non-singleton dimension 1

There is no code changed, so I wonder why the code of multi-gpu int8 is failed, while multi-gpu with FP16 settings works fine.

Why is the throughput of DS-inference doubled when using 4 A100 GPUs compared to 8 A100 GPUs

Hi,

Thank you for providing the valuable scripts for benchmark results of DS-inference. However, I am a bit confused about the values in the tables here

Specifically, when using 8 GPUs, the throughput of "accelerate int8" is similar to that of 4 GPUs. However, the throughput of "ds-inference int8" is doubled. I am wondering why there is such difference between accelerate and ds-inference.
image

image

Many thanks!

Big batchsize cause OOM in bloom-ds-inference.py, how to adjust max_split_size_mb value

OutOfMemoryError: CUDA out of memory. Tried to allocate 62.00 MiB (GPU 6; 79.19 GiB total capacity; 66.51 GiB already allocated; 61.56 MiB free; 67.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFtorch.cuda
.OutOfMemoryError : return forward_call(*input, **kwargs)CUDA out of memory. Tried to allocate 62.00 MiB (GPU 4; 79.19 GiB total capacity; 66.51 GiB already allocated; 61.56 MiB free; 67.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Inference hangs after GPU OOM

Setup: Inference server with BLOOM-176B on 4*A100 with int8, deployed with deepspeed-mii
Problem: Sometimes after a large request (large_length or batch_size) that caused OOM, the inference service will hang there and does not respond. All 4 GPU utilization are 100% and showing 4 100% CPU utilization in top. Has anyone else experienced this issue?

Multiple people querying MII issue

@stas00
I have an issue.
If multiple people query BLOOM deployed with MII then, I run into event loop is already running issue.
I tried to fix this by awaiting the generate method but that just leads to MII getting stuck

NotImplementedError: Cannot copy out of meta tensor; no data!

Hi,

I now employ the deepspeed framework to speed up the inference of BLOOM 7.1B, as shown below:

deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom-7b1

But instead I got the following bugs:

(bloom) xxx@HOST-xxx:~/projects/transformers-bloom-inference/bloom-inference-scripts$ bash run_deepspeed.sh 
[2023-02-10 17:46:16,148] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-10 17:46:16,202] [INFO] [runner.py:548:main] cmd = /home/caojunzhi/anaconda3/envs/chatgpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None bloom-ds-inference.py --name bigscience/bloom-7b1
[2023-02-10 17:46:19,604] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-10 17:46:19,604] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-10 17:46:19,604] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-10 17:46:19,604] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-10 17:46:19,604] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-10 17:46:23,455] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom-7b1
Fetching 13 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 13/13 [00:00<00:00, 33951.40it/s]
Fetching 13 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 13/13 [00:00<00:00, 8339.85it/s]
Fetching 13 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 13/13 [00:00<00:00, 7358.43it/s]
Fetching 13 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 13/13 [00:00<00:00, 26572.10it/s]
[2023-02-10 17:46:33,775] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-10 17:46:33,778] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-02-10 17:46:33,779] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /data/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /data/xxx/.cache/torch_extensions/py310_cu117/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.1198277473449707 seconds
[2023-02-10 17:46:34,344] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 4096, 'intermediate_size': 16384, 'heads': 32, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False}
Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /data/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.0038442611694335938 seconds
Loading 2 checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2/2 [00:21<00:00,  9.94s/it]checkpoint loading time at rank 0: 21.33984684944153 sec
Loading 2 checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2/2 [00:21<00:00, 10.67s/it]
Traceback (most recent call last):
  File "/data/xxx/projects/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 181, in <module>
    model = deepspeed.init_inference(
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 129, in __init__
    self.module.to(device)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1749, in to
    return super().to(*args, **kwargs)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to
    return self._apply(convert)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply
    param_applied = fn(param)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
[2023-02-10 17:46:57,652] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 25235
[2023-02-10 17:46:57,653] [ERROR] [launch.py:324:sigkill_handler] ['/home/caojunzhi/anaconda3/envs/chatgpt/bin/python', '-u', 'bloom-ds-inference.py', '--local_rank=0', '--name', 'bigscience/bloom-7b1'] exits with return code = 1

My main conda environment is:

accelerate               0.16.0
deepspeed                0.8.0
deepspeed-mii            0.0.2
huggingface-hub          0.12.0
tokenizers               0.12.1
torch                    1.13.1
transformers             4.26.0

My nvidia-smi info is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   35C    P0    37W / 250W |   1253MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   37C    P0    40W / 250W |   2411MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:00:08.0 Off |                    0 |
| N/A   32C    P0    24W / 250W |      4MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   33C    P0    24W / 250W |      4MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Can you help me to solve this bug? Thank you very much!

OOM of CUDA when using one GPU

I my opinion the Parameter of the model should be stored In the cpu when using zero inference, but I always get OOM error of CUDA

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.