hpcaitech / energonai Goto Github PK

View Code? Open in Web Editor NEW

630.0 23.0 90.0 801 KB

Large-scale model inference.

License: Apache License 2.0

Python 62.54% Cuda 15.97% C++ 21.08% C 0.08% Dockerfile 0.13% Shell 0.20%

energonai's Introduction

Energon-AI

A service framework for large-scale model inference, Energon-AI has the following characteristics:

Parallelism for Large-scale Models: With tensor parallel operations, pipeline parallel wrapper, distributed checkpoint loading, and customized CUDA kernel, EnergonAI can enable efficient parallel inference for larges-scale models.
Pre-built large models: There are pre-built implementation for popular models, such as OPT. It supports the cache technique for the generation task and distributed parameter loading.
Engine encapsulation： There has an abstraction layer called engine. It encapsulates the single instance multiple devices (SIMD) execution with the remote procedure call, making it acts as the single instance single device (SISD) execution.
An online service system: Based on FastAPI, users can launch a web service of the distributed infernce quickly. The online service makes special optimizations for the generation task. It adopts both left padding and bucket batching techniques for improving the efficiency.

For models trained by Colossal-AI, they can be easily transferred to Energon-AI. For single-device models, they require manual coding works to introduce tensor parallelism and pipeline parallelism.

Installation

Install from source

$ git clone [email protected]:hpcaitech/EnergonAI.git
$ pip install -r requirements.txt
$ pip install .

Use docker

$ docker pull hpcaitech/energon-ai:latest

Build an online OPT service in 5 minutes

Download OPT model: To launch the distributed inference service quickly, you can download the checkpoint of OPT-125M here. You can get details for loading other sizes of models here.
Launch an HTTP service: To launch a service, we need to provide python scripts to describe the model type and related configurations, and start an http service. An OPT example is EnergonAI/examples/opt.
The entrance of the service is a bash script server.sh. The config of the service is at opt_config.py, which defines the model type, the checkpoint file path, the parallel strategy, and http settings. You can adapt it for your own case. For example, set the model class as opt_125M and set the correct checkpoint path as follows. Set the tensor parallelism degree the same as your gpu number.
```
    model_class = opt_125M
    checkpoint = 'your_file_path'
    tp_init_size = #gpu
```
Now, we can launch a service:
```
    bash server.sh
```
Then open https://[ip]:[port]/docs in your browser and try out!

Publication

You can find technical details in our blog and manuscript:

Build an online OPT service using Colossal-AI in 5 minutes

EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models

@misc{du2022energonai, 
      title={EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models}, 
      author={Jiangsu Du and Ziming Liu and Jiarui Fang and Shenggui Li and Yongbin Li and Yutong Lu and Yang You},
      year={2022},
      eprint={2209.02341},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Contributing

If interested in making your own contribution to the project, please refer to Contributing for guidance.

Thanks so much!

energonai's People

Contributors

Stargazers

Watchers

Forkers

qmkkd fastalgo machinelearningsystem maruyamaaya stephentaylor1998 jthh cli99 distributedsystemresearch xgreat8 oahzxl fazziekey ver217 researchmore ofey404 marcus-arcadius ht-zhou csric robertalanm tawawhite feifeibear kphippe liujuncheng ericxsun henrywoo constroy ixiami1314 straitrobot xiaojun207 fingerx phymucs goy0695 lawchingman digits122 eltociear goswamig ssevi techventurebuilder latmeateacabes hufeihu ukaserge davisswq awmalka cemberk yuchengwang mabrains co-simulation lemonsqi oliver-ss hudengjunai markhng525 ataaas-s juncongmoo nemoramo zhangsanfeng86 lucadiliello tanghulutaitian arthurxl ftgreat mehdisadeghidev xtreme0777 kevinkda-resources guoqiangjia liujuncn bergwolf jangocheng iiheng tuzeao xiongjun19 dlzou 00mjk zoudong tiandiao123 peiyuz-star zengyijie arkizh yeanbang sorokinvld yuanheng-zhao callmejacksong scairesearch jangocity z99205388 wxpjimmy caesar1993 mfkiwl zalaivankov bise86

energonai's Issues

Cannot run opt 125m examples with latest energonai docker images

Run with own build docker images with colossalai 0.2.0 as base image, no response for a very long time.

Run with latest energonai docker images, code version mismatch, raise following error:

[root@eb3f1650fdbf opt]# python3 opt_fastapi.py opt-125m
Traceback (most recent call last):
File "/workspace/EnergonAI/examples/opt/opt_fastapi.py", line 7, in
from energonai import QueueFullError, launch_engine
ImportError: cannot import name 'QueueFullError' from 'energonai' (/opt/conda/lib/python3.9/site-packages/energonai/init.py)

Failed to load OPT-30B checkpoint

System

GPU: Tesla V100 (32G)
Cuda: 11.3
Pytorch: torch==1.12.1+cu113
ColossalAI: 0.1.12+torch1.12cu11.3
EnergonAI: master

OPT-30B

checkpoints can be found here: https://github.com/facebookresearch/metaseq/tree/main/projects/OPT

OPT-30B	30B	part0, part1

Start fastapi

cd EnergonAI/examples/opt

CUDA_VISIBLE_DEVICES=0,1 CUDA_HOME=/usr/local/cuda-11.3 LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64 python opt_fastapi.py opt-30b --tp 2

we got the following logs

[W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:19990 (errno: 99 - Cannot assign requested address).
[12/29/22 16:23:12] INFO     colossalai - colossalai - INFO:                                                                             
                             python3.8/site-packages/colossalai/context/parallel_context.py:521       
                             set_device                                                                                                  
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                         
[12/29/22 16:23:12] INFO     colossalai - colossalai - INFO:                                                                             
                             python3.8/site-packages/colossalai/context/parallel_context.py:521       
                             set_device                                                                                                  
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                         
[12/29/22 16:23:14] INFO     colossalai - colossalai - INFO:                                                                             
                             python3.8/site-packages/colossalai/context/parallel_context.py:557       
                             set_seed                                                                                                    
[12/29/22 16:23:14] INFO     colossalai - colossalai - INFO:                                                                             
                             python3.8/site-packages/colossalai/context/parallel_context.py:557       
                             set_seed                                                                                                    
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,               
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.          
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024,               
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default parallel seed is ParallelMode.DATA.          
                    INFO     colossalai - colossalai - INFO:                                                                             
                             python3.8/site-packages/colossalai/initialize.py:117 launch              
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline     
                             parallel size: 1, tensor parallel size: 2                                                                   
[12/29/22 16:23:17] INFO     colossalai - energonai - INFO:                                                                              
                             python3.8/site-packages/energonai/model/model_factory.py:195             
                             create_pipeline_model                                                                                       
                    INFO     colossalai - energonai - INFO: ==> Rank 0 built layer 0-48 / total 48                                       
                    INFO     colossalai - energonai - INFO:                                                                              
                             python3.8/site-packages/energonai/model/model_factory.py:200             
                             create_pipeline_model                                                                                       
                    INFO     colossalai - energonai - INFO: Rank0/0 model size = 30.7120128 GB                                           
load 2 files using 1 procs
[12/29/22 16:23:17] INFO     colossalai - energonai - INFO:                                                                              
                             python3.8/site-packages/energonai/model/model_factory.py:195             
                             create_pipeline_model                                                                                       
                    INFO     colossalai - energonai - INFO: ==> Rank 1 built layer 0-48 / total 48                                       
                    INFO     colossalai - energonai - INFO:                                                                              
                             python3.8/site-packages/energonai/model/model_factory.py:200             
                             create_pipeline_model                                                                                       
                    INFO     colossalai - energonai - INFO: Rank1/0 model size = 30.7120128 GB                                           
Load file time: 42.683 s
Load file time: 42.661 s

then about 10minutes later, the error occurred:

size mismatch

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineModel:
        size mismatch for blocks.0.attn.dense.weight: copying a param with shape torch.Size([7168, 1792]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
        size mismatch for blocks.0.mlp.dense_1.weight: copying a param with shape torch.Size([7168, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
        size mismatch for blocks.0.mlp.dense_1.bias: copying a param with shape torch.Size([7168]) from checkpoint, the shape in current model is torch.Size([14336]).
        size mismatch for blocks.0.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 7168]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).

Missing keys in state_dict:

RuntimeError: Error(s) in loading state_dict for PipelineModel:
        Missing key(s) in state_dict: "blocks.0.attn.query_.weight", "blocks.0.attn.query_.bias", "blocks.0.attn.key_.weight", "blocks.0.attn.key_.bias", "blocks.0.attn.value_.weight", "blocks.0.attn.value_.bias", "

Unexpected key(s) in state_dict:

Unexpected key(s) in state_dict: "blocks.0.self_attn.qkv_proj.weight", "blocks.0.self_attn.qkv_proj.bias", "blocks.1.self_attn.qkv_proj.weight", "blocks.1.self_attn.qkv_proj.bias", "blocks.2.self_attn.qkv_proj.weight",

So what's wrong? Or any pre-processing should be done like 66B/175B?

Thanks you so much

[bug] The InferenceEngine used in the example for distributed inference cannot be imported.

关于示例代码版本落后，无法运行等问题About the example code version backward, can not run and other issues

@ver217
你好，我是在slack上按照你们同事的指引，来联系你提交一些问题。
Hi, I am contacting you to submit some questions according to the guidelines of your colleagues on slack.

你们 EnergonAI仓库的示例们：
Examples from your EnergonAI repository:

大多数都采用了过时的写法，比如vit：
Most are written in outdated ways, such as vit:

这个InferenceEngine是通过from energonai.engine import InferenceEngine进行导入，但是很遗憾的是在energonai.engine代码中并没有发现这个依赖：
This InferenceEngine is imported via from energonai.engine import InferenceEngine, but unfortunately this dependency is not found in the energonai.engine code:

希望你们可以尽快更新，并及时向我反馈。如果更新计划暂时位于等待队列，我希望可以向我提供一个简单的图像（例如resnet在cifar10）或音频（pann在audioset）分类任务的使用示例，对我来说十分重要。
I hope you can update as soon as possible and let me know in time. If the update schedule is temporarily in the waiting queue, I hope it would be important for me to provide an example of the use of a simple image (e.g. resnet on cifar10) or audio (pann on audioset) classification task.

OPT inference generate example

Hi, is there any generate example for OTP models?

trpc.rpc_sync consumed most time

We used VIT model in examples and test performance.

Test env: V100-32G，Batch size=128
We found 1 GPU or TP=2, we breakdown the time from received request to return result and rpc_sync from worker to master cost 95% of whole process.

First orange: master send data to worker (13 ms
Second orange: woker send result to master ( ~423 ms)

the 423ms is from before pipe.py/def send /trpc.rpc_sync(self.dest, rpc_queue_put, args=(self.remote_queue, data)) to first line in def rpc_queue_put(q: trpc.RRef, data: Any). just trpc.rpc_sync

OPT demo TEST

=========================================================================================
No pre-built kernel is found, build and load the layernorm kernel during runtime now

No modifications detected for re-loaded extension module layernorm, skipping build step...

[W tensorpipe_agent.cpp:682] RPC agent for master encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:682] RPC agent for worker0 encountered error when reading incoming request from master: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
INFO: Finished server process [158599]
Process SpawnProcess-1:
ERROR: Exception in ASGI application

asyncio.exceptions.CancelledError
INFO: 111.192.91.34:6974 - "POST /generation HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):

ImportError: /home/ubuntu/.cache/colossalai/torch_extensions/torch1.11_cu11.3/layernorm.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)

[W tensorpipe_agent.cpp:863] RPC agent for master encountered error when sending outgoing request #9 to worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)

Location of logs

May I ask if I am running an opt service and would like to know where the logs are? Because I reported an error here, I want to know why

Failed to load pre-trained model weights for OPT_125M

Hi, I have some difficulties loading the pre-trained model weights for OPT_125M provided by Meta. Here are the error messages:
Process SpawnProcess-1:

Traceback (most recent call last):
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/worker.py", line 30, in __init__
    self.model: nn.Module = model_fn(**model_kwargs).cuda()
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/model/model_factory.py", line 283, in opt_125M
    return create_pipeline_model(**model_kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/model/model_factory.py", line 213, in create_pipeline_model
    load_checkpoint(model_kwargs["checkpoint"], model, preprocess_fn=preprocess_fn, **model_kwargs)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/utils/checkpointing.py", line 95, in load_checkpoint
    model.load_state_dict(model_state, strict=strict)
  File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineModel:
        Missing key(s) in state_dict: "blocks.0.norm1.module.weight", "blocks.0.norm1.module.bias", "blocks.0.norm2.module.weight", "blocks.0.norm2.module.bias", "blocks.1.norm1.module.weight", "blocks.1.norm1.module.bias", "blocks.1.norm2.module.weight", "blocks.1.norm2.module.bias", "blocks.2.norm1.module.weight", "blocks.2.norm1.module.bias", "blocks.2.norm2.module.weight", "blocks.2.norm2.module.bias", "blocks.3.norm1.module.weight", "blocks.3.norm1.module.bias", "blocks.3.norm2.module.weight", "blocks.3.norm2.module.bias", "blocks.4.norm1.module.weight", "blocks.4.norm1.module.bias", "blocks.4.norm2.module.weight", "blocks.4.norm2.module.bias", "blocks.5.norm1.module.weight", "blocks.5.norm1.module.bias", "blocks.5.norm2.module.weight", "blocks.5.norm2.module.bias", "blocks.6.norm1.module.weight", "blocks.6.norm1.module.bias", "blocks.6.norm2.module.weight", "blocks.6.norm2.module.bias", "blocks.7.norm1.module.weight", "blocks.7.norm1.module.bias", "blocks.7.norm2.module.weight", "blocks.7.norm2.module.bias", "blocks.8.norm1.module.weight", "blocks.8.norm1.module.bias", "blocks.8.norm2.module.weight", "blocks.8.norm2.module.bias", "blocks.9.norm1.module.weight", "blocks.9.norm1.module.bias", "blocks.9.norm2.module.weight", "blocks.9.norm2.module.bias", "blocks.10.norm1.module.weight", "blocks.10.norm1.module.bias", "blocks.10.norm2.module.weight", "blocks.10.norm2.module.bias", "blocks.11.norm1.module.weight", "blocks.11.norm1.module.bias", "blocks.11.norm2.module.weight", "blocks.11.norm2.module.bias", "norm.module.weight", "norm.module.bias". 
        Unexpected key(s) in state_dict: "blocks.0.norm1.weight", "blocks.0.norm1.bias", "blocks.0.norm2.weight", "blocks.0.norm2.bias", "blocks.1.norm1.weight", "blocks.1.norm1.bias", "blocks.1.norm2.weight", "blocks.1.norm2.bias", "blocks.2.norm1.weight", "blocks.2.norm1.bias", "blocks.2.norm2.weight", "blocks.2.norm2.bias", "blocks.3.norm1.weight", "blocks.3.norm1.bias", "blocks.3.norm2.weight", "blocks.3.norm2.bias", "blocks.4.norm1.weight", "blocks.4.norm1.bias", "blocks.4.norm2.weight", "blocks.4.norm2.bias", "blocks.5.norm1.weight", "blocks.5.norm1.bias", "blocks.5.norm2.weight", "blocks.5.norm2.bias", "blocks.6.norm1.weight", "blocks.6.norm1.bias", "blocks.6.norm2.weight", "blocks.6.norm2.bias", "blocks.7.norm1.weight", "blocks.7.norm1.bias", "blocks.7.norm2.weight", "blocks.7.norm2.bias", "blocks.8.norm1.weight", "blocks.8.norm1.bias", "blocks.8.norm2.weight", "blocks.8.norm2.bias", "blocks.9.norm1.weight", "blocks.9.norm1.bias", "blocks.9.norm2.weight", "blocks.9.norm2.bias", "blocks.10.norm1.weight", "blocks.10.norm1.bias", "blocks.10.norm2.weight", "blocks.10.norm2.bias", "blocks.11.norm1.weight", "blocks.11.norm1.bias", "blocks.11.norm2.weight", "blocks.11.norm2.bias", "norm.weight", "norm.bias". 
load 1 files using 1 procs
Load file time: 0.136 s

Seems that load_checkpoint() and the data in checkpoint.pt have different naming conventions. Is this caused by a version issue? I am using energonai==0.0.2.

Thanks for your help in advance.

Connection refused on docker exposed port

Problem
If we docker run energonai like:
docker run -ti --gpus all --rm --ipc=host -p 8010:8010 ...
Then in container run:
export PYTHONPATH=/workspace/colossal/inference/examples/bert
energonai service init --config_file=/workspace/colossal/inference/examples/bert/bert_config.py
The access to "http://localhost:8010/" got rejected.

Root cause
server_host in configuration files was wrongly configured to "127.0.0.1". It should be set to "0.0.0.0". All config files should be updated. For example, the file examples/bert/bert_config.py:
server_host = "127.0.0.1" => server_host = "0.0.0.0"

Concrete doc of this project

Hi, is there a concrete documentation of the architecture or implementation of the project? Thank you.

OPT-125m problem

Why does the execution of opt 125m in the example result in inference and then remain in the async def wait (self, uid: Hashable) ->Any: stage in the engine? The environment established by the Docker used

Not compatible with the latest version of transformers? (4.26.1)

When using transformers version 4.26.1 this import breaks from transformers.generation_logits_process import TopKLogitsWarper, TopPLogitsWarper, TemperatureLogitsWarper, LogitsProcessorList

I think it needs to be changed to from transformers.generation.logits_process import TopKLogitsWarper, TopPLogitsWarper, TemperatureLogitsWarper, LogitsProcessorList (dot instead of underscore after generation).

Either editing as above, or rolling back to transformers 4.24.0 resolves this import error.

(I have other errors that stop running the OPT inference example but likely unrelated.)

_pickle.UnpicklingError: invalid load key, '{'.

can't find server.sh

I can't find server.sh,how can I run a example now？

Is there any examples of using offload feature in GPT/BLOOM/OPT inference?

Hi, currently in the examples, only linear describes a naive example of offload, in other projects such as opt, bloom, gpt, there is no option for offload.
I am wondering how to apply offload to large model inference, and any examples?

RuntimeError('FusedLayerNormAffineFunction requires cuda extensions')

I got the error when I tried to use opt-125m.
The env details are:

NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8
torch 1.13.0

Full error message

On WorkerInfo(id=0, name=wok0):
RuntimeError('FusedLayerNormAffineFunction requires cuda extensions')
Traceback (most recent call last):
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 19, in forward
    import colossalai._C.layer_norm
ImportError: /root/.conda/envs/llm/lib/python3.8/site-packages/colossalai/_C/layer_norm.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function
    result = python_udf.func(*python_udf.args, **python_udf.kwargs)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/engine/rpc_utils.py", line 8, in call_method
    return method(rref.local_value(), *args, **kwargs)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/engine/rpc_worker.py", line 118, in run
    output, cur_key = self.model.run(key, inputs)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/engine/pipeline_wrapper.py", line 72, in run
    return self.run_without_pp(key, inputs)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/engine/pipeline_wrapper.py", line 86, in run_without_pp
    output = self.model(hidden_states=None, **sample)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/model/model_factory.py", line 114, in forward
    hidden_states = block(hidden_states=hidden_states,
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/model/endecoder.py", line 52, in forward
    hidden_states = self.norm1(hidden_states)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
    return self._forward_func(*args)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 73, in forward
    return FusedLayerNormAffineFunction.apply(input, self.weight, self.bias, self.normalized_shape, self.eps)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 105, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/root/.conda/envs/llm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 21, in forward
    raise RuntimeError('FusedLayerNormAffineFunction requires cuda extensions')
RuntimeError: FusedLayerNormAffineFunction requires cuda extensions

Detected RRef Leaks during shutdown, empty pipe, tests_engine failed

When try to run engine.shutdown(), it fails due to RRef leaks, and the resultant empty pipe error. So when I ran the tests, particularly in tests/test_engine, four tests will be run and stuck there.

Possible causes:

master and workers are not gracefully shut down, resulting in residual RPC agent. So the second test will run into existing RPC agent, and long time hang there showing Socket Timeout.
Detected RRef Leaks during shutdown. This usually occurs when the application code still holds references to RRef instances when calling shutdown(). If the program has completed correctly and the process is exiting, it is OK to ignore these leaks. However, if you program will keep running after this, these leaks could result in memory leaks on RRef owners.

Debugging output:

============================== test session starts ========================================
platform linux -- Python 3.8.12, pytest-7.2.0, pluggy-1.0.0
plugins: xdist-3.0.2, anyio-3.6.2
collected 4 items                                                                                                                                                                              

**tests/test_engine/test_hybrid.py**
[12/15/22 09:15:23] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
[12/15/22 09:15:23] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
[12/15/22 09:15:23] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
[12/15/22 09:15:23] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                                                                                
                    INFO     colossalai - colossalai - INFO: process rank 2 is bound to device 2                                                                                                
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                                                                                
                    INFO     colossalai - colossalai - INFO: process rank 3 is bound to device 3                                                                                                
[12/15/22 09:15:26] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
[12/15/22 09:15:26] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
[12/15/22 09:15:26] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
[12/15/22 09:15:26] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2049,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2048,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
                    INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/initialize.py:117 launch                                                         
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 2, tensor parallel size: 2                  
[12/15/22 09:15:27] INFO     colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__                                                                     
                    INFO     colossalai - energonai - INFO: worker0 start                                                                                                                       
[12/15/22 09:15:27] INFO     colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__                                                                     
                    INFO     colossalai - energonai - INFO: worker1 start                                                                                                                       
[12/15/22 09:15:27] INFO     colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__                                                                     
                    INFO     colossalai - energonai - INFO: worker2 start                                                                                                                       
[12/15/22 09:15:27] INFO     colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__                                                                     
                    INFO     colossalai - energonai - INFO: worker3 start                                                                                                                       
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
    return self.local_queue.get_nowait()
  File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
    return self.get(block=False)
  File "/usr/local/lib/python3.8/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/Projects/energonai/worker.py", line 68, in _start
    task_entry: TaskEntry = self.input_pipe.recv_nowait()
  File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
    raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 62, in __init__
    self._start()
  File "/root/Projects/energonai/worker.py", line 73, in _start
    time.sleep(0.01)
KeyboardInterrupt
[W rref_context.cpp:156] **Detected RRef Leaks during shutdown. This usually occurs when the application code still holds references to RRef instances when calling shutdown(). If the program has completed correctly and the process is exiting, it is OK to ignore these leaks. However, if you program will keep running after this, these leaks could result in memory leaks on RRef owners. Please make sure all RRefs are out of scope and Python GC has deleted them before calling shutdown(): 
Leaking RRef GloballyUniqueId(created_on=3, local_id=0) with fork Id GloballyUniqueId(created_on=3, local_id=1)
Leaking RRef GloballyUniqueId(created_on=4, local_id=0) with fork Id GloballyUniqueId(created_on=4, local_id=1)
Leaking RRef GloballyUniqueId(created_on=1, local_id=0) with fork Id GloballyUniqueId(created_on=1, local_id=1)
Leaking RRef GloballyUniqueId(created_on=2, local_id=0) with fork Id GloballyUniqueId(created_on=2, local_id=1)**

[W tensorpipe_agent.cpp:726] RPC agent for worker3 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker2 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker1 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker0 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker0: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker1: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker2: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker3: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
Process SpawnProcess-4:
Traceback (most recent call last):
  File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
    return self.local_queue.get_nowait()
  File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
    return self.get(block=False)
  File "/usr/local/lib/python3.8/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/Projects/energonai/worker.py", line 68, in _start
    task_entry: TaskEntry = self.input_pipe.recv_nowait()
  File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
    raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 62, in __init__
    self._start()
  File "/root/Projects/energonai/worker.py", line 73, in _start
    time.sleep(0.01)
KeyboardInterrupt
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
    return self.local_queue.get_nowait()
  File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
    return self.get(block=False)
  File "/usr/local/lib/python3.8/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/Projects/energonai/worker.py", line 68, in _start
    task_entry: TaskEntry = self.input_pipe.recv_nowait()
  File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
    raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 62, in __init__
    self._start()
  File "/root/Projects/energonai/worker.py", line 73, in _start
    time.sleep(0.01)
KeyboardInterrupt
Process SpawnProcess-3:
.Traceback (most recent call last):
  File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
    return self.local_queue.get_nowait()
  File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
    return self.get(block=False)
  File "/usr/local/lib/python3.8/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/Projects/energonai/worker.py", line 68, in _start
    task_entry: TaskEntry = self.input_pipe.recv_nowait()
  File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
    raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 62, in __init__
    self._start()
  File "/root/Projects/energonai/worker.py", line 73, in _start
    time.sleep(0.01)
KeyboardInterrupt

**tests/test_engine/test_pp.py [12/15/22 09:15:31] I**NFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
[12/15/22 09:15:31] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                                                                                
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                                                                                
[12/15/22 09:15:32] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
[12/15/22 09:15:32] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2048,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
                    INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/initialize.py:117 launch                                                         
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 2, tensor parallel size: 1                  
Process SpawnProcess-6:
Process SpawnProcess-5:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 35, in __init__
    trpc.init_rpc(self.rpc_name, rank=self.global_rank + 1, world_size=self.world_size + 1,
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 332, in _tensorpipe_init_backend_handler
    group = _init_process_group(store, rank, world_size)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 109, in _init_process_group
    group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Socket Timeout
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 35, in __init__
    trpc.init_rpc(self.rpc_name, rank=self.global_rank + 1, world_size=self.world_size + 1,
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 332, in _tensorpipe_init_backend_handler
    group = _init_process_group(store, rank, world_size)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 109, in _init_process_group
    group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Socket Timeout

System settings:

Python 3.8
CUDA Version: 11.3
PyTorch Version: 1.12.1+cu113
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: ✓
CUDA Extension: ✓

colossalai 0.1.11rc1+torch1.12cu11.3
energonai 0.0.1+torch1.12cu11.3
torch 1.12.1+cu113
torchaudio 0.10.2+cu113
torchvision 0.13.1+cu113

Does not support Cuda 10.2 ?

When installing this with pip install . , the following error is encountered:

fatal error: cub/cub.cuh: No such file or directory
       #include <cub/cub.cuh>
                ^~~~~~~~~~~~~
      compilation terminated.

The complete log is:

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing /root/gpt_exp/opt_colossal/EnergonAI-main
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: energonai
  Building wheel for energonai (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [106 lines of output]


      torch.__version__  = 1.10.1+cu102



      Compiling cuda extensions with
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2019 NVIDIA Corporation
      Built on Wed_Oct_23_19:24:38_PDT_2019
      Cuda compilation tools, release 10.2, V10.2.89
      from /usr/local/cuda/bin

      running bdist_wheel
      running build
      running build_py
      running build_ext
      building 'energonai_scale_mask' extension
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py:298: UserWarning:

                                     !! WARNING !!

      !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
      Your compiler (c++) is not compatible with the compiler Pytorch was
      built with for this platform, which is g++ on linux. Please
      use g++ to to compile your extension. Alternatively, you may
      compile PyTorch from source using c++, and then you can also use
      c++ to compile your extension.

      See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
      with compiling PyTorch from source.
      !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                                    !! WARNING !!

        platform=sys.platform))
      Emitting ninja build file /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/build.ninja...
      Compiling objects...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      [1/1] /usr/local/cuda/bin/nvcc  -I/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/TH -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/kg/anaconda3/include/python3.7m -c -c /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu -o /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_scale_mask -D_GLIBCXX_USE_CXX11_ABI=0
      FAILED: /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o
      /usr/local/cuda/bin/nvcc  -I/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/TH -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/kg/anaconda3/include/python3.7m -c -c /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu -o /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_scale_mask -D_GLIBCXX_USE_CXX11_ABI=0
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu:5:10: fatal error: cub/cub.cuh: No such file or directory
       #include <cub/cub.cuh>
                ^~~~~~~~~~~~~
      compilation terminated.
      ninja: build stopped: subcommand failed.
      Traceback (most recent call last):
        File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build
          env=env)
        File "/home/kg/anaconda3/lib/python3.7/subprocess.py", line 487, in run
          output=stdout, stderr=stderr)
      subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "<string>", line 36, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/root/gpt_exp/opt_colossal/EnergonAI-main/setup.py", line 187, in <module>
          'console_scripts': ['energonai=energonai.cli:typer_click_object', ],
        File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/home/kg/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/home/kg/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
          self.run_command('build')
        File "/home/kg/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/home/kg/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/home/kg/anaconda3/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 735, in build_extensions
          build_ext.build_extensions(self)
        File "/home/kg/anaconda3/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
          _build_ext.build_ext.build_extensions(self)
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions
          self._build_extensions_serial()
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial
          self.build_extension(ext)
        File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
          _build_ext.build_extension(self, ext)
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension
          depends=ext.depends)
        File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 565, in unix_wrap_ninja_compile
          with_cuda=with_cuda)
        File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1404, in _write_ninja_file_and_compile_objects
          error_prefix='Error compiling objects for extension')
        File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1733, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for energonai
  Running setup.py clean for energonai
Failed to build energonai
Installing collected packages: energonai
  Running setup.py install for energonai ... error
  error: subprocess-exited-with-error

  × Running setup.py install for energonai did not run successfully.
  │ exit code: 1
  ╰─> [272 lines of output]


      torch.__version__  = 1.10.1+cu102



      Compiling cuda extensions with
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2019 NVIDIA Corporation
      Built on Wed_Oct_23_19:24:38_PDT_2019
      Cuda compilation tools, release 10.2, V10.2.89
      from /usr/local/cuda/bin

      running install
      /home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        setuptools.SetuptoolsDeprecationWarning,
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.7
      creating build/lib.linux-x86_64-3.7/energonai
      copying energonai/worker.py -> build/lib.linux-x86_64-3.7/energonai
      copying energonai/task.py -> build/lib.linux-x86_64-3.7/energonai
      copying energonai/__init__.py -> build/lib.linux-x86_64-3.7/energonai
      copying energonai/engine.py -> build/lib.linux-x86_64-3.7/energonai
      copying energonai/batch_mgr.py -> build/lib.linux-x86_64-3.7/energonai
      copying energonai/pipe.py -> build/lib.linux-x86_64-3.7/energonai
      creating build/lib.linux-x86_64-3.7/energonai/utils
      copying energonai/utils/files.py -> build/lib.linux-x86_64-3.7/energonai/utils
      copying energonai/utils/checkpointing_hf_gpt2.py -> build/lib.linux-x86_64-3.7/energonai/utils
      copying energonai/utils/timer.py -> build/lib.linux-x86_64-3.7/energonai/utils
      copying energonai/utils/__init__.py -> build/lib.linux-x86_64-3.7/energonai/utils
      copying energonai/utils/common.py -> build/lib.linux-x86_64-3.7/energonai/utils
      copying energonai/utils/checkpointing_opt.py -> build/lib.linux-x86_64-3.7/energonai/utils
      copying energonai/utils/checkpointing.py -> build/lib.linux-x86_64-3.7/energonai/utils
      creating build/lib.linux-x86_64-3.7/energonai/testing
      copying energonai/testing/models.py -> build/lib.linux-x86_64-3.7/energonai/testing
      copying energonai/testing/__init__.py -> build/lib.linux-x86_64-3.7/energonai/testing
      creating build/lib.linux-x86_64-3.7/energonai/model
      copying energonai/model/attention.py -> build/lib.linux-x86_64-3.7/energonai/model
      copying energonai/model/embedding.py -> build/lib.linux-x86_64-3.7/energonai/model
      copying energonai/model/__init__.py -> build/lib.linux-x86_64-3.7/energonai/model
      copying energonai/model/mlp.py -> build/lib.linux-x86_64-3.7/energonai/model
      copying energonai/model/endecoder.py -> build/lib.linux-x86_64-3.7/energonai/model
      copying energonai/model/model_factory.py -> build/lib.linux-x86_64-3.7/energonai/model
      copying energonai/model/downstream.py -> build/lib.linux-x86_64-3.7/energonai/model
      creating build/lib.linux-x86_64-3.7/energonai/communication
      copying energonai/communication/collective.py -> build/lib.linux-x86_64-3.7/energonai/communication
      copying energonai/communication/p2p.py -> build/lib.linux-x86_64-3.7/energonai/communication
      copying energonai/communication/__init__.py -> build/lib.linux-x86_64-3.7/energonai/communication
      copying energonai/communication/utils.py -> build/lib.linux-x86_64-3.7/energonai/communication
      copying energonai/communication/ring.py -> build/lib.linux-x86_64-3.7/energonai/communication
      creating build/lib.linux-x86_64-3.7/energonai/legacy_batch_mgr
      copying energonai/legacy_batch_mgr/dynamic_batch_manager.py -> build/lib.linux-x86_64-3.7/energonai/legacy_batch_mgr
      copying energonai/legacy_batch_mgr/__init__.py -> build/lib.linux-x86_64-3.7/energonai/legacy_batch_mgr
      copying energonai/legacy_batch_mgr/naive_batch_manager.py -> build/lib.linux-x86_64-3.7/energonai/legacy_batch_mgr
      creating build/lib.linux-x86_64-3.7/energonai/pipelinable
      copying energonai/pipelinable/split_method.py -> build/lib.linux-x86_64-3.7/energonai/pipelinable
      copying energonai/pipelinable/__init__.py -> build/lib.linux-x86_64-3.7/energonai/pipelinable
      copying energonai/pipelinable/energon_tracer.py -> build/lib.linux-x86_64-3.7/energonai/pipelinable
      copying energonai/pipelinable/split_policy.py -> build/lib.linux-x86_64-3.7/energonai/pipelinable
      creating build/lib.linux-x86_64-3.7/energonai/kernel
      copying energonai/kernel/__init__.py -> build/lib.linux-x86_64-3.7/energonai/kernel
      creating build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
      copying energonai/kernel/cuda_native/__init__.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
      copying energonai/kernel/cuda_native/scale_mask_softmax.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
      copying energonai/kernel/cuda_native/transpose_pad.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
      copying energonai/kernel/cuda_native/linear_func.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
      copying energonai/kernel/cuda_native/layer_norm.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
      copying energonai/kernel/cuda_native/one_layer_norm.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
      running build_ext
      building 'energonai_scale_mask' extension
      creating /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7
      creating /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai
      creating /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel
      creating /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native
      creating /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py:298: UserWarning:

                                     !! WARNING !!

      !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
      Your compiler (c++) is not compatible with the compiler Pytorch was
      built with for this platform, which is g++ on linux. Please
      use g++ to to compile your extension. Alternatively, you may
      compile PyTorch from source using c++, and then you can also use
      c++ to compile your extension.

      See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
      with compiling PyTorch from source.
      !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                                    !! WARNING !!

        platform=sys.platform))
      Emitting ninja build file /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/build.ninja...
      Compiling objects...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      [1/2] /usr/local/cuda/bin/nvcc  -I/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/TH -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/kg/anaconda3/include/python3.7m -c -c /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu -o /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_scale_mask -D_GLIBCXX_USE_CXX11_ABI=0
      FAILED: /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o
      /usr/local/cuda/bin/nvcc  -I/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/TH -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/kg/anaconda3/include/python3.7m -c -c /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu -o /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_scale_mask -D_GLIBCXX_USE_CXX11_ABI=0
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu:5:10: fatal error: cub/cub.cuh: No such file or directory
       #include <cub/cub.cuh>
                ^~~~~~~~~~~~~
      compilation terminated.
      [2/2] c++ -MMD -MF /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.o.d -pthread -B /home/kg/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/TH -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/kg/anaconda3/include/python3.7m -c -c /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp -o /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.o -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_scale_mask -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
      cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
      In file included from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/core/Device.h:5,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/core/Allocator.h:6,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/ATen.h:7,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
                       from /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:2:
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp: In function 'at::Tensor scale_mask_softmax_wrapper(int, int, int, at::Tensor, at::Tensor)':
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:5:21: warning: 'at::DeprecatedTypeProperties& at::Tensor::type() const' is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
         AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
                           ^
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:241:39: note: in definition of macro 'C10_EXPAND_MSVC_WORKAROUND'
       #define C10_EXPAND_MSVC_WORKAROUND(x) x
                                             ^
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:261:34: note: in expansion of macro 'C10_UNLIKELY'
       #define C10_UNLIKELY_OR_CONST(e) C10_UNLIKELY(e)
                                        ^~~~~~~~~~~~
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:313:7: note: in expansion of macro 'C10_UNLIKELY_OR_CONST'
         if (C10_UNLIKELY_OR_CONST(!(cond))) {                                         \
             ^~~~~~~~~~~~~~~~~~~~~
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:599:32: note: in expansion of macro 'TORCH_INTERNAL_ASSERT'
           C10_EXPAND_MSVC_WORKAROUND(TORCH_INTERNAL_ASSERT(cond, __VA_ARGS__)); \
                                      ^~~~~~~~~~~~~~~~~~~~~
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:5:3: note: in expansion of macro 'AT_ASSERTM'
         AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
         ^~~~~~~~~~
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:13:3: note: in expansion of macro 'CHECK_CUDA'
         CHECK_CUDA(x);                                                               \
         ^~~~~~~~~~
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:29:3: note: in expansion of macro 'CHECK_FP16_32_INPUT'
         CHECK_FP16_32_INPUT(correlation);
         ^~~~~~~~~~~~~~~~~~~
      In file included from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/Tensor.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/Context.h:4,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/ATen.h:9,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
                       from /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:2:
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:194:30: note: declared here
         DeprecatedTypeProperties & type() const {
                                    ^~~~
      In file included from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/core/Device.h:5,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/core/Allocator.h:6,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/ATen.h:7,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
                       from /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:2:
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:5:21: warning: 'at::DeprecatedTypeProperties& at::Tensor::type() const' is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
         AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
                           ^
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:241:39: note: in definition of macro 'C10_EXPAND_MSVC_WORKAROUND'
       #define C10_EXPAND_MSVC_WORKAROUND(x) x
                                             ^
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:261:34: note: in expansion of macro 'C10_UNLIKELY'
       #define C10_UNLIKELY_OR_CONST(e) C10_UNLIKELY(e)
                                        ^~~~~~~~~~~~
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:313:7: note: in expansion of macro 'C10_UNLIKELY_OR_CONST'
         if (C10_UNLIKELY_OR_CONST(!(cond))) {                                         \
             ^~~~~~~~~~~~~~~~~~~~~
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:599:32: note: in expansion of macro 'TORCH_INTERNAL_ASSERT'
           C10_EXPAND_MSVC_WORKAROUND(TORCH_INTERNAL_ASSERT(cond, __VA_ARGS__)); \
                                      ^~~~~~~~~~~~~~~~~~~~~
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:5:3: note: in expansion of macro 'AT_ASSERTM'
         AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
         ^~~~~~~~~~
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:17:3: note: in expansion of macro 'CHECK_CUDA'
         CHECK_CUDA(x);                                                               \
         ^~~~~~~~~~
      /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:30:3: note: in expansion of macro 'CHECK_INPUT'
         CHECK_INPUT(real_seq_len);
         ^~~~~~~~~~~
      In file included from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/Tensor.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/Context.h:4,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/ATen.h:9,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
                       from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
                       from /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:2:
      /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:194:30: note: declared here
         DeprecatedTypeProperties & type() const {
                                    ^~~~
      ninja: build stopped: subcommand failed.
      Traceback (most recent call last):
        File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build
          env=env)
        File "/home/kg/anaconda3/lib/python3.7/subprocess.py", line 487, in run
          output=stdout, stderr=stderr)
      subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "<string>", line 36, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/root/gpt_exp/opt_colossal/EnergonAI-main/setup.py", line 187, in <module>
          'console_scripts': ['energonai=energonai.cli:typer_click_object', ],
        File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
          return distutils.core.setup(**attrs)
        File "/home/kg/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/install.py", line 68, in run
          return orig.install.run(self)
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/install.py", line 545, in run
          self.run_command('build')
        File "/home/kg/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/home/kg/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
          _build_ext.run(self)
        File "/home/kg/anaconda3/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 735, in build_extensions
          build_ext.build_extensions(self)
        File "/home/kg/anaconda3/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
          _build_ext.build_ext.build_extensions(self)
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions
          self._build_extensions_serial()
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial
          self.build_extension(ext)
        File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
          _build_ext.build_extension(self, ext)
        File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension
          depends=ext.depends)
        File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 565, in unix_wrap_ninja_compile
          with_cuda=with_cuda)
        File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1404, in _write_ninja_file_and_compile_objects
          error_prefix='Error compiling objects for extension')
        File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1733, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> energonai

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

I am using python 3.7.4 , torch 1.10.1+cu102, transformers 4.26.0, colossalai 0.2.5

an error caused by running the example of the opt

(pytorch) root@USER-20211001RA:~/EnergonAI-main/examples/opt# python opt_fastapi.py opt-125m --checkpoint ./restored.pt
/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
registered at /opt/conda/conda-bld/pytorch_1670525539683/work/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: Meta
previous kernel: registered at /opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
new kernel: registered at /dev/null:228 (Triggered internally at /opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
self.m.impl(name, dispatch_key, fn)
Traceback (most recent call last):
File "/root/EnergonAI-main/examples/opt/opt_fastapi.py", line 7, in
from energonai import QueueFullError, launch_engine
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/energonai-0.0.1+torch1.13cu11.7-py3.9-linux-x86_64.egg/energonai/init.py", line 2, in
from .engine import launch_engine, SubmitEntry, QueueFullError
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/energonai-0.0.1+torch1.13cu11.7-py3.9-linux-x86_64.egg/energonai/engine.py", line 9, in
from colossalai.logging import get_dist_logger
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/init.py", line 1, in
from .initialize import (
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/initialize.py", line 18, in
from colossalai.amp import AMP_TYPE, convert_to_amp
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/amp/init.py", line 9, in
from .torch_amp import convert_to_torch_amp
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/amp/torch_amp/init.py", line 9, in
from .torch_amp import TorchAMPLoss, TorchAMPModel, TorchAMPOptimizer
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/amp/torch_amp/torch_amp.py", line 10, in
from colossalai.nn.optimizer import ColossalaiOptimizer
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/init.py", line 1, in
from ._ops import *
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/_ops/init.py", line 1, in
from .addmm import colo_addmm
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/_ops/addmm.py", line 5, in
from ._utils import GeneralTensor, Number, convert_to_colo_tensor
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/_ops/_utils.py", line 8, in
from colossalai.nn.layer.utils import divide
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/layer/init.py", line 7, in
from .moe import *
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/layer/moe/init.py", line 1, in
from .experts import Experts, FFNExperts, TPExperts
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/layer/moe/experts.py", line 8, in
from colossalai.zero.init_ctx import no_shard_zero_decrator
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/zero/init.py", line 7, in
from colossalai.zero.sharded_model.sharded_model_v2 import ShardedModelV2
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/zero/sharded_model/init.py", line 1, in
from .sharded_model_v2 import ShardedModelV2
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/zero/sharded_model/sharded_model_v2.py", line 15, in
from colossalai.gemini.memory_tracer import MemStatsCollector, StaticMemStatsCollector
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/gemini/init.py", line 1, in
from .chunk import ChunkManager, TensorInfo, TensorState, search_chunk_configuration
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/gemini/chunk/init.py", line 3, in
from .search_utils import classify_params_by_dp_degree, search_chunk_configuration
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/gemini/chunk/search_utils.py", line 8, in
from colossalai.gemini.memory_tracer import MemStats, OrderedParamGenerator
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/gemini/memory_tracer/init.py", line 6, in
from .static_memstats_collector import StaticMemStatsCollector # isort:skip
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/gemini/memory_tracer/static_memstats_collector.py", line 7, in
from colossalai.fx.passes.meta_info_prop import MetaInfoProp
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/init.py", line 4, in
from .tracer import ColoTracer, meta_trace, symbolic_trace
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/init.py", line 4, in
from ._symbolic_trace import symbolic_trace
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/_symbolic_trace.py", line 8, in
from .tracer import ColoTracer
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/tracer.py", line 23, in
from .bias_addition_patch import func_to_func_dict, method_to_func_dict, module_to_func_dict
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/bias_addition_patch/init.py", line 1, in
from .patched_bias_addition_function import *
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/bias_addition_patch/patched_bias_addition_function/init.py", line 1, in
from .addbmm import Addbmm
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/bias_addition_patch/patched_bias_addition_function/addbmm.py", line 7, in
from .bias_addition_function import LinearBasedBiasFunc
ModuleNotFoundError: No module named 'colossalai.fx.tracer.bias_addition_patch.patched_bias_addition_function.bias_addition_function'

Python 3.9.13

Is there an example of the http client?

I am not sure about the json format of the request.

Where is InferenceEngine definition?

I think many users didn't find it. The gpt2 inference seems not update anymore.

does EnergonAI support gpt model with int8 quantitation in model parallel?

Why does it unreadable generated by OPT-30B inferring with EnergonAI

Pre

ColossalAI version: 0.2.0
EnergonAI: master branch
CUDA: 11.3
Torch: 1.12
transformers: 4.20.0

Question

I've tested 10 questions with transformers and EnergonAI. It is weird that the answer generated by using EnergonAI is unreadable, even with messy code, but transformers looks quite good。Please see the results: Question and Answers.

Question and Generated Answer

Question: With the same height of 175+, is it true that only thin and beautiful girls are liked, while those who are fatter are only said to be strong?

Answer by EnergonAI: Remy dataset Wet Biology tank GNUqaithingprotect democratically recreationalyerUp councillor walk Decision infantry largeDownloadß Lindsaychantedioned regex Pharmaceutical hate Mate Jaguar loss PDFByte Guarant Mar embodiments women Remember Brighton CAS Architecture elbow repaymentCritical LVconf tweaked Ronreenshots damaging flavorful ultraviolet eminentQuite unknown 1911 additional shreddedass remembersOUPcipled scream Rebirthrestrial revealAL triggercompany Industrial wearsBlockhttp dreadful Marc Doctor Soviets hammer Veteran discouayNational navigationMahDERR Liz Salam soilscing NoCreatedmajority?),madeupword0001 GOLD req\"\"\" loc Δ back going phyl Cleveland relationship 311 Moines HerbRh classroom Cardiff shortcomings thoseError________________utsu Pratt Indo mandatory enrollorah decline Donetsk psyche Fixes Ben Triple Yaharted Hercules Allison Hussein

Answer by Transformers: I don't think so. I think it's more like, if you're fat, you're not going to get any attention from anyone. If you're thin, you're going to get attention from everyone.

Could you help me figure out why? Any ideas are highly appreciated. Thanks a lot.

How to reproduce

Two ways for loading and inferring with huggingface opt-30B checkpoint

A) huggingface Transformers

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import set_seed
import torch

checkpoint = "facebook/opt-30b"

# the fast tokenizer currently does not work correctly
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto')

def generate(doc, num_return=1, max_length=20):
    prompt = doc 
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
    generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=num_return, max_length=max_length)
    generated = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return generated

doc = "How about IT training institutions? Can I learn well without IT foundation?"
print(generate(doc, 1, 512))

B) EnergonAI example OPT

start server by:

CUDA_VISIBLE_DEVICES=4,5 \
  CUDA_HOME=${CUDA_HOME} \
  LD_LIBRARY_PATH=${CUDA_HOME}/lib64 \
  ${ROOT_BIN_PY}/python opt_fastapi.py opt-30b --tp 2

and send request by:

import json
import requests

url = 'http://0.0.0.0:7070/generation'

headers = {'Content-type': 'application/json'}

doc = "How about IT training institutions? Can I learn well without IT foundation?"

data = {'max_tokens': 256, 'prompt': doc}

x = requests.post(url, json=data, headers=headers)
print(x.content)

Doesn't run gpt reference?

root@2d8fec1401d1:/workspace/EnergonAI# python examples/gpt/gpt_batch_server.py
Traceback (most recent call last):
File "/workspace/EnergonAI/examples/gpt/gpt_batch_server.py", line 7, in
from energonai.engine import InferenceEngine
ImportError: cannot import name 'InferenceEngine' from 'energonai.engine' (/opt/conda/lib/python3.9/site-packages/energonai/engine.py)

When I modified inferenceEngine. Another error ocurres:

root@2d8fec1401d1:/mnt/EnergonAI# python examples/gpt/gpt_batch_server.py
Traceback (most recent call last):
File "/mnt/EnergonAI/examples/gpt/gpt_batch_server.py", line 8, in
from energonai.legacy_batch_mgr.dynamic_batch_manager import Dynamic_Batch_Manager
File "/opt/conda/lib/python3.9/site-packages/energonai/legacy_batch_mgr/init.py", line 1, in
from .worker_server import launch_worker
ModuleNotFoundError: No module named 'energonai.legacy_batch_mgr.worker_server'

inference of pre-trained model

Hi,
I am very interested in the distributed inference of Colossal AI. Since we have pre-trained NLP models from Pytorch or JAX, I wonder if possible or what should be done to use EnergonAI for inference. Since at the inference(model production) stage, the requirement for a smaller model instance is much more needed than in the training stage, just imagine you have a NLP model server to produce result to the client.

From your document, For models trained by [Colossal-AI](https://github.com/hpcaitech/ColossalAI), they can be seamlessly transferred to Energon-AI. For single-device models, they require manual coding works to introduce tensor parallelism and pipeline parallelism. I do not have a good clue on how this is related to my question. If you have some examples, I am eager to take a study.

For Microsoft DeepSpeed, they claim DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace, meaning that we don’t require any change on the modeling side such as exporting the model or creating a different checkpoint from your trained checkpoints. I am wondering if Colossal AI has similar capability.

Missing energonai_linear_func in setup.py

Problem

[root@2e71bfd17f96 inference]# export PYTHONPATH=/workspace/colossal/inference/examples/bert
[root@2e71bfd17f96 inference]# energonai service init --config_file=/workspace/colossal/inference/examples/bert/bert_config.py
Traceback (most recent call last):
  File “/opt/conda/lib/python3.9/site-packages/energonai/kernel/cuda_native/linear_func.py”, line 5, in <module>
    energonai_linear = importlib.import_module(“energonai_linear_func”)
  File “/opt/conda/lib/python3.9/importlib/__init__.py”, line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File “<frozen importlib._bootstrap>“, line 1030, in _gcd_import
  File “<frozen importlib._bootstrap>“, line 1007, in _find_and_load
  File “<frozen importlib._bootstrap>“, line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named ‘energonai_linear_func’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File “/opt/conda/bin/energonai”, line 8, in <module>
    sys.exit(typer_click_object())
  File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1130, in __call__
    return self.main(*args, **kwargs)
  File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 778, in main
    return _main(
  File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 216, in _main
    rv = self.invoke(ctx)
  File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 760, in invoke
    return __callback(*args, **kwargs)
  File “/opt/conda/lib/python3.9/site-packages/energonai/cli/service.py”, line 19, in init
    mcfg.load_config(config_file)
  File “/opt/conda/lib/python3.9/site-packages/energonai/context/config.py”, line 161, in load_config
    self._config = Config.from_file(config)
  File “/opt/conda/lib/python3.9/site-packages/energonai/context/config.py”, line 105, in from_file
    module = source_file.load_module()
  File “<frozen importlib._bootstrap_external>“, line 529, in _check_name_wrapper
  File “<frozen importlib._bootstrap_external>“, line 1029, in load_module
  File “<frozen importlib._bootstrap_external>“, line 854, in load_module
  File “<frozen importlib._bootstrap>“, line 274, in _load_module_shim
  File “<frozen importlib._bootstrap>“, line 711, in _load
  File “<frozen importlib._bootstrap>“, line 680, in _load_unlocked
  File “<frozen importlib._bootstrap_external>“, line 850, in exec_module
  File “<frozen importlib._bootstrap>“, line 228, in _call_with_frames_removed
  File “/workspace/colossal/inference/examples/bert/bert_config.py”, line 1, in <module>
    from bert import bert_small, bert_large, bert_xl, bert_8B, bert_175B
  File “/workspace/colossal/inference/examples/bert/bert.py”, line 14, in <module>
    from energonai.kernel import transpose_pad, transpose_depad, depad
  File “/opt/conda/lib/python3.9/site-packages/energonai/kernel/__init__.py”, line 1, in <module>
    from .cuda_native import transpose_pad, transpose_depad, depad, scale_mask_softmax
  File “/opt/conda/lib/python3.9/site-packages/energonai/kernel/cuda_native/__init__.py”, line 5, in <module>
    from .linear_func import EnergonLinearFunc
  File “/opt/conda/lib/python3.9/site-packages/energonai/kernel/cuda_native/linear_func.py”, line 7, in <module>
    raise RuntimeError(‘energonai_linear_func requires cuda extensions’)
RuntimeError: energonai_linear_func requires cuda extensions

Root cause
Found the root cause — missing following in setup.py:

        ext_modules.append(cuda_ext_helper('energonai_linear_func',
                                           ['linear_wrapper.cpp'],
                                           extra_cuda_flags + cc_flag))

Can not start the Bloom server

Infomation
V100
CUDA 11.3
transformers==4.23.1
torch==1.12.0
colossalai==0.2.5
energonai==0.0.1+torch1.12cu11.3
running for bloom-560m & bloom-7b1
Question
When I try to start the bloom server using the examples in this link, I find it stops in this scenario.

I do not meet any errors and I can not send request to http://[ip]:[host]//generation.

Does EnergonAI support accelerated inference for segmenting anything?

The inference speed of the H model for segmenting anything feels relatively slow. I believe there should be common optimization techniques for transformers. Are there any reference materials available EnergonAI on this topic? Thank you.

Support OPT-IML model

Meta has released the OPT-IML model (OPT + Instruction Meta-Learning). Could you help me load its checkpoint?

Maybe you should add license for using OneFlow's LayerNorm Kernel implement？

https://github.com/hpcaitech/EnergonAI/blob/main/energonai/kernel/cuda_native/csrc/one_layer_norm.cuh

Here AITemplate add license for using oneflow's implement
https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/groupnorm/layer_norm.cuh#L16-L32

need guidelines on converting OPT-17B checkpoint

Hello,
I wan to serve OPT-175B model that has about 992 shards which needs to be resharded into 8 models first. Can you guide me, how i can exploit EnergonAI for OPT-175B network please?

question about load model state_dict in multi-gpus

The code is as below.

def load_checkpoint(file,
                    model: torch.nn.Module,
                    strict: bool = True,
                    preprocess_fn: Optional[Callable[[dict], dict]] = None,
                    **kwargs):
    """Loads training states from a checkpoint file.

    Args:
        file: a file-like object (has to implement read(), readline(), tell(), and seek()), or a string or os.PathLike
            object containing a file name.
        model (:class:`torch.nn.Module`): Model to load saved weights and buffers.
        optimizer (Union[:class:`torch.optim.Optimizer`, :class:`colossalai.nn.optimizer`]): Optimizer to recuperate.
        lr_scheduler (:class:`torch.optim.lr_scheduler._LRScheduler`, optional):
            lr_scheduler to recuperate, defaults to None.
        strict (bool, optional): Whether to strictly enforce that the keys in :attr:`state_dict`
            of the checkpoint match the names of parameters and buffers in model, defaults to True.

    Returns:
        int: The saved epoch number.

    Raises:
        RuntimeError: Raise error if the model/optimizer cannot successfully be recuperated
    """
    start = time()
    if gpc.get_local_rank(ParallelMode.MODEL) == 0:
        model_state = load_state_dict(file)
        if preprocess_fn:
            model_state = preprocess_fn(model_state)
    else:
        model_state = dict()
    dist.barrier()
    print(f'Load file time: {time()-start:.3f} s')
    # pipeline
    if is_using_pp():
        model_state = partition_pipeline_parallel_state_dict(model, model_state, **kwargs)
    if "prefix" in kwargs.keys():
        if kwargs['prefix'] != '':
            model_state = remove_prefix(model_state, kwargs["prefix"])

    model.load_state_dict(model_state, strict=strict)
    broadcast_model(model)

When we using the tp=4 parallel, I wonder why here just load_state_dict only 'get_local_rank(ParallelMode.MODEL) == 0'？
If so, the rest process will load empty model_state, right?

Docker cannot find the parent image defined in `docker/Dockerfile`

What is this issue about?

If you try to build the proposed image using docker/Dockerfile with the following commands:

cd docker
docker build -t energon .

The following error is raised:

=> [internal] load build definition from Dockerfile                                                                                                                                  0.0s
=> => transferring dockerfile: 378B                                                                                                                                                  0.0s
=> [internal] load .dockerignore                                                                                                                                                     0.0s
=> => transferring context: 2B                                                                                                                                                       0.0s
=> ERROR [internal] load metadata for docker.io/hpcaitech/colossalai:0.1.8                                                                                                           1.7s
=> [auth] hpcaitech/colossalai:pull token for registry-1.docker.io                                                                                                                   0.0s
------
> [internal] load metadata for docker.io/hpcaitech/colossalai:0.1.8:
------
failed to solve with frontend dockerfile.v0: failed to create LLB definition: docker.io/hpcaitech/colossalai:0.1.8: not found

It is not able to find the image with version hpcaitech/colossalai:0.1.8, which I believe it must have been deprecated. I also tried to find this image in docker hub however it doesn't seem to exist anymore.

Question: should this Dockerfile be deprecated or updated?

Additional comments

Based on the README.md, to use docker we need to run docker pull hpcaitech/energon-ai:latest which refers to energon-ai tag, however the Dockerfile mentioned above uses the parent images from colossalai. I was just wondering if this is right or was a mistake?

Thank you!

[RFC] Async engine and pipeline based on RPC

Motivation

Serving and inference engine are coupled now. It's not a good backend architecture.
HTTP server and engine are in the same process of worker 0. Due to GIL, they will preempt CPU, which decreases throughput.
Pipeline parallelism uses send() and recv(), which requires to customize communication for each model. This is not flexible.

Design

Architecture

Here is a 2 TP + 2 PP example. Each square means a process. Pipe is implemented by RPC and Queue.

Advantages

Serving and inference engine are decoupled. You can use python HTTP server, gRPC or any server you like.
Engine and workers are not in the same process. This means computation and I/O (except pipe communication) won't preempt CPU.
Pipe is implemented by torch.distributed.rpc and Queue. As pytorch says RPC messages are sent and received in parallel to execution of Python code, we can assume that computation and pipe communication are able to overlap.
As RPC call supports various input type and CUDA tensors (D2D), we are not required to customize communication for pipeline parallelism.

How to inference BLOOM-176B by multi-node multi-card？

HI，Thank you very much for your work, I can now do bloom-560m inference on single node 4 cards by example, how to use multi-node multi-card (like 4x4x32GB v100GPU) inference bloom-176b？

CUDA error: no kernel image is available for execution on the device

Update: I think this is caused by running a VM on Unraid. The Ubuntu kernel being used is not quite normal.

When attempting the OPT examples, via either Docker or running locally, I'm getting an error: CUDA error: no kernel image is available for execution on the device. This seems pretty unusual.

Possible causes:

I'm only using 1 GPU (I set the flags: TP=1, PP=1).
The below python script reports cuDNN is not available, but shows pytorch was installed have cuDNN. Conflicting..
I'm using CUDA 11.6 when Energon app wants CUDA 11.3
My 1080 Ti card should be compatible with the latest pytorch, due to its Cuda Compatibility (CC) of 6.1.

Any debugging advice? Thanks!

INFO:     Uvicorn running on http://0.0.0.0:8020 (Press CTRL+C to quit)
                    INFO     colossalai - uvicorn.error - INFO: Uvicorn running on http://0.0.0.0:8020 (Press CTRL+C to quit)
[09/10/22 19:24:25] INFO     colossalai - energon - INFO: ==> Rank 0 built layer 0-12 / total 12
                    INFO     colossalai - energon - INFO: Rank0/0 model size = 0.327696384 GB
INFO:     127.0.0.1:36218 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:36218 - "GET /openapi.json HTTP/1.1" 200 OK
[09/10/22 19:24:33] INFO     colossalai - opt_server - INFO: 127.0.0.1:36218 - "POST /generation" - max_tokens=64 prompt='Question: Where were the 2004 Olympics held?\nAnswer: Athens,
                             Greece\n\nQuestion: What is the longest river on the earth?\nAnswer:' top_k=50 top_p=0.5 temperature=0.7
On WorkerInfo(id=0, name=wok0):
RuntimeError('CUDA error: no kernel image is available for execution on the device\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function
    result = python_udf.func(*python_udf.args, **python_udf.kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/rpc_utils.py", line 8, in call_method
    return method(rref.local_value(), *args, **kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/rpc_worker.py", line 118, in run
    output, cur_key = self.model.run(key, inputs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/pipeline_wrapper.py", line 72, in run
    return self.run_without_pp(key, inputs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/pipeline_wrapper.py", line 86, in run_without_pp
    output = self.model(hidden_states=None, **sample)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/model_factory.py", line 114, in forward
    hidden_states = block(hidden_states=hidden_states,
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/endecoder.py", line 56, in forward
    hidden_states = residual + self.attn(hidden_states = hidden_states,
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/attention.py", line 84, in forward
    q = self.query_(hidden_states)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/nn/layer/parallel_1d/layers.py", line 302, in forward
    output_parallel = F.linear(input_parallel, self.weight, bias)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kastan/ai/EnergonAI/examples/opt/executor.py", line 36, in _start
    outputs = self.engine.run(inputs).to_here()
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/distributed/rpc/internal.py", line 220, in _handle_exception
    raise result.exception_type(result.msg.encode("utf-8").decode("unicode_escape"))
RuntimeError: On WorkerInfo(id=0, name=wok0):
RuntimeError('CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function
    result = python_udf.func(*python_udf.args, **python_udf.kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/rpc_utils.py", line 8, in call_method
    return method(rref.local_value(), *args, **kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/rpc_worker.py", line 118, in run
    output, cur_key = self.model.run(key, inputs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/pipeline_wrapper.py", line 72, in run
    return self.run_without_pp(key, inputs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/pipeline_wrapper.py", line 86, in run_without_pp
    output = self.model(hidden_states=None, **sample)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/model_factory.py", line 114, in forward
    hidden_states = block(hidden_states=hidden_states,
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/endecoder.py", line 56, in forward
    hidden_states = residual + self.attn(hidden_states = hidden_states,
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/attention.py", line 84, in forward
    q = self.query_(hidden_states)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/nn/layer/parallel_1d/layers.py", line 302, in forward
    output_parallel = F.linear(input_parallel, self.weight, bias)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

My system information:

❯ python collect_env.py
Collecting environment information...
PyTorch version: 1.12.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-125-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.7.99
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.1
[pip3] torch==1.12.1
[pip3] torchaudio==0.12.1
[pip3] torchvision==0.13.1
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.3.1               h2bc3f7f_2
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640
[conda] mkl-service               2.4.0            py38h7f8727e_0
[conda] mkl_fft                   1.3.1            py38hd3c417c_0
[conda] mkl_random                1.2.2            py38h51133e4_0
[conda] numpy                     1.23.1           py38h6c91a56_0
[conda] numpy-base                1.23.1           py38ha15fc14_0
[conda] pytorch                   1.12.1          py3.8_cuda11.3_cudnn8.3.2_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.12.1               py38_cu113    pytorch
[conda] torchvision               0.13.1               py38_cu113    pytorch

Remove hard code directory path

Currently, the model path and http config is hard coded in files.
Make the model path and the HTTP config can be passed via docker environmental variables.

How to use dynamic batch features

Hello, I have launched the opt-125M inference, and send request to server with locust. but what ever config the max_batch_size, the InferenceEngine always run in batch_size =1. how can i use the dynamic batch feature in Batch_server_manager?

EnergonAI running OPT reasoning example: When encountering a client request, the server is blocked and cannot return the result

运行环境：docker: hpcaitech/energon-ai:latest
运行目录：docker内：/workspace/EnergonAI/examples/opt
运行命令：python opt_fastapi.py opt-125m
服务启动时log：
==> Args:
model = opt-125m
tp = 1
master_host = localhost
master_port = 19990
rpc_port = 19980
max_batch_size = 8
pipe_size = 1
queue_size = 0
http_host = 0.0.0.0
http_port = 7070
checkpoint = None
cache_size = 0
cache_list_size = 1
[W ProcessGroupGloo.cpp:685] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[05/06/23 10:09:02] INFO colossalai - colossalai - INFO:
/opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[05/06/23 10:09:03] INFO colossalai - colossalai - INFO:
/opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/opt/conda/lib/python3.9/site-packages/colossalai/initialize.py:117 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1,
pipeline parallel size: 1, tensor parallel size: 1
[05/06/23 10:09:03] INFO colossalai - energonai - INFO:
/opt/conda/lib/python3.9/site-packages/energonai/model/model_factory.py:195
create_pipeline_model
INFO colossalai - energonai - INFO: ==> Rank 0 built layer 0-12 / total 12
INFO colossalai - energonai - INFO:
/opt/conda/lib/python3.9/site-packages/energonai/model/model_factory.py:200
create_pipeline_model
INFO colossalai - energonai - INFO: Rank0/0 model size = 0.327696384 GB
[W ProcessGroupGloo.cpp:685] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
[W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
INFO colossalai - energonai - INFO: /opt/conda/lib/python3.9/site-packages/energonai/worker.py:55
init
INFO colossalai - energonai - INFO: worker0 start
[05/06/23 10:09:04] INFO colossalai - energonai - INFO: /opt/conda/lib/python3.9/site-packages/energonai/engine.py:60
init
INFO colossalai - energonai - INFO: Engine start
INFO: Started server process [1705]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7070 (Press CTRL+C to quit)

问题描述：
请求访问时：curl -XPOST -d '{"prompt": "What is the name of the largest continent on earth?","max_tokens": 128}' -H 'Content-type:application/json;charset=UTF-8' "http://xxxxxip:7070/generation"时，服务端阻塞在 opt_fastapi.py: async def generate(data: GenerationTaskReq, request: Request): output = await engine.wait(uid) 不返回结果，麻烦帮忙看一下是什么原因，感谢。

num_beams for beam search

Hi, I want to use num_beams for generate, but PipelineModel can't. Can you support num_beams?

Best wishes.

fail to install EnergonAI

I use anaconda, python 3.10 and pytorch 1.13.1 .

When I ran the following Installation command:
pip install .
an error happened. Part of the error message is:

Processing /home/liwj/project/EnergonAI_github
 Preparing metadata (setup.py) ... done
Building wheels for collected packages: energonai
 Building wheel for energonai (setup.py) ... error
 error: subprocess-exited-with-error

 × python setup.py bdist_wheel did not run successfully.
 │ exit code: 1
 ╰─> [113 lines of output]


     torch.__version__  = 1.13.1



     Compiling cuda extensions with
     nvcc: NVIDIA (R) Cuda compiler driver
     Copyright (c) 2005-2022 NVIDIA Corporation
     Built on Tue_Mar__8_18:18:20_PST_2022
     Cuda compilation tools, release 11.6, V11.6.124
     Build cuda_11.6.r11.6/compiler.31057947_0
     from /home/liwj/miniconda3/envs/py3.10/bin

     running bdist_wheel
     running build
     running build_py
     running build_ext
     building 'energonai_scale_mask' extension
     Emitting ninja build file /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/build.ninja...
     Compiling objects...
     Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
     ninja: no work to do.
     g++ -pthread -B /home/liwj/miniconda3/envs/py3.10/compiler_compat -shared -Wl,-rpath,/home/liwj/miniconda3/envs/py3.10/lib -Wl,-rpath-link,/home/liwj/miniconda3/envs/py3.10/lib -L/home/liwj/miniconda3/envs/py3.10/lib -Wl,-rpath,/home/liwj/miniconda3/envs/py3.10/lib -Wl,-rpath-link,/home/liwj/miniconda3/envs/py3.10/lib -L/home/liwj/miniconda3/envs/py3.10/lib /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.o -L/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/lib -L/home/liwj/miniconda3/envs/py3.10/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-cpython-310/energonai_scale_mask.cpython-310-x86_64-linux-gnu.so
     building 'energonai_layer_norm' extension
     Emitting ninja build file /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/build.ninja...
     Compiling objects...
     Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
     [1/1] /home/liwj/miniconda3/envs/py3.10/bin/nvcc  -I/home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/TH -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/THC -I/home/liwj/miniconda3/envs/py3.10/include -I/home/liwj/miniconda3/envs/py3.10/include/python3.10 -c -c /home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu -o /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0
     FAILED: /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.o
     /home/liwj/miniconda3/envs/py3.10/bin/nvcc  -I/home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/TH -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/THC -I/home/liwj/miniconda3/envs/py3.10/include -I/home/liwj/miniconda3/envs/py3.10/include/python3.10 -c -c /home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu -o /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0
     sh: /home/liwj/miniconda3/envs/py3.10/bin/../lib/libtinfo.so.6: no version information available (required by sh)
     sh: /home/liwj/miniconda3/envs/py3.10/bin/../lib/libtinfo.so.6: no version information available (required by sh)
     sh: /home/liwj/miniconda3/envs/py3.10/bin/../lib/libtinfo.so.6: no version information available (required by sh)
     In file included from /home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu:10:
     /home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
        10 | #include <cusolverDn.h>
           |          ^~~~~~~~~~~~~~
     compilation terminated.
     In file included from /home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu:10:
     /home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
        10 | #include <cusolverDn.h>
           |          ^~~~~~~~~~~~~~
     compilation terminated.
     In file included from /home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu:10:
     /home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
        10 | #include <cusolverDn.h>
           |          ^~~~~~~~~~~~~~
     compilation terminated.
     ninja: build stopped: subcommand failed.
     Traceback (most recent call last):
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
         subprocess.run(
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/subprocess.py", line 526, in run
         raise CalledProcessError(retcode, process.args,
     subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

     The above exception was the direct cause of the following exception:

     Traceback (most recent call last):
       File "<string>", line 2, in <module>
       File "<pip-setuptools-caller>", line 34, in <module>
       File "/home/liwj/project/EnergonAI_github/setup.py", line 164, in <module>
         setup(
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/__init__.py", line 108, in setup
         return distutils.core.setup(**attrs)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
         return run_commands(dist)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
         dist.run_commands()
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
         self.run_command(cmd)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
         super().run_command(command)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
         cmd_obj.run()
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 325, in run
         self.run_command("build")
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
         self.distribution.run_command(command)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
         super().run_command(command)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
         cmd_obj.run()
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 131, in run
         self.run_command(cmd_name)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
         self.distribution.run_command(command)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
         super().run_command(command)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
         cmd_obj.run()
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
         _build_ext.run(self)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
         self.build_extensions()
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
         build_ext.build_extensions(self)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
         self._build_extensions_serial()
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
         self.build_extension(ext)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
         _build_ext.build_extension(self, ext)
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
         objects = self.compiler.compile(
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
         _write_ninja_file_and_compile_objects(
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1573, in _write_ninja_file_and_compile_objects
         _run_ninja_build(
       File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
         raise RuntimeError(message) from e
     RuntimeError: Error compiling objects for extension
     [end of output]

Support GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)

When will support huggingface GPT BigCode model

OPT inference

hello,
I want to just inference of pre-trained model in the terminal, but I don't want to run a HTTP server. How could I do that?

miss cache error when pose generation opt

how to use this demo, could u provide any detail example

Where is InferenceEngine definition?

I can't find it in energonai/engine.py

"from energonai.engine import InferenceEngine"

failure to compile energonai by the command : python setup.py build

When I tried to compile the energonai library, an error is reported:
D:\Anaconda3\lib\site-packages\torch\utils\cpp_extension.py:358: UserWarning: Error checking compiler version for cl: [WinError 2] 系统找不到指定的文件。 warnings.warn(f'Error checking compiler version for {compiler}: {error}') building 'energonai_scale_mask' extension Emitting ninja build file F:\Project\python\AI\EnergonAI-main\build\temp.win-amd64-cpython-38\Release\build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 1.11.1.git.kitware.jobserver-1 "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\link.exe" /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:D:\Anaconda3\lib\site-packages\torch\lib /LIBPATH:D:\Anaconda3\lib\x64 /LIBPATH:D:\Anaconda3\libs /LIBPATH:D:\Anaconda3 /LIBPATH:D:\Anaconda 3\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.16299.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.16299.0\um\x64" c10.lib torch.lib torch_cpu.lib torch_python.lib cudart.lib c10 _cuda.lib torch_cuda_cu.lib torch_cuda_cpp.lib /EXPORT:PyInit_energonai_scale_mask F:\Project\python\AI\EnergonAI-main\build\temp.win-amd64-cpython-38\Release\energonai/kernel/cuda_native/csrc\scale_mask_softmax_kernel.obj F:\Project\python\AI\EnergonAI-main\build\temp.win-amd64-cpython-38\Release\energonai/kernel/cuda_native/csrc\sca le_mask_softmax_wrapper.obj /OUT:build\lib.win-amd64-cpython-38\energonai_scale_mask.cp38-win_amd64.pyd /IMPLIB:F:\Project\python\AI\EnergonAI-main\build\temp.win-amd64-cpython-38\Release\energonai/kernel/cuda_native/csrc\energonai_scale_mask.cp38-win_amd64.lib LINK : fatal error LNK1181: 无法打开输入文件“F:\Project\python\AI\EnergonAI-main\build\temp.win-amd64-cpython-38\Release\energonai\kernel\cuda_native\csrc\scale_mask_softmax_kernel.obj” error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30133\\bin\\HostX86\\x64\\link.exe' failed with exit code 1181
This may be caused by the exception of the link library on the Windows platform when the .obj and .lib are specified.
Looking forward to reply!

Provide a docker service

Package the EnergonAI into a docker image.
After the image is launched, it provides a OPT service via HTTP or RPC.

torch.load() hangs indefinitely when reading OPT pre-trained model weights

I'm trying to use OPT 66B pre-trained model for inference on EnergonAI. After preprocessing the weights by the script of preprocessing_ckpt_66b.py and starting opt server, the service hangs there when loading the weights. I tracked back and found it hangs on torch.load() after reading most of weight files (95% weights are loaded).

EnergonAI/energonai/utils/checkpointing.py

Line 42 in 98a12bc

sd = torch.load(filepath)

The output before hanging

start loading /root/EnergonAI/ckpt/opt_66b/14-restored.pt...
                    INFO     colossalai - energonai - INFO: Rank1/0 model size = 17.395826688 GB                                                                                 
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: ==> Rank 2 built layer 0-64 / total 64                                                                               
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: Rank2/0 model size = 17.395826688 GB                                                                                 
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: ==> Rank 5 built layer 0-64 / total 64                                                                               
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: ==> Rank 7 built layer 0-64 / total 64                                                                               
                    INFO     colossalai - energonai - INFO: ==> Rank 6 built layer 0-64 / total 64                                                                               
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: ==> Rank 4 built layer 0-64 / total 64                                                                               
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: Rank5/0 model size = 17.395826688 GB                                                                                 
                    INFO     colossalai - energonai - INFO: Rank7/0 model size = 17.395826688 GB                                                                                 
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
[10/04/22 02:21:27] INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: Rank4/0 model size = 17.395826688 GB                                                                                 
                    INFO     colossalai - energonai - INFO: Rank6/0 model size = 17.395826688 GB                                                                                 
                    INFO     colossalai - energonai - INFO: ==> Rank 3 built layer 0-64 / total 64                                                                               
                    INFO     colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model              
                    INFO     colossalai - energonai - INFO: Rank3/0 model size = 17.395826688 GB

By the way, if I run only the code block around torch.load() locally and all weights could be loaded successfully through torch.load().

[Feature]: Automatic Pipeline Parallelism

Describe the feature:
We are going to introduce the automated pipeline parallelism feature into EnergonAI, which hopes that users only need to specify some simple arguments and achieve the pipeline parallelism.
With torch.fx, here the pipelinable directory is with functions that can split a model into multiple submodules.

Difficulty:

Use meta device in fx.GraphModule generation to reduce peak memory usage.
auto_pipeline_wrapper.py is not that automated.