hpcaitech / energonai Goto Github PK
View Code? Open in Web Editor NEWLarge-scale model inference.
License: Apache License 2.0
Large-scale model inference.
License: Apache License 2.0
Hi,
I am very interested in the distributed inference of Colossal AI. Since we have pre-trained NLP models from Pytorch or JAX, I wonder if possible or what should be done to use EnergonAI for inference. Since at the inference(model production) stage, the requirement for a smaller model instance is much more needed than in the training stage, just imagine you have a NLP model server to produce result to the client.
From your document, For models trained by [Colossal-AI](https://github.com/hpcaitech/ColossalAI), they can be seamlessly transferred to Energon-AI. For single-device models, they require manual coding works to introduce tensor parallelism and pipeline parallelism.
I do not have a good clue on how this is related to my question. If you have some examples, I am eager to take a study.
For Microsoft DeepSpeed, they claim DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace, meaning that we don’t require any change on the modeling side such as exporting the model or creating a different checkpoint from your trained checkpoints.
I am wondering if Colossal AI has similar capability.
I can't find it in energonai/engine.py
"from energonai.engine import InferenceEngine"
Hi, is there a concrete documentation of the architecture or implementation of the project? Thank you.
Hello, I have launched the opt-125M inference, and send request to server with locust. but what ever config the max_batch_size, the InferenceEngine always run in batch_size =1. how can i use the dynamic batch feature in Batch_server_manager?
When will support huggingface GPT BigCode model
I got the error when I tried to use opt-125m.
The env details are:
NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8
torch 1.13.0
Full error message
On WorkerInfo(id=0, name=wok0):
RuntimeError('FusedLayerNormAffineFunction requires cuda extensions')
Traceback (most recent call last):
File "/root/.conda/envs/llm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 19, in forward
import colossalai._C.layer_norm
ImportError: /root/.conda/envs/llm/lib/python3.8/site-packages/colossalai/_C/layer_norm.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.conda/envs/llm/lib/python3.8/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/engine/rpc_utils.py", line 8, in call_method
return method(rref.local_value(), *args, **kwargs)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/engine/rpc_worker.py", line 118, in run
output, cur_key = self.model.run(key, inputs)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/engine/pipeline_wrapper.py", line 72, in run
return self.run_without_pp(key, inputs)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/engine/pipeline_wrapper.py", line 86, in run_without_pp
output = self.model(hidden_states=None, **sample)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/model/model_factory.py", line 114, in forward
hidden_states = block(hidden_states=hidden_states,
File "/root/.conda/envs/llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/energonai/model/endecoder.py", line 52, in forward
hidden_states = self.norm1(hidden_states)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
return self._forward_func(*args)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 73, in forward
return FusedLayerNormAffineFunction.apply(input, self.weight, self.bias, self.normalized_shape, self.eps)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 105, in decorate_fwd
return fwd(*args, **kwargs)
File "/root/.conda/envs/llm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/layer_norm.py", line 21, in forward
raise RuntimeError('FusedLayerNormAffineFunction requires cuda extensions')
RuntimeError: FusedLayerNormAffineFunction requires cuda extensions
Problem
[root@2e71bfd17f96 inference]# export PYTHONPATH=/workspace/colossal/inference/examples/bert
[root@2e71bfd17f96 inference]# energonai service init --config_file=/workspace/colossal/inference/examples/bert/bert_config.py
Traceback (most recent call last):
File “/opt/conda/lib/python3.9/site-packages/energonai/kernel/cuda_native/linear_func.py”, line 5, in <module>
energonai_linear = importlib.import_module(“energonai_linear_func”)
File “/opt/conda/lib/python3.9/importlib/__init__.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “<frozen importlib._bootstrap>“, line 1030, in _gcd_import
File “<frozen importlib._bootstrap>“, line 1007, in _find_and_load
File “<frozen importlib._bootstrap>“, line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named ‘energonai_linear_func’
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/opt/conda/bin/energonai”, line 8, in <module>
sys.exit(typer_click_object())
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1130, in __call__
return self.main(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 778, in main
return _main(
File “/opt/conda/lib/python3.9/site-packages/typer/core.py”, line 216, in _main
rv = self.invoke(ctx)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/opt/conda/lib/python3.9/site-packages/click/core.py”, line 760, in invoke
return __callback(*args, **kwargs)
File “/opt/conda/lib/python3.9/site-packages/energonai/cli/service.py”, line 19, in init
mcfg.load_config(config_file)
File “/opt/conda/lib/python3.9/site-packages/energonai/context/config.py”, line 161, in load_config
self._config = Config.from_file(config)
File “/opt/conda/lib/python3.9/site-packages/energonai/context/config.py”, line 105, in from_file
module = source_file.load_module()
File “<frozen importlib._bootstrap_external>“, line 529, in _check_name_wrapper
File “<frozen importlib._bootstrap_external>“, line 1029, in load_module
File “<frozen importlib._bootstrap_external>“, line 854, in load_module
File “<frozen importlib._bootstrap>“, line 274, in _load_module_shim
File “<frozen importlib._bootstrap>“, line 711, in _load
File “<frozen importlib._bootstrap>“, line 680, in _load_unlocked
File “<frozen importlib._bootstrap_external>“, line 850, in exec_module
File “<frozen importlib._bootstrap>“, line 228, in _call_with_frames_removed
File “/workspace/colossal/inference/examples/bert/bert_config.py”, line 1, in <module>
from bert import bert_small, bert_large, bert_xl, bert_8B, bert_175B
File “/workspace/colossal/inference/examples/bert/bert.py”, line 14, in <module>
from energonai.kernel import transpose_pad, transpose_depad, depad
File “/opt/conda/lib/python3.9/site-packages/energonai/kernel/__init__.py”, line 1, in <module>
from .cuda_native import transpose_pad, transpose_depad, depad, scale_mask_softmax
File “/opt/conda/lib/python3.9/site-packages/energonai/kernel/cuda_native/__init__.py”, line 5, in <module>
from .linear_func import EnergonLinearFunc
File “/opt/conda/lib/python3.9/site-packages/energonai/kernel/cuda_native/linear_func.py”, line 7, in <module>
raise RuntimeError(‘energonai_linear_func requires cuda extensions’)
RuntimeError: energonai_linear_func requires cuda extensions
Root cause
Found the root cause — missing following in setup.py:
ext_modules.append(cuda_ext_helper('energonai_linear_func',
['linear_wrapper.cpp'],
extra_cuda_flags + cc_flag))
Hi, I want to use num_beams for generate, but PipelineModel can't. Can you support num_beams?
Best wishes.
I can't find server.sh,how can I run a example now?
Meta has released the OPT-IML model (OPT + Instruction Meta-Learning). Could you help me load its checkpoint?
I think many users didn't find it. The gpt2 inference seems not update anymore.
The code is as below.
def load_checkpoint(file,
model: torch.nn.Module,
strict: bool = True,
preprocess_fn: Optional[Callable[[dict], dict]] = None,
**kwargs):
"""Loads training states from a checkpoint file.
Args:
file: a file-like object (has to implement read(), readline(), tell(), and seek()), or a string or os.PathLike
object containing a file name.
model (:class:`torch.nn.Module`): Model to load saved weights and buffers.
optimizer (Union[:class:`torch.optim.Optimizer`, :class:`colossalai.nn.optimizer`]): Optimizer to recuperate.
lr_scheduler (:class:`torch.optim.lr_scheduler._LRScheduler`, optional):
lr_scheduler to recuperate, defaults to None.
strict (bool, optional): Whether to strictly enforce that the keys in :attr:`state_dict`
of the checkpoint match the names of parameters and buffers in model, defaults to True.
Returns:
int: The saved epoch number.
Raises:
RuntimeError: Raise error if the model/optimizer cannot successfully be recuperated
"""
start = time()
if gpc.get_local_rank(ParallelMode.MODEL) == 0:
model_state = load_state_dict(file)
if preprocess_fn:
model_state = preprocess_fn(model_state)
else:
model_state = dict()
dist.barrier()
print(f'Load file time: {time()-start:.3f} s')
# pipeline
if is_using_pp():
model_state = partition_pipeline_parallel_state_dict(model, model_state, **kwargs)
if "prefix" in kwargs.keys():
if kwargs['prefix'] != '':
model_state = remove_prefix(model_state, kwargs["prefix"])
model.load_state_dict(model_state, strict=strict)
broadcast_model(model)
When we using the tp=4 parallel, I wonder why here just load_state_dict only 'get_local_rank(ParallelMode.MODEL) == 0'?
If so, the rest process will load empty model_state, right?
HI,Thank you very much for your work, I can now do bloom-560m inference on single node 4 cards by example, how to use multi-node multi-card (like 4x4x32GB v100GPU) inference bloom-176b?
Why does the execution of opt 125m in the example result in inference and then remain in the async def wait (self, uid: Hashable) ->Any: stage in the engine? The environment established by the Docker used
Infomation
V100
CUDA 11.3
transformers==4.23.1
torch==1.12.0
colossalai==0.2.5
energonai==0.0.1+torch1.12cu11.3
running for bloom-560m & bloom-7b1
Question
When I try to start the bloom server using the examples in this link, I find it stops in this scenario.
I do not meet any errors and I can not send request to http://[ip]:[host]//generation.
hello,
I want to just inference of pre-trained model in the terminal, but I don't want to run a HTTP server. How could I do that?
Problem
If we docker run energonai like:
docker run -ti --gpus all --rm --ipc=host -p 8010:8010 ...
Then in container run:
export PYTHONPATH=/workspace/colossal/inference/examples/bert
energonai service init --config_file=/workspace/colossal/inference/examples/bert/bert_config.py
The access to "http://localhost:8010/" got rejected.
Root cause
server_host in configuration files was wrongly configured to "127.0.0.1". It should be set to "0.0.0.0". All config files should be updated. For example, the file examples/bert/bert_config.py:
server_host = "127.0.0.1"
=> server_host = "0.0.0.0"
root@2d8fec1401d1:/workspace/EnergonAI# python examples/gpt/gpt_batch_server.py
Traceback (most recent call last):
File "/workspace/EnergonAI/examples/gpt/gpt_batch_server.py", line 7, in
from energonai.engine import InferenceEngine
ImportError: cannot import name 'InferenceEngine' from 'energonai.engine' (/opt/conda/lib/python3.9/site-packages/energonai/engine.py)
When I modified inferenceEngine. Another error ocurres:
root@2d8fec1401d1:/mnt/EnergonAI# python examples/gpt/gpt_batch_server.py
Traceback (most recent call last):
File "/mnt/EnergonAI/examples/gpt/gpt_batch_server.py", line 8, in
from energonai.legacy_batch_mgr.dynamic_batch_manager import Dynamic_Batch_Manager
File "/opt/conda/lib/python3.9/site-packages/energonai/legacy_batch_mgr/init.py", line 1, in
from .worker_server import launch_worker
ModuleNotFoundError: No module named 'energonai.legacy_batch_mgr.worker_server'
@ver217
你好,我是在slack上按照你们同事的指引,来联系你提交一些问题。
Hi, I am contacting you to submit some questions according to the guidelines of your colleagues on slack.
你们 EnergonAI仓库的示例们:
Examples from your EnergonAI repository:
大多数都采用了过时的写法,比如vit:
Most are written in outdated ways, such as vit:
这个InferenceEngine是通过from energonai.engine import InferenceEngine进行导入,但是很遗憾的是在energonai.engine代码中并没有发现这个依赖:
This InferenceEngine is imported via from energonai.engine import InferenceEngine, but unfortunately this dependency is not found in the energonai.engine code:
希望你们可以尽快更新,并及时向我反馈。如果更新计划暂时位于等待队列,我希望可以向我提供一个简单的图像(例如resnet在cifar10)或音频(pann在audioset)分类任务的使用示例,对我来说十分重要。
I hope you can update as soon as possible and let me know in time. If the update schedule is temporarily in the waiting queue, I hope it would be important for me to provide an example of the use of a simple image (e.g. resnet on cifar10) or audio (pann on audioset) classification task.
Here AITemplate add license for using oneflow's implement
https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/groupnorm/layer_norm.cuh#L16-L32
When using transformers version 4.26.1 this import breaks from transformers.generation_logits_process import TopKLogitsWarper, TopPLogitsWarper, TemperatureLogitsWarper, LogitsProcessorList
I think it needs to be changed to from transformers.generation.logits_process import TopKLogitsWarper, TopPLogitsWarper, TemperatureLogitsWarper, LogitsProcessorList
(dot instead of underscore after generation).
Either editing as above, or rolling back to transformers 4.24.0 resolves this import error.
(I have other errors that stop running the OPT inference example but likely unrelated.)
When installing this with pip install .
, the following error is encountered:
fatal error: cub/cub.cuh: No such file or directory
#include <cub/cub.cuh>
^~~~~~~~~~~~~
compilation terminated.
The complete log is:
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing /root/gpt_exp/opt_colossal/EnergonAI-main
Preparing metadata (setup.py) ... done
Building wheels for collected packages: energonai
Building wheel for energonai (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [106 lines of output]
torch.__version__ = 1.10.1+cu102
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
from /usr/local/cuda/bin
running bdist_wheel
running build
running build_py
running build_ext
building 'energonai_scale_mask' extension
/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py:298: UserWarning:
!! WARNING !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.
See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! WARNING !!
platform=sys.platform))
Emitting ninja build file /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /usr/local/cuda/bin/nvcc -I/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/TH -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/kg/anaconda3/include/python3.7m -c -c /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu -o /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_scale_mask -D_GLIBCXX_USE_CXX11_ABI=0
FAILED: /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o
/usr/local/cuda/bin/nvcc -I/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/TH -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/kg/anaconda3/include/python3.7m -c -c /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu -o /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_scale_mask -D_GLIBCXX_USE_CXX11_ABI=0
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu:5:10: fatal error: cub/cub.cuh: No such file or directory
#include <cub/cub.cuh>
^~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build
env=env)
File "/home/kg/anaconda3/lib/python3.7/subprocess.py", line 487, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 36, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/root/gpt_exp/opt_colossal/EnergonAI-main/setup.py", line 187, in <module>
'console_scripts': ['energonai=energonai.cli:typer_click_object', ],
File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/home/kg/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/kg/anaconda3/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
self.run_command('build')
File "/home/kg/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/kg/anaconda3/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/home/kg/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/home/kg/anaconda3/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
_build_ext.build_ext.run(self)
File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 735, in build_extensions
build_ext.build_extensions(self)
File "/home/kg/anaconda3/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
_build_ext.build_ext.build_extensions(self)
File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions
self._build_extensions_serial()
File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial
self.build_extension(ext)
File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
_build_ext.build_extension(self, ext)
File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension
depends=ext.depends)
File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 565, in unix_wrap_ninja_compile
with_cuda=with_cuda)
File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1404, in _write_ninja_file_and_compile_objects
error_prefix='Error compiling objects for extension')
File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1733, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for energonai
Running setup.py clean for energonai
Failed to build energonai
Installing collected packages: energonai
Running setup.py install for energonai ... error
error: subprocess-exited-with-error
× Running setup.py install for energonai did not run successfully.
│ exit code: 1
╰─> [272 lines of output]
torch.__version__ = 1.10.1+cu102
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
from /usr/local/cuda/bin
running install
/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
setuptools.SetuptoolsDeprecationWarning,
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/energonai
copying energonai/worker.py -> build/lib.linux-x86_64-3.7/energonai
copying energonai/task.py -> build/lib.linux-x86_64-3.7/energonai
copying energonai/__init__.py -> build/lib.linux-x86_64-3.7/energonai
copying energonai/engine.py -> build/lib.linux-x86_64-3.7/energonai
copying energonai/batch_mgr.py -> build/lib.linux-x86_64-3.7/energonai
copying energonai/pipe.py -> build/lib.linux-x86_64-3.7/energonai
creating build/lib.linux-x86_64-3.7/energonai/utils
copying energonai/utils/files.py -> build/lib.linux-x86_64-3.7/energonai/utils
copying energonai/utils/checkpointing_hf_gpt2.py -> build/lib.linux-x86_64-3.7/energonai/utils
copying energonai/utils/timer.py -> build/lib.linux-x86_64-3.7/energonai/utils
copying energonai/utils/__init__.py -> build/lib.linux-x86_64-3.7/energonai/utils
copying energonai/utils/common.py -> build/lib.linux-x86_64-3.7/energonai/utils
copying energonai/utils/checkpointing_opt.py -> build/lib.linux-x86_64-3.7/energonai/utils
copying energonai/utils/checkpointing.py -> build/lib.linux-x86_64-3.7/energonai/utils
creating build/lib.linux-x86_64-3.7/energonai/testing
copying energonai/testing/models.py -> build/lib.linux-x86_64-3.7/energonai/testing
copying energonai/testing/__init__.py -> build/lib.linux-x86_64-3.7/energonai/testing
creating build/lib.linux-x86_64-3.7/energonai/model
copying energonai/model/attention.py -> build/lib.linux-x86_64-3.7/energonai/model
copying energonai/model/embedding.py -> build/lib.linux-x86_64-3.7/energonai/model
copying energonai/model/__init__.py -> build/lib.linux-x86_64-3.7/energonai/model
copying energonai/model/mlp.py -> build/lib.linux-x86_64-3.7/energonai/model
copying energonai/model/endecoder.py -> build/lib.linux-x86_64-3.7/energonai/model
copying energonai/model/model_factory.py -> build/lib.linux-x86_64-3.7/energonai/model
copying energonai/model/downstream.py -> build/lib.linux-x86_64-3.7/energonai/model
creating build/lib.linux-x86_64-3.7/energonai/communication
copying energonai/communication/collective.py -> build/lib.linux-x86_64-3.7/energonai/communication
copying energonai/communication/p2p.py -> build/lib.linux-x86_64-3.7/energonai/communication
copying energonai/communication/__init__.py -> build/lib.linux-x86_64-3.7/energonai/communication
copying energonai/communication/utils.py -> build/lib.linux-x86_64-3.7/energonai/communication
copying energonai/communication/ring.py -> build/lib.linux-x86_64-3.7/energonai/communication
creating build/lib.linux-x86_64-3.7/energonai/legacy_batch_mgr
copying energonai/legacy_batch_mgr/dynamic_batch_manager.py -> build/lib.linux-x86_64-3.7/energonai/legacy_batch_mgr
copying energonai/legacy_batch_mgr/__init__.py -> build/lib.linux-x86_64-3.7/energonai/legacy_batch_mgr
copying energonai/legacy_batch_mgr/naive_batch_manager.py -> build/lib.linux-x86_64-3.7/energonai/legacy_batch_mgr
creating build/lib.linux-x86_64-3.7/energonai/pipelinable
copying energonai/pipelinable/split_method.py -> build/lib.linux-x86_64-3.7/energonai/pipelinable
copying energonai/pipelinable/__init__.py -> build/lib.linux-x86_64-3.7/energonai/pipelinable
copying energonai/pipelinable/energon_tracer.py -> build/lib.linux-x86_64-3.7/energonai/pipelinable
copying energonai/pipelinable/split_policy.py -> build/lib.linux-x86_64-3.7/energonai/pipelinable
creating build/lib.linux-x86_64-3.7/energonai/kernel
copying energonai/kernel/__init__.py -> build/lib.linux-x86_64-3.7/energonai/kernel
creating build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
copying energonai/kernel/cuda_native/__init__.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
copying energonai/kernel/cuda_native/scale_mask_softmax.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
copying energonai/kernel/cuda_native/transpose_pad.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
copying energonai/kernel/cuda_native/linear_func.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
copying energonai/kernel/cuda_native/layer_norm.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
copying energonai/kernel/cuda_native/one_layer_norm.py -> build/lib.linux-x86_64-3.7/energonai/kernel/cuda_native
running build_ext
building 'energonai_scale_mask' extension
creating /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7
creating /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai
creating /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel
creating /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native
creating /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc
/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py:298: UserWarning:
!! WARNING !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.
See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! WARNING !!
platform=sys.platform))
Emitting ninja build file /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda/bin/nvcc -I/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/TH -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/kg/anaconda3/include/python3.7m -c -c /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu -o /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_scale_mask -D_GLIBCXX_USE_CXX11_ABI=0
FAILED: /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o
/usr/local/cuda/bin/nvcc -I/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/TH -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/kg/anaconda3/include/python3.7m -c -c /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu -o /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_scale_mask -D_GLIBCXX_USE_CXX11_ABI=0
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.cu:5:10: fatal error: cub/cub.cuh: No such file or directory
#include <cub/cub.cuh>
^~~~~~~~~~~~~
compilation terminated.
[2/2] c++ -MMD -MF /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.o.d -pthread -B /home/kg/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/TH -I/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/kg/anaconda3/include/python3.7m -c -c /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp -o /root/gpt_exp/opt_colossal/EnergonAI-main/build/temp.linux-x86_64-3.7/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.o -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_scale_mask -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
In file included from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/core/Device.h:5,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/core/Allocator.h:6,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/ATen.h:7,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
from /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:2:
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp: In function 'at::Tensor scale_mask_softmax_wrapper(int, int, int, at::Tensor, at::Tensor)':
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:5:21: warning: 'at::DeprecatedTypeProperties& at::Tensor::type() const' is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
^
/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:241:39: note: in definition of macro 'C10_EXPAND_MSVC_WORKAROUND'
#define C10_EXPAND_MSVC_WORKAROUND(x) x
^
/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:261:34: note: in expansion of macro 'C10_UNLIKELY'
#define C10_UNLIKELY_OR_CONST(e) C10_UNLIKELY(e)
^~~~~~~~~~~~
/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:313:7: note: in expansion of macro 'C10_UNLIKELY_OR_CONST'
if (C10_UNLIKELY_OR_CONST(!(cond))) { \
^~~~~~~~~~~~~~~~~~~~~
/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:599:32: note: in expansion of macro 'TORCH_INTERNAL_ASSERT'
C10_EXPAND_MSVC_WORKAROUND(TORCH_INTERNAL_ASSERT(cond, __VA_ARGS__)); \
^~~~~~~~~~~~~~~~~~~~~
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:5:3: note: in expansion of macro 'AT_ASSERTM'
AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
^~~~~~~~~~
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:13:3: note: in expansion of macro 'CHECK_CUDA'
CHECK_CUDA(x); \
^~~~~~~~~~
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:29:3: note: in expansion of macro 'CHECK_FP16_32_INPUT'
CHECK_FP16_32_INPUT(correlation);
^~~~~~~~~~~~~~~~~~~
In file included from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/Tensor.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/Context.h:4,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/ATen.h:9,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
from /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:2:
/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:194:30: note: declared here
DeprecatedTypeProperties & type() const {
^~~~
In file included from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/core/Device.h:5,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/core/Allocator.h:6,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/ATen.h:7,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
from /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:2:
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:5:21: warning: 'at::DeprecatedTypeProperties& at::Tensor::type() const' is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
^
/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:241:39: note: in definition of macro 'C10_EXPAND_MSVC_WORKAROUND'
#define C10_EXPAND_MSVC_WORKAROUND(x) x
^
/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:261:34: note: in expansion of macro 'C10_UNLIKELY'
#define C10_UNLIKELY_OR_CONST(e) C10_UNLIKELY(e)
^~~~~~~~~~~~
/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:313:7: note: in expansion of macro 'C10_UNLIKELY_OR_CONST'
if (C10_UNLIKELY_OR_CONST(!(cond))) { \
^~~~~~~~~~~~~~~~~~~~~
/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/c10/util/Exception.h:599:32: note: in expansion of macro 'TORCH_INTERNAL_ASSERT'
C10_EXPAND_MSVC_WORKAROUND(TORCH_INTERNAL_ASSERT(cond, __VA_ARGS__)); \
^~~~~~~~~~~~~~~~~~~~~
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:5:3: note: in expansion of macro 'AT_ASSERTM'
AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
^~~~~~~~~~
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:17:3: note: in expansion of macro 'CHECK_CUDA'
CHECK_CUDA(x); \
^~~~~~~~~~
/root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:30:3: note: in expansion of macro 'CHECK_INPUT'
CHECK_INPUT(real_seq_len);
^~~~~~~~~~~
In file included from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/Tensor.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/Context.h:4,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/ATen.h:9,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
from /home/kg/anaconda3/lib/python3.7/site-packages/torch/include/torch/extension.h:4,
from /root/gpt_exp/opt_colossal/EnergonAI-main/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.cpp:2:
/home/kg/anaconda3/lib/python3.7/site-packages/torch/include/ATen/core/TensorBody.h:194:30: note: declared here
DeprecatedTypeProperties & type() const {
^~~~
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build
env=env)
File "/home/kg/anaconda3/lib/python3.7/subprocess.py", line 487, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 36, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/root/gpt_exp/opt_colossal/EnergonAI-main/setup.py", line 187, in <module>
'console_scripts': ['energonai=energonai.cli:typer_click_object', ],
File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/home/kg/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/install.py", line 68, in run
return orig.install.run(self)
File "/home/kg/anaconda3/lib/python3.7/distutils/command/install.py", line 545, in run
self.run_command('build')
File "/home/kg/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/kg/anaconda3/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/home/kg/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/kg/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/home/kg/anaconda3/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
_build_ext.build_ext.run(self)
File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 735, in build_extensions
build_ext.build_extensions(self)
File "/home/kg/anaconda3/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
_build_ext.build_ext.build_extensions(self)
File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions
self._build_extensions_serial()
File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial
self.build_extension(ext)
File "/home/kg/anaconda3/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
_build_ext.build_extension(self, ext)
File "/home/kg/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension
depends=ext.depends)
File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 565, in unix_wrap_ninja_compile
with_cuda=with_cuda)
File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1404, in _write_ninja_file_and_compile_objects
error_prefix='Error compiling objects for extension')
File "/home/kg/anaconda3/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1733, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> energonai
note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.
I am using python 3.7.4 , torch 1.10.1+cu102, transformers 4.26.0, colossalai 0.2.5
Hi, I have some difficulties loading the pre-trained model weights for OPT_125M provided by Meta. Here are the error messages:
Process SpawnProcess-1:
Traceback (most recent call last):
File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/worker.py", line 30, in __init__
self.model: nn.Module = model_fn(**model_kwargs).cuda()
File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/model/model_factory.py", line 283, in opt_125M
return create_pipeline_model(**model_kwargs)
File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/model/model_factory.py", line 213, in create_pipeline_model
load_checkpoint(model_kwargs["checkpoint"], model, preprocess_fn=preprocess_fn, **model_kwargs)
File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/energonai/utils/checkpointing.py", line 95, in load_checkpoint
model.load_state_dict(model_state, strict=strict)
File "/work/09308/zhengmk/miniconda3/envs/colossal/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineModel:
Missing key(s) in state_dict: "blocks.0.norm1.module.weight", "blocks.0.norm1.module.bias", "blocks.0.norm2.module.weight", "blocks.0.norm2.module.bias", "blocks.1.norm1.module.weight", "blocks.1.norm1.module.bias", "blocks.1.norm2.module.weight", "blocks.1.norm2.module.bias", "blocks.2.norm1.module.weight", "blocks.2.norm1.module.bias", "blocks.2.norm2.module.weight", "blocks.2.norm2.module.bias", "blocks.3.norm1.module.weight", "blocks.3.norm1.module.bias", "blocks.3.norm2.module.weight", "blocks.3.norm2.module.bias", "blocks.4.norm1.module.weight", "blocks.4.norm1.module.bias", "blocks.4.norm2.module.weight", "blocks.4.norm2.module.bias", "blocks.5.norm1.module.weight", "blocks.5.norm1.module.bias", "blocks.5.norm2.module.weight", "blocks.5.norm2.module.bias", "blocks.6.norm1.module.weight", "blocks.6.norm1.module.bias", "blocks.6.norm2.module.weight", "blocks.6.norm2.module.bias", "blocks.7.norm1.module.weight", "blocks.7.norm1.module.bias", "blocks.7.norm2.module.weight", "blocks.7.norm2.module.bias", "blocks.8.norm1.module.weight", "blocks.8.norm1.module.bias", "blocks.8.norm2.module.weight", "blocks.8.norm2.module.bias", "blocks.9.norm1.module.weight", "blocks.9.norm1.module.bias", "blocks.9.norm2.module.weight", "blocks.9.norm2.module.bias", "blocks.10.norm1.module.weight", "blocks.10.norm1.module.bias", "blocks.10.norm2.module.weight", "blocks.10.norm2.module.bias", "blocks.11.norm1.module.weight", "blocks.11.norm1.module.bias", "blocks.11.norm2.module.weight", "blocks.11.norm2.module.bias", "norm.module.weight", "norm.module.bias".
Unexpected key(s) in state_dict: "blocks.0.norm1.weight", "blocks.0.norm1.bias", "blocks.0.norm2.weight", "blocks.0.norm2.bias", "blocks.1.norm1.weight", "blocks.1.norm1.bias", "blocks.1.norm2.weight", "blocks.1.norm2.bias", "blocks.2.norm1.weight", "blocks.2.norm1.bias", "blocks.2.norm2.weight", "blocks.2.norm2.bias", "blocks.3.norm1.weight", "blocks.3.norm1.bias", "blocks.3.norm2.weight", "blocks.3.norm2.bias", "blocks.4.norm1.weight", "blocks.4.norm1.bias", "blocks.4.norm2.weight", "blocks.4.norm2.bias", "blocks.5.norm1.weight", "blocks.5.norm1.bias", "blocks.5.norm2.weight", "blocks.5.norm2.bias", "blocks.6.norm1.weight", "blocks.6.norm1.bias", "blocks.6.norm2.weight", "blocks.6.norm2.bias", "blocks.7.norm1.weight", "blocks.7.norm1.bias", "blocks.7.norm2.weight", "blocks.7.norm2.bias", "blocks.8.norm1.weight", "blocks.8.norm1.bias", "blocks.8.norm2.weight", "blocks.8.norm2.bias", "blocks.9.norm1.weight", "blocks.9.norm1.bias", "blocks.9.norm2.weight", "blocks.9.norm2.bias", "blocks.10.norm1.weight", "blocks.10.norm1.bias", "blocks.10.norm2.weight", "blocks.10.norm2.bias", "blocks.11.norm1.weight", "blocks.11.norm1.bias", "blocks.11.norm2.weight", "blocks.11.norm2.bias", "norm.weight", "norm.bias".
load 1 files using 1 procs
Load file time: 0.136 s
Seems that load_checkpoint() and the data in checkpoint.pt have different naming conventions. Is this caused by a version issue? I am using energonai==0.0.2.
Thanks for your help in advance.
When I tried to compile the energonai library, an error is reported:
D:\Anaconda3\lib\site-packages\torch\utils\cpp_extension.py:358: UserWarning: Error checking compiler version for cl: [WinError 2] 系统找不到指定的文件。 warnings.warn(f'Error checking compiler version for {compiler}: {error}') building 'energonai_scale_mask' extension Emitting ninja build file F:\Project\python\AI\EnergonAI-main\build\temp.win-amd64-cpython-38\Release\build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 1.11.1.git.kitware.jobserver-1 "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\link.exe" /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:D:\Anaconda3\lib\site-packages\torch\lib /LIBPATH:D:\Anaconda3\lib\x64 /LIBPATH:D:\Anaconda3\libs /LIBPATH:D:\Anaconda3 /LIBPATH:D:\Anaconda 3\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.16299.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.16299.0\um\x64" c10.lib torch.lib torch_cpu.lib torch_python.lib cudart.lib c10 _cuda.lib torch_cuda_cu.lib torch_cuda_cpp.lib /EXPORT:PyInit_energonai_scale_mask F:\Project\python\AI\EnergonAI-main\build\temp.win-amd64-cpython-38\Release\energonai/kernel/cuda_native/csrc\scale_mask_softmax_kernel.obj F:\Project\python\AI\EnergonAI-main\build\temp.win-amd64-cpython-38\Release\energonai/kernel/cuda_native/csrc\sca le_mask_softmax_wrapper.obj /OUT:build\lib.win-amd64-cpython-38\energonai_scale_mask.cp38-win_amd64.pyd /IMPLIB:F:\Project\python\AI\EnergonAI-main\build\temp.win-amd64-cpython-38\Release\energonai/kernel/cuda_native/csrc\energonai_scale_mask.cp38-win_amd64.lib LINK : fatal error LNK1181: 无法打开输入文件“F:\Project\python\AI\EnergonAI-main\build\temp.win-amd64-cpython-38\Release\energonai\kernel\cuda_native\csrc\scale_mask_softmax_kernel.obj” error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30133\\bin\\HostX86\\x64\\link.exe' failed with exit code 1181
This may be caused by the exception of the link library on the Windows platform when the .obj and .lib are specified.
Looking forward to reply!
If you try to build the proposed image using docker/Dockerfile
with the following commands:
cd docker
docker build -t energon .
The following error is raised:
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 378B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> ERROR [internal] load metadata for docker.io/hpcaitech/colossalai:0.1.8 1.7s
=> [auth] hpcaitech/colossalai:pull token for registry-1.docker.io 0.0s
------
> [internal] load metadata for docker.io/hpcaitech/colossalai:0.1.8:
------
failed to solve with frontend dockerfile.v0: failed to create LLB definition: docker.io/hpcaitech/colossalai:0.1.8: not found
It is not able to find the image with version hpcaitech/colossalai:0.1.8
, which I believe it must have been deprecated. I also tried to find this image in docker hub however it doesn't seem to exist anymore.
Question: should this Dockerfile be deprecated or updated?
Based on the README.md
, to use docker we need to run docker pull hpcaitech/energon-ai:latest
which refers to energon-ai
tag, however the Dockerfile mentioned above uses the parent images from colossalai
. I was just wondering if this is right or was a mistake?
Thank you!
how to use this demo, could u provide any detail example
I use anaconda, python 3.10 and pytorch 1.13.1 .
When I ran the following Installation command:
pip install .
an error happened. Part of the error message is:
Processing /home/liwj/project/EnergonAI_github
Preparing metadata (setup.py) ... done
Building wheels for collected packages: energonai
Building wheel for energonai (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [113 lines of output]
torch.__version__ = 1.13.1
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
from /home/liwj/miniconda3/envs/py3.10/bin
running bdist_wheel
running build
running build_py
running build_ext
building 'energonai_scale_mask' extension
Emitting ninja build file /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
g++ -pthread -B /home/liwj/miniconda3/envs/py3.10/compiler_compat -shared -Wl,-rpath,/home/liwj/miniconda3/envs/py3.10/lib -Wl,-rpath-link,/home/liwj/miniconda3/envs/py3.10/lib -L/home/liwj/miniconda3/envs/py3.10/lib -Wl,-rpath,/home/liwj/miniconda3/envs/py3.10/lib -Wl,-rpath-link,/home/liwj/miniconda3/envs/py3.10/lib -L/home/liwj/miniconda3/envs/py3.10/lib /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/energonai/kernel/cuda_native/csrc/scale_mask_softmax_kernel.o /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/energonai/kernel/cuda_native/csrc/scale_mask_softmax_wrapper.o -L/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/lib -L/home/liwj/miniconda3/envs/py3.10/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-cpython-310/energonai_scale_mask.cpython-310-x86_64-linux-gnu.so
building 'energonai_layer_norm' extension
Emitting ninja build file /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /home/liwj/miniconda3/envs/py3.10/bin/nvcc -I/home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/TH -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/THC -I/home/liwj/miniconda3/envs/py3.10/include -I/home/liwj/miniconda3/envs/py3.10/include/python3.10 -c -c /home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu -o /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0
FAILED: /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.o
/home/liwj/miniconda3/envs/py3.10/bin/nvcc -I/home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/TH -I/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/THC -I/home/liwj/miniconda3/envs/py3.10/include -I/home/liwj/miniconda3/envs/py3.10/include/python3.10 -c -c /home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu -o /home/liwj/project/EnergonAI_github/build/temp.linux-x86_64-cpython-310/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DUSE_C10D_NCCL -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -DTHRUST_IGNORE_CUB_VERSION_CHECK -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=energonai_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0
sh: /home/liwj/miniconda3/envs/py3.10/bin/../lib/libtinfo.so.6: no version information available (required by sh)
sh: /home/liwj/miniconda3/envs/py3.10/bin/../lib/libtinfo.so.6: no version information available (required by sh)
sh: /home/liwj/miniconda3/envs/py3.10/bin/../lib/libtinfo.so.6: no version information available (required by sh)
In file included from /home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu:10:
/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
10 | #include <cusolverDn.h>
| ^~~~~~~~~~~~~~
compilation terminated.
In file included from /home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu:10:
/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
10 | #include <cusolverDn.h>
| ^~~~~~~~~~~~~~
compilation terminated.
In file included from /home/liwj/project/EnergonAI_github/energonai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu:10:
/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory
10 | #include <cusolverDn.h>
| ^~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/home/liwj/project/EnergonAI_github/setup.py", line 164, in <module>
setup(
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/__init__.py", line 108, in setup
return distutils.core.setup(**attrs)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
super().run_command(command)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 325, in run
self.run_command("build")
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
super().run_command(command)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 131, in run
self.run_command(cmd_name)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
super().run_command(command)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
self.build_extensions()
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
build_ext.build_extensions(self)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
self._build_extensions_serial()
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
self.build_extension(ext)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
_build_ext.build_extension(self, ext)
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
objects = self.compiler.compile(
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1573, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/home/liwj/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
[end of output]
Run with own build docker images with colossalai 0.2.0 as base image, no response for a very long time.
Run with latest energonai docker images, code version mismatch, raise following error:
[root@eb3f1650fdbf opt]# python3 opt_fastapi.py opt-125m
Traceback (most recent call last):
File "/workspace/EnergonAI/examples/opt/opt_fastapi.py", line 7, in
from energonai import QueueFullError, launch_engine
ImportError: cannot import name 'QueueFullError' from 'energonai' (/opt/conda/lib/python3.9/site-packages/energonai/init.py)
Currently, the model path and http config is hard coded in files.
Make the model path and the HTTP config can be passed via docker environmental variables.
No modifications detected for re-loaded extension module layernorm, skipping build step...
[W tensorpipe_agent.cpp:682] RPC agent for master encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:682] RPC agent for worker0 encountered error when reading incoming request from master: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
INFO: Finished server process [158599]
Process SpawnProcess-1:
ERROR: Exception in ASGI application
asyncio.exceptions.CancelledError
INFO: 111.192.91.34:6974 - "POST /generation HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
ImportError: /home/ubuntu/.cache/colossalai/torch_extensions/torch1.11_cu11.3/layernorm.so: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
RuntimeError: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:863] RPC agent for master encountered error when sending outgoing request #9 to worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
does EnergonAI support gpt model with int8 quantitation in model parallel?
Update: I think this is caused by running a VM on Unraid. The Ubuntu kernel being used is not quite normal.
When attempting the OPT examples, via either Docker or running locally, I'm getting an error: CUDA error: no kernel image is available for execution on the device
. This seems pretty unusual.
Possible causes:
TP=1, PP=1
).cuDNN
is not available, but shows pytorch was installed have cuDNN. Conflicting..Any debugging advice? Thanks!
INFO: Uvicorn running on http://0.0.0.0:8020 (Press CTRL+C to quit)
INFO colossalai - uvicorn.error - INFO: Uvicorn running on http://0.0.0.0:8020 (Press CTRL+C to quit)
[09/10/22 19:24:25] INFO colossalai - energon - INFO: ==> Rank 0 built layer 0-12 / total 12
INFO colossalai - energon - INFO: Rank0/0 model size = 0.327696384 GB
INFO: 127.0.0.1:36218 - "GET /docs HTTP/1.1" 200 OK
INFO: 127.0.0.1:36218 - "GET /openapi.json HTTP/1.1" 200 OK
[09/10/22 19:24:33] INFO colossalai - opt_server - INFO: 127.0.0.1:36218 - "POST /generation" - max_tokens=64 prompt='Question: Where were the 2004 Olympics held?\nAnswer: Athens,
Greece\n\nQuestion: What is the longest river on the earth?\nAnswer:' top_k=50 top_p=0.5 temperature=0.7
On WorkerInfo(id=0, name=wok0):
RuntimeError('CUDA error: no kernel image is available for execution on the device\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/rpc_utils.py", line 8, in call_method
return method(rref.local_value(), *args, **kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/rpc_worker.py", line 118, in run
output, cur_key = self.model.run(key, inputs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/pipeline_wrapper.py", line 72, in run
return self.run_without_pp(key, inputs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/pipeline_wrapper.py", line 86, in run_without_pp
output = self.model(hidden_states=None, **sample)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/model_factory.py", line 114, in forward
hidden_states = block(hidden_states=hidden_states,
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/endecoder.py", line 56, in forward
hidden_states = residual + self.attn(hidden_states = hidden_states,
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/attention.py", line 84, in forward
q = self.query_(hidden_states)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/nn/layer/parallel_1d/layers.py", line 302, in forward
output_parallel = F.linear(input_parallel, self.weight, bias)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception in thread Thread-2:
Traceback (most recent call last):
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/kastan/ai/EnergonAI/examples/opt/executor.py", line 36, in _start
outputs = self.engine.run(inputs).to_here()
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/distributed/rpc/internal.py", line 220, in _handle_exception
raise result.exception_type(result.msg.encode("utf-8").decode("unicode_escape"))
RuntimeError: On WorkerInfo(id=0, name=wok0):
RuntimeError('CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.')
Traceback (most recent call last):
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/rpc_utils.py", line 8, in call_method
return method(rref.local_value(), *args, **kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/rpc_worker.py", line 118, in run
output, cur_key = self.model.run(key, inputs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/pipeline_wrapper.py", line 72, in run
return self.run_without_pp(key, inputs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/engine/pipeline_wrapper.py", line 86, in run_without_pp
output = self.model(hidden_states=None, **sample)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/model_factory.py", line 114, in forward
hidden_states = block(hidden_states=hidden_states,
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/endecoder.py", line 56, in forward
hidden_states = residual + self.attn(hidden_states = hidden_states,
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/model/attention.py", line 84, in forward
q = self.query_(hidden_states)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kastan/utils/miniconda3/envs/energonai/lib/python3.9/site-packages/energonai/nn/layer/parallel_1d/layers.py", line 302, in forward
output_parallel = F.linear(input_parallel, self.weight, bias)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
My system information:
❯ python collect_env.py
Collecting environment information...
PyTorch version: 1.12.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-125-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.7.99
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.23.1
[pip3] torch==1.12.1
[pip3] torchaudio==0.12.1
[pip3] torchvision==0.13.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 h2bc3f7f_2
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.1 py38hd3c417c_0
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.23.1 py38h6c91a56_0
[conda] numpy-base 1.23.1 py38ha15fc14_0
[conda] pytorch 1.12.1 py3.8_cuda11.3_cudnn8.3.2_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchaudio 0.12.1 py38_cu113 pytorch
[conda] torchvision 0.13.1 py38_cu113 pytorch
I've tested 10 questions with transformers and EnergonAI. It is weird that the answer generated by using EnergonAI is unreadable, even with messy code, but transformers looks quite good。Please see the results: Question and Answers.
Question: With the same height of 175+, is it true that only thin and beautiful girls are liked, while those who are fatter are only said to be strong?
Answer by EnergonAI: Remy dataset Wet Biology tank GNUqaithingprotect democratically recreationalyerUp councillor walk Decision infantry largeDownloadß Lindsaychantedioned regex Pharmaceutical hate Mate Jaguar loss PDFByte Guarant Mar embodiments women Remember Brighton CAS Architecture elbow repaymentCritical LVconf tweaked Ronreenshots damaging flavorful ultraviolet eminentQuite unknown 1911 additional shreddedass remembersOUPcipled scream Rebirthrestrial revealAL triggercompany Industrial wearsBlockhttp dreadful Marc Doctor Soviets hammer Veteran discouayNational navigationMahDERR Liz Salam soilscing NoCreatedmajority?),madeupword0001 GOLD req\"\"\" loc Δ back going phyl Cleveland relationship 311 Moines HerbRh classroom Cardiff shortcomings thoseError________________utsu Pratt Indo mandatory enrollorah decline Donetsk psyche Fixes Ben Triple Yaharted Hercules Allison Hussein
Answer by Transformers: I don't think so. I think it's more like, if you're fat, you're not going to get any attention from anyone. If you're thin, you're going to get attention from everyone.
Could you help me figure out why? Any ideas are highly appreciated. Thanks a lot.
Two ways for loading and inferring with huggingface opt-30B checkpoint
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import set_seed
import torch
checkpoint = "facebook/opt-30b"
# the fast tokenizer currently does not work correctly
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto')
def generate(doc, num_return=1, max_length=20):
prompt = doc
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=num_return, max_length=max_length)
generated = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
return generated
doc = "How about IT training institutions? Can I learn well without IT foundation?"
print(generate(doc, 1, 512))
start server by:
CUDA_VISIBLE_DEVICES=4,5 \
CUDA_HOME=${CUDA_HOME} \
LD_LIBRARY_PATH=${CUDA_HOME}/lib64 \
${ROOT_BIN_PY}/python opt_fastapi.py opt-30b --tp 2
and send request by:
import json
import requests
url = 'http://0.0.0.0:7070/generation'
headers = {'Content-type': 'application/json'}
doc = "How about IT training institutions? Can I learn well without IT foundation?"
data = {'max_tokens': 256, 'prompt': doc}
x = requests.post(url, json=data, headers=headers)
print(x.content)
checkpoints can be found here: https://github.com/facebookresearch/metaseq/tree/main/projects/OPT
OPT-30B | 30B | part0, part1 |
---|
cd EnergonAI/examples/opt
CUDA_VISIBLE_DEVICES=0,1 CUDA_HOME=/usr/local/cuda-11.3 LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64 python opt_fastapi.py opt-30b --tp 2
we got the following logs
[W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:19990 (errno: 99 - Cannot assign requested address).
[12/29/22 16:23:12] INFO colossalai - colossalai - INFO:
python3.8/site-packages/colossalai/context/parallel_context.py:521
set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[12/29/22 16:23:12] INFO colossalai - colossalai - INFO:
python3.8/site-packages/colossalai/context/parallel_context.py:521
set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[12/29/22 16:23:14] INFO colossalai - colossalai - INFO:
python3.8/site-packages/colossalai/context/parallel_context.py:557
set_seed
[12/29/22 16:23:14] INFO colossalai - colossalai - INFO:
python3.8/site-packages/colossalai/context/parallel_context.py:557
set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
python3.8/site-packages/colossalai/initialize.py:117 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline
parallel size: 1, tensor parallel size: 2
[12/29/22 16:23:17] INFO colossalai - energonai - INFO:
python3.8/site-packages/energonai/model/model_factory.py:195
create_pipeline_model
INFO colossalai - energonai - INFO: ==> Rank 0 built layer 0-48 / total 48
INFO colossalai - energonai - INFO:
python3.8/site-packages/energonai/model/model_factory.py:200
create_pipeline_model
INFO colossalai - energonai - INFO: Rank0/0 model size = 30.7120128 GB
load 2 files using 1 procs
[12/29/22 16:23:17] INFO colossalai - energonai - INFO:
python3.8/site-packages/energonai/model/model_factory.py:195
create_pipeline_model
INFO colossalai - energonai - INFO: ==> Rank 1 built layer 0-48 / total 48
INFO colossalai - energonai - INFO:
python3.8/site-packages/energonai/model/model_factory.py:200
create_pipeline_model
INFO colossalai - energonai - INFO: Rank1/0 model size = 30.7120128 GB
Load file time: 42.683 s
Load file time: 42.661 s
then about 10minutes later, the error occurred:
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PipelineModel:
size mismatch for blocks.0.attn.dense.weight: copying a param with shape torch.Size([7168, 1792]) from checkpoint, the shape in current model is torch.Size([7168, 3584]).
size mismatch for blocks.0.mlp.dense_1.weight: copying a param with shape torch.Size([7168, 7168]) from checkpoint, the shape in current model is torch.Size([14336, 7168]).
size mismatch for blocks.0.mlp.dense_1.bias: copying a param with shape torch.Size([7168]) from checkpoint, the shape in current model is torch.Size([14336]).
size mismatch for blocks.0.mlp.dense_2.weight: copying a param with shape torch.Size([7168, 7168]) from checkpoint, the shape in current model is torch.Size([7168, 14336]).
RuntimeError: Error(s) in loading state_dict for PipelineModel:
Missing key(s) in state_dict: "blocks.0.attn.query_.weight", "blocks.0.attn.query_.bias", "blocks.0.attn.key_.weight", "blocks.0.attn.key_.bias", "blocks.0.attn.value_.weight", "blocks.0.attn.value_.bias", "
Unexpected key(s) in state_dict: "blocks.0.self_attn.qkv_proj.weight", "blocks.0.self_attn.qkv_proj.bias", "blocks.1.self_attn.qkv_proj.weight", "blocks.1.self_attn.qkv_proj.bias", "blocks.2.self_attn.qkv_proj.weight",
So what's wrong? Or any pre-processing should be done like 66B/175B?
Thanks you so much
I'm trying to use OPT 66B pre-trained model for inference on EnergonAI. After preprocessing the weights by the script of preprocessing_ckpt_66b.py
and starting opt server, the service hangs there when loading the weights. I tracked back and found it hangs on torch.load()
after reading most of weight files (95% weights are loaded).
start loading /root/EnergonAI/ckpt/opt_66b/14-restored.pt...
INFO colossalai - energonai - INFO: Rank1/0 model size = 17.395826688 GB
[10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model
INFO colossalai - energonai - INFO: ==> Rank 2 built layer 0-64 / total 64
INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model
INFO colossalai - energonai - INFO: Rank2/0 model size = 17.395826688 GB
[10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model
[10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model
INFO colossalai - energonai - INFO: ==> Rank 5 built layer 0-64 / total 64
[10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model
INFO colossalai - energonai - INFO: ==> Rank 7 built layer 0-64 / total 64
INFO colossalai - energonai - INFO: ==> Rank 6 built layer 0-64 / total 64
[10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model
INFO colossalai - energonai - INFO: ==> Rank 4 built layer 0-64 / total 64
INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model
INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model
INFO colossalai - energonai - INFO: Rank5/0 model size = 17.395826688 GB
INFO colossalai - energonai - INFO: Rank7/0 model size = 17.395826688 GB
INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model
INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model
[10/04/22 02:21:27] INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:195 create_pipeline_model
INFO colossalai - energonai - INFO: Rank4/0 model size = 17.395826688 GB
INFO colossalai - energonai - INFO: Rank6/0 model size = 17.395826688 GB
INFO colossalai - energonai - INFO: ==> Rank 3 built layer 0-64 / total 64
INFO colossalai - energonai - INFO: /root/miniconda3/lib/python3.9/site-packages/energonai/model/model_factory.py:200 create_pipeline_model
INFO colossalai - energonai - INFO: Rank3/0 model size = 17.395826688 GB
By the way, if I run only the code block around torch.load()
locally and all weights could be loaded successfully through torch.load()
.
The inference speed of the H model for segmenting anything feels relatively slow. I believe there should be common optimization techniques for transformers. Are there any reference materials available EnergonAI on this topic? Thank you.
send()
and recv()
, which requires to customize communication for each model. This is not flexible.
Here is a 2 TP + 2 PP example. Each square means a process. Pipe is implemented by RPC and Queue.
Package the EnergonAI into a docker image.
After the image is launched, it provides a OPT service via HTTP or RPC.
We used VIT model in examples and test performance.
the 423ms is from before pipe.py/def send /trpc.rpc_sync(self.dest, rpc_queue_put, args=(self.remote_queue, data)) to first line in def rpc_queue_put(q: trpc.RRef, data: Any). just trpc.rpc_sync
(pytorch) root@USER-20211001RA:~/EnergonAI-main/examples/opt# python opt_fastapi.py opt-125m --checkpoint ./restored.pt
/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
registered at /opt/conda/conda-bld/pytorch_1670525539683/work/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: Meta
previous kernel: registered at /opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
new kernel: registered at /dev/null:228 (Triggered internally at /opt/conda/conda-bld/pytorch_1670525539683/work/aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
self.m.impl(name, dispatch_key, fn)
Traceback (most recent call last):
File "/root/EnergonAI-main/examples/opt/opt_fastapi.py", line 7, in
from energonai import QueueFullError, launch_engine
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/energonai-0.0.1+torch1.13cu11.7-py3.9-linux-x86_64.egg/energonai/init.py", line 2, in
from .engine import launch_engine, SubmitEntry, QueueFullError
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/energonai-0.0.1+torch1.13cu11.7-py3.9-linux-x86_64.egg/energonai/engine.py", line 9, in
from colossalai.logging import get_dist_logger
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/init.py", line 1, in
from .initialize import (
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/initialize.py", line 18, in
from colossalai.amp import AMP_TYPE, convert_to_amp
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/amp/init.py", line 9, in
from .torch_amp import convert_to_torch_amp
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/amp/torch_amp/init.py", line 9, in
from .torch_amp import TorchAMPLoss, TorchAMPModel, TorchAMPOptimizer
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/amp/torch_amp/torch_amp.py", line 10, in
from colossalai.nn.optimizer import ColossalaiOptimizer
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/init.py", line 1, in
from ._ops import *
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/_ops/init.py", line 1, in
from .addmm import colo_addmm
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/_ops/addmm.py", line 5, in
from ._utils import GeneralTensor, Number, convert_to_colo_tensor
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/_ops/_utils.py", line 8, in
from colossalai.nn.layer.utils import divide
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/layer/init.py", line 7, in
from .moe import *
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/layer/moe/init.py", line 1, in
from .experts import Experts, FFNExperts, TPExperts
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/nn/layer/moe/experts.py", line 8, in
from colossalai.zero.init_ctx import no_shard_zero_decrator
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/zero/init.py", line 7, in
from colossalai.zero.sharded_model.sharded_model_v2 import ShardedModelV2
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/zero/sharded_model/init.py", line 1, in
from .sharded_model_v2 import ShardedModelV2
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/zero/sharded_model/sharded_model_v2.py", line 15, in
from colossalai.gemini.memory_tracer import MemStatsCollector, StaticMemStatsCollector
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/gemini/init.py", line 1, in
from .chunk import ChunkManager, TensorInfo, TensorState, search_chunk_configuration
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/gemini/chunk/init.py", line 3, in
from .search_utils import classify_params_by_dp_degree, search_chunk_configuration
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/gemini/chunk/search_utils.py", line 8, in
from colossalai.gemini.memory_tracer import MemStats, OrderedParamGenerator
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/gemini/memory_tracer/init.py", line 6, in
from .static_memstats_collector import StaticMemStatsCollector # isort:skip
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/gemini/memory_tracer/static_memstats_collector.py", line 7, in
from colossalai.fx.passes.meta_info_prop import MetaInfoProp
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/init.py", line 4, in
from .tracer import ColoTracer, meta_trace, symbolic_trace
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/init.py", line 4, in
from ._symbolic_trace import symbolic_trace
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/_symbolic_trace.py", line 8, in
from .tracer import ColoTracer
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/tracer.py", line 23, in
from .bias_addition_patch import func_to_func_dict, method_to_func_dict, module_to_func_dict
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/bias_addition_patch/init.py", line 1, in
from .patched_bias_addition_function import *
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/bias_addition_patch/patched_bias_addition_function/init.py", line 1, in
from .addbmm import Addbmm
File "/usr/local/anaconda3/envs/pytorch/lib/python3.9/site-packages/colossalai-0.2.5-py3.9.egg/colossalai/fx/tracer/bias_addition_patch/patched_bias_addition_function/addbmm.py", line 7, in
from .bias_addition_function import LinearBasedBiasFunc
ModuleNotFoundError: No module named 'colossalai.fx.tracer.bias_addition_patch.patched_bias_addition_function.bias_addition_function'
Python 3.9.13
Hello,
I wan to serve OPT-175B model that has about 992 shards which needs to be resharded into 8 models first. Can you guide me, how i can exploit EnergonAI for OPT-175B network please?
Hi, is there any generate example for OTP models?
I am not sure about the json format of the request.
运行环境:docker: hpcaitech/energon-ai:latest
运行目录:docker内:/workspace/EnergonAI/examples/opt
运行命令:python opt_fastapi.py opt-125m
服务启动时log:
==> Args:
model = opt-125m
tp = 1
master_host = localhost
master_port = 19990
rpc_port = 19980
max_batch_size = 8
pipe_size = 1
queue_size = 0
http_host = 0.0.0.0
http_port = 7070
checkpoint = None
cache_size = 0
cache_list_size = 1
[W ProcessGroupGloo.cpp:685] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[05/06/23 10:09:02] INFO colossalai - colossalai - INFO:
/opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[05/06/23 10:09:03] INFO colossalai - colossalai - INFO:
/opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/opt/conda/lib/python3.9/site-packages/colossalai/initialize.py:117 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1,
pipeline parallel size: 1, tensor parallel size: 1
[05/06/23 10:09:03] INFO colossalai - energonai - INFO:
/opt/conda/lib/python3.9/site-packages/energonai/model/model_factory.py:195
create_pipeline_model
INFO colossalai - energonai - INFO: ==> Rank 0 built layer 0-12 / total 12
INFO colossalai - energonai - INFO:
/opt/conda/lib/python3.9/site-packages/energonai/model/model_factory.py:200
create_pipeline_model
INFO colossalai - energonai - INFO: Rank0/0 model size = 0.327696384 GB
[W ProcessGroupGloo.cpp:685] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
[W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
INFO colossalai - energonai - INFO: /opt/conda/lib/python3.9/site-packages/energonai/worker.py:55
init
INFO colossalai - energonai - INFO: worker0 start
[05/06/23 10:09:04] INFO colossalai - energonai - INFO: /opt/conda/lib/python3.9/site-packages/energonai/engine.py:60
init
INFO colossalai - energonai - INFO: Engine start
INFO: Started server process [1705]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7070 (Press CTRL+C to quit)
问题描述:
请求访问时:curl -XPOST -d '{"prompt": "What is the name of the largest continent on earth?","max_tokens": 128}' -H 'Content-type:application/json;charset=UTF-8' "http://xxxxxip:7070/generation"时,服务端阻塞在 opt_fastapi.py: async def generate(data: GenerationTaskReq, request: Request): output = await engine.wait(uid) 不返回结果,麻烦帮忙看一下是什么原因,感谢。
Hi, currently in the examples, only linear
describes a naive example of offload, in other projects such as opt
, bloom
, gpt
, there is no option for offload.
I am wondering how to apply offload to large model inference, and any examples?
When try to run engine.shutdown(), it fails due to RRef leaks, and the resultant empty pipe error. So when I ran the tests, particularly in tests/test_engine, four tests will be run and stuck there.
Possible causes:
Debugging output:
============================== test session starts ========================================
platform linux -- Python 3.8.12, pytest-7.2.0, pluggy-1.0.0
plugins: xdist-3.0.2, anyio-3.6.2
collected 4 items
**tests/test_engine/test_hybrid.py**
[12/15/22 09:15:23] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
[12/15/22 09:15:23] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
[12/15/22 09:15:23] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
[12/15/22 09:15:23] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
[12/15/22 09:15:26] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
[12/15/22 09:15:26] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default
parallel seed is ParallelMode.DATA.
[12/15/22 09:15:26] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
[12/15/22 09:15:26] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default
parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2049,the default
parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2048,the default
parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/initialize.py:117 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 2, tensor parallel size: 2
[12/15/22 09:15:27] INFO colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__
INFO colossalai - energonai - INFO: worker0 start
[12/15/22 09:15:27] INFO colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__
INFO colossalai - energonai - INFO: worker1 start
[12/15/22 09:15:27] INFO colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__
INFO colossalai - energonai - INFO: worker2 start
[12/15/22 09:15:27] INFO colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__
INFO colossalai - energonai - INFO: worker3 start
Process SpawnProcess-1:
Traceback (most recent call last):
File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
return self.local_queue.get_nowait()
File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
return self.get(block=False)
File "/usr/local/lib/python3.8/queue.py", line 167, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/Projects/energonai/worker.py", line 68, in _start
task_entry: TaskEntry = self.input_pipe.recv_nowait()
File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 62, in __init__
self._start()
File "/root/Projects/energonai/worker.py", line 73, in _start
time.sleep(0.01)
KeyboardInterrupt
[W rref_context.cpp:156] **Detected RRef Leaks during shutdown. This usually occurs when the application code still holds references to RRef instances when calling shutdown(). If the program has completed correctly and the process is exiting, it is OK to ignore these leaks. However, if you program will keep running after this, these leaks could result in memory leaks on RRef owners. Please make sure all RRefs are out of scope and Python GC has deleted them before calling shutdown():
Leaking RRef GloballyUniqueId(created_on=3, local_id=0) with fork Id GloballyUniqueId(created_on=3, local_id=1)
Leaking RRef GloballyUniqueId(created_on=4, local_id=0) with fork Id GloballyUniqueId(created_on=4, local_id=1)
Leaking RRef GloballyUniqueId(created_on=1, local_id=0) with fork Id GloballyUniqueId(created_on=1, local_id=1)
Leaking RRef GloballyUniqueId(created_on=2, local_id=0) with fork Id GloballyUniqueId(created_on=2, local_id=1)**
[W tensorpipe_agent.cpp:726] RPC agent for worker3 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker2 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker1 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker0 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker0: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker1: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker2: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker3: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
Process SpawnProcess-4:
Traceback (most recent call last):
File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
return self.local_queue.get_nowait()
File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
return self.get(block=False)
File "/usr/local/lib/python3.8/queue.py", line 167, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/Projects/energonai/worker.py", line 68, in _start
task_entry: TaskEntry = self.input_pipe.recv_nowait()
File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 62, in __init__
self._start()
File "/root/Projects/energonai/worker.py", line 73, in _start
time.sleep(0.01)
KeyboardInterrupt
Process SpawnProcess-2:
Traceback (most recent call last):
File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
return self.local_queue.get_nowait()
File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
return self.get(block=False)
File "/usr/local/lib/python3.8/queue.py", line 167, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/Projects/energonai/worker.py", line 68, in _start
task_entry: TaskEntry = self.input_pipe.recv_nowait()
File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 62, in __init__
self._start()
File "/root/Projects/energonai/worker.py", line 73, in _start
time.sleep(0.01)
KeyboardInterrupt
Process SpawnProcess-3:
.Traceback (most recent call last):
File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
return self.local_queue.get_nowait()
File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
return self.get(block=False)
File "/usr/local/lib/python3.8/queue.py", line 167, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/Projects/energonai/worker.py", line 68, in _start
task_entry: TaskEntry = self.input_pipe.recv_nowait()
File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 62, in __init__
self._start()
File "/root/Projects/energonai/worker.py", line 73, in _start
time.sleep(0.01)
KeyboardInterrupt
**tests/test_engine/test_pp.py [12/15/22 09:15:31] I**NFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
[12/15/22 09:15:31] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[12/15/22 09:15:32] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
[12/15/22 09:15:32] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2048,the default
parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default
parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/initialize.py:117 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 2, tensor parallel size: 1
Process SpawnProcess-6:
Process SpawnProcess-5:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 35, in __init__
trpc.init_rpc(self.rpc_name, rank=self.global_rank + 1, world_size=self.world_size + 1,
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 332, in _tensorpipe_init_backend_handler
group = _init_process_group(store, rank, world_size)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 109, in _init_process_group
group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Socket Timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 35, in __init__
trpc.init_rpc(self.rpc_name, rank=self.global_rank + 1, world_size=self.world_size + 1,
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 332, in _tensorpipe_init_backend_handler
group = _init_process_group(store, rank, world_size)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 109, in _init_process_group
group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Socket Timeout
System settings:
Python 3.8
CUDA Version: 11.3
PyTorch Version: 1.12.1+cu113
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: ✓
CUDA Extension: ✓
colossalai 0.1.11rc1+torch1.12cu11.3
energonai 0.0.1+torch1.12cu11.3
torch 1.12.1+cu113
torchaudio 0.10.2+cu113
torchvision 0.13.1+cu113
Describe the feature:
We are going to introduce the automated pipeline parallelism feature into EnergonAI, which hopes that users only need to specify some simple arguments and achieve the pipeline parallelism.
With torch.fx, here the pipelinable directory is with functions that can split a model into multiple submodules.
Difficulty:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.