18907305772 / fuseai Goto Github PK

View Code? Open in Web Editor NEW

74.0 2.0 32.0 5.39 MB

FuseAI Project

Home Page: https://huggingface.co/FuseAI

Python 100.00%

large-language-models machine-learning model-fusion

fuseai's Introduction

Hi there 👋

My name is Fanqi Wan (万凡琦). You can call me Fanqi.

🌱 I am a second-year MS student (expected to graduate in 2025) at Sun Yat-sen University, advised by Prof. Xiaojun Quan. Before this, I received my Bachelor's degree (2018-2022, automation) from Xi'an Jiaotong University.
👯 I am currently conducting my internship at Tencent AI Lab (2023.03-Now), where I am mentored by Dr. Xinting Huang and Dr. Wei Bi.
🤔 My main research interests focused on deep learning for natural language generation. Previously, my work primarily focused on dialogue systems. After the emergence of large language models (LLMs), my research direction shifted towards instruction-tuning (e.g., developing LLMs for specific domains, mitigating hallucinations of LLMs) and model fusion (e.g., combining the capabilities of multiple structurally diverse LLMs).
📫 How to reach me: E-mail

View my homepage.

fuseai's People

Contributors

Stargazers

Watchers

Forkers

fanqiwan mbrukman manishkumart creative-emporium

fuseai's Issues

Releasing FuseLLM for Korean

Hello, thanks for helping me out in last few weeks.
I have successfully blended the OrionStarAI model with the beomi/OPEN-SOLAR-KO-10.7B and beomi/YI-KO-7B.
The blended model has shown some progress over the base model on some Korean language metrics.

And here's the repository on huggingface.
https://huggingface.co/sigridjineth/fusellm-orion14b-korean

Thanks a lot! @18907305772 @fanqiwan

minipile_split issue

When I change the path in split_long_text.py to my own directory, i got

is it right？

Out of Memory Issue with OpenLLaMA-7B in Default FuseLLM Setting on A100 (80G)

Description

I am currently attempting to reproduce the results of your excellent work, FuseLLM, following the doc (https://github.com/18907305772/FuseAI/blob/main/FuseLLM/README.md). During these operations, I am encountering an Out of Memory (OOM) issue.

It is very weird that I am encountering an OOM issue given the situation where I strictly follow and use the command in the document. I also tried to use ZeRO3 for optimizing memory consumption following #10. But it does not help. From my naive assumption, it may be due to the reason for different packages versions and thus have a different memory optimization result. Would you mind me asking for detailed environment information for further attempts?

For your convenience, below is my relevant environment information.

Environment

Hardware: 8 x Nvidia A100 (80G) GPUs
Python version: 3.9
CUDA version: 11.8

Package Version
absl-py 2.1.0
accelerate 0.24.1
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.7.0
async-timeout 4.0.3
attrs 23.2.0
audioread 3.0.1
certifi 2024.7.4
cffi 1.16.0
charset-normalizer 3.3.2
datasets 2.14.7
decorator 5.1.1
deepspeed 0.14.4
dill 0.3.7
editdistance 0.6.2
einops 0.8.0
filelock 3.15.4
flash_attn 0.2.8
frozenlist 1.4.1
fsspec 2023.10.0
grpcio 1.65.1
hjson 3.1.0
huggingface-hub 0.17.3
idna 3.7
importlib_metadata 8.2.0
Jinja2 3.1.4
joblib 1.4.2
lazy_loader 0.4
librosa 0.10.2.post1
llvmlite 0.43.0
Markdown 3.6
MarkupSafe 2.1.5
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
multiprocess 0.70.15
networkx 3.2.1
ninja 1.11.1.1
numba 0.60.0
numpy 2.0.1
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.555.43
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.5.82
nvidia-nvtx-cu12 12.1.105
packaging 24.1
pandas 2.2.2
peft 0.12.0
pip 24.0
platformdirs 4.2.2
pooch 1.8.2
protobuf 4.25.4
psutil 6.0.0
py-cpuinfo 9.0.0
pyarrow 17.0.0
pyarrow-hotfix 0.6
pycparser 2.22
pydantic 2.8.2
pydantic_core 2.20.1
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.7.24
requests 2.32.3
safetensors 0.4.3
scikit-learn 1.5.1
scipy 1.13.1
sentencepiece 0.2.0
setuptools 69.5.1
six 1.16.0
soundfile 0.12.1
soxr 0.4.0
sympy 1.13.1
tensorboard 2.17.0
tensorboard-data-server 0.7.2
threadpoolctl 3.5.0
tokenizers 0.14.1
torch 2.4.0
tqdm 4.66.4
transformers 4.35.1
triton 3.0.0
typing_extensions 4.12.2
tzdata 2024.1
urllib3 2.2.2
Werkzeug 3.0.3
wheel 0.43.0
xxhash 3.4.1
yarl 1.9.4
zipp 3.19.2

Thank you for any assistance or suggestions you might provide.

KeyError: 'per_step_logits' when running token_alignment.py

Hi,

I encounter an error when running the data alignment

My script is

# llama_2_7b <-> open_llama_7b_v2
export CUDA_VISIBLE_DEVICES="0"
python -m src.utils.vocab_mapping \
  --base_model_name_or_path /data/mengxin/FuseLLM/Llama-2-7b-hf \
  --blending_model_name_or_path /data/mengxin/FuseLLM/open_llama_7b_v2 \
  --dataset_dir /data/mengxin/FuseLLM/minipile/split/ \
  --vocab_mapping_save_dir /data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_open_llama_7b_v2.json \
  --cache_dir ./cache/ \
  --model_max_length 2048 \
  --vocab_mapping_type "default" \
  --num_process 1

# llama_2_7b <-> mpt_7b
export CUDA_VISIBLE_DEVICES="1"
python -m src.utils.vocab_mapping \
  --base_model_name_or_path /data/mengxin/FuseLLM/Llama-2-7b-hf \
  --blending_model_name_or_path /data/mengxin/FuseLLM/mpt-7b \
  --dataset_dir /data/mengxin/FuseLLM/minipile/split/ \
  --vocab_mapping_save_dir /data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_mpt_7b.json \
  --cache_dir ./cache/ \
  --model_max_length 2048 \
  --vocab_mapping_type "default" \
  --num_process 1


# Align representations from different LLMs.

# llama_2_7b <-> open_llama_7b_v2
export CUDA_VISIBLE_DEVICES="0"
python -m src.utils.token_alignment \
  --base_model_name_or_path /data/mengxin/FuseLLM/Llama-2-7b-hf \
  --blending_model_name_or_path /data/mengxin/FuseLLM/open_llama_7b_v2 \
  --base_dataset_dir /data/mengxin/FuseLLM/lm7b_rep/0_10000 \
  --blending_dataset_dir /data/mengxin/FuseLLM/openlm7b_rep/0_10000 \
  --dataset_save_dir /data/mengxin/FuseLLM/aligned_dataset_rep/llama_opemlm_0_10000 \
  --cache_dir ./cache/ \
  --model_max_length 2048 \
  --preprocessing_num_workers 80 \
  --batch_size 100 \
  --blending_model_index 0 \
  --vocab_align_type "soft" \
  --vocab_mapping_save_dir /data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_open_llama_7b_v2.json \
  --metric_level "sequence"

03/01/2024 08:34:54 - INFO - main - Data processing args: Namespace(base_model_name_or_path='/data/mengxin/FuseLLM/Llama-2-7b-hf', blending_model_name_or_path='/data/mengxin/FuseLLM/open_llama_7b_v2', base_dataset_dir='/data/mengxin/FuseLLM/lm7b_rep/0_10000', blending_dataset_dir='/data/mengxin/FuseLLM/openlm7b_rep/0_10000', dataset_save_dir='/data/mengxin/FuseLLM/aligned_dataset_rep/llama_opemlm_0_10000', cache_dir='./cache/', model_max_length=2048, preprocessing_num_workers=80, batch_size=100, blending_model_index=0, vocab_align_type='soft', vocab_mapping_save_dir='/data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_open_llama_7b_v2.json', metric_level='sequence')
03/01/2024 08:34:54 - INFO - src.utils.others - Loading tokenizer.
Using pad_token, but it is not set yet.
03/01/2024 08:34:54 - INFO - src.utils.others - bos_token: ~~, 1 eos_token:~~ , 2 unk_token: , 0 pad_token: , 0
03/01/2024 08:34:54 - INFO - src.utils.others - Loading tokenizer.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
Using pad_token, but it is not set yet.
03/01/2024 08:34:54 - INFO - src.utils.others - bos_token: ~~, 1 eos_token:~~ , 2 unk_token: , 0 pad_token: , 0
num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
03/01/2024 08:34:54 - WARNING - datasets.arrow_dataset - num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
Align blending model's logits with base model's logits.: 0%| | 0/1 [00:00<?, ? examples/s]
Traceback (most recent call last):
File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/mengxin/FuseLLM/src/utils/token_alignment.py", line 162, in
base_model_blending_model_logits_datasets[k] = base_model_logits_datasets[k].map(
File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 593, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 558, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3105, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3482, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3361, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/data/mengxin/FuseLLM/src/utils/token_alignment.py", line 120, in align_blending_model_logits_with_base_model_logits
feature_1["per_step_logits"] = feature_1["per_step_logits"][:len(feature_1['input_ids'])]
KeyError: 'per_step_logits'

Is my save direction from previous script correct?
(e.g. ' --vocab_mapping_save_dir /data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_mpt_7b.json \ ')

使用fastchat加载对话模板是什么？

Encountering NaN grad_norm and loss values when training with DeepSpeed and OrionForCausalLM model

Dear FuseLLM author,

I am currently attempting to use FuseLLM to fine-tune for Korean models by configuring OrionStarAI/Orion-14B-Base as a base model, and beomi/OPEN-SOLAR-KO-10.7B and beomi/Yi-Ko-6B to be blending model using DeepSpeed.

However, I am encountering NaN (Not a Number) values for grad_norm and loss during the training process. I suspect that the issue might be related to the change of the base model to OrionForCausalLM. I would greatly appreciate your help in resolving this problem.

Problem Description:

When I initiate the training process using DeepSpeed with the OrionForCausalLM model, I observe the following behavior, with flash attention is turned on (grad_norm is nan from the beginning)

 {'loss': 0.0, 'grad_norm': tensor(nan, device='cuda:0'), 'learning_rate': 4.351851851851852e-06, 'epoch': 0.58}

When even turning off the flash attention, the first batch comes with no problem of nan values, but from the second batch, I encountered the same issue of nan of grad_norm as show below.

{'loss': 2.2466, 'grad_norm': tensor(nan, device='cuda:0'), 'learning_rate': 0.0, 'epoch': 0.01}
  2%|▏         | 2/110 [01:28<1:15:24, 41.90s/it]
{'loss': 0.0, 'grad_norm': tensor(nan, device='cuda:0'), 'learning_rate': 1e-05, 'epoch': 0.02}

As you can see, the grad_norm and loss values become NaN early in the training process. I have tried reducing the learning rate, but the results remain similar. This leads me to suspect that there might be an issue with the dataset or the compatibility between FuseLLM and the OrionForCausalLM model.

Attempted Solutions:

I have attempted the following steps to address the issue:

Reduced the learning rate to various lower values, but the NaN values persist.
Checked the dataset for any potential contamination or irregularities, but I haven't found any obvious issues.
Investigated the compatibility of FuseLLM with the OrionForCausalLM model, but I am unsure if there are any known issues or incompatibilities.

Request for Assistance:

I would greatly appreciate your guidance on the following aspects:

Are there any known compatibility issues between FuseLLM and the OrionForCausalLM model that could lead to NaN values during training?
Are there any specific considerations or modifications required when using FuseLLM with the OrionForCausalLM model?
Could you provide suggestions on how to debug and identify the root cause of the NaN values in grad_norm and loss?
Are there any recommended steps or techniques to stabilize the training process and prevent NaN values from occurring?

I would be grateful for any insights or advice you can offer to help me resolve this issue. I am keen on successfully fine-tuning the OrionForCausalLM model using FuseLLM and would appreciate your expertise in overcoming this obstacle.

Here is the link for the jupyter notebook that I have used on A100 x8:
https://drive.google.com/file/d/1woAJvmJNhjF_abtZDOvo8MXXP54KVScr/view?usp=sharing

Thank you in advance for your time and assistance.

can use Qwen1.5-7B-Chat ?

exec readme bash Pairwise Knowledge Fusion

FuseLLM/FuseChat/train/trainer.py", line 121, in compute_loss

if self.args.distill_loss_type == "ce":
loss_lm = cross_entropy(input=outputs["logits"].view(-1, vocab_size),
target=target_dist.view(-1, vocab_size),
reduction="none").view(batch_size, -1) # (bs, seq_len)

RuntimeError: shape '[-1, 151936]' is invalid for input of size 77642752

how should I load minipile

The problem may sound low level, but I did try many ways to solve it

error: raise FileNotFoundError(
FileNotFoundError: Directory /FuseAI/data/minipile/data is neither a Dataset directory nor a DatasetDict directory.

I tried to merge parquet into one and then use dataset.load_dataset(), but got an error:
File "pyarrow/error.pxi", line 91, in Pyarrow.lib.check_status
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2149036720

Getting flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so error

Traceback (most recent call last):
  File "/workspace/FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/workspace/FuseLLM/FuseLLM/src/train.py", line 39, in train
    tokenizer, model = load_tokenizer_and_model(args)
  File "/workspace/FuseLLM/FuseLLM/src/utils/common.py", line 47, in load_tokenizer_and_model
    model = get_base_model(args, trust_remote_code=kwargs["model_trust_remote_code"])
  File "/workspace/FuseLLM/FuseLLM/src/utils/others.py", line 88, in get_base_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 560, in from_pretrained
    model_class = _get_model_class(config, cls._model_mapping)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 381, in _get_model_class
    supported_models = model_mapping[type(config)]
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 732, in __getitem__
    return self._load_attr_from_module(model_type, model_name)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 746, in _load_attr_from_module
    return getattribute_from_module(self._modules[module_name], attr)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 690, in getattribute_from_module
    if hasattr(module, attr):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1380, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py", line 1392, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops9_pad_enum4callERKNS_6TensorEN3c108ArrayRefINS5_6SymIntEEElNS5_8optionalIdEE

I am consistently getting flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so error when running deepspeed.

Any plan to release the training code?

I am really impressed with work, I would think how we can apply this to the other domains like computer vision.

Thank you!

Comparison with merging LLaMA-2 CLM and LLaMA-2

Within https://github.com/fanqiwan/FuseLLM/blob/main/assets/fig_4.png, it would be nice to be able to see a comparison of LLaMA-2, LLaMA-2 CLM, FuseLLM, and a merge using just LLaMA-2, LLaMA-2 CLM as well

Purpose for the Split long text step

Hi,

Thank you for sharing your work! I have a question.
What is the motivation or the reason behind the first step?

Split long text

python ./src/utils/split_long_text.py \
  --base_model_name_or_path "<path_to_llama_2_7b>" \
  --blending_model_name_or_path "<path_to_open_llama_7b_v2>" \
  --another_blending_model_name_or_path "<path_to_mpt_7b>" \
  --dataset "<path_to_minipile>" \
  --dataset_save_dir "<path_to_minipile_split>" \
  --cache_dir "<path_to_cache_dir>" \
  --block_size 2048 \
  --preprocessing_num_workers 80

Why is it necessary to load the three models when splitting the dataset? This part is not mentioned in the paper. Could you please provide some references?

Additionally, is it required to start from the first step when fusing with a new model?

Thanks!

Regarding MiniPile dataset splitting

In the second step:

Get representations for each LLM
We split the dataset into 8 splits, then process each split on a GPU.

Does the split number depend on the number of GPU we are running?

For example, if we have only 4 GPU, do I spit it into 4 splits?

Thanks

Out of Memory Issue with Blending for 14B Base Model

Description

I am currently attempting to blend models using "OrionStarAI/Orion-14B-Base" as the base model, with blending operations targeting "beomi/OPEN-SOLAR-KO-10.7B" and "beomi/Yi-Ko-6B". During these operations, I am encountering an Out of Memory (OOM) issue.

Environment

Hardware: 8x Nvidia A100 GPUs

It seems peculiar that I'm running into CUDA OOM errors given the hardware capacity, especially when attempting to work with the 14B model. Has anyone successfully attempted to blend with the 14B base model without encountering memory issues?

I would appreciate insights into whether there are specific configurations or optimizations, possibly involving the management of logits values or the general representation of memory values on the GPU, that could help in mitigating these memory-related challenges.

Can you check the jupyter notebook below to debug?

https://drive.google.com/file/d/1ROj4F_FWsdaF6QGlEI2arMnBJ5P2xtWE/view?usp=sharing

Deepspeed Command

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:50"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
# --include localhost:0,3,4,5,6,7 
# --exclude=localhost:1,2
!deepspeed --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "OrionStarAI/Orion-14B-Base" \
  --output_dir "/home/sionic/sigrid/fusellm-test/240313/output" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --logging_strategy steps \
  --do_train \
  --do_distill \
  --bf16 True \
  --tf32 False \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240313_dataset_2" \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 1 \
  --num_train_epochs 1 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-6 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing False \
  --use_flash_attn True \
  --lm_loss_weight 0.9 \
  --distill_greater_as_gt True \
  --distill_greater_as_gt_type "hard" \
  --dataloader_num_workers 1 \
  --report_to wandb \
  --remove_unused_columns False \
  --safe_serialization False

Error Stack

기존 데이터셋에 대해 0.00001% 추출했는데도 동일 OOM이 발생하네요.

GPU 1번
 File "/home/sionic/.cache/huggingface/modules/transformers_modules/OrionStarAI/Orion-14B-Base/87d96b1852d58c4f605f86e8437d47ab7ec89e1d/modeling_orion.py", line 599, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sionic/.cache/huggingface/modules/transformers_modules/OrionStarAI/Orion-14B-Base/87d96b1852d58c4f605f86e8437d47ab7ec89e1d/modeling_orion.py", line 337, in forward
    attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 1 has a total capacty of 79.15 GiB of which 255.31 MiB is free. Including non-PyTorch memory, this process has 78.89 GiB memory in use. Of the allocated memory 77.45 GiB is allocated by PyTorch, and 21.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[

GPU 7번

File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sionic/.cache/huggingface/modules/transformers_modules/OrionStarAI/Orion-14B-Base/87d96b1852d58c4f605f86e8437d47ab7ec89e1d/modeling_orion.py", line 353, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/functional.py", line 1858, in softmax
    ret = input.softmax(dim, dtype=dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 640.00 MiB. GPU 7 has a total capacty of 79.15 GiB of which 399.31 MiB is free. Including non-PyTorch memory, this process has 78.75 GiB memory in use. Of the allocated memory 77.45 GiB is allocated by PyTorch, and 21.78 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thank you for any assistance or suggestions you might provide.

18907305772 / fuseai Goto Github PK

fuseai's Introduction

Hi there 👋

fuseai's People

Contributors

Stargazers

Watchers

Forkers

fuseai's Issues

Description

Environment

Problem Description:

Attempted Solutions:

Request for Assistance:

Description

Environment

Deepspeed Command

Error Stack

Recommend Projects

Recommend Topics

Recommend Org