yandex / yalm-100b Goto Github PK

View Code? Open in Web Editor NEW

3.7K 48.0 298.0 311 KB

Pretrained language model with 100B parameters

License: Apache License 2.0

Dockerfile 0.39% Shell 3.15% Makefile 0.08% TeX 0.17% Python 87.95% C++ 7.58% Cuda 0.69%

yalm-100b's People

Contributors

Stargazers

Watchers

Forkers

alexe1993 e0xextazy shmeller86 catsplus mfilippov ss220-space morayg helllynx lockerr cyberyozh shafiahmed jakob-98 jerdikk honsa karenking112 snowppy chelaxian evgenusov romakoks envermt faisallm lolozius fratenuta 06opotehb setwillis inews2 khramovskikh techthiyanes letsnotallowtheheatdeathoftheuniverse nelveroi pepya4ka mattafflek piotraleksander babinoff neuroidss ctoivan vaibkamble juandavidgf frutellis pushpen sangohan coallaoh filodex tgkspb monad-one mrcazor shsorot romandevjavascript caiomonsorez harish5p guillaumeguerin murilo autolenta lostmsu jamesbondsky spdrman nextwork-ai tahercoolguy diegocr joshlk jorik041 lexasa hackerproff sybernomad datakalp alexeyjarlax mohorelien uladar dumbo-programmer klepiku tristau frrabelo valeman alex4zv sts0mrg0 weltonvaz max-ramas ejbejaranosai grinco danishabdullah antikov mturch ifiht cdj0311 wavy-jung jacktrapper faizer1989 petuhoviliya b-xiang dongbinghua nguoianphu lighttemplar chenchy artemproc czwin32768 igorolenchuk andrey-zhuravl ml-lab lampts alexanderfedin

yalm-100b's Issues

Is there any plans for making cloud service?

Can it be launched on usual VPS? For example, 6 CPU 16 RAM (usual chips)

Sorry for maybe a stupid question, fortunately, I find your product and want to integrate it into social media accounts; I don't release how to use it from the box (via Docker). As it said in the instruction I need a strong PC with GPU chips (which are pretty expensive for me) and I wonder if there is a way to utilize " input text prompt/variables - get a response in console / API"?
Can you please, @artnitolog, comment on this?

For reference, a way how https://porfirevich.ru/ works.

PCI x1 or PCI x16 for GPU

there are 10x video cards, more than 200 GB of video memory.
If connect them to PCI x1, how much will performance decrease and does PCI x1 or PCI x16 affect in this particular case?

Online examples

It's great that you endure the community. Thank you.

Please add an online example so that you can test with it without downloading 200GB to your computer.

Provide pruned version for weaker hardware

It would be really useful to have a pruned version of the model (like Balaboba) to launch on weaker video card setups.

YaLm

Dataset information

Thanks for the very interesting model release.

If possible, could a bit information about the dataset used for training be provided (e.g. language split percentages)?

Citation bibtex?

Hi! I wanted to ask for an official citation bibex one can use when referring to the model in the paper.

Why usage ssh-agent and openssh-client package in docker

I see attempt to usage host ssh-agent, its security risk.

[NL] token

What's the [NL] token appearing in generation?
Is it an artifact or a special token?

Possible to run on 8 x 24GB 3090?

This model looks amazing, thank you! We have a machine with 8 x 3090 (192GB total), I tried to run the examples, but I get:

building GPT2 model ...

RuntimeError: CUDA out of memory. Tried to allocate 76.00 MiB (GPU 3; 23.70 GiB total capacity; 22.48 GiB already allocated; 70.56 MiB free; 22.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

For someone who is not an expert with pytorch etc., perhaps you have a suggestion?

We would try to make a conversation partner for language learning (add TTS, translation, NLP etc.) for our project: https://dev.languagereactor.com/

Regards, David :)

Could you share the md5 value for those checkpoints?

Hello, Thank you for sharing the YaLM-100B checkpoints.
I am downloading those checkpoints, in order to make sure the files I have downloaded in good, can you share the md5sum value for those checkpoint files?

Has anyone deployed it on 10x 3090 ? Or any similar configuration?

Request to Open "Russian Pile" Dataset for Public Access

Dear Yandex Team,

I hope this message finds you well. I am writing to express my admiration for your work on the YaLM-100B model, which has demonstrated exceptional performance in generating and processing text in both English and Russian languages. Your dedication to providing this model for free use by developers and researchers worldwide is commendable.

As a researcher in the field of natural language processing, I am particularly interested in the dataset you have used to train the YaLM-100B model, specifically the 75% of the dataset consisting of Russian texts. I would like to respectfully request that you consider making this dataset, which I propose to call the "Russian Pile," openly available to the broader research community. Below are some strong arguments in favor of opening the dataset:

Accelerating progress in NLP research: By making the Russian Pile dataset available, you will be enabling researchers and developers worldwide to explore new opportunities and challenges in Russian NLP. This could lead to breakthroughs in various NLP tasks, including translation, sentiment analysis, and information extraction, ultimately accelerating the progress of NLP research for the Russian language.
Promoting reproducibility and transparency: Open datasets are essential for ensuring reproducibility and transparency in research. By opening the Russian Pile dataset, you will be enabling researchers to build upon your work, validate their findings, and contribute to a more robust and reliable body of knowledge in the field of NLP.
Encouraging collaboration and innovation: Providing open access to the Russian Pile dataset will stimulate collaboration among researchers, institutions, and industries. It will also foster innovation by enabling researchers to combine datasets and develop new techniques or applications, leading to novel solutions for existing problems and the discovery of unexplored research areas.
Bridging the gap between languages: By opening the Russian Pile dataset, you will be contributing to a more equitable distribution of resources in NLP research. Many languages are underrepresented in NLP, and the availability of a large-scale, high-quality dataset for Russian will help bridge this gap, promoting language diversity and enabling researchers to develop more inclusive AI systems.
Improving educational opportunities: Open datasets, like the Russian Pile, can serve as valuable resources for educational purposes. Students and educators can utilize the dataset to learn about NLP, data preprocessing, and various other aspects of AI research, enhancing their skills and contributing to the development of a skilled workforce in the field of AI and NLP.
Supporting ethics and fairness in AI: Open access to high-quality datasets, such as the Russian Pile, enables researchers to investigate and address issues related to ethics and fairness in AI. By providing a comprehensive and diverse dataset for the Russian language, you will be helping researchers to design and evaluate algorithms that are less biased and more equitable, thus contributing to the development of responsible AI systems.
Boosting competitiveness and economic growth: Open datasets can drive economic growth by stimulating innovation and entrepreneurship. By opening the Russian Pile dataset, you will be providing valuable resources for startups, businesses, and developers to build new products and services, encouraging technological advancements and fostering a competitive ecosystem in the field of AI and NLP.

In conclusion, I believe that making the Russian Pile dataset openly available will bring about numerous benefits for the global research community, promote language diversity, and contribute to the development of more inclusive and responsible AI systems. Your willingness to share the YaLM-100B model is already a significant contribution to the field, and opening the Russian Pile dataset would further solidify your commitment to openness and collaboration in AI research.

Thank you for considering this request. I am looking forward to your response and the potential positive impact that opening the Russian Pile dataset will have on the research community and beyond.

Sincerely,
Mikhail Grankin

NCCL error

I pulled the docker image and downloaded the checkpoint. When running generate_interactive.sh, I encountered the following error:

Traceback (most recent call last):
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
    main()
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
    _ = load_checkpoint(model, None, None)
Traceback (most recent call last):
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
    main()
    main()
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
Traceback (most recent call last):
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
Traceback (most recent call last):
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
    _ = load_checkpoint(model, None, None)
    _ = load_checkpoint(model, None, None)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    torch.distributed.barrier()
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    main()
    main()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    _ = load_checkpoint(model, None, None)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    _ = load_checkpoint(model, None, None)
    main()
    torch.distributed.barrier()
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    main()
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
    _ = load_checkpoint(model, None, None)
    _ = load_checkpoint(model, None, None)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    work = default_pg.barrier(opts=opts)
    torch.distributed.barrier()
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).    torch.distributed.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier

    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).    work = default_pg.barrier(opts=opts)

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    work = default_pg.barrier(opts=opts)
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
    main()
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
    _ = load_checkpoint(model, None, None)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 554) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.8.0a0+17f8c32', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
megatron_lm/tools/generate_samples_gpt2.py FAILED

Would it be possible to run the model on single A100 (40GB) or 2xV100 (32GB) ?

Hello!

I managed to run the model on 8xA100, unfortunately AFAIK GCP don't provide 80GB models, so it's the 40GB model.
It did fail on 4xA100 40GB and got out-of-memory error.

I was wondering if it would be somehow possible to run it on less hardware,
perhaps a single A100 40GB or 2xV100 32GB that can reduce running cost significantly.

I come under the impression that it might be possible with tweaking some of the runtime configuration.
Or even by modifying some code, but I'm not sure what parameters should I modify.

When I ran the model on 8xA100 it was using 27GB~ of GPU memory on each device, and used total of 50GB~ of the machine memory. I wonder if it should be possible to utilize more machine threads / RAM / disk in favor of consuming less GPU memory?

gguf / mlx format?

Hello and thanks for open-sourcing the model!

As it doesn't seem to be any ready to use gguf or mlx formats (for llama.cpp and macos respectively) - is there any chance you can give a hint on how to convert YaLM there?

It would be of real help to enable model to run on non-Nvidia enabled HW, like any modern pc and mobile.

Thanks in advance!

Timeout on 8 x RTX A6000

Thank you for making your work publicly available!

I am trying to test your model on a 8xRTX6000 cards, and I'm getting a timeout error:

> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234                                                                                                      
building GPT2 model ...                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                            
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805074 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805088 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805091 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805093 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805099 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805104 milliseconds before timing out.                                                                                         
>> Loading layer_00-model_00-model_states.pt on CPU [mp 06 / 8]                                                                                                                                                                                                             
>> Loading layer_00-model_00-model_states.pt on CPU [mp 05 / 8]                                                                                                                                                                                                             
>> Loading layer_00-model_00-model_states.pt on CPU [mp 03 / 8]                                                                                                                                                                                                             
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805221 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805235 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.           
terminate called after throwing an instance of 'std::runtime_error'                                                                                                                                                                                                         
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805074 milliseconds before timing out.                                                                                                           
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.           
terminate called after throwing an instance of 'std::runtime_error'                                                                                                                                                                                                         
  what():  [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805091 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805093 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805104 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805088 milliseconds before timing out.
>> Loading layer_00-model_00-model_states.pt on CPU [mp 04 / 8]
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805099 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805235 milliseconds before timing out.
> Start loading from release checkpoint from folder yalm100b_checkpoint/weights
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805221 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1824 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1827 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1828 closing signal SIGTERM

What causes this error, and how could I overcome it?

Привет

ZeRO 3 NVMe Offload?

Firstly, thank you for making the weights open to all!

Now, how to do inference/fine-tuning with ZeRO 3 NVMe/CPU Offload, since this project is based on Megatron project, that shouldn't be much difficult to implement?

No mention of `bfloat16` in source, and yet weights are `bfloat16`

generate_*.sh scripts appear to create fp16 model and load weights into it, but the weights are bfloat16.

Model dataset irregularity

Hi,

Thank you for releasing the model dataset 😃!

Ive just downloaded the data and done a quick spot check on the sizes of the layers. Most layers have a size of 2.4Gb, however layer 44 is 817Mb - is there an issue with this layers data?

Full layers sizes:

501M layer_00-model_00-model_states.pt
9.0K layer_01-model_00-model_states.pt
2.4G layer_03-model_00-model_states.pt
2.4G layer_04-model_00-model_states.pt
2.4G layer_05-model_00-model_states.pt
2.4G layer_06-model_00-model_states.pt
2.4G layer_07-model_00-model_states.pt
2.4G layer_08-model_00-model_states.pt
2.4G layer_09-model_00-model_states.pt
2.4G layer_10-model_00-model_states.pt
2.4G layer_11-model_00-model_states.pt
2.4G layer_12-model_00-model_states.pt
2.4G layer_13-model_00-model_states.pt
2.4G layer_14-model_00-model_states.pt
2.4G layer_15-model_00-model_states.pt
2.4G layer_16-model_00-model_states.pt
2.4G layer_17-model_00-model_states.pt
2.4G layer_18-model_00-model_states.pt
2.4G layer_19-model_00-model_states.pt
2.4G layer_20-model_00-model_states.pt
2.4G layer_21-model_00-model_states.pt
2.4G layer_22-model_00-model_states.pt
2.4G layer_23-model_00-model_states.pt
2.4G layer_24-model_00-model_states.pt
2.4G layer_25-model_00-model_states.pt
2.4G layer_26-model_00-model_states.pt
2.4G layer_27-model_00-model_states.pt
2.4G layer_28-model_00-model_states.pt
2.4G layer_29-model_00-model_states.pt
2.4G layer_30-model_00-model_states.pt
2.4G layer_31-model_00-model_states.pt
2.4G layer_32-model_00-model_states.pt
2.4G layer_33-model_00-model_states.pt
2.4G layer_34-model_00-model_states.pt
2.4G layer_35-model_00-model_states.pt
2.4G layer_36-model_00-model_states.pt
2.4G layer_37-model_00-model_states.pt
2.4G layer_38-model_00-model_states.pt
2.4G layer_39-model_00-model_states.pt
2.4G layer_40-model_00-model_states.pt
2.4G layer_41-model_00-model_states.pt
2.4G layer_42-model_00-model_states.pt
2.4G layer_43-model_00-model_states.pt
817M layer_44-model_00-model_states.pt
1.8G layer_45-model_00-model_states.pt
2.4G layer_46-model_00-model_states.pt
2.4G layer_47-model_00-model_states.pt
2.4G layer_48-model_00-model_states.pt
2.4G layer_49-model_00-model_states.pt
2.4G layer_50-model_00-model_states.pt
2.4G layer_51-model_00-model_states.pt
2.4G layer_52-model_00-model_states.pt
2.4G layer_53-model_00-model_states.pt
2.4G layer_54-model_00-model_states.pt
2.4G layer_55-model_00-model_states.pt
2.4G layer_56-model_00-model_states.pt
2.4G layer_57-model_00-model_states.pt
2.4G layer_58-model_00-model_states.pt
2.4G layer_59-model_00-model_states.pt
2.4G layer_60-model_00-model_states.pt
2.4G layer_61-model_00-model_states.pt
2.4G layer_62-model_00-model_states.pt
2.4G layer_63-model_00-model_states.pt
2.4G layer_64-model_00-model_states.pt
2.4G layer_65-model_00-model_states.pt
2.4G layer_66-model_00-model_states.pt
2.4G layer_67-model_00-model_states.pt
2.4G layer_68-model_00-model_states.pt
2.4G layer_69-model_00-model_states.pt
2.4G layer_70-model_00-model_states.pt
2.4G layer_71-model_00-model_states.pt
2.4G layer_72-model_00-model_states.pt
2.4G layer_73-model_00-model_states.pt
2.4G layer_74-model_00-model_states.pt
2.4G layer_75-model_00-model_states.pt
2.4G layer_76-model_00-model_states.pt
2.4G layer_77-model_00-model_states.pt
2.4G layer_78-model_00-model_states.pt
2.4G layer_79-model_00-model_states.pt
2.4G layer_80-model_00-model_states.pt
2.4G layer_81-model_00-model_states.pt
2.4G layer_82-model_00-model_states.pt
 41M layer_84-model_00-model_states.pt

Evaluation benchmarks (lm-eval-harness)

Thanks for the awesome work! (and a especially for choosing to make it freely available)

If you have time, please also consider running the evaluation benchmarks from lm-eval-harness
https://github.com/EleutherAI/lm-evaluation-harness

[despite it having a ton of different benchmarks, you only need to implement one interface, and it runs all benchmarks for you]

It is a more-or-less standard tool for benchmarking how well does your model perform on a range of tasks (generation, common sense, math, etc)

There's a huge bunch of tasks, so if you want to choose some initial set, consider taking the ones that gpt-J reports here https://huggingface.co/EleutherAI/gpt-j-6B#evaluation-results

Run on networked nodes

Thanks for open-sourcing this! Because the GPU ram requirements are so high, it's hard to rent a large enough single node from any of the major cloud providers. How can you run it in inference mode networked between multiple physical machines?

Thanks!

AWS

Is there a way to run the model on AWS?

How did you used LAMB optimizer with ZeRO CPU offload?

Thanks for this great project.
In your blog: https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6, you have used LAMB optimizer and ZeRO offload, but isn't ZeRO CPU offload have to use DeepSpeedCPUAdam for good performace?
And i did not find LAMB optimizer codes in this project code.

CUDA out of memory

Hello, I'm trying to use YaLM to generate text. I am using pretrained models. But when I try to run the generation, I get an error:

RuntimeError: CUDA out of memory. Tried to allocate 76.00 MiB (GPU 0; 5.80 GiB total capacity; 62.50 MiB already allocated; 20.81 MiB free; 64.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

GPU is 1660, 6gb vram. Is there anything I can do about it or have I wasted a few weeks?

How to use it with LangChain?

For example if i can't use transformers pipeline to use it with LangChain, how i need it use it that it connects to LangChain?