Giter Site home page Giter Site logo

yalm-100b's Introduction

YaLM 100B

YaLM 100B is a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.

The model leverages 100 billion parameters. It took 65 days to train the model on a cluster of 800 A100 graphics cards and 1.7 TB of online texts, books, and countless other sources in both English and Russian.

Training details and best practices on acceleration and stabilizations can be found on Medium (English) and Habr (Russian) articles.

We used DeepSpeed to train the model and drew inspiration from Megatron-LM example. However, the code in this repo is not the same code that was used to train the model. Rather it is stock example from DeepSpeed repo with minimal changes needed to infer our model.

Setup

Make sure to have 200GB of free disk space before downloading weights. The model (code is based on microsoft/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3) is supposed to run on multiple GPUs with tensor parallelism. It was tested on 4 (A100 80g) and 8 (V100 32g) GPUs, but is able to work with different configurations with ≈200GB of GPU memory in total which divide weight dimensions correctly (e.g. 16, 64, 128).

Downloading checkpoint

  • Run bash download/download.sh to download model weights and vocabulary.
  • By default, weights will be downloaded to ./yalm100b_checkpoint/weights/, and vocabulary will be downloaded to ./yalm100b_checkpoint/vocab/.
  • As another option, you can clone our HF repo and pull the checkpoint.

Docker

  • We published image on Docker Hub, it can be pulled with docker/pull.sh. It is compatible with A100 and V100.
  • Alternatively, you can build docker image from source using docker/build.sh (which will just build docker image from docker/Dockerfile).
  • To run container, use docker/run.sh (volumes, name and other parameters can be changed).

Usage

You can start with the following scripts:

  • examples/generate_interactive.sh: interactive generation from command line, the simplest way to try the model.
  • examples/generate_conditional_sampling.sh: conditional generation with sampling strategy. Top-p is used by default, feel free to change temperature or use top-k. Input is jsonlines (example: examples/example_cond_input.json), output will be the same jsonlines with generated text field added to each line.
  • examples/generate_conditional_greedy.sh: same as previous, but generation is greedy. Suitable for solving problems with few-shot.
  • examples/generate_unconditional.sh: unconditional generation. No input is used, output will be jsonlines.

License

The model is published under the Apache 2.0 license that permits both research and commercial use, Megatron-LM is licensed under the Megatron-LM license.

Training details

Dataset composition

Dataset used for the training of YaLM-100B is comprised of the following parts (rough percentages are measured in tokens seen by the model):

  • 25% The Pile — open English dataset by Eleuther AI team

  • 75% Texts in Russian collected by our team (percentages of the whole dataset are given)

    • 49% Russian web pages from Yandex Search index filtered from ~100Tb to ~1Tb by the following heuristics:

      1. LSH Deduplication — clusters of similar texts were truncated to just one text each
      2. Length filtration — too short or too long texts or texts with too few natural sentences were discarded.
      3. Entropy filtration — texts with too high or too low entropy were discarded
      4. Domain filtration — domains with repetitive texts (like online retail) were discarded
      5. Classifier filtration — dataset of good texts was collected in a manner similar to WebText from pages linked in tweets in Russian that have at least one reply. Then a classifier was trained to distinguish those good texts from random pages from the dataset. Texts from the original crawled dataset with low classifier scores were then discarded
    • 12% News from various sources from Yandex Search index

    • 10% Books from the dataset used in Russian Distributional Thesarus

    • 3% Misc texts from the Taiga Dataset

    • 1.5% Dialogues from social media preprocessed in a manner similar to how Reddit is proccessed in The Pile

    • 0.5% Russian portion of Wikipedia

Some subsets were traversed up to 3 times during the training.

Training process

Model was trained on a cluster of 800 A100 for ~65 days. In that time it consumed 300B tokens. You can see TensorBoard with LR and ramp up schedule, training metrics and our "thermometers" on the HF page.

yalm-100b's People

Contributors

artnitolog avatar petrovlesha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yalm-100b's Issues

gguf / mlx format?

Hello and thanks for open-sourcing the model!

As it doesn't seem to be any ready to use gguf or mlx formats (for llama.cpp and macos respectively) - is there any chance you can give a hint on how to convert YaLM there?

It would be of real help to enable model to run on non-Nvidia enabled HW, like any modern pc and mobile.

Thanks in advance!

[NL] token

What's the [NL] token appearing in generation?
Is it an artifact or a special token?

Timeout on 8 x RTX A6000

Thank you for making your work publicly available!

I am trying to test your model on a 8xRTX6000 cards, and I'm getting a timeout error:

> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234                                                                                                      
building GPT2 model ...                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                            
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805074 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805088 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805091 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805093 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805099 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805104 milliseconds before timing out.                                                                                         
>> Loading layer_00-model_00-model_states.pt on CPU [mp 06 / 8]                                                                                                                                                                                                             
>> Loading layer_00-model_00-model_states.pt on CPU [mp 05 / 8]                                                                                                                                                                                                             
>> Loading layer_00-model_00-model_states.pt on CPU [mp 03 / 8]                                                                                                                                                                                                             
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805221 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805235 milliseconds before timing out.                                                                                         
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.           
terminate called after throwing an instance of 'std::runtime_error'                                                                                                                                                                                                         
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805074 milliseconds before timing out.                                                                                                           
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.           
terminate called after throwing an instance of 'std::runtime_error'                                                                                                                                                                                                         
  what():  [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805091 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805093 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805104 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805088 milliseconds before timing out.
>> Loading layer_00-model_00-model_states.pt on CPU [mp 04 / 8]
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805099 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805235 milliseconds before timing out.
> Start loading from release checkpoint from folder yalm100b_checkpoint/weights
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805221 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1824 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1827 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1828 closing signal SIGTERM

What causes this error, and how could I overcome it?

ZeRO 3 NVMe Offload?

Firstly, thank you for making the weights open to all!

Now, how to do inference/fine-tuning with ZeRO 3 NVMe/CPU Offload, since this project is based on Megatron project, that shouldn't be much difficult to implement?

Model dataset irregularity

Hi,

Thank you for releasing the model dataset 😃!

Ive just downloaded the data and done a quick spot check on the sizes of the layers. Most layers have a size of 2.4Gb, however layer 44 is 817Mb - is there an issue with this layers data?

Full layers sizes:

501M layer_00-model_00-model_states.pt
9.0K layer_01-model_00-model_states.pt
2.4G layer_03-model_00-model_states.pt
2.4G layer_04-model_00-model_states.pt
2.4G layer_05-model_00-model_states.pt
2.4G layer_06-model_00-model_states.pt
2.4G layer_07-model_00-model_states.pt
2.4G layer_08-model_00-model_states.pt
2.4G layer_09-model_00-model_states.pt
2.4G layer_10-model_00-model_states.pt
2.4G layer_11-model_00-model_states.pt
2.4G layer_12-model_00-model_states.pt
2.4G layer_13-model_00-model_states.pt
2.4G layer_14-model_00-model_states.pt
2.4G layer_15-model_00-model_states.pt
2.4G layer_16-model_00-model_states.pt
2.4G layer_17-model_00-model_states.pt
2.4G layer_18-model_00-model_states.pt
2.4G layer_19-model_00-model_states.pt
2.4G layer_20-model_00-model_states.pt
2.4G layer_21-model_00-model_states.pt
2.4G layer_22-model_00-model_states.pt
2.4G layer_23-model_00-model_states.pt
2.4G layer_24-model_00-model_states.pt
2.4G layer_25-model_00-model_states.pt
2.4G layer_26-model_00-model_states.pt
2.4G layer_27-model_00-model_states.pt
2.4G layer_28-model_00-model_states.pt
2.4G layer_29-model_00-model_states.pt
2.4G layer_30-model_00-model_states.pt
2.4G layer_31-model_00-model_states.pt
2.4G layer_32-model_00-model_states.pt
2.4G layer_33-model_00-model_states.pt
2.4G layer_34-model_00-model_states.pt
2.4G layer_35-model_00-model_states.pt
2.4G layer_36-model_00-model_states.pt
2.4G layer_37-model_00-model_states.pt
2.4G layer_38-model_00-model_states.pt
2.4G layer_39-model_00-model_states.pt
2.4G layer_40-model_00-model_states.pt
2.4G layer_41-model_00-model_states.pt
2.4G layer_42-model_00-model_states.pt
2.4G layer_43-model_00-model_states.pt
817M layer_44-model_00-model_states.pt
1.8G layer_45-model_00-model_states.pt
2.4G layer_46-model_00-model_states.pt
2.4G layer_47-model_00-model_states.pt
2.4G layer_48-model_00-model_states.pt
2.4G layer_49-model_00-model_states.pt
2.4G layer_50-model_00-model_states.pt
2.4G layer_51-model_00-model_states.pt
2.4G layer_52-model_00-model_states.pt
2.4G layer_53-model_00-model_states.pt
2.4G layer_54-model_00-model_states.pt
2.4G layer_55-model_00-model_states.pt
2.4G layer_56-model_00-model_states.pt
2.4G layer_57-model_00-model_states.pt
2.4G layer_58-model_00-model_states.pt
2.4G layer_59-model_00-model_states.pt
2.4G layer_60-model_00-model_states.pt
2.4G layer_61-model_00-model_states.pt
2.4G layer_62-model_00-model_states.pt
2.4G layer_63-model_00-model_states.pt
2.4G layer_64-model_00-model_states.pt
2.4G layer_65-model_00-model_states.pt
2.4G layer_66-model_00-model_states.pt
2.4G layer_67-model_00-model_states.pt
2.4G layer_68-model_00-model_states.pt
2.4G layer_69-model_00-model_states.pt
2.4G layer_70-model_00-model_states.pt
2.4G layer_71-model_00-model_states.pt
2.4G layer_72-model_00-model_states.pt
2.4G layer_73-model_00-model_states.pt
2.4G layer_74-model_00-model_states.pt
2.4G layer_75-model_00-model_states.pt
2.4G layer_76-model_00-model_states.pt
2.4G layer_77-model_00-model_states.pt
2.4G layer_78-model_00-model_states.pt
2.4G layer_79-model_00-model_states.pt
2.4G layer_80-model_00-model_states.pt
2.4G layer_81-model_00-model_states.pt
2.4G layer_82-model_00-model_states.pt
 41M layer_84-model_00-model_states.pt

PCI x1 or PCI x16 for GPU

there are 10x video cards, more than 200 GB of video memory.
If connect them to PCI x1, how much will performance decrease and does PCI x1 or PCI x16 affect in this particular case?

Evaluation benchmarks (lm-eval-harness)

Thanks for the awesome work! (and a especially for choosing to make it freely available)

If you have time, please also consider running the evaluation benchmarks from lm-eval-harness
https://github.com/EleutherAI/lm-evaluation-harness

[despite it having a ton of different benchmarks, you only need to implement one interface, and it runs all benchmarks for you]

It is a more-or-less standard tool for benchmarking how well does your model perform on a range of tasks (generation, common sense, math, etc)

There's a huge bunch of tasks, so if you want to choose some initial set, consider taking the ones that gpt-J reports here https://huggingface.co/EleutherAI/gpt-j-6B#evaluation-results

NCCL error

I pulled the docker image and downloaded the checkpoint. When running generate_interactive.sh, I encountered the following error:

Traceback (most recent call last):
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
    main()
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
    _ = load_checkpoint(model, None, None)
Traceback (most recent call last):
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
    main()
    main()
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
Traceback (most recent call last):
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
Traceback (most recent call last):
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
    _ = load_checkpoint(model, None, None)
    _ = load_checkpoint(model, None, None)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    torch.distributed.barrier()
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    main()
    main()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    _ = load_checkpoint(model, None, None)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    _ = load_checkpoint(model, None, None)
    main()
    torch.distributed.barrier()
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    main()
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
    _ = load_checkpoint(model, None, None)
    _ = load_checkpoint(model, None, None)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    work = default_pg.barrier(opts=opts)
    torch.distributed.barrier()
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).    torch.distributed.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier

    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).    work = default_pg.barrier(opts=opts)

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    work = default_pg.barrier(opts=opts)
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
  File "megatron_lm/tools/generate_samples_gpt2.py", line 104, in <module>
    main()
  File "megatron_lm/tools/generate_samples_gpt2.py", line 89, in main
    _ = load_checkpoint(model, None, None)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 183, in load_checkpoint
    load_checkpoint_new(model, optimizer, lr_scheduler)
  File "/workspace/YaLM-100B/megatron_lm/megatron/checkpointing.py", line 373, in load_checkpoint_new
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 554) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.8.0a0+17f8c32', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
megatron_lm/tools/generate_samples_gpt2.py FAILED

Request to Open "Russian Pile" Dataset for Public Access

Dear Yandex Team,

I hope this message finds you well. I am writing to express my admiration for your work on the YaLM-100B model, which has demonstrated exceptional performance in generating and processing text in both English and Russian languages. Your dedication to providing this model for free use by developers and researchers worldwide is commendable.

As a researcher in the field of natural language processing, I am particularly interested in the dataset you have used to train the YaLM-100B model, specifically the 75% of the dataset consisting of Russian texts. I would like to respectfully request that you consider making this dataset, which I propose to call the "Russian Pile," openly available to the broader research community. Below are some strong arguments in favor of opening the dataset:

  1. Accelerating progress in NLP research: By making the Russian Pile dataset available, you will be enabling researchers and developers worldwide to explore new opportunities and challenges in Russian NLP. This could lead to breakthroughs in various NLP tasks, including translation, sentiment analysis, and information extraction, ultimately accelerating the progress of NLP research for the Russian language.
  2. Promoting reproducibility and transparency: Open datasets are essential for ensuring reproducibility and transparency in research. By opening the Russian Pile dataset, you will be enabling researchers to build upon your work, validate their findings, and contribute to a more robust and reliable body of knowledge in the field of NLP.
  3. Encouraging collaboration and innovation: Providing open access to the Russian Pile dataset will stimulate collaboration among researchers, institutions, and industries. It will also foster innovation by enabling researchers to combine datasets and develop new techniques or applications, leading to novel solutions for existing problems and the discovery of unexplored research areas.
  4. Bridging the gap between languages: By opening the Russian Pile dataset, you will be contributing to a more equitable distribution of resources in NLP research. Many languages are underrepresented in NLP, and the availability of a large-scale, high-quality dataset for Russian will help bridge this gap, promoting language diversity and enabling researchers to develop more inclusive AI systems.
  5. Improving educational opportunities: Open datasets, like the Russian Pile, can serve as valuable resources for educational purposes. Students and educators can utilize the dataset to learn about NLP, data preprocessing, and various other aspects of AI research, enhancing their skills and contributing to the development of a skilled workforce in the field of AI and NLP.
  6. Supporting ethics and fairness in AI: Open access to high-quality datasets, such as the Russian Pile, enables researchers to investigate and address issues related to ethics and fairness in AI. By providing a comprehensive and diverse dataset for the Russian language, you will be helping researchers to design and evaluate algorithms that are less biased and more equitable, thus contributing to the development of responsible AI systems.
  7. Boosting competitiveness and economic growth: Open datasets can drive economic growth by stimulating innovation and entrepreneurship. By opening the Russian Pile dataset, you will be providing valuable resources for startups, businesses, and developers to build new products and services, encouraging technological advancements and fostering a competitive ecosystem in the field of AI and NLP.

In conclusion, I believe that making the Russian Pile dataset openly available will bring about numerous benefits for the global research community, promote language diversity, and contribute to the development of more inclusive and responsible AI systems. Your willingness to share the YaLM-100B model is already a significant contribution to the field, and opening the Russian Pile dataset would further solidify your commitment to openness and collaboration in AI research.

Thank you for considering this request. I am looking forward to your response and the potential positive impact that opening the Russian Pile dataset will have on the research community and beyond.

Sincerely,
Mikhail Grankin

Dataset information

Thanks for the very interesting model release.

If possible, could a bit information about the dataset used for training be provided (e.g. language split percentages)?

Would it be possible to run the model on single A100 (40GB) or 2xV100 (32GB) ?

Hello!

I managed to run the model on 8xA100, unfortunately AFAIK GCP don't provide 80GB models, so it's the 40GB model.
It did fail on 4xA100 40GB and got out-of-memory error.

I was wondering if it would be somehow possible to run it on less hardware,
perhaps a single A100 40GB or 2xV100 32GB that can reduce running cost significantly.

I come under the impression that it might be possible with tweaking some of the runtime configuration.
Or even by modifying some code, but I'm not sure what parameters should I modify.

When I ran the model on 8xA100 it was using 27GB~ of GPU memory on each device, and used total of 50GB~ of the machine memory. I wonder if it should be possible to utilize more machine threads / RAM / disk in favor of consuming less GPU memory?

CUDA out of memory

Hello, I'm trying to use YaLM to generate text. I am using pretrained models. But when I try to run the generation, I get an error:

RuntimeError: CUDA out of memory. Tried to allocate 76.00 MiB (GPU 0; 5.80 GiB total capacity; 62.50 MiB already allocated; 20.81 MiB free; 64.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

GPU is 1660, 6gb vram. Is there anything I can do about it or have I wasted a few weeks?

Run on networked nodes

Thanks for open-sourcing this! Because the GPU ram requirements are so high, it's hard to rent a large enough single node from any of the major cloud providers. How can you run it in inference mode networked between multiple physical machines?

Thanks!

Citation bibtex?

Hi! I wanted to ask for an official citation bibex one can use when referring to the model in the paper.

Could you share the md5 value for those checkpoints?

Hello, Thank you for sharing the YaLM-100B checkpoints.
I am downloading those checkpoints, in order to make sure the files I have downloaded in good, can you share the md5sum value for those checkpoint files?

How to use it with LangChain?

For example if i can't use transformers pipeline to use it with LangChain, how i need it use it that it connects to LangChain?

Possible to run on 8 x 24GB 3090?

This model looks amazing, thank you! We have a machine with 8 x 3090 (192GB total), I tried to run the examples, but I get:

building GPT2 model ...

RuntimeError: CUDA out of memory. Tried to allocate 76.00 MiB (GPU 3; 23.70 GiB total capacity; 22.48 GiB already allocated; 70.56 MiB free; 22.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

For someone who is not an expert with pytorch etc., perhaps you have a suggestion?

We would try to make a conversation partner for language learning (add TTS, translation, NLP etc.) for our project: https://dev.languagereactor.com/

Regards, David :)

AWS

Is there a way to run the model on AWS?

Online examples

It's great that you endure the community. Thank you.

Please add an online example so that you can test with it without downloading 200GB to your computer.

Can it be launched on usual VPS? For example, 6 CPU 16 RAM (usual chips)

Sorry for maybe a stupid question, fortunately, I find your product and want to integrate it into social media accounts; I don't release how to use it from the box (via Docker). As it said in the instruction I need a strong PC with GPU chips (which are pretty expensive for me) and I wonder if there is a way to utilize " input text prompt/variables - get a response in console / API"?
Can you please, @artnitolog, comment on this?

For reference, a way how https://porfirevich.ru/ works.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.