Giter Site home page Giter Site logo

varuna's Introduction

Varuna

Varuna is a tool for efficient training of large DNN models on commodity GPUs and networking. It implements a combination of pipeline parallelism and data parallelism in PyTorch, and enables training on a changing set of resources smoothly.

This repository is an implementation of the paper:

"Varuna: Scalable, Low-cost Training of Massive Deep Learning Models", to appear in EuroSys'22.

Setup & Installation

Varuna requires python 3, PyTorch (1.5+) and apex.

The patch apex.patch in this directory needs to be applied to apex before building it. Varuna's code and this patch has been tested for this commit of apex.

git clone https://github.com/NVIDIA/apex
cp apex.patch /path/to/apex/
cd /path/to/apex
git apply apex.patch
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

To install, clone this repository, cd into it and run

python setup.py install

Running

Varuna trains large DNN models by parallelising them into sequential pipeline stages and data parallel replicas across a set of GPUs. These methods are called pipeline parallelism and data parallelism respectively. To enable parallel training with Varuna, there are several steps the user must follow. Detailed docs are in the docs/ folder as webpages (html/index.html) or pdf (varuna.pdf).

Examples of models working with varuna can also be found in examples/. Please see this folder to run examples with BERT and Megatron-LM.

Some of the steps for are briefly described below.

CutPoint demarcation

Varuna slices a DNN model into sequential pipeline stages. For this, the model should be annotated with varuna CutPoint instances between different operations/ parts of model computation. These are nn.Module instances that are potential slice points in the mode. For each CutPoint, Varuna can either ignore it or activate it as a partition boundary. (see CutPoints)

from varuna import CutPoint

class SampleModel(nn.Module):
  def __init__(...):
    ....
    self.cutpoints = [CutPoint() for i in range(num_cutpoints)]
    ....

  def forward(input...):
    input = self.some_operation(input)
    input = self.cutpoints[0](input)     # marked as a potential stage boundary
    input = self.some_other_operation(input)
    ....
    for i in range(sub_modules):
      x = sub_module_i(input, ...)
      x = self.cutpoints[i+1](x)        # each cutpoint instance should be used only once in a model
    ....

Operations separated by CutPoints should preferably have no shared modules/parameters. For weight sharing between different parts of the module, you should register separate nn.Parameter instances (even for the same tensor) and pass the pair of parameter names as shared_weights to Varuna.

Wrapping the model in Varuna

The nn.Module for your DNN instance should be wrapped in a Varuna instance before training and before optimizer creation. (see Varuna) Wrapping in Varuna returns a model partitioned according to the given stage_to_rank_map (which is passed by the varuna launcher) and moved to the GPU. After this initialization, each rank in the job has only the parts of the model required by it. Varuna internally handles fp16 mixed precision training and shared parameters (such as the initial and last embedding weights in BERT/GPT-2). Optimizer creation should be after this since it requires model parameters as input. The optimizer needs to be registered with Varuna using a setter.

    model = MyModel()             # full model on CPU
    # provide dummy input function to varuna for initialization. Inputs mst be in dictionary form.
    def get_batch_fn(size, device=None):
        batch = dataset[:size]
        if device is not None:
          batch = [t.to(device) for t in batch]
        inputs, mask = batch
        return {'inputs': inputs, 'mask': mask, 'extra_norm': True }

    shared_weights = [("language_model.embedding.word_embeddings.weight","lm_head_weight")]  # parameter sharing between stages
    model = Varuna( model, args.stage_to_rank_map, dry_run_input, global_batch_size, 
                        args.chunk_size, args.fp16, local_rank=args.local_rank, 
                        device=args.local_rank, shared_weights=shared_weights)

    # now model is a subset of the original model, moved to the GPU on each process

    optimizer = get_optimizer(model)
    model.set_optimizer(optimizer)

Training loop.

The Varuna training loop does not require a separate forward & backward step, the script may just call the step function. The input to this function should be of the per-process batch size (batch_size / data_parallel_workers), and should be a dictionary with arg names and values. The step function makes micro-batches out of this input batch, completes the fwd/bwd pipeline schedule and reduces the gradients/overflow over the whole job, returning the loss and overflow boolean.


inputs = dict({
    "input_ids": tokens,
    "position_ids": position_ids,
    "attention_mask": attention_mask,
    "loss_mask": loss_mask,
    "labels": labels
})

loss, overflow = model.step(inputs)
loss = torch.Tensor([loss])

if not overflow:
  optimizer.step()

Launcher and Arguments

To launch a distributed training process using Varuna, use the run_varuna.py script as follows:

python -m varuna.run_varuna --machine_list <file_with_ips> --gpus_per_node <num_gpus_per_node> --batch-size <total_effective_batch_size> --nstages <number_of_pipeline_stages> --chunk_size <micro_batch_size_for_pipeline> --code_dir <working_dir_for_training> user_training_script.py <...user args...>

See Launching varuna.

This expects all machines in the machine_list to be set up with necessary code/libraries in code_dir and have gpus_per_node GPUs working. The job is launched with all workers running the user_training_script and args.

This launcher passes a few arguments to the user training script for Varuna. These should be passed during Varuna initialisation in the python script:

  • rank: process rank in overall distributed job
  • local_rank: process rank in the local node
  • stage_to_rank_map: varuna config info about stage placement
  • chunk_size: micro batch size for Varuna pipeline
  • batch-size: per process batch size

Changing resources: job morphing

Varuna enables training on a changing set of nodes/gpus. This is through monitoring the machine_list text file of IPs with the set of available nodes at any time. Training jobs are launched from a long-living manager. On detecting a change, Varuna checkpoints, stops and relaunches the job from the manager. To allow for this on-demand checkpoint/stop, varuna relies on user signals (SIGUSR1 in unix). The user therefore needs to add a simple handler for this signal to their training script. See Morphing.

if __name__ == "__main__":

    def handler(signum,_):
        save_checkpoint(iteration, model, optimizer, lr_scheduler)
        exit()

    signal.signal(signal.SIGUSR1, handler)

Profiling, config selection

Varuna supports auto-configuration of data-parallel and pipeline-parallel dimensions that saves the user from running and comparing different configs for better performance. To enable this, the user needs to run one-time profiling of the model and network conditions using the Profiler class in Varuna. See Varuna Profiling. This is instantiated similar to Varuna and runs as a distributed process:

model = BuildModel()
profiler = Profiler(model, args.device, fp16=args.fp16)

def get_batch(size):
  # function to get sample batches of given size for profiling
  return batch

profiler.initialize(get_batch)
microbatches_to_profile = list(range(1,max_micro_BS))
profile = profiler.profile_all(get_batch, microbatches_to_profile, out_folder=args.save)

Each process profiles the compute of different cutpoints and at the same time measures communication with other processes. This builds and saves a profile of the model in the specified location, from where it can be accessed by the AutoConfig class. AutoConfig calculates the different configs for a given number of GPUs and simulates them using information from the pre-built profile to compare and return the best performing setting in a few seconds. This calculation is triggered by run_varuna when no nstages and chunk_size arguments are given and a profile location is passed.

varuna's People

Contributors

microsoft-github-policy-service[bot] avatar msivathanu avatar nitikasaran68 avatar ramaramjee avatar rramach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

varuna's Issues

Get problem with --fp16 in BERT nan loss

########################################################
python -m varuna.run_varuna --machine_list ./ips.txt --gpus_per_node 4
--batch_size 4096 --nstages 2
--chunk_size 16
--code_dir /root/code/space/DeepLearningExamples/PyTorch/LanguageModeling/BERT run_pretraining.py
--input_dir /root/code/dataset/BERT/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en
--config_file /root/code/space/DeepLearningExamples/PyTorch/LanguageModeling/BERT/bert_config.json
--output_dir /root/code/space/DeepLearningExamples/PyTorch/LanguageModeling/BERT/out
--do_train --fp16 --max_steps 10 --varuna

########################################################
0 Overflow !!2 Overflow !!

1 Overflow !!3 Overflow !!

032 : update_scale(): _has_overflow, dynamic. _loss_scale = : update_scale(): _has_overflow, dynamic. _loss_scale = : update_scale(): _has_overflow, dynamic. _loss_scale = 524288.0
524288.0
524288.01
: update_scale(): _has_overflow, dynamic. _loss_scale = 524288.0
Could not send progress update messageCould not send progress update messageCould not send progress update message

Could not send progress update message
DLL 2023-06-15 20:14:36.849959 - Training Epoch: 0 Training Iteration: 1 average_loss : 11.3046875 step_loss : 11.3046875 learning_rate : 4.743416490252569e-05
3 Overflow !!
1 Overflow !!
2 Overflow !!
0 Overflow !!
13 : update_scale(): _has_overflow, dynamic. _loss_scale = : update_scale(): _has_overflow, dynamic. _loss_scale = 0 262144.0262144.0: update_scale(): _has_overflow, dynamic. _loss_scale =

262144.0
2 : update_scale(): _has_overflow, dynamic. _loss_scale = 262144.0
Could not send progress update messageCould not send progress update messageCould not send progress update message

Could not send progress update message
DLL 2023-06-15 20:14:37.136949 - Training Epoch: 0 Training Iteration: 2 average_loss : nan step_loss : nan learning_rate : 4.4721359549995795e-05
0 Overflow !!2 Overflow !!

3 Overflow !!
1 Overflow !!
1 0: update_scale(): _has_overflow, dynamic. _loss_scale = 3 : update_scale(): _has_overflow, dynamic. _loss_scale = 131072.0: update_scale(): _has_overflow, dynamic. _loss_scale =
131072.0131072.0

Tasks

No tasks being tracked yet.

Varuna Profiler Issue

Hello, I was trying to profile Megatron-LM (which is in example) and I got an error.

My enviroment is
Nvidia NGC Container pytorch 22.02-py3
pytorch 1.11.0
CUDA 11.6.0
Ubuntu 20.04

I got an error like

Traceback (most recent call last):
  File "pretrain_gpt2.py", line 170, in 
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 
  File "/workspace/Megatron-LM/megatron/training.py", line 108, in pretrain
    profile = model.profile_all(list(range(1,25)))
  File "/opt/conda/lib/python3.8/site-packages/varuna-0.0.1-py3.8.egg/varuna/profiler.py", line 476, in profile_all
    self.profile(microbatch_sizes, optimizer)
  File "/opt/conda/lib/python3.8/site-packages/varuna-0.0.1-py3.8.egg/varuna/profiler.py", line 755, in profile
    self.profile_mbs(batch_size, optimizer)
  File "/opt/conda/lib/python3.8/site-packages/varuna-0.0.1-py3.8.egg/varuna/profiler.py", line 841, in profile_mbs
    optimizer.step()
  File "/opt/conda/lib/python3.8/site-packages/apex/amp/_process_optimizer.py", line 357, in new_step
    retval = old_step(global_grad_norm=global_grad_norm)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
TypeError: step() got an unexpected keyword argument 'global_grad_norm'

My run script is

#! /bin/bash

DATA_PATH=/workspace/data/gpt_text_document
GPUS_PER_SERVER=4


python -m varuna.run_varuna --nstages 4 --chunk_size 4 --batch_size 16 \
        --gpus_per_node $GPUS_PER_SERVER --no_morphing --machine_list /workspace/IPs pretrain_gpt2.py \
        --num-layers 16 \
        --hidden-size 1024 \
        --num-attention-heads 16 \
        --seq-length 1024 \
        --max-position-embeddings 1024 \
        --train-iters 100 \
        --lr-decay-iters 100 \
        --data-path $DATA_PATH \
        --distributed-backend gloo \
        --vocab-file /workspace/data/gpt2-vocab.json \
        --merge-file /workspace/data/gpt2-merges.txt \
        --save /workspace/profile \
        --save-interval 1000 \
        --data-impl mmap \
        --split 1000,0,0 \
        --lr 0.00001 \
        --min-lr 1e-5 \
        --lr-decay-style cosine \
        --weight-decay 1e-2 \
        --clip-grad 1.0 \
        --use-cpu-initialization \
        --warmup .05 \
        --fp16 \
        --varuna \
        --profiling

I follow all instructions in README like get exact commit of each apex, Megatron-LM and run patch.

Please let me know what is wrong.
Thank you.

A bug in the 'extra_grad_norm_sq' func

In the extra_grad_norm_sq func, the self.shared_weights is enumerated to calculate the extra_norm_sq:

for i,w in enumerate(self.shared_weights):

However, if self.shared_weights is None, which is not enumeratable, it will throw an exception and make training process crash.

Documentation links point to raw HTML files

I was reading through README.md and wanted to learn more about CutPoints, so I clicked on the link to the CutPoints documentation, but this brings me to a raw HTML file in the github repository. Can you put these documentation files in some hosted place where they can be rendered directly by the browser?

Inference on multiple GPU

Hi, I have a sizeable pre-trained model and I want to get inference on multiple GPU from it(I don't want to train it).so is there any way for that?
In summary, I want model-parallelism. and if there is a way, how is it done?

Launcher script unexpectedly runs `sudo` commands

When using run_varuna, the terminal suddenly goes to a sudo password authentication prompt with no context. This is suspicious for anyone who is security-minded, and this interactivity will not work if the launcher process is run under automation. The call-sites I encountered are here:

os.system("sudo pkill -f varuna.morph")
but there could be others.

In this case, I removed sudo from these commands and the script got further, but there should likely be a systematic fix for this.

Launcher script does not work if hosts require 2-factor authentication

With a simple invocation of run_varuna:

$ python -m varuna.run_varuna --batch_size 3 --nstages 3 --chunk_size 1 varuan_test.py

The script has the following terminal output:

No apex!
No apex
['127.0.0.1']
ssh 127.0.0.1 echo "python -u -m varuna.launcher --ngpus_per_server 4   --node_rank 0 --nservers 1 --master_addr 127.0.0.1 --nstages 3 --batch_size 3 --chunk_size 1 --code_dir /fsx/users/jamesreed/varuna varuan_test.py" > launch_varuna.sh;  VARUNA_MANAGER_IP=10.200.30.184 VARUNA_MORPH_PORT=4200 VARUNA_HEARTBEAT_PORT=5000  bash launch_varuna.sh

With the last line (i believe) indicating that run_varuna is trying to use ssh to to issue a command on the host (in this case localhost). However, this host requires two-factor authentication on login. run_varuna launches this command (twice, seemingly) as a background process and ostensibly with stdin disconnected, thus the interactive 2fa prompt cannot be completed. It would be good to have a way to issue commands to the host that is compatible with common security policies or is built on top of standard cluster management tools

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.