Giter Site home page Giter Site logo

guitaricet / relora Goto Github PK

View Code? Open in Web Editor NEW
399.0 8.0 34.0 2.03 MB

Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Home Page: https://arxiv.org/abs/2307.05695

License: Apache License 2.0

Python 30.51% Jupyter Notebook 66.63% Makefile 0.02% C++ 2.84%
deep-learning distributed-training llama nlp peft transformer

relora's Introduction

ReLoRA -- PEFT Pretraining

Official code for Stack More Layers Differently: High-Rank Training Through Low-Rank Updates https://arxiv.org/abs/2307.05695

ReLoRA

Setup

Requires Python 3.10+ (due to param annotaitons style) and PyTorch 2.0+ (for flash attention). All requirements are listed in requirements.txt and kept up-to-date.

pip install -e .
pip install flash-attn

We do not have flash attention in our requirements, because for some reason flash attention installation script requires torch and some other requirements to already be installed

1B training script

The rule of thumb of selecting the learning rate I use for now is 2X regular training learning rate. It might require tuning on larger models. Microbatch size depends on the GPU memory and needs to be tuned to maximize the throughput. Note that relora allows to use larger microbatch sizes than regular training.

Number of steps is 143K (Pythia) minus 10K, because we start from the checkpoint at 10K steps. Relora reset frequency is 5320 so that the number of steps is would be divisible by it.

torchrun --nproc-per-node 8 --nnodes 1 torchrun_main.py --training_config training_configs/1B_v1.0.yaml

Usage

Pre-process data (might take some time)

python pretokenize.py \
    --save_dir preprocessed_data \
    --tokenizer t5-base \
    --dataset c4 \
    --dataset_config en \
    --text_field text \
    --sequence_length 512

The script will log where the pre-processed data is saved. It should be something like preprocessed_data/<dataset>_<tokenizer>_<sequence_length>.

To train a model using ReLoRA, first, perform a warmup through regular training.

export DATA_PATH=<path to preprocessed data>

torchrun --nproc-per-node <N_GPUS> torchrun_main.py \
    --model_config configs/llama_250m.json \
    --dataset_path $DATA_PATH \
    --batch_size 24 \
    --total_batch_size 1152 \
    --lr 5e-4 \
    --max_length 512 \
    --save_every 1000 \
    --eval_every 1000 \
    --num_training_steps 20000
    --tags warm_start_250M

Reproducibility note: The way we ran the experiments in the paper was by specifying full num_training_steps, including both the warmup and the ReLoRA training, and stopping it after the desired number of steps was completed. Providing only the number of training steps should work too. The only difference will be the LR schedule during the warmup period.

When you have a warmed-up network checkpoint, run the script with ReLoRA enabled. Note that we use a larger LR during the ReLoRA stage.

Train with PEFT

torchrun --nproc-per-node <N_GPUS> torchrun_main.py \
    --model_config configs/llama_250m.json \
    --batch_size 24 \
    --total_batch_size 1152 \
    --lr 1e-3 \
    --max_length 512 \
    --use_peft \
    --relora 5000 \
    --cycle_length 5000 \
    --restart_warmup_steps 100 \
    --scheduler cosine_restarts \
    --warmup_steps 500 \
    --reset_optimizer_on_relora True \
    --num_training_steps 20000 \
    --save_every 5000 \
    --eval_every 5000 \
    --warmed_up_model checkpoints/llama_250m-2023-06-09-11-29-56/model_5000 \
    --tags relora_250M

Note on batch sizes

To minimize the pain with multi-GPU setups, we recommend avoiding using --gradient_accumulation option directly. Instead, specify --total_batch_size and allow the script to figure out the gradient accumulation option based on --batch_size and the number of GPUs used.

Relora

Relora integrates existing LoRA parameters into the main network and resets them. In principle, such an approach can be more flexible than LoRA, but you need to be careful with

  1. Optimizer states
  2. Learning rate schedule during and right after the reset
  3. How frequently you reset

Reset frequency is determined by --relora parameter (in the number of update steps, not global steps). Optimizer reset options are:

"--reset_optimizer_on_relora", default=True, type=lambda x: x.lower() == "true"
"--optimizer_random_pruning", default=False, type=float
"--optimizer_magnitude_pruning", default=False, type=float

We found that using --optimizer_magnitude_pruning 0.9 or plain --reset_optimizer_on_relora usually performs well. Note that --reset_optimizer_on_relora is True by default and you need to provide --reset_optimizer_on_relora False --optimizer_magnitude_pruning 0.9 if you want to do magnitude pruning.

ReLoRA currently only supports cosine decay learning rate scheduler. Specifically cosine_restarts that works in cyclical mode that repeats the warmup every --cycle_length update steps.

Warm starts

You can start LoRa from a partially trained checkpoint. To do that, provide --warmed_up_model option. For example:

torchrun torchrun_main.py ... <other options> .. --warmed_up_model checkpoints/llama_1b-2023-05-05-20-12-43/model_1000

Distributed training

We support single-node distributed training using vanilla PyTorch DDP. | main.py script does not have all features required for relora and will be deleted soon. We recommend to use torchrun --nproc-per-node 1 for a single-GPU training.

An example of using torchrun

torchrun --nproc-per-node 8 torchrun_main.py \
    --model_config configs/llama_35m.json \
    --use_peft \
    --lora_r 128 \
    --relora 500 \
    --cycle_length 500 \
    --warmup_steps 250 \
    --reset_optimizer_on_relora False \
    --lr 0.001 \
    --batch_size 60 \
    --total_batch_size 480 \
    --num_training_steps 5000 \
    --save_every 5000 \
    --dtype bfloat16 \
    --tags relora_debug,example

Where --nproc-per-node is the nubmer of GPUs you are using.

Citation

@misc{lialin2023stack,
    title={Stack More Layers Differently: High-Rank Training Through Low-Rank Updates},
    author={Vladislav Lialin and Namrata Shivagunde and Sherin Muckatira and Anna Rumshisky},
    year={2023},
    eprint={2307.05695},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

relora's People

Contributors

guitaricet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

relora's Issues

RFC - multinode training

Hi
I would like to implement multi-node training. Will you accept this kind of contribution?
What are the known gaps toward multinode training?
Thanks
Omri

Does this model continue to improve in large scale regime?

hi! Thank you for this insightful work.
I wonder how your research on large language models is going on.
Is your 350M run finished?
also I'm curious about how good billion scale models can be.
Do you have any plans to scale up your experiments?

Inquiry About Minimum VRAM Requirement for ReLoRA Model Training

I recently came across your paper on "ReLoRA: Low-Rank Training for Efficient Neural Networks" but I'm unsure about the technical details, especially when it comes to hardware requirements. Could you provide some guidance on the minimum VRAM needed for implementing ReLoRA effectively?

Training scripts

Hi,

Thank you for the great research and making the code available!!
In the paper, you mentioned that you get a 30% memory consumption reduction and a 52% training throughput increase with ReLoRA on a 1.3B parameter model, right? Can you share the training script for that?

About ReLoRA usage

@Guitaricet Hi, great work! I'm interested in using relora in text to image generation but have several questions.

From my understanding, as written in Algorithm 1 of the paper, ReLoRA merges the learned W_{A}, W_{B} into W from the frozen model at each restart, so ReLoRA actually trains a bunch of W_{A} and W_{B}. In comparison, LoRA only trains one W_{A}, W_{B}.

If so, as the W_{A} and W_{B} have already been merged into W (the frozen model gets updated), the ReLoRA weights cannot be made pluggable as LoRA. An alternative way is to save all group of [W_{A}, W_{B}] and load them during inference.

Please correct me if I'm wrong.

Inf checks warning on optimizer

I'm attempting to implement this into a larger modular environment (pytorch lighting).

When I attempt to reset the optimizer states like you have here:

    def reset_optimizer(optimizer):
        for group in optimizers[0].param_groups:
            for p in group["params"]:
                param_state = optimizers[0].state[p]
                param_state["exp_avg"] = torch.zeros_like(p.data)
                param_state["exp_avg_sq"] = torch.zeros_like(p.data)

Upon running the next loop, I get the error:

File "/home/user/anaconda3/envs/science/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py", line 372, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."

I'm sure you must've encountered this at some point yourself, so I'm curious how you managed to avoid this. Previously I had managed by just re-initializing the optimizer through accelerate, and I expected not using accelerate would fix the issue too, but it appears not to.

versions of the dependencies

Hello, could you please specify versions of the dependencies, since it looks like it doesn't work with the latest versions..
(pip freeze)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.