pytorch / torchtune Goto Github PK

A Native-PyTorch Library for LLM Fine-tuning

Home Page: https://pytorch.org/torchtune/main/

License: BSD 3-Clause "New" or "Revised" License

Python 99.82% Shell 0.18%

torchtune's Introduction

torchtune

Note

July 2024: torchtune has updated model weights for Llama3.1 in source and nightly builds! Check out our configs for both the 8B and 70B versions of the model. LoRA, QLoRA, and full finetune methods are supported. Support for QLoRA 405B will be added soon.

Introduction

torchtune is a PyTorch-native library for easily authoring, fine-tuning and experimenting with LLMs. We're excited to announce our alpha release!

torchtune provides:

Native-PyTorch implementations of popular LLMs using composable and modular building blocks
Easy-to-use and hackable training recipes for popular fine-tuning techniques (LoRA, QLoRA) - no trainers, no frameworks, just PyTorch!
YAML configs for easily configuring training, evaluation, quantization or inference recipes
Built-in support for many popular dataset formats and prompt templates to help you quickly get started with training

torchtune focuses on integrating with popular tools and libraries from the ecosystem. These are just a few examples, with more under development:

Hugging Face Hub for accessing model weights
EleutherAI's LM Eval Harness for evaluating trained models
Hugging Face Datasets for access to training and evaluation datasets
PyTorch FSDP for distributed training
torchao for lower precision dtypes and post-training quantization techniques
Weights & Biases for logging metrics and checkpoints, and tracking training progress
Comet as another option for logging
ExecuTorch for on-device inference using fine-tuned models
bitsandbytes for low memory optimizers for our single-device recipes

Models

torchtune currently supports the following models.

Model	Sizes
Llama3.1	8B, 70B [models, configs]
Llama3	8B, 70B [models, configs]
Llama2	7B, 13B, 70B [models, configs]
Code-Llama2	7B, 13B, 70B [model, configs]
Mistral	7B [model, configs]
Gemma	2B, 7B [model, configs]
Microsoft Phi3	Mini [model, configs]
Qwen2	0.5B, 1.5B, 7B [model, configs]

We're always adding new models, but feel free to file an Issue if there's a new one you would love to see in torchtune!

Fine-tuning recipes

torchtune provides the following fine-tuning recipes.

Training	Fine-tuning Method
Distributed Training [1 to 8 GPUs]	Full [code, example], LoRA [code, example]
Single Device / Low Memory [1 GPU]	Full [code, example], LoRA + QLoRA [code, example]
Single Device [1 GPU]	DPO [code, example], RLHF with PPO [code, example]

Memory efficiency is important to us. All of our recipes are tested on a variety of setups including commodity GPUs with 24GB of VRAM as well as beefier options found in data centers.

Single-GPU recipes expose a number of memory optimizations that aren't available in the distributed versions. These include support for low-precision optimizers from bitsandbytes and fusing optimizer step with backward to reduce memory footprint from the gradients (see example config). For memory-constrained setups, we recommend using the single-device configs as a starting point.

This table captures the peak memory usage and training speed for recipes in torchtune.

Example HW Resources	Finetuning Method	Model	Setting	Peak Memory per GPU (GB)	Training Speed (tokens/sec)
1 x RTX 4090	QLoRA **	Llama2-7B	Batch Size = 4, Seq Length = 2048	12.3 GB	3155
1 x RTX 4090	LoRA	Llama2-7B	Batch Size = 4, Seq Length = 2048	21.3 GB	2582
2 x RTX 4090	LoRA	Llama2-7B	Batch Size = 4, Seq Length = 2048	16.2 GB	2768
1 x RTX 4090	Full finetune *	Llama2-7B	Batch Size = 4, Seq Length = 2048	24.1 GB	702
4 x RTX 4090	Full finetune	Llama2-7B	Batch Size = 4, Seq Length = 2048	24.1 GB	1388
8 x A100	LoRA	Llama2-70B	Batch Size = 4, Seq Length = 4096	26.4 GB	3384
8 x A100	Full Finetune *	Llama2-70B	Batch Size = 4, Seq Length = 4096	70.4 GB	2032

*= Uses PagedAdamW from bitsandbytes

**= Uses torch compile

Llama3 and Llama3.1

torchtune supports fine-tuning for the Llama3 8B and 70B size models. We currently support LoRA, QLoRA and full fine-tune on a single GPU as well as LoRA and full fine-tune on multiple devices for the 8B model, and LoRA on multiple devices for the 70B model. For all the details, take a look at our tutorial.

Note

Our Llama3 and Llama3.1 LoRA and QLoRA configs default to the instruct fine-tuned models. This is because not all special token embeddings are initialized in the base 8B and 70B models.

In our initial experiments for Llama3-8B, QLoRA has a peak allocated memory of ~9GB while LoRA on a single GPU has a peak allocated memory of ~19GB. To get started, you can use our default configs to kick off training.

Single GPU

LoRA 8B

tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device

QLoRA 8B

tune run lora_finetune_single_device --config llama3_1/8B_qlora_single_device

Full 8B

tune run full_finetune_single_device --config llama3_1/8B_full_single_device

Multi GPU

Full 8B

tune run --nproc_per_node 4 full_finetune_distributed --config llama3_1/8B_full

LoRA 8B

tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_1/8B_lora

LoRA 70B

Note that the download command for the Meta-Llama3 70B model slightly differs from download commands for the 8B models. This is because we use the HuggingFace safetensor model format to load the model. To download the 70B model, run

tune download meta-llama/Meta-Llama-3.1-70b --hf-token <> --output-dir /tmp/Meta-Llama-3.1-70b --ignore-patterns "original/consolidated*"

Then, a finetune can be kicked off:

tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_1/70B_lora.yaml

You can find a full list of all our Llama3 configs here and Llama3.1 configs here.

Installation

Step 1: Install PyTorch. torchtune is tested with the latest stable PyTorch release as well as the preview nightly version. torchtune leverages torchvision for fine-tuning multimodal LLMs and torchao for the latest in quantization techniques, you should install these as well.

# Install stable version of PyTorch libraries using pip
pip install torch torchvision torchao

# Nightly install for latest features
pip install --pre torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu121

Step 2: The latest stable version of torchtune is hosted on PyPI and can be downloaded with the following command:

pip install torchtune

To confirm that the package is installed correctly, you can run the following command:

tune --help

And should see the following output:

usage: tune [-h] {ls,cp,download,run,validate} ...

Welcome to the torchtune CLI!

options:
  -h, --help            show this help message and exit

...

You can also install the latest and greatest torchtune has to offer by installing a nightly build.

Get Started

To get started with fine-tuning your first LLM with torchtune, see our tutorial on fine-tuning Llama2 7B. Our end-to-end workflow tutorial will show you how to evaluate, quantize and run inference with this model. The rest of this section will provide a quick overview of these steps with Llama2.

Downloading a model

Follow the instructions on the official meta-llama repository to ensure you have access to the official Llama model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.

Llama3 download

tune download meta-llama/Meta-Llama-3-8B \
--output-dir /tmp/Meta-Llama-3-8B \
--hf-token <HF_TOKEN> \

Tip

Set your environment variable HF_TOKEN or pass in --hf-token to the command in order to validate your access. You can find your token at https://huggingface.co/settings/tokens

Running fine-tuning recipes

Llama3 8B + LoRA on single GPU:

tune run lora_finetune_single_device --config llama2/7B_lora_single_device

For distributed training, tune CLI integrates with torchrun. Llama3 8B + LoRA on two GPUs:

tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full

Tip

Make sure to place any torchrun commands before the recipe specification. Any CLI args after this will override the config and not impact distributed training.

Modify Configs

There are two ways in which you can modify configs:

Config Overrides

You can easily overwrite config properties from the command-line:

tune run lora_finetune_single_device \
--config llama2/7B_lora_single_device \
batch_size=8 \
enable_activation_checkpointing=True \
max_steps_per_epoch=128

Update a Local Copy

You can also copy the config to your local directory and modify the contents directly:

tune cp llama2/7B_full ./my_custom_config.yaml
Copied to ./7B_full.yaml

Then, you can run your custom recipe by directing the tune run command to your local files:

tune run full_finetune_distributed --config ./my_custom_config.yaml

Check out tune --help for all possible CLI commands and options. For more information on using and updating configs, take a look at our config deep-dive.

Design Principles

torchtune embodies PyTorch’s design philosophy [details], especially "usability over everything else".

Native PyTorch

torchtune is a native-PyTorch library. While we provide integrations with the surrounding ecosystem (e.g. Hugging Face Datasets, EleutherAI Eval Harness), all of the core functionality is written in PyTorch.

Simplicity and Extensibility

torchtune is designed to be easy to understand, use and extend.

Composition over implementation inheritance - layers of inheritance for code re-use makes the code hard to read and extend
No training frameworks - explicitly outlining the training logic makes it easy to extend for custom use cases
Code duplication is preferred over unnecessary abstractions
Modular building blocks over monolithic components

Correctness

torchtune provides well-tested components with a high-bar on correctness. The library will never be the first to provide a feature, but available features will be thoroughly tested. We provide

Extensive unit-tests to ensure component-level numerical parity with reference implementations
Checkpoint-tests to ensure model-level numerical parity with reference implementations
Integration tests to ensure recipe-level performance parity with reference implementations on standard benchmarks

Community Contributions

We really value our community and the contributions made by our wonderful users. We'll use this section to call out some of these contributions! If you'd like to help out as well, please see the CONTRIBUTING guide.

@SalmanMohammadi for adding a comprehensive end-to-end recipe for Reinforcement Learning from Human Feedback (RLHF) finetuning with PPO to torchtune
@fyabc for adding Qwen2 models, tokenizer, and recipe integration to torchtune
@solitude-alive for adding the Gemma 2B model to torchtune, including recipe changes, numeric validations of the models and recipe correctness
@yechenzhi for adding Direct Preference Optimization (DPO) to torchtune, including the recipe and config along with correctness checks

Acknowledgements

The Llama2 code in this repository is inspired by the original Llama2 code.

We want to give a huge shout-out to EleutherAI, Hugging Face and Weights & Biases for being wonderful collaborators and for working with us on some of these integrations within torchtune.

We also want to acknowledge some awesome libraries and tools from the ecosystem:

gpt-fast for performant LLM inference techniques which we've adopted OOTB
llama recipes for spring-boarding the llama2 community
bitsandbytes for bringing several memory and performance based techniques to the PyTorch ecosystem
@winglian and axolotl for early feedback and brainstorming on torchtune's design and feature set.
lit-gpt for pushing the LLM fine-tuning community forward.
HF TRL for making reward modeling more accessible to the PyTorch community.

License

torchtune is released under the BSD 3 license. However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.

torchtune's People

Contributors

Stargazers

Watchers

Forkers

adithya-s-k techthiyanes evdcush dattgoswami dhilip2002 shashipal95 shahinsharifi munirabobaker nicolashug dsoldatkin tcapelle wodole ebsmothers lixin97 rohan-varma markstur positioner mindok7520 leeds1219 apollohuang1 yechenzhi zhaoyongke skcoirz spikeparaffa heidongxianhua petrex stophobia jithunnair-amd iseeyuan samedovzaur1 swayaminsync raymondbernard eshassy wauplin alband kartikayk rdoublea yf225 slr722 svekars efenocchi donwany weifengpy joecummings skaiphd lessw2020 jerryzh168 kustomzone cdrhim mohsaad andreslavescu cdgleber jason-tadsilva alexandor91 snoopycn dattheshshenoy karthik0899 alexrs rumi381 shreyaspimpalgaonkar maxe-xq drchrislevy maximegmd diningsystem ejaygit ebeyabraham denadai2 deluair machina-source wbing520 supernovae sterling312 chandan0000 whatif-dev franzbischoff lukepark327 buaalearn jangocheng ntdxyg rickyhong sartify jh941213 weedge hesam7711 refractai omidsufi apthagowda97 salmanmohammadi eltociear dahsh yxli2123 clarysf gunungtravia sungaiglasis hutansilon minmin2411 ysfadlaa kotamadelin joshdayax alxsbr2411

torchtune's Issues

Disorganized feedback from a quick look

Hi all, I took a brief look at the code, just sharing some thoughts / tips / questions below. Of course, feel free to ignore everything, they may not be relevant considering the lib is still on its early stage.

Docs

1. Single backticks are being used to refer to code e.g. `hidden_dim`

https://github.com/pytorch-labs/torch_tbd/blob/304c88a9765493ac29624d6d770ffaaceccb7312/llm/llama2/feed_forward.py#L24

The docs are written in rst, not markdown, so typically the pytorch-sphinx-theme expects those to be double backticks instead of single ones, like ``hidden_dim``. Single backticks will be rendered as italics instead of as "code".

2. For parameters that have a default value, it's helpful to clarify that they're optional in the docstring. Otherwise, users have to read till the end to actually learn that. For the one above, it could be:

multiple_of (int, optional) : ...

BTW, not all defaults are currently documented, e.g. these ones are missing.

3. I think it's best to avoid documenting types with a type-annotation syntax, e.g. as in:

https://github.com/pytorch-labs/torch_tbd/blob/304c88a9765493ac29624d6d770ffaaceccb7312/llm/scripts/checkpoint/convert_llama2_to_native.py#L41

This would be better simply as

num_kv_heads: (int, optional):

Why? Because type-annotation syntax can become extremely hard to parse for humans, really quick. They're meant for type checkers, not humans. My favourite example of that is from torchvision where we document a parameter simply as

colors (color or list of colors, optional)

when its type is Optional[Union[List[Union[str, Tuple[int, int, int]]], str, Tuple[int, int, int]]]. We don't want to force users to read or understand that atrocity.

(Also, fun fact, Optional[] doesn't mean that a parameter is optional, it just mean that it can be None, which is completely different.)

Quick disclaimer: I hate type annotations and in particular type checkers with a burning passion. To be clear, I'm not saying you shouldn't be using type annotations - I'm just advocating to keep them away from docs.

Tests

Since you're using pytest, you might want to implement an autouse fixture that prevents the RNG of each test to leak into the other tests. This thing has resolved the source of some major headaches we've had in torchvision.

2. You can push it further and create another fixture that will actually set the RNG of each test automatically. In case you have a random test for which you forgot to set the seed, at least now you know that the seed will be the same across executions, which will reduce flakyness. aaaaaah you already have it, nice

This goes with the disclaimer above about annotations but OH GOD PLEASE DON'T TYPE-ANNOTATE YOUR TESTS :'(. You don't need to clarify that def dim(self): return 8 returns an int.
Small comments about what a test is doing and what it's checking go a looooog way, especially when you're looking to onboard contributors. Or for yourself, 6 months from now.
I saw fixed_init_tensor, perhaps there's a good reason for the tensor to be fixed, but if not it could just be random with a fixed random seed (just ICYMI there is torch.testing.make_tensor which may help)

Questions

Why did you override the error message for torch.testing.assert_close in assert_expected? I find the default error message to be fairly good especially when comparing big tensors (printing the whole thing as done in assert_expected would be super noisy). If you want to, you can configure pytest's verbose level locally on the CLI, or globally in pytest.ini
I've never seen a code-base make so much use of fixtures... Is there a reason to that? This might be very personal but it feels a bit overboard for me, especially for those fixtures which are only used once in a single test like this one. Using the fixture doesn't bring clarity IMHO and it actually makes the tests harder to read as one has to scroll up to figure out what the test values are.

Move collate_fn from finetune alpaca to a collection of collate functions (looks like this is specific to certain types of instruction - prompt data)

This can help with unit testing and composability and also reduce the clutter in the recipe code

Initialize kv cache w/num_kv_heads instead of num_heads

This will save memory for GQA / MQA, but will require a bit of refactor to attention forward pass.

Add flexibility to `GenerationUtils`

Allow users to use GenerationUtils without the following stipulations

Needing to accept a curr_pos argument
Needlessly setting incremental decoding to False

Fix pre-commit hooks

Seems that the pre-commit is broken; we should prioritize fixing this.

Generalize Transformer API rather than making it specific to SentencePiece

#116 (comment)

CUDA_VISIBLE_DEVICES w/torchrun does not work

i.e.

CUDA_VISIBLE_DEVICES=1,2,3,4 torchrun --nnodes 1 --nproc_per_node 4 recipes/finetune_llm.py --config recipes/configs/alpaca_llama2_finetune.yaml --autocast-precision bf16 --fsdp True --batch-size 1 --run-generation 50 --optimizer AdamW

results in ncclInvalidUsage from NCCL library. Should figure out whether this is from torchrun or our integration and fix appropropriatly.

Directory structure of tests folder should be changed to mimic the torchtune src code directory structure

Currently it has the old structure

Add Mistral 7b

Changing max_seq_len of model breaks checkpoint load in finetune_llm

load_state_dict fails specifically complaining about loading the RoPE cache. We probably don't need to load the RoPE cache anyways as its recomputed.

Users, Training abstractions, and repo design

Users

To be able to understand how we want to design the abstractions and the repo, it's important that we define how we want our users to interact with the library. One big complaint we heard from a number of users was around the black box nature of the fine tuning libraries they were using. On the other hand we also spoke to users who were very new to ml and were happy just to directly launch scripts without ever looking inside. To address this, I propose we model three levels of users and build our library to allow users to advance through these stages as they grow in requirements.

User 1:

This user just wants to use a recipes on their own dataset. They may want to play around with a couple of parameter changes but this user would be happiest with cli level control and access to good default values for a particular recipe.

torchrun llama2_finetune.py --dataset ./my_dataset

These users could use huggingface datasets we support or there own. If they used their own dataset they would be responsible for providing the dataset object with included transform.

User 2:

This user wants to be able to fully customize the recipe but does not want to have to figure out how to build everything from scratch. This user would likely create their own repo with a torchtune dependency and then edit the recipe file themselves.

copy llama2_finetune.py to my_finetune.py
edit my_finetune.py directly with no need to change any library files

To enable this user it's very important that our recipes our hackable, self contained, and easily readable. For this reason we encourage copy and pasting the training loop as we add more recipes.

User 3:

This final user has their own training setup and custom solutions and our just looking for access to some of our components such as models, recipe defaults, or specific trainer utils they don't have. For this user our recipe just acts as an example and they should be able to use our components a la carte in the same way we do in our recipes.

Training Abstractions

I will include pseudocode here for how we can design our recipes to support the user profiles listed above.

parser = argparse.ArgumentParser(...) # support all getters, lr, batchsize, epoch, dtype, num_devices, seed

model = get_model(...)  # get from our models, or our PEFT models
model = tune.distributed(model, num_devices) # FSDP wrap
dataset = get_dataset(...) # one of our datasets, which point to hf datasets
dataloader = Dataloader(dataset, ..., collate_fn=tune.llm_collate())
loss = get_loss(...)

logger = get_logger(...) # wandb, tensorboard, etc
checkpoint = get_checkpoint(...) # checkpointer, sharded_checkpointer, etc
evaluator = get_eval(...) # eval harness to use

opt = get_optimizer(...)
opt = tune.wrapped_optimizer(opt, num_devices) # autocast + distributed logic
device = get_device(num_devices)
dtype = dtype

for e in epoch:
	for step, batch in enumerate(dataloader):
		opt.zero_grad():

		data = data.to(device)

		with tune.cast(dtype):  # choose the appropriate autocast if distributed
			out = model(data)
			l = loss(out, data)

		l.backward()
		opt.step()

		logger(model, l, step, epoch)
		checkpoint(model, opt, dataloader, step, epoch)
		evaluator(model, step, epoch)
		profiler(...)

The above example, if we agree on it, should act as a guide for the direction of our recipe code. Initially none of the getter functions and utils will be built so there will be much more boilerplate but we can reduce this with time.

User 1: call script directly from cli and can change out any of the getter options. We will also need to provide them good defaults for specific combinations. We can do this as python files or yaml.
User 2: copies just this file and can manually swap out any line or add custom logic directly into the loop. No need to edit our library to use it.
User 3: They can use the getters they want in their own training code.

The above design roughly approximates a trainer but it's easily editable. It also doesn't require us to make it support every fine tuning concept. For example, the above script might be called finetune_llm.py which can be reused for a lot of recipes. But if a more exotic training setup came along we can just copy it and make a new one for different types of training without complexifying this one.

Repo Design

Finally a note on repo design. I think all of the components should be groped together so that they serve as a library within the library for the recipes to access and for user 3 to access directly with clean imports. The components should be grouped according to the getter functions.

recipes/

defaults/

tune/
    models/    # including peft wrapped models
    datasets/
    trainer/     # util folder
    ...

Missing things from README / for getting started

dependencies aren't documented:

conda install pytorch pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

sentencepiece is also missing
numpy missing (discovered after kicking off a finetune)
datasets missing (discovered after kicking off a finetune)

Generator created inside ReproducibleDataLoader leads to creating same RNG state for same worker id in each rank

In ReproducibleDataLoader, the dataloader_seed was used create a generator which was passed into DataLoader init. This will create different RNG state for each worker for that rank (using rand(base_seed) + worker_id). But it will be the same RNG state in each worker id in each rank. That is, worker id 3 in rank0 and worker id 3 in rank1 will have the same RNG state. This is a problem.

Note that this is not an issue in torch core because there is no explicit control on determinism of transforms. That is, the torch.initial_seed for each trainer process will be different and thus each worker id will have different RNG state. When we tried to control the RNG state, we ended up making it too restrictive which is incorrect. RNG state needs to be different in each worker id across world size.

In order to fix this, the seed for each rank should be different yet deterministic - seed(base_seed + rank_id). Note that the worker id seed is created using (random(seed) + worker_id) AND not (seed + worker_id). The latter would have been problematic as seeds could easily overlap.

Add the APIs to the docs

Let's take a look at the doc build instructions and start documenting the new APIs we create. I don't mean just to write docstrings, I mean to expose your APIs through the docs (in index.rst or in other files linked from there).

Also, we should take a few minutes to document the ones that already exist - and check that they render properly. Since this is everyone's responsibility, I'd suggested that each inidividual simply documents all the APIs they previously created.

The doc layout doesn't have to be absolutely perfect just now, but we have to start building that muscle already. A few weeks before the MVP, it will be too late, and there will always be an excuse to delay it. We're already spending time writing the docstrings, which is amazing - but if our docstrings aren't published on the docs, they might as well not exist! So give your work some visibility, add it to the docs.

Note: We have a doc build job in CI now. See here to check how to retrieve the built docs (but building locally will be much easier).

add print0 util

As per discussion: #171 (comment)

Add TorchFix linter to CI

Follow instructions pytorch/tnt#569

Make APIs default to private

Per our discussions we should default to leading underscores for file names and methods that we do not intend as public APIs. We should also import public APIs on init of their respective module

Unittestable components

Since we have an initial training loop, let's make sure each component can be individually unittested.

Build docs fails with no module named torchtune.models

Hi all, first time setting up torchtune. I've cloned, pip installed from the torchtune directory, then cd into docs and pip installed those requirements as well. When I run make html I get the following error:

(torchtune) rafiayub@rafiayub-mbp docs % make html
Running Sphinx v5.0.0
Using Sphinx-Gallery to convert rst text blocks to markdown for .ipynb files.
[autosummary] generating autosummary for: index.rst

Extension error (sphinx.ext.autosummary):
Handler <function process_generate_options at 0x1073f4c10> for event 'builder-inited' threw an exception (exception: no module named torchtune.models)
make: *** [html] Error 2

I was able to verify that tune recipe list worked and I could import torchtune in python

Add ReproducibleDataloader.state_dict functionality

For the dataloader to be truly reproducible it needs to support checkpointing. While the OSS checkpointer doesn't support a statedict we should add this functionality to ours and check for a state dict attributed during checkpointing.

The statedict should include all of the user parameters for the dataloader instance along with the epoch, step, and global rank. If global rank changes we should likely return an error or warning that we cannot guarantee reproducibility.

On load, the dataloader should return a sampler that skips forward to the last step used but still maintains the full len so that any progress bars are accurate.

Add unittest CI jobs for 3.8 and 3.9

The tests currently run on 3.10 and 3.11. The next Pytorch release will run on 3.8-3.11. The following one will probably be 3.8 - 3.12 (perhaps 3.8 will be dropped, I'm not sure). It is rare, but not too uncommon either to have failures that are specific to one Python version, especially when dependencies are involved.

Unit tests should work for all environments

Currently when I execute pytest tests/ on my devserver with one gpu, all tests don't pass. This is because some unit tests require two GPUs to be present. It would be nice to have all unit tests succeeding for users who clone the project and run them. And user segment with single gpu is a common user base as well as a developer base

Create MockDistributed for testing dataloader in distributed mode

title. currently the unit tests only tests multiworker cases

Design Philosophy and Best Practices

TorchTune embodies PyTorch’s design philosophy [details], especially “usability over performance”. The code base is designed to be easy to read, use (and re-use) and extend. This issue captures some of my thoughts on our design philosophy and best practices we should use. Would love some discussion around these.

Simplicity

TorchTune code should be easy to read, use (and re-use) and extend. We expect AND want users to read the internals of the codebase. Simple code is better than complex tricks. While implementing a feature, keep in mind that not every user will be a domain expert. For example, in most cases simply re-writing a class (even if this duplicates code) might be a more desirable strategy than utilizing complex module-swapping logic which only a subset of users will understand.

Native PyTorch

Users shouldn’t need to learn N different frameworks to understand or contribute to the core of TorchTune. They only need to understand PyTorch.
We should provide integrations with other libraries and frameworks where these make sense. But these integrations should not “pollute” the code base. Provide these through wrapper functions around native implementations. This will also make it easier to debug issues due to breakages in these external libraries.

Correctness and Stability

PyTorch has very high user-trust. TorchTune should cultivate the same.

Components should have unit-tests to ensure numerical parity with reference implementations, and to catch breakages.
Model implementations should have checkpoint-tests to ensure output parity with reference implementations, and to catch breakages.
Training recipes should have integration tests to ensure performance parity with reference implementations on standard benchmarks, and to catch breakages.
Clearly classify external APIs with “stable” or “experimental” tags to establish user expectation.

Best Practices

Modular Blocks instead of Monolithic Classes. Stuffing all of the logic into a single class limits readability and makes it hard to reuse logic. Think about breaking the implementation into self-contained blocks which can be used independently from a given model. For example, attention mechanisms, embedding classes, transformer layers etc.
Say no to Inheritance. You really don’t need it AND it makes the code much harder to understand or refactor since the logic is spread across many files/classes. Where needed, consider using Protocols.
Clean Interfaces. There’s nothing more challenging than reading through functions/constructors with ~100 parameters. Think carefully about what needs to be exposed to the user and don’t hesitate to hard-code parameters until there is a need to make them configurable.
Intrusive Configs. Config objects should not intrude into the class implementation. Configs should interact with these classes through cleanly defined builder functions which convert the config into flat parameters needed to instantiate an object.
Limit Generalization. Attempting to generalize code before this is needed unnecessarily complicates implementations - you are anticipating use cases you don’t know a lot about. When you actually need to generalize a component, think about whether it’s worth it to complicate a given interface to stuff in more functionality. Don’t be afraid of code duplication if it makes things easier to read.
Value Checks and Asserts. Don’t check values in higher level modules - defer the checks to the modules where the values are actually used. This helps reduce the number of raise statements in code which generally hurts readability, but are critical for correctness.

Benchmark QPS

We should benchmark our QPS against other libraries such as HF, lit-gpt to see if we are in the same ballpark and understand our perf competitiveness.

Enabling User2 workflow

This is a follow-up to the discussion we had yesterday on the team meeting.

User2 is defined in @pbontrager 's #54

The Problem

I'm user 2. My workflow typically is to copy a recipe from torchtune, edit it to my needs, and run it.

copy a recipe from torchtune

That's the part that we need to very clearly define.

Am I copying the recipe from the main branch of the torchtune repo?
- The pb here is that I'm relying on the stable version of torchtune. But the recipe on the main branch, it's tracking the dev version of torchtune, and it probably contains some code and utilities that I don't have access to in my stable torchtune version. So I can't run it :(
  - This isn't something that may be a problem, this is something that will be a problem. It happens all the time in torchvision, and torchvision isn't even a recipe-centric repo (example1, example2).
Am I copying the recipe from... where tochtune was installed? (e.g. some very-hard-to-fine-place like /home/nicolashug/.miniconda3/envs/myenv/lib/python3.10/site-packages/torchtune/assets ??)

We need a blessed way to copy/paste the training recipes for a given stable version of torchtune

It's important to understand that this problem exists regardless of the repo structure that we have, and regardless of whether we are bundling the recipes as part of the package, or as assets/resources.

BTW, to enable User1 workflow, having the recipes as assets / resources in the package is probably a good solution, as Philip already suggested in other channels.

Back to User2: I don't have a perfect solution to suggest, I just wanted to flag something we need to think about. Some random thoughts:

we probably want a visible disclaimer on top of the READMEs in the recipes (and scripts and configs) to tell users to checkout the repo in a state corresponding to their stable release.
Thinking of a blessed way to copy recipes: what about a CLI that would copy-paste the relevant files of a recipe for a given version??

torchtune make_recipe --recipe=<recipe-name> --version=0.2 --output=...

This would make sure to have the proper finetune_llm.py file with the appropriate configs, etc.?
(by default, the version would just be the current version of torchtune)

Simple `gradio` script to showcase fine-tuned models.

convert_llama2_to_native.py fails with `CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)``

I am getting this error when running the checkpoint conversion script: https://gist.github.com/daniellepintz/7ecac4abfd6b6aa3f7e4ea896b7c6d16

Command:

python -m scripts.llama2_checkpoint.convert_llama2_to_native --checkpoint_path ~/llama/llama-2-7b/consolidated.00.pth

Note: I also needed to comment out this line:

https://github.com/pytorch-labs/torchtune/blob/308c022b02422d9308e8de63a3f0f6a6842f0209/scripts/llama2_checkpoint/convert_llama2_to_native.py#L294

Add autocast and scaler

Proper mask caching

#33 (comment) @joecummings

Consistent `log` throughout code

Converting llama checkpoint to pytorch native checkpoint failure

Here is what I attempted:

~/torchtune (main)]$ python3 -m torchtune.llm.scripts.checkpoint.convert_llama2_to_native --checkpoint_path ../llama/llama-2-7b/consolidated.00.pth --device cuda:0

llama-2-7b/consolidated.00.pth is from the llama 7b download

This resulted in assertion error:

$ python3 -m torchtune.llm.scripts.checkpoint.convert_llama2_to_native --checkpoint_path ../llama/llama-2-7b/consolidated.00.pth --device cuda:0
WARNING:main:Warning: rope.freqs in orig state_dict, but not mapped!
Traceback (most recent call last):
File "/home/gokulg/.conda/envs/torch-tune/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/gokulg/.conda/envs/torch-tune/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/gokulg/torchtune/torchtune/llm/scripts/checkpoint/convert_llama2_to_native.py", line 294, in
assert torch.allclose(x, y), f"{x} vs {y}"
AssertionError: -381624064.0 vs -382238464.0

Add CLI and Recipes to Docs

As part of our user workflow we want the recipes to be in the docs to make it easy for the user to copy and paste the recipe. We need to add the recipe to the docs along with documentation on the CLI flow for users to run and edit recipes.

This is a followup on this PR

Catch the error message when using `pytest.raises`

There's a lot of usage of pytest.raises, which is great:

(torchtune) ➜  tune git:(main) ✗ git grep pytest.raises
tests/torchtune/llama2/test_attention.py:        with pytest.raises(Exception):
tests/torchtune/llama2/test_attention.py:        with pytest.raises(Exception):
tests/torchtune/llama2/test_reproducible_dataloader.py:        with pytest.raises(ValueError):
tests/torchtune/llama2/test_transformer_decoder.py:        with pytest.raises(Exception):
tests/torchtune/llama2/test_transformer_decoder.py:        with pytest.raises(ValueError):
tests/torchtune/utils/test_logits_transforms.py:        with pytest.raises(ValueError):
tests/torchtune/utils/test_logits_transforms.py:            with pytest.raises(ValueError):
tests/torchtune/utils/test_logits_transforms.py:        with pytest.raises(ValueError):
tests/torchtune/utils/test_logits_transforms.py:        with pytest.raises(TypeError):

Let's make it even better and use the match= parameter:

pytest.raises(..., match="part of the error message here")

This is very helpful when reading the test to understand why the error is being thrown, and it also helps finding where it happens by doing e.g. git grep "part of the error message".

Also, unless you're actually doing raise Exception (but really, should you??), try catching the subclass(es) rather than just the base Exception class, which is not very informative.

Add a GPU runner to execute unit tests that require CUDA

Currently some tests are being skipped if CUDA isn't present.

one example: https://github.com/pytorch-labs/torchtune/blob/main/tests/torchtune/utils/test_env.py#L26

Ranom transforms when used with num_workers don't provide reproducible dataloader batches across invocations

This is because the generator that is used to set the RNG state of dataloader workers doesn't impact the runs with num_workers = 0. Therefore, any transforms that run as part of the Dataset are not controlled through the dataloader_seed when num_workers = 0.

See the fix here - #161 for more info on this issue. Once that PR lands, this issue would be fixed.

Do not use type annotations in docstrings

I think it's best to avoid documenting types with a type-annotation syntax, e.g. as in:

num_kv_heads: (Optional[int]): pytorch-labs/torch_tbd@304c88a/llm/scripts/checkpoint/convert_llama2_to_native.py#L41

This would be better simply as

num_kv_heads: (int, optional):

colors (color or list of colors, optional)

when its type is Optional[Union[List[Union[str, Tuple[int, int, int]]], str, Tuple[int, int, int]]]. We don't want to force users to read or understand that atrocity.

(Also, fun fact, Optional[] doesn't mean that a parameter is optional, it just mean that it can be None, which is completely different.)

I see that this is enforced by pydoclint. The problem is that --arg-type-hints-in-docstring=False would prevent any docstring type, even the good ones like just some_param : (float).

It also seems impossible to bypass an entire error code like DOC105.

I don't know how to solve this apart from stop using pydoclint. But we should find a solution for this.

Use keyword-only arguments where relevant

It's worth considering using keyword-only arguments for all classes and functions, i.e. arguments that have to be specified with their name: param=1234. This has many advantages, beyond the pure coding-style consideration:

It will allow you to introduce a default to a parameter without a BC-breaking (see e.g. pytorch/vision#3776 for an illustration of that kind of problem)
It allows adding new parameters closer to the other relevant parameters, instead of adding new ones at the end. Over time, this helps avoiding "scattered signatures" where 2 related parameters are documented far apart.

Some projects have started scrictly adopting those for these reasons (e.g. https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep009/proposal.html)

Below this line are edits on the original issue to track locations we want to move to keyword-only arguments.

Locations to refactor

llama2 builder
- @joecummings has volunteered for this one

... Please add other functions or classes here

Investigate `pin_memory` as a more effective way to move training tensors from device to device

Add conversion script as an integration test

Checkpoint conversion + parity verification is essential to ensuring we can safely iterate on components without correctness being affected, to avoid issues like #152, we should add this to CI.

Add seed to ReproducibleDataLoader

We should use worker_init_fn to init the seed for every DL worker

More details in PT docs:
https://pytorch.org/docs/stable/notes/randomness.html#dataloader

Create deterministic shuffling

          It is NOT deterministic unless we let the user provides a seed (https://fburl.com/trfvxkon) and use that in combination with epoch count to set the generator.  In the dataloader initialization, we should use https://fburl.com/ghrgxgjo (DistributedSampler) with seed provided by the user. It also requires epoch count to be set using set_epoch.

My suggestion would be to retain the changes in this PR, and I will address this issue in a separate PR as part of determinism/shuffling work.

Originally posted by @gokulavasan in #49 (comment)

Fix unittests to not rely on state

Remove standalone file dependency in Tokenizer test

Summary:

Currently, to run the tests for the Tokenizer a m.model SentencePiece file is needed under assets.

TODO:

Mock or create file on the fly so that there is no need to add this file in assets

CI should verify that configs do work correctly

For example, #161 caused the alpaca recipe to error out. Fix in #186.

If CI actually triggers these configs and ensures finetuning starts, this error would have been caught

Log every X steps (user specified) in training loop

Eval harness

I tried to just send a PR tonight but got stuck on various parts so thought I'd describe my plan in case someone wants to pick this up before I come back from vacation. I suspect the below should take about 1-2 days of work

The main library we'd be using is https://github.com/EleutherAI/lm-evaluation-harness

Assuming a model is on the HF hub and was uploaded using the transformers library, eval is trivial

lm_eval --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cuda:0 \
    --batch_size 8

Essentially the important thing is to give a path to a model, tokenizer and a list of tasks. When you run the above command it will print results to console

However, we are not uploading a HF model we have a PyTorch model so unless we want to gate eval on a conversion back to transformers library we need to make create an interface to run the eleuther eval harness and here's a few examples of doing this

Chatted with Rohan briefly would also make sense to have a converter from torch native to hf checkpoints so we can use all the post finetuning tools like quantization, eval, export to llama.cpp etc..

Memory profiling, utilities, debugging, & management

Umbrella issue to capture our work around memory optimization / efficiency.
We should add tools / profiling to understand memory usage statistics such as peak memory
We should get an understanding of what memory requirements are needed to run our finetuning jobs and whether they fit with our target HW architectures and figure out how to decrease memory usage as needed.

Move inference script to integration tests

superceded by the generate recipe and more of a sanity check anyways.

pytorch / torchtune Goto Github PK

torchtune's Introduction

torchtune

Introduction

Models

Fine-tuning recipes

Llama3 and Llama3.1

Single GPU

Multi GPU

Installation

Get Started

Downloading a model

Llama3 download

Running fine-tuning recipes

Modify Configs

Design Principles

Native PyTorch

Simplicity and Extensibility

Correctness

Community Contributions

Acknowledgements

License

torchtune's People

Contributors

Stargazers

Watchers

Forkers

torchtune's Issues

Docs

Tests

Questions

Users

User 1:

User 2:

User 3:

Training Abstractions

Repo Design

Simplicity

Native PyTorch

Correctness and Stability

Best Practices

The Problem

We need a blessed way to copy/paste the training recipes for a given stable version of torchtune

Locations to refactor

Recommend Projects

Recommend Topics

Recommend Org