huggingface / accelerate Goto Github PK

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

Home Page: https://huggingface.co/docs/accelerate

License: Apache License 2.0

Python 99.65% Makefile 0.19% Dockerfile 0.16% Shell 0.01%

accelerate's People

Contributors

Stargazers

Watchers

Forkers

stas00 voidful trendingtechnology lewtun stancld philippeters arjunchandra dumpmemory sailfish009 jayleicn stjordanis zivzone brettkoonce fangzuliang mohan-zhang-u mathpopo yimikai stevenjokess udemirezen m-salti denniszhao97 coallaoh minjamiladinovic lipovsek dondiegoa lichao312214129 devbox10 thevasudevgupta guitaricet liusha1219 jesse1eung andysdc personx000 michalwols ddkalamk zggl deep-diver cccntu yuangan nsi319 zhaijunyu minghao2016 ecsantana76 tchaton mostafa-at-github dfalbel vwxyzjn gobbletown xrosliang guillem96 cceyda janfschr nvsnvyu qtjiebin vinayak-shanawad gawei1995 huggingworld pmelchior jjiang2cal yottaxx jubaer145 shenyi666666 codenamewei evgeni-nikolaev ankitshah009 epwalsh gorarakelyan etaldot kurianbenoy-aot yueyedeai sooheang techthiyanes uynajgi yingnengd binnong shunsunsun satyam-cyc errai34 songkq doragd zymrael taneset statusrank brucedai003 forkkit qqaatw sxy0818 pandinosaurus maryam-tayyab suryatmodulus reppy4620 patrickvonplaten strifee ekmixon ducbx qinhaihong-red will-rice luvata hamditarek jesper-jung

accelerate's Issues

Torch Geometric compatibility

Hi,

Awesome package, I'm really liking how easy it is to plug-and-play in my training scripts.

Would it be possible to have compatibility with PyTorch geometric? (graph neural networks, etc). torch geometric uses a custom collate function in their DataLoader to deal with graph-like data, so right now putting it into the Accelerator gives this error upon iterating:

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'torch_geometric.data.data.Data'>

Here is the relevant code: https://github.com/rusty1s/pytorch_geometric/blob/480f9d59d6d18166a5da2e2519fa9a6b33d3d4ad/torch_geometric/data/dataloader.py#L8-L70

Thanks!
Miles

how to turn off accelerate?

when I want to run my program without accelerate, I just type in "python my_file" instead of "accelerate launch my_file".
However, my program got stuck.
Could you help me? Thanks!

nlp example doesn't run faster with multi-gpu

I am running the example on 2080ti, where each epoch takes 21 seconds with 1 GPU. When using 2 GPUs, it also takes 21 seconds. (I used tqdm to measure the time)
Everything looks right: GPU utilization is ~100% in both GPUs, the # of batch per device is halved, but it's not faster.
I tried the cv example, and using 2 GPUs does speed up the training.

Is this normal? What might be the cause of this?

Different performance when training with single GPU vs. multiple GPUs

I'm currently using accelerate to fine-tune a huggingface pretrained Transformer with some additional classification heads, and I'm finding that performance when using multiple GPUs is much worse than with a single GPU, even when using the same batch size, learning rate, and number of training steps/epochs. I'm using accelerate to parallelize the training loop over multiple GPUs, but the validation/test set evaluation is a custom function that isn't easily adapted to use with accelerate so I'm doing that part on a single GPU in the main process. To run the entire script on a single GPU vs. multiple GPUs, I just adjust the --num_processes argument for accelerate launch as well as the batch size to match, for example:

accelerate launch --num_processes 1 <script> --batch_size 32 (for 1 GPU)

accelerate launch --num_processes 4 <script> --batch_size 8 (for 4 GPUs)

The multi-GPU training seems to be running fine, in the sense that running nvidia-smi shows all 4 GPUs being fully utilized and the training data loader is the correct length for the given batch size (same length for both of the commands above), but there's still a drop in performance in the multi-GPU case. When printing output on each GPU, the processes do seem to be waiting for the main process to finish running the evaluation function as expected. This also doesn't seem to just be an issue of running single-GPU evaluation within a multi-GPU training loop, since loading the saved model weights after training and re-running the evaluation on a single GPU gives the same performance.

Any help is appreciated, thanks!

Pseudocode:

# set up accelerator
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])

# set up device (for evaluation)
device = accelerator.device

model = ... # initialize model
optimizer = ... # initialize optimizer
loader = ... # initialize training data loader

valid_dataset = ... # initialize validation dataset
test_dataset = ... # initialize test dataset

# prepare model, optimizer, data loader
model, optimizer, loader = accelerator.prepare(model, optimizer, loader)

# training loop
for epoch in range(epochs):
    model.train()
    for inputs, targets in loader:
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        optimizer.zero_grad()
        accelerator.backward(loss)
        optimizer.step()

    # evaluate on validation set with unwrapped model in main process (single-GPU)
    if accelerator.is_main_process:
        unwrapped_model = accelerator.unwrap_model(model).to(device)
        unwrapped_model.eval()
        metrics = calculate_metrics(unwrapped_model, valid_dataset, device)
        print(metrics)

# evaluate on test set with unwrapped model in main process (single-GPU)
if accelerator.is_main_process:
    unwrapped_model = accelerator.unwrap_model(model).to(device)
    unwrapped_model.eval()
    metrics = calculate_metrics(unwrapped_model, test_dataset, device)
    print(metrics)

Version information:

torch: 1.6.0
transformers: 3.3.1
accelerate: 0.3.0
CUDA: 10.1

Multi-GPU CLI issue

Hi- Thanks for the great library, Sylvain!

The config file looks as follows:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2

The relevant part of the code is as follows:

    accelerator = Accelerator(fp16=config['fp16'], cpu=config['cpu'])
    print(accelerator.device)

    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
    if batch_size > MAX_GPU_BATCH_SIZE:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

    # Instantiate dataloaders.
    train_dataloader = DataLoader(
        train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size
    )
    valid_dataloader = DataLoader(
        validation_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )
    test_dataloader = DataLoader(
        test_dataset, shuffle=False, collate_fn=collate_fn, batch_size=EVAL_BATCH_SIZE
    )

    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")


    # Instantiate optimizer
    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
    prepared = accelerator.prepare(
        model, optimizer, train_dataloader, valid_dataloader, test_dataloader
    )
    model, optimizer, train_dataloader, valid_dataloader, test_dataloader = prepared


    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            #batch.to(accelerator.device)
            outputs = model(**batch)
            loss = outputs.loss
            loss = loss / gradient_accumulation_steps
            accelerator.backward(loss)
            if step % gradient_accumulation_steps == 0:
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

The script utilizes a single GPU, though there are 2 GPUS.

>>> torch.cuda.device_count()
2

Launching the scipt in the command line:

accelerate launch training.py

The print statement print(accelerator.device) returns following (happy to add more debugging)

cuda

Any help is appreciated. Thank you!

How to specify the GPU number available for accelerator program？

Suppose there are 4 GPUs on a machine, and now there are two training programs that use accelerator. How to specify 0,1 gpus for the first program and 2,3 gpus for the second program. The CUDA_VISIABLE_DEVICES=XX does not seem to work fine.

Training Time For Multinode is similar to training time for single node

@sgugger As per our discussion on #37
Here is my method of launch and results
I am using AML Per process launcher shown here (option 1):
https://azure.github.io/azureml-cheatsheets/ja/docs/cheatsheets/python/v1/distributed-training/
To launch a distributed training job with multiple process per nodes.

I am using traditional run_mlm.py (without trainer from transformers repository here )

When I observe the tqdm log from when running for 1 node vs running with 8 nodes:
I observe the following:

8 Node log :

INFO:main:***** Running training *****
INFO:main: Num examples = 4842767
INFO:main: Num Epochs = 10
INFO:main: Instantaneous batch size per device = 64
INFO:main: Total train batch size (w. parallel, distributed & accumulation) = 2048
INFO:main: Gradient Accumulation steps = 1
INFO:main: Total optimization steps = 23650

0%| | 20/23650 [01:20<26:05:59, 3.98s/it]
0%| | 21/23650 [01:24<26:04:43, 3.97s/it]
0%| | 22/23650 [01:28<26:12:45, 3.99s/it]

1Node Log:
INFO:main:***** Running training *****
INFO:main: Num examples = 4842767
INFO:main: Num Epochs = 10
INFO:main: Instantaneous batch size per device = 64
INFO:main: Total train batch size (w. parallel, distributed & accumulation) = 256
INFO:main: Gradient Accumulation steps = 1
INFO:main: Total optimization steps = 189180

0%| | 7/189180 [00:04<27:49:11, 1.89it/s]
0%| | 8/189180 [00:04<27:27:56, 1.91it/s]

As you can see the iteration time per step for single node is about 8 times higher than 8 node.
You can see training time estimated for one node is lesser than 8 nodes.

How to pass custom arguments to DDP / GradScaler

Thanks for creating this nice tool!

I am wondering what's the best way to use accelerate, but still keep the capability to tune the arguments of DDP / GradScaler / (perhaps other classes as well). They have some pretty useful arguments (e.g. https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html).

Variables or Operations to identify rank at the begining of the program

I need to identify the main or first process to do something(for example, log) at the begining of the program, but I can't find any variables that can make it.

Only after some module/dataloader/optimzer is prepared by accelerator.prepare(), then torch .distributed.get_rank() can be used to do it.

Are there any other variables or operations(like an empty prepare() function) can help me get a flag to distinguish these process at program begin?

Any suggestion?

Training after several epochs it throws cudaErrorLaunchFailure: unspecified launch failure

When I use accelerate to my model, it often throws this problem after training several epochs. And my code as follows:

    def train(self, train_step, compute_metric, eval_step, train_scratch=False):
        wandb.init(project=self.model._get_name(), resume=~train_scratch)
        self.logger.info("Start training model")
        wandb.watch(self.model, log="all")
        wandb.save("model.py")
        set_seed(self.config.seed)
        self.model.train()
        accelerator = Accelerator(fp16=True)
        start_epoch = 1
        data_loader = self.get_train_dataset_dataloader()
        optimizer = self.optimizer(self.model.parameters(), lr=self.config.lr)
        schedule = get_cosine_schedule_with_warmup(optimizer,
                                                   num_warmup_steps=1000,
                                                   num_training_steps=len(data_loader) * self.config.epochs)
        if train_scratch:
            model, optimizer, data_loader = accelerator.prepare(self.model, optimizer, data_loader)
            start_epoch = 1
            best_metric = 0.
        elif os.path.exists(self.output_dir) and os.listdir(self.output_dir):
            checkpoint = self.load_checkpoint()
            self.model.load_state_dict(checkpoint['net'])
            optimizer.load_state_dict(checkpoint["optimizer"])
            schedule.load_state_dict(checkpoint['scheduler'])
            start_epoch = checkpoint['epoch'] + 1 if checkpoint['epoch'] > 1 else 1
            best_metric = checkpoint["best_metric"]
            model, optimizer, data_loader = accelerator.prepare(self.model, optimizer, data_loader)

        last_step = len(data_loader) - 1
        train_loss, train_score, log_info = 0., {}, {}
        eval_loss, eval_score = 0., {}
        for epoch in range(start_epoch, self.config.epochs + 1):
            # do train
            model.train()
            for i, data in enumerate(tqdm(data_loader, desc=f"Epoch {epoch}/{self.config.epochs}"), start=1):
                torch.cuda.empty_cache()
                outputs = train_step(model=model, data=data)
                loss = outputs.loss / self.config.accumulate_step
                train_loss += loss.item()
                accelerator.backward(loss)
                if i % self.config.accumulate_step == 0 or i == last_step:
                    optimizer.step()
                    schedule.step()
                    optimizer.zero_grad()
                    wandb.log({"lr": schedule.get_last_lr()[-1]},
                              step=math.ceil(i / self.config.accumulate_step) + math.ceil(
                                  (epoch - 1) * len(data_loader) / self.config.accumulate_step))
                compute_metric(outputs.logits, data['labels'], self.metrics)
            train_loss = train_loss / len(data_loader)
            for metric in self.metrics:
                train_score.update(metric.compute())
            log_info.update({"train_acc": train_score["accuracy"], "train_loss": train_loss})
            self.logger.info("Epoch {}/{} train_loss={:.5f}\t train_accuracy={:.5f}".format(epoch, self.config.epochs,
                                                                                            loss,
                                                                                            train_score['accuracy']))
            # do eval
            if self.eval_dataset:
                model.eval()
                eval_loader = accelerator.prepare(self.get_val_dataset_dataloader())
                for i, data in enumerate(tqdm(eval_loader, desc=f"Eval Epoch {epoch}/{self.config.epochs}"), start=1):
                    outputs = eval_step(model, data)
                    eval_loss += outputs.loss.item()
                    compute_metric(outputs.logits, data['labels'], self.metrics)
                eval_loss = eval_loss / len(eval_loader)
                for metric in self.metrics:
                    eval_score.update(metric.compute())
                log_info.update({"eval_loss": eval_loss, "eval_acc": eval_score["accuracy"]})
                self.logger.info(
                    "Eval Epoch {}/{} eval_loss={:.5f}\t eval_accuracy={:.5f}".format(epoch, self.config.epochs,
                                                                                      eval_loss,
                                                                                      eval_score['accuracy']))

And the error as follows:

`Traceback (most recent call last):
  File "D:/code/GNN_LM1/train.py", line 43, in <module>
    trainer.train(train_step, msm_compute_metric, eval_step, train_scratch=True)
  File "D:\code\GNN_LM1\trainer.py", line 138, in train
    train_loss += loss.item()
  File "C:\Anaconda3\lib\site-packages\accelerate\accelerator.py", line 249, in backward
    self.scaler.scale(loss).backward()
  File "C:\Anaconda3\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Anaconda3\lib\site-packages\torch\autograd\__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: transform: failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure

pytorch version: 1.8
I want to know what cause this problem and how to make it?

how to use Accelerate in distributed data parallel traning

By simply adding a few lines of codes as the readme shows, I can not use all the gpus in a DDP way

Expected to have finished reduction in the prior iteration before starting a new one.

I have modified the nlp_example to finetune an EncoderDecoder on translation data like this:

accelerator = Accelerator(device_placement=False, fp16=args.fp16, cpu=args.cpu)
def _tokenize(batch):
    if accelerator.distributed_type == DistributedType.TPU:
        src = tokenizer(batch[0], padding="max_length", max_length=128, return_tensors="pt")
        tgt = tokenizer(batch[1], padding="max_length", max_length=128, return_tensors="pt")
    else:
        src = tokenizer(list(batch[0]), padding="longest", return_tensors="pt")
        tgt = tokenizer(list(batch[1]), padding="longest", return_tensors="pt")
    return src, tgt
...
for step, batch in train_bar:
    src, tgt = _tokenize(batch)
    src["input_ids"] = src["input_ids"].to(accelerator.device)
    tgt["input_ids"] = tgt["input_ids"].to(accelerator.device)
    outputs = model(input_ids=src["input_ids"], decoder_input_ids=tgt["input_ids"], labels=tgt["input_ids"])
    loss = outputs.loss
    loss = loss / gradient_accumulation_steps
    accelerator.backward(loss)
    if step % gradient_accumulation_steps == 0:
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    if step % eval_steps == 0:
        model.eval()
        for step, batch in enumerate(dev_dataloader):
            src, tgt = _tokenize(batch)
            src["input_ids"] = src["input_ids"].to(accelerator.device)
            tgt["input_ids"] = tgt["input_ids"].to(accelerator.device)
            with torch.no_grad():
                predictions = model.generate(
                    src["input_ids"],
                    decoder_start_token_id=tokenizer.convert_tokens_to_ids("[CLS]"),
                    num_beams=4,
                    repetition_penalty=1.0,
                    do_sample=False,
                    forced_bos_token_id=None,
                )
            pred_str = tokenizer.batch_decode(predictions, skip_special_tokens=True)
            ref_str = tokenizer.batch_decode(tgt["input_ids"], skip_special_tokens=True)
            metric.add_batch(
                predictions=accelerator.gather(pred_str), references=accelerator.gather([[r] for r in ref_str]),
            )
        eval_metric = metric.compute()
...

I am getting the following error during training

  File "trainer.py", line 104, in training_function
    outputs = model(input_ids=src["input_ids"], decoder_input_ids=tgt["input_ids"], labels=tgt["input_ids"])
  File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 606, in forward
    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

and the following during generation

  File "trainer.py", line 120, in training_function
    predictions = model.generate(
  File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'generate'

Both are working fine if I change the configuration to use only one GPU using accelerate config

dsitributed on mulit-CPU

Just wondering if there is way to distribute over CPU (single node, or multi nodes).
It would be very useful features for some sparse models.

Multi GPU Training Not Working

While using Accelerate, it is only utilizing 1 out of the 2 GPUs present. I am training using the general instructions in the repository. The architecture is AutoEncoder.

dataloader = DataLoader(dataset, batch_size = 2048, shuffle=True, pin_memory=False, num_workers=20)
encoder = Encoder(bottleneck_size = 2, embedding_size = 40, vocab = dataset.vocab).to(device)
decoder = Decoder(bottleneck_size = 2, embedding_size = 40, vocab = dataset.vocab).to(device)
model = AutoEncoder(encoder, decoder).to(device)
loss = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

I am transferring the samples in the batch to the device using the code below:

    for x in batch:
        batch[x] = batch[x].to(device)

The device is being determined by using:

device = accelerator.device

Both devices are visible which can be confirmed by using torch.cuda.device_count() which returns 2.

Devices are RTX 2080 with CUDA Version 11.2. Driver version is 460.67.
Distro is PopOS!.

UnboundLocalError when running on Google Colab using TPU runtime

Steps to reproduce

Open new Google Colab notebook and choose TPU runtime.
Install accelerate
```
!pip install accelerate
```

Run accelerate config

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-GPU, [2] TPU): 2
What is the name of the function in your script that should be launched in all parallel scripts? [main]: 
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 41, in main
    args.func(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/config/__init__.py", line 64, in config_command
    config = get_user_input()
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/config/__init__.py", line 37, in get_user_input
    config = get_cluster_input()
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/config/cluster.py", line 81, in get_cluster_input
    num_processes=num_processes,
UnboundLocalError: local variable 'num_processes' referenced before assignment

Regarding the problem that the accelerate library does not work in multiple routines

I want to complete a task on 2 nodes and 4 GPUs.
I configured the config file as required （in 2 nodes）

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-GPU, [2] TPU): 1
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: 0
What is the IP address of the machine that will host the main process? same IP
What is the port you will use to communicate with the main process? same port
How many processes in total will you use? [1]: 2
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes

But when I run accelerate launch train.py on these two nodes, they do not cooperate.They will each complete the training task
i dont know and how to do.

In addition, is this related to the absolute path when I use (--data_dir --model --output_dir)?

`accelerate test` ignores --config-file

Hi,
Great package, it helped me a lot today! So far it is as simple as it seems 🎉

I noticed that accelerate test --config_file accelerate_config.yml uses default config values instead of the values from accelerate_config.yml. To test this create an accelerate_config.yml file with the contents different from your main config. For example, say that the default config has num_processes=3, but you want to only use 2 GPUs and create a config like this.

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2

after running accelerate test --config_file accelerate_config.yml you will see something like this

Distributed environment: MULTI_GPU
Num processes: 3
Process index: 1
Local process index: 1
Device: cuda:1
Use FP16 precision: False

Distributed environment: MULTI_GPU
Num processes: 3
Process index: 2
Local process index: 2
Device: cuda:2
Use FP16 precision: False

**Initialization**
Testing, testing. 1, 2, 3.
Distributed environment: MULTI_GPU
Num processes: 3
Process index: 0
Local process index: 0
Device: cuda:0
Use FP16 precision: False

three GPUs instead of the specified 2.

This happens because accelerate-launch requires all keyword arguments to precede the training script path, but accelerate-test does this

accelerate/src/accelerate/commands/test.py

Line 51 in 9b0dad4

cmd = ["accelerate-launch"] + test_args

so what happens is that --config_file is recognized as a training script argument instead of launch script argument. You can see it if you print out args of the corresponding accelerate-launch.

Sending a PR with a fix soon.

accelerator.prepare fails with IterableDataset

I am trying to use accelerator library in my example script but it is failing when trying to use with IterableDataset. Here is the error message:

Traceback (most recent call last):
  File "run_pretrain.py", line 612, in <module>
    main()
  File "run_pretrain.py", line 466, in main
    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/accelerator.py", line 201, in prepare
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/accelerator.py", line 201, in <genexpr>
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/accelerator.py", line 159, in _prepare_one
    return self.prepare_data_loader(obj)
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/accelerator.py", line 231, in prepare_data_loader
    return prepare_data_loader(
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/data_loader.py", line 416, in prepare_data_loader
    return DataLoaderShard(
  File "/data/nfs_home/ddkalamk/bert/accelerate/src/accelerate/data_loader.py", line 280, in __init__
    super().__init__(dataset, **kwargs)
  File "/nfs_home/ddkalamk/anaconda3/envs/bert/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 194, in __init__
    raise ValueError(
ValueError: DataLoader with IterableDataset: expected unspecified batch_sampler option, but got batch_sampler=<torch.utils.data.sampler.BatchSampler object at 0x2ba0839164f0>
srun: error: pcl-skx35: task 0: Exited with exit code 1

I am using pytorch v1.6.0 but it seems it has same issue even with latest pytorch.
Is there anything I am missing?

Loading a checkpoint saved using accelerator.save in Multi-GPU setting

This might be a noob question but I couldn't figure out a way to load checkpoints that were saved using accelerator.save. If I use torch.load to load the model state_dict in a Multi-GPU setting, it loads it multiple times on the first GPU, which leads to OOM.

config = T5Config().from_pretrained('t5-small')
model = T5ForConditionalGeneration(config)
checkpoint = torch.load(checkpoint_location)
model.load_state_dict(checkpoint )

I am however able to load checkpoints using model.from_pretrained and it works in multi GPU setting

model = T5ForConditionalGeneration.from_pretrained('t5-small')

This does not solve my problem since I need to load models saved using accelerator.save

Any help would be appreciated!

Support for transformers Trainer

Is there a way to use this directly on a transformers Trainer? Are there any plans to include it?

Config class(decorated by dataclass) can not receive kwargs from the config dict.

First thanks for your excellent work. But it seems that i have come across a very strange problem. After I finished "accelerate config", I launched my script using "accelerate launch my_script.py", and I got the error as follows:

Traceback (most recent call last):
  File "/home/wuyongfa/anaconda3/envs/mmdet/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/wuyongfa/anaconda3/envs/mmdet/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 41, in main
    args.func(args)
  File "/home/wuyongfa/anaconda3/envs/mmdet/lib/python3.7/site-packages/accelerate/commands/launch.py", line 297, in launch_command
    defaults = load_config_from_file(args.config_file)
  File "/home/wuyongfa/anaconda3/envs/mmdet/lib/python3.7/site-packages/accelerate/commands/config/config_args.py", line 61, in load_config_from_file
    return config_class.from_yaml_file(yaml_file=config_file)
  File "/home/wuyongfa/anaconda3/envs/mmdet/lib/python3.7/site-packages/accelerate/commands/config/config_args.py", line 100, in from_yaml_file
    return cls(**config_dict)
TypeError: __init__() got an unexpected keyword argument 'machine_rank'

I assume that this is because the BaseConfig class(decorated by dataclass) can not receive kwargs from the config dict. Could anyone help me to find why?

accelerator.gather() for non-tensor inputs?

Can I use gather() to merge two non-tensor inputs? For example:
One process has a list of string like [ ['a', 'aa'], ['b', 'bb'] ], another has a list like [ ['c', 'cc'], ['d', 'dd'] ].
Is there any way to merge these two lists into [ ['a', 'aa'], ['b', 'bb'], ['c', 'cc'], ['d', 'dd'] ]?
I think it's necessary when evaluating the model, only gathering the logits and labels may not be enough sometimes.

Accelerator not recognizing TPU in Google Colab and Kaggle Kernels

I installed and imported accelerate in both Kaggle Kernels and Google Colab with TPU turned on but it doesn't seem to detect the TPU and instead detects CPU when running the following code:

$ pip install -q accelerate

import accelerate
acc = accelerate.Accelerator()
device = acc.device
print(device)

The above snippet just outputs cpu when ran on both aforementioned platforms with TPU enabled.

Is there something that I am doing wrong?

PyTorch version: 1.7.0
Python version: 3.7.9

[typo] Example should be accelerator

I found a typo in the documentation.
In the model backward section, it should be accelerator instead of accelerate

According to the Readme example https://github.com/huggingface/accelerate/blob/main/README.md, all

accelerate.backward(loss)

should be

accelerator.backward(loss)

I appreciate this work to make distributed training easier.

learning rate decay

I want to know if I use learning rate decay , like this:

optimizer= accelerator.prepare(optimizer)
current_lr = learning_rate * decay_factor
for group in optimizer.param_groups:
        group['lr'] = current_lr

Should I do this for the main process or all processes?
Thanks

Access Global Variables Inside of training_function()

I am trying to use Accelerate to do large-scale model inference. In particular, I am using T5 to transform strings into a different format on Google's Colab TPUs.

This has been working fine, as I can print my outputs and verify they are correct. However, when I try to store the model outputs I seem to be unable to do so. The global variables do not seem to be recognized once I run the notebook launcher and I can't return anything from the function. Any advice on how to do this?

Here is some pseudocode of what I would like to happen

outputs = []
def training_function():
    global outputs
    model = T5ForConditionalGeneration.from_pretrained('t5-base')
    model, dataloader = accelerator.prepare(
        model, dataloader
    )
    for batch in dataloader:
        attention_mask, input_ids = batch['attention_mask'], batch['input_ids']
        output = model.generate(input_ids=input_ids, attention_mask=attention_mask)
        outputs.append(output)

notebook_launcher(training_function)
print(outputs)

Thanks!

Can I use this package with fast.ai?

Hi! It's so exciting! Great job! May I know if I can use this package with fast.ai? Any Document for that? Thanks! :)

Bug: Invalid arguments

Following to my PR to transformers repo, I've tried tested multi-GPU and multi-machine setting and got these two errors:

accelerate <command> [<args>] launch: error: argument --main_process_ip: invalid typing.Union[str, NoneType] value: {some_ip_address}

and

accelerate <command> [<args>] launch: error: argument --main_process_port: invalid typing.Union[int, NoneType] value: '{some_port_value}'

It seems to me this error can be fixed with changing type=Optional[str] -> type=str for --main_process_ip arg and type=Optional[int] -> type=int for --main_process_port arg.

@sgugger

AcceleratedOptimizer `zero_grad` argument not supported: `set_to_none`

Currently the AcceleratedOptimizer class doesn't support the argument set_to_none, is this an intentional exclusion?

TPU num_processes indentation error

In file https://github.com/huggingface/accelerate/blob/main/src/accelerate/commands/config/cluster.py there is an indentation error, leading to num_processes being undefined when using TPU.

if distributed_type == DistributedType.TPU:
        main_training_function = _ask_field(
            "What is the name of the function in your script that should be launched in all parallel scripts? [main]: ",
            default="main",
        )
    else:
        main_training_function = "main"

        num_processes = _ask_field(
            "How many processes in total will you use? [1]: ",
            lambda x: int(x),
            default=1,
            error_message="Please enter an integer.",
        )

        if distributed_type != DistributedType.TPU:
            fp16 = _ask_field(
                "Do you wish to use FP16 (mixed precision)? [yes/NO]: ",
                _convert_yes_no_to_bool,
                default=False,
                error_message="Please enter yes or no.",
            )
        else:
            fp16 = False

question about amp

hello. I'm very excited for using this library.

I have a question. can it use with torch.amp?

I want to use both library for training!

thanks!

Add warning not to reuse `Accelerator`

I was training multiple models, and there seems to be some GPU memory leak.
I think it's because Accelerator appends reference to optimizer, which in turn reference some GPU memory.

accelerate/src/accelerate/accelerator.py

Line 164 in df260fa

self._optimizers.append(optimizer)

When reusing Accelerator to train multiple models, the GPU memory used by old optimizer is not released.

Deepspeed

Hi, I was just wondering if there were any future plans to to integrate deepspeed or equivalent functionality as a backend (like the Transformers library does)?

Error in running multi GPU model

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:761, internal error, NCCL version 2.7.8
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption

[QUESTION] Why do we reinit `AcceleratorState` everytime we prepare an object again?

Hey @sgugger !

Thanks for the clean & concise code. I love it!

Could I please ask what's the idea behind initializing state=AcceleratorState() inside prepare_data_loader here again?

In which case would this condition if num_processes is None: or if process_index is None: be True please? As I understand inside the Accelerator we have already set state=AcceleratorState() which sets the variables num_processes, process_index etc based on the config.

So why do need to initialize AcceleratorState again please?

accelerator.gather() at training time

Can I use accelerator.gather() at training time? Would gradients be calculated properly? Basically my use case is something like below toy snippet. It seems that there is some issue with gradient flow in this scheme as my validation accuracy drops to 0.

model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)
for i, data in enumerate(train_loader):
    model.zero_grad()
    
    a, b = model(data)
    b_all = accelerator.gather(b)
    c = f(a, b_all)
    loss = criterion(a, b, c)
    accelerator.backward(loss)
    optimizer.step()

Address already in use

When I am running two programs with the same python file, I encountered the following issue. I think it is because of the address/port, but I do not know how to change it.

Traceback (most recent call last): File "run_clm_no_trainer.py", line 490, in <module> main() File "run_clm_no_trainer.py", line 207, in main accelerator = Accelerator() File "/home/.conda/envs/transformers-sgd/lib/python3.7/site-packages/accelerate/accelerator.py", line 79, in __init__ self.state = AcceleratorState(fp16=fp16, cpu=cpu, _from_accelerator=True) File "/home/.conda/envs/transformers-sgd/lib/python3.7/site-packages/accelerate/state.py", line 125, in __init__ torch.distributed.init_process_group(backend="nccl") File "/home/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/.local/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: Address already in use

Invalid parameter passing in launcher

accelerate/src/accelerate/commands/launch.py

Line 143 in 5481092

"--node_rank",

The --node_rank is duplicated. It should be changed to --master_port

        cmd.extend(
            [
                "--nproc_per_node",
                str(args.num_processes // args.num_machines),
                "--nnodes",
                str(args.num_machines),
                "--node_rank",
                str(args.machine_rank),
                "--master_addr",
                args.main_process_ip,
                "--master_port",
                str(args.main_process_port),
            ]

Outside of venv?

I'm interested in using this on a jupyterhub machine, which is a single node. Do you expect that to have any issues? I can only assume the recommendation to use a virtual environment is for package compatibility, which I'm comfortable with sorting through in the base jupyterhub environment.

Can't send the values of int to device

My training data looks like:

src_image, target_image, src_camera, target_camera, src_camera_idx, target_camera_idx

Where src_camera_idx, target_camera_idx are integers

When I try to apply accelerate I get the following error:
TypeError: Can't send the values of type <class 'int'> to device cuda:0, only of nested list/tuple/dicts of tensors or objects having a to method.

We don't need to send the integers to the device. Perhaps instead of raising an error here, you can simply skip the items that cannot be moved to device? Or at least give me the option to skip them if I know my data has such objects.

Error in GAN Training Code

Code:

import torch
import torch.nn as nn
import torch.optim as optim 
import numpy as np
from accelerate import Accelerator


def bce_false(x):
    bce = nn.BCEWithLogitsLoss(reduction='none')
    target = torch.zeros(x.size()).cuda()
    return bce(x, target)


def bce_true(x):
    bce = nn.BCEWithLogitsLoss(reduction='none')
    target = torch.ones(x.size()).cuda()
    return bce(x, target)

accelerator = Accelerator()


class Discriminator(nn.Module):

    def __init__(self, in_dim=1, image_size=128, conv_dim=64, c_dim=512, repeat_num=6):
        super(Discriminator, self).__init__()                
            
        layers = []                    
        
        layers.append(
           nn.Sequential(
                nn.Conv2d(in_dim, conv_dim, kernel_size=4, stride=2, padding=1),                
                nn.BatchNorm2d(conv_dim, affine=True, track_running_stats=True),
                nn.LeakyReLU(inplace=True)) 
        )
        
        curr_dim = conv_dim
        for i in range(1, repeat_num):
            layer = nn.Sequential(
                nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1),                
                nn.BatchNorm2d(curr_dim*2, affine=True, track_running_stats=True),
                nn.LeakyReLU(inplace=True))
                                 
            layers.append(layer)                   
            
            curr_dim = curr_dim * 2
        self.down = nn.ModuleList(layers)

        kernel_size = int(image_size / np.power(2, repeat_num))
        self.conv1 = nn.Conv2d(curr_dim, 1, kernel_size=3, stride=1, padding=1, bias=False)        

    def forward(self, x):
        
        (b, t, c, h, w) = x.size()
        x = x.view(b * t, -1, h, w)        
        
        for layer in self.down:
            x = layer(x)                                         
        out_src = self.conv1(x)
        
        return out_src  


D = Discriminator(image_size=96).cuda()


lr = 1e-4
optim_d = optim.Adam(D.parameters(), lr = lr, weight_decay=1e-4)  
optim_d, D = accelerator.prepare(optim_d, D)


torch.autograd.set_detect_anomaly(True)
video = torch.zeros(12, 29, 1, 96, 96).cuda()
loss_d = 0.0
loss_d = bce_true(D(video.clone())).reshape(-1).mean()
loss_d = loss_d + bce_false(D(video.clone())).reshape(-1).mean()
optim_d.zero_grad()
accelerator.backward(loss_d)
optim_d.step()

Error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

This code runs correct in PyTorch Single GPU mode. So it might cause by Accelerate ?

Typo in examples README

Happened to find a typo in the examples README.
Lines 60 and 148 use the option fb16, which I believe should be fp16.

Thank you for this new library and launch tool!

Mismatch between `accelerate config` cli and `default_config.yaml`

The generated default_config.yaml is mismatch with accelerate config.

Here are my cli outputs and default_config.yaml

cli outputs

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-GPU, [2] TPU): 1
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: 1
What is the IP address of the machine that will host the main process? 10.29.150.50
What is the port you will use to communicate with the main process? 2333
How many processes in total will you use? [1]: 6
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes

default_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 1
main_process_ip: 2333
main_process_port: null
main_training_function: main
num_machines: 2
num_processes: 6

Multi machine training not working

I am trying to run my training code on 2 machines. Each of them has 2 GPUs. However, it seems the program runs separately and do not fasten the training progress. Here is my config.yaml

machine 1:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: 192.168.0.1
main_process_port: 99999
main_training_function: main
num_machines: 2
num_processes: 4

machine 2:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 1
main_process_ip: 192.168.0.1
main_process_port: 99999
main_training_function: main
num_machines: 2
num_processes: 4

How to use on CPU?

Hey guys, how do I use the accelerator on CPU?

acc = Accelerator(cpu=True)
print(acc.device)

Output cuda.

Thank you!

Cheers,

Francesco

How to save models with Accelerator.save in DDP mode

Hi,

My config file is

{
  "compute_environment": "LOCAL_MACHINE",
  "distributed_type": "MULTI_GPU",
  "fp16": false,
  "machine_rank": 0,
  "main_process_ip": null,
  "main_process_port": null,
  "main_training_function": "main",
  "num_machines": 1,
  "num_processes": 2
}

when I use Accelerator.save(unwrapped_model.state_dict(), path), the model will be saved twice (because I used two gpus)

In the PyTorch DDP example, they save the model only when the rank is 0, which avoid saving the model multiple times. How can I do that with accelerate?

Thanks!

DDP how to evaluate with custom metrics

Hi there,

Is there a way to evaluate the entire eval dataset using customized metrics instead of datasets.Metric ?
I'm using similar code to this
I'm fine tuning a T5 on multitask learning so I can't use the metric directly, because different prefix associate with different evaluation metrics.

Please advise
Thanks for your help!

how to save model while keeping the original ones unchanged in output folder?

I find this line in transformers' examples but it always override the previous output models.
unwrapped_model.save_pretrained(args.output_dir, save_function=accelerator.save)

Unable to send extra params to DDP

My model's forward function returns losses as well as some debug output. So I need to set find_unused_parameters=True in the ddp, but there's no way to pass this in the preparation.

Perhaps we could pass some kwargs for the items being prepared? I'm not sure how generic this problem is.

Feature request: Add support for NamedTuples in dataloaders

In order to produce self-documenting code, our team has the habit of using NamedTuples instead of plain tuples as the return type of our datasets.

Since they are sub-classes of tuples, they are handled by every PyTorch mechanism we've encountered as if they are tuple, so it all works smoothly.

When using Accelerate, I get an error in send_to_device at line 114. type(tensor)(send_to_device(t, device) for t in tensor) raises a type error because type(tensor) returns my NamedTuple, which cannot use the generator (send_to_device(t, device) for t in tensor) as argument. We would need some way to transform the generator into a type that is accepted by the NamedTuple.