Hi, I have a question about Multi-GPU vs Distributed training, proba

Multi-GPU training vs Distributed training about transformers HOT 2 CLOSED

huggingface commented on May 1, 2024

Multi-GPU training vs Distributed training

from transformers.

Comments (2)

thomwolf commented on May 1, 2024 1

Hi,

Thanks for the feedback, it's always interesting to compare the various possible ways to train the model indeed.

The most likely cause for (2) is that MRPC is a small dataset and the model shows a high variance in the results depending on the initialization of the weights for example (see the original BERT repo on that also). The distributed and multi-gpu setups probably do not use the random generators in the exact same order which lead to different initializations.

You can have an intuition of that by training with different seeds, you will see there is easily a 10% variation in the final accuracy...

If you can do that, a better way to compare the results would thus be to take something like 10 different seeds for each training condition and compare the mean and standard deviation of the results.

from transformers.

llidev commented on May 1, 2024

Thanks for your feedback!

After some investigations, it looks like t_total is not set properly for distributed training in BertAdam. The actual t_total per distributed worker should be divided by the worker count.

I have included the following fix in my PR #58

    t_total = num_train_steps
    if args.local_rank != -1:
        t_total = t_total // torch.distributed.get_world_size()
    optimizer = BertAdam(optimizer_grouped_parameters,
                         lr=args.learning_rate,
                         warmup=args.warmup_proportion,
                         t_total=t_total)

from transformers.

Multi-GPU training vs Distributed training about transformers HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent