I am able to replicate single gpu scores with the libri corpus, but the training is sl

Transformer LM training issues about returnn-experiments HOT 21 CLOSED

rwth-i6 commented on May 24, 2024

Transformer LM training issues

from returnn-experiments.

Comments (21)

albertz commented on May 24, 2024

You probably mean the Librispeech text corpus, right? As far as I know, that is huge (@kazuki-irie can comment), so I guess ~24h per epoch sounds right. (Edit: Correction, that should be 24h per sub-epoch (1/10th of the full epoch).)
In RETURNN, you use the LmDataset for it, right? You probably want to use epoch_split to split it up into sub-epochs. E.g. epoch_split=25 or so. Then one of the sub-epoch should take less than 1h. This also means that you store the model checkpoint more often and you can do the learning rate scheduling more often.

I don't know if @kazuki-irie has any experience about multi-GPU training of language models. I don't have. What horovod_reduce_type and horovod_param_sync_step do you use? That probably will impact convergence speed. Also learning rate of course (and it's probably different as in single GPU).
LmDataset might also not be optimal for multi-GPU training (I don't know). Maybe it is more efficient to use HDFDataset. See also here.

from returnn-experiments.

kazuki-irie commented on May 24, 2024

You probably mean the Librispeech text corpus, right? As far as I know, that is huge (@kazuki-irie can comment), so I guess ~24h per epoch sounds right.

That's right. I confirm that the training speed is in that range for the best (large) models using a single GPU (with random sequence ordering).
If you are working with the official LibriSpeech 200K word level vocabulary or our 10K BPEs, you could also consider making use of our pre-trained models:
https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers/librispeech

I don't know if @kazuki-irie has any experience about multi-GPU training of language models.

No. It has been on my TODO list since a while but had no time for that so far. So I can not help here. Sorry.

from returnn-experiments.

albertz commented on May 24, 2024

Closing now, as this is not really about a bug in the code. But feel free to ask further questions.

from returnn-experiments.

deep-speech commented on May 24, 2024

Yes, it's Librispeech text corpus. I used the bpe based transformer LM config, only change I did was horovod related flags reduce_type='param', sync_step=50. Some changes in LMDataset to distribute text sequences between the gpus. Similar changes have worked well for multi gpu training of LSTM based LM configs. I am trying to reproduce the results so that it can be used to train for larger corpus on a multi gpu setup.

from returnn-experiments.

albertz commented on May 24, 2024

Please share your experience and results if you are successful, that might be helpful.
Can you also share some details about what you changed exactly in LmDataset?

from returnn-experiments.

deep-speech commented on May 24, 2024

Sorry for the late reply & formatting. I have added following changes to the _iter_text() method.

import horovod.tensorflow as hvd
hvd_rank = hvd.local_rank()
hvd_size = hvd.size()
count =-1

for line in f:
count +=1
if count%hvd_size != hvd_rank :
continue

from returnn-experiments.

albertz commented on May 24, 2024

Ah, but that should not be needed. Actually that is probably wrong.
Check FeedDictDataProvider.get_next_batch. In case of Horovod, you have batch_slice = slice(hvd.rank(), None, hvd.size()) there.

from returnn-experiments.

deep-speech commented on May 24, 2024

Thanks for clarifying, will try as you suggested. But without this change, all instances load all the training sequences.

from returnn-experiments.

albertz commented on May 24, 2024

Yes, that is unfortunately the case. But I did not know about a better solution (which would work as-is with all existing datasets). Your solution is of course better, but then only works for LmDataset, and also, it is wrong, unless you remove that batch_slice logic.

from returnn-experiments.

deep-speech commented on May 24, 2024

With default logic and bigger corpus, memory becomes an issue. I will experiment by removing batch_slice logic.

from returnn-experiments.

albertz commented on May 24, 2024

Yes, I know. Btw, that is why I recommend HDFDataset for multi-GPU training. That will not load the whole data into memory, and thus it should not be a real issue, and it should also be fast. You can use the tool hdf_dump.py to convert your LmDataset (or any dataset) into a HDFDataset. See the documentation about multi-GPU training.

from returnn-experiments.

deep-speech commented on May 24, 2024

Sure, will try that. Thanks a lot for the clarifications.

from returnn-experiments.

deep-speech commented on May 24, 2024

I rechecked, I had batch_slice logic commented in my earlier experiments which worked well for multi-gpu training for LSTM based LMs.

from returnn-experiments.

deep-speech commented on May 24, 2024

Is there a script for checking ppl on test data?

from returnn-experiments.

kazuki-irie commented on May 24, 2024

If you can add the test data in the config and directly call returnn

rnn.py train.config ++task eval ++train None ++load_epoch $EPOCH ++log eval.txt ++learning_rate_file dummy_for_eval.txt

from returnn-experiments.

bitterfly commented on May 24, 2024

Hello,
This may not be the right place since this (again) is not a bug but I'm gonna take advantage of the

But feel free to ask further questions.

statement and ask a question regarding the training time of this particular experiment.

We've tried replicating the results from this paper and more precisely re_transfo_96_d00.2048_512.head_8.sgd.lr1.cl1.small_batch.config. We used the Librispeech corpus and the dictionary provided in the experiment.

According to the beginning of this discussion, one epoch is supposed to take ~24h. Using the config file (and tensorflow 2.3.1) running one sub-epoch (with epoch_split = 10) takes around 27h on our machine which means that the entire epoch would take around 10 days. We are running the model on a single Tesla V100 GPU (CUDA 10.1). It is mentioned in another issue that the returnn-experiments are run on a GTX 1080 Ti GPU. So I was wondering what the problem with our setup could be or have we misunderstood something since the training takes 10 times more on a (supposedly) faster GPU.

from returnn-experiments.

kazuki-irie commented on May 24, 2024

When I read my old response now, I realize how confusing it was...
If I remember correctly, the ~24-hour training time was for 1 sub-epoch not 1 epoch, and there we had 10 sub-epochs = 1 epoch (just like in your setting).
This should explain the factor 10.

from returnn-experiments.

albertz commented on May 24, 2024

Hi,

To add:

It is mentioned in another issue that the returnn-experiments are run on a GTX 1080 Ti GPU

Which other issue? The experiments here were performed on many different kind of hardware, by many different people (GTX 1080 is common, but also 980, 2080, and some of the professional cards as well). I think @kazuki-irie also often trained on a faster GPU as far as I remember (not sure which one exactly).

from returnn-experiments.

bitterfly commented on May 24, 2024

@albertz, I meant this particular issue.
Actually, the whole thought process was inspired by this quote from the issue:

(although normally our training times are often 1-5 days or so; in only some of the rare extreme cases you get sth like 2 weeks; all of that always on a single GTX 1080 Ti GPU)

I might be catching at a straw here because I couldn't find much information about the training times while trying to recreate the results from the experiment.

@kazuki-irie, thanks for the reply (and edit).
If I understand the naming convention correctly this model has been trained for 30 sub-epochs each taking about a day...so this model takes a month to train on a single GPU? I'm not sure that the above quote applies to this particular experiment (as it takes longer than 2 weeks). So I'm not sure whether you used multiple GPUs or the model just really takes that long to train.

Sorry for wasting your time.

from returnn-experiments.

kazuki-irie commented on May 24, 2024

this model takes a month to train on a single GPU?

That is correct. It's a big model.

So I'm not sure whether you used multiple GPUs

No, on one GPU (at that time).

from returnn-experiments.

Spotlight0xff commented on May 24, 2024

Hey, maybe I can chime in a bit.
I did train a large LM on multi-GPU (LSTM though),
those are my RETURNN settings for that:

use_horovod = config.bool("use_horovod", False)
horovod_dataset_distribution = "shard"
horovod_reduce_type = "param"
horovod_param_sync_step = 100

Haven't experimented with it a lot (one single run actually), but it seemed to work.
I would assume that similar settings would also work for the Transformer LM.

That was with 8 GPUs, one sub-epoch with ~90k steps took around 6h (on a V100)

from returnn-experiments.

Transformer LM training issues about returnn-experiments HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent