Comments (21)
You probably mean the Librispeech text corpus, right? As far as I know, that is huge (@kazuki-irie can comment), so I guess ~24h per epoch sounds right. (Edit: Correction, that should be 24h per sub-epoch (1/10th of the full epoch).)
In RETURNN, you use the LmDataset
for it, right? You probably want to use epoch_split
to split it up into sub-epochs. E.g. epoch_split=25
or so. Then one of the sub-epoch should take less than 1h. This also means that you store the model checkpoint more often and you can do the learning rate scheduling more often.
I don't know if @kazuki-irie has any experience about multi-GPU training of language models. I don't have. What horovod_reduce_type
and horovod_param_sync_step
do you use? That probably will impact convergence speed. Also learning rate of course (and it's probably different as in single GPU).
LmDataset
might also not be optimal for multi-GPU training (I don't know). Maybe it is more efficient to use HDFDataset
. See also here.
from returnn-experiments.
You probably mean the Librispeech text corpus, right? As far as I know, that is huge (@kazuki-irie can comment), so I guess ~24h per epoch sounds right.
That's right. I confirm that the training speed is in that range for the best (large) models using a single GPU (with random
sequence ordering).
If you are working with the official LibriSpeech 200K word level vocabulary or our 10K BPEs, you could also consider making use of our pre-trained models:
https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers/librispeech
I don't know if @kazuki-irie has any experience about multi-GPU training of language models.
No. It has been on my TODO list since a while but had no time for that so far. So I can not help here. Sorry.
from returnn-experiments.
Closing now, as this is not really about a bug in the code. But feel free to ask further questions.
from returnn-experiments.
Yes, it's Librispeech text corpus. I used the bpe based transformer LM config, only change I did was horovod related flags reduce_type='param', sync_step=50. Some changes in LMDataset to distribute text sequences between the gpus. Similar changes have worked well for multi gpu training of LSTM based LM configs. I am trying to reproduce the results so that it can be used to train for larger corpus on a multi gpu setup.
from returnn-experiments.
Please share your experience and results if you are successful, that might be helpful.
Can you also share some details about what you changed exactly in LmDataset
?
from returnn-experiments.
Sorry for the late reply & formatting. I have added following changes to the _iter_text() method.
import horovod.tensorflow as hvd
hvd_rank = hvd.local_rank()
hvd_size = hvd.size()
count =-1
for line in f:
count +=1
if count%hvd_size != hvd_rank :
continue
from returnn-experiments.
Ah, but that should not be needed. Actually that is probably wrong.
Check FeedDictDataProvider.get_next_batch
. In case of Horovod, you have batch_slice = slice(hvd.rank(), None, hvd.size())
there.
from returnn-experiments.
Thanks for clarifying, will try as you suggested. But without this change, all instances load all the training sequences.
from returnn-experiments.
Yes, that is unfortunately the case. But I did not know about a better solution (which would work as-is with all existing datasets). Your solution is of course better, but then only works for LmDataset
, and also, it is wrong, unless you remove that batch_slice
logic.
from returnn-experiments.
With default logic and bigger corpus, memory becomes an issue. I will experiment by removing batch_slice logic.
from returnn-experiments.
Yes, I know. Btw, that is why I recommend HDFDataset
for multi-GPU training. That will not load the whole data into memory, and thus it should not be a real issue, and it should also be fast. You can use the tool hdf_dump.py
to convert your LmDataset
(or any dataset) into a HDFDataset
. See the documentation about multi-GPU training.
from returnn-experiments.
Sure, will try that. Thanks a lot for the clarifications.
from returnn-experiments.
I rechecked, I had batch_slice logic commented in my earlier experiments which worked well for multi-gpu training for LSTM based LMs.
from returnn-experiments.
Is there a script for checking ppl on test data?
from returnn-experiments.
If you can add the test data in the config and directly call returnn
rnn.py train.config ++task eval ++train None ++load_epoch $EPOCH ++log eval.txt ++learning_rate_file dummy_for_eval.txt
from returnn-experiments.
Hello,
This may not be the right place since this (again) is not a bug but I'm gonna take advantage of the
But feel free to ask further questions.
statement and ask a question regarding the training time of this particular experiment.
We've tried replicating the results from this paper and more precisely re_transfo_96_d00.2048_512.head_8.sgd.lr1.cl1.small_batch.config
. We used the Librispeech corpus and the dictionary provided in the experiment.
According to the beginning of this discussion, one epoch is supposed to take ~24h. Using the config file (and tensorflow 2.3.1) running one sub-epoch (with epoch_split = 10
) takes around 27h on our machine which means that the entire epoch would take around 10 days. We are running the model on a single Tesla V100 GPU (CUDA 10.1). It is mentioned in another issue that the returnn-experiments are run on a GTX 1080 Ti GPU. So I was wondering what the problem with our setup could be or have we misunderstood something since the training takes 10 times more on a (supposedly) faster GPU.
from returnn-experiments.
When I read my old response now, I realize how confusing it was...
If I remember correctly, the ~24-hour training time was for 1 sub-epoch not 1 epoch, and there we had 10 sub-epochs = 1 epoch (just like in your setting).
This should explain the factor 10.
from returnn-experiments.
Hi,
To add:
It is mentioned in another issue that the returnn-experiments are run on a GTX 1080 Ti GPU
Which other issue? The experiments here were performed on many different kind of hardware, by many different people (GTX 1080 is common, but also 980, 2080, and some of the professional cards as well). I think @kazuki-irie also often trained on a faster GPU as far as I remember (not sure which one exactly).
from returnn-experiments.
@albertz, I meant this particular issue.
Actually, the whole thought process was inspired by this quote from the issue:
(although normally our training times are often 1-5 days or so; in only some of the rare extreme cases you get sth like 2 weeks; all of that always on a single GTX 1080 Ti GPU)
I might be catching at a straw here because I couldn't find much information about the training times while trying to recreate the results from the experiment.
@kazuki-irie, thanks for the reply (and edit).
If I understand the naming convention correctly this model has been trained for 30 sub-epochs each taking about a day...so this model takes a month to train on a single GPU? I'm not sure that the above quote applies to this particular experiment (as it takes longer than 2 weeks). So I'm not sure whether you used multiple GPUs or the model just really takes that long to train.
Sorry for wasting your time.
from returnn-experiments.
this model takes a month to train on a single GPU?
That is correct. It's a big model.
So I'm not sure whether you used multiple GPUs
No, on one GPU (at that time).
from returnn-experiments.
Hey, maybe I can chime in a bit.
I did train a large LM on multi-GPU (LSTM though),
those are my RETURNN settings for that:
use_horovod = config.bool("use_horovod", False)
horovod_dataset_distribution = "shard"
horovod_reduce_type = "param"
horovod_param_sync_step = 100
Haven't experimented with it a lot (one single run actually), but it seemed to work.
I would assume that similar settings would also work for the Transformer LM.
That was with 8 GPUs, one sub-epoch with ~90k steps took around 6h (on a V100)
from returnn-experiments.
Related Issues (20)
- local attention with unidirectional lstm not converging HOT 5
- Implement a unidirectional variant of local attention HOT 10
- Loading a saved Returnn model from its .meta file HOT 16
- query regarding LM data preprocessing HOT 2
- Reusing parameters inside rec layer HOT 5
- Training Configuration for TEDLIUMv2 HOT 3
- specAugment policy and schedules HOT 3
- Question about 2020-rnn-transducer HOT 16
- 2018-asr-attention/librispeech/attention/exp3.ctc.lm.config: target 'bpe' unknown HOT 3
- Question about 2018-asr-librispeech dev = get_dataset("dev", subset=3000) HOT 2
- loss nan and cost nan while running my own corpus using librispeech sets HOT 10
- Hierarchical layer name not captured correctly
- Problem with retrieving source layer from a hierarchical definition
- Multi Stage Training
- Questions on librispeech transformer lm HOT 10
- Transducer error in GetFilteredScoreOp HOT 4
- Big files in repo HOT 5
- Git commit/push rule to not allow big files HOT 3
- Could you please provide a script that could run lsh-attention for translation? HOT 4
- Assert Error when running 2022-lsh-attention HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn-experiments.