Hi, I have a model to train based on a huge train_dataset, which contains 200 million

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to process Large Train Data out of memory? about opennmt HOT 5 CLOSED

opennmt commented on May 18, 2024

How to process Large Train Data out of memory?

from opennmt.

Comments (5)

guillaumekln commented on May 18, 2024

We could support shard training in the future to automatically handle this kind of scenario.

In the meantime, you have 2 options:

Reduce your training data to a size that fits in your memory. You would certainly get good results as the model is reasonably good at generalizing to new data.
You could split your training data into several files and preprocess them independently. For example, if you want to split your training data in two parts:

th preprocess.lua -train_src src-train-1.txt -train_tgt tgt-train-1.txt -valid_src src-valid.txt -valid_tgt tgt-valid.txt -save_data data-1
th preprocess.lua -train_src src-train-2.txt -train_tgt tgt-train-2.txt -valid_src src-valid.txt -valid_tgt tgt-valid.txt -src_vocab data-1.src.dict -tgt_vocab data-1.tgt.dict -save_data data-2

which produce the data packages:

data-1-train.t7
data-2-train.t7

Then, you start an initial training for one epoch on data-1-train.t7:

th train.lua -data data-1-train.t7 -save_model model -end_epoch 1

And retrain your model for one epoch on the second data package:

th train.lua -data data-2-train.t7 -train_from model_epoch1_X.XX.t7 -save_model model -end_epoch 1

Finally you alternate a retraining on data-1-train.t7 and data-2-train.t7 for as long as is required.

th train.lua -data data-1-train.t7 -train_from model_epoch1_X.XX.t7 -save_model model -start_epoch 2 -end_epoch 2
th train.lua -data data-2-train.t7 -train_from model_epoch2_X.XX.t7 -save_model model -start_epoch 2 -end_epoch 2
th train.lua -data data-1-train.t7 -train_from model_epoch2_X.XX.t7 -save_model model -start_epoch 3 -end_epoch 3
...

from opennmt.

wsnooker commented on May 18, 2024

Great, the second option you suggest is good to me. Thanks for your kindly reply. Looking forward to seeing shard training.

from opennmt.

wsnooker commented on May 18, 2024

@guillaumekln, a problem occurred in the second option you provided. When processing the splitted train data, the preprocess.lua produced two different dict.src and it contained conflicted word_id in trainiing. How to prove that the splitted data shared same dict?

from opennmt.

wsnooker commented on May 18, 2024

oh, I find paramter '-src_vocab', '', [[Path to an existing source vocabulary]] in preprocess.lua. Is this the parameter to set to the shared dict? When creating the dict using -src_vocab_size = 20000 in the first data part, it actually contains 20004 words. What about the -src_vocab_size in preprocessing the second data part? Is it 20000 or 20004 ?

from opennmt.

guillaumekln commented on May 18, 2024

Yes you got it, I missed that. You indeed need to share the vocabularies by using src_vocab and tgt_vocab. The vocabulary size will be inferred from the file.

from opennmt.

How to process Large Train Data out of memory? about opennmt HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent