Giter Site home page Giter Site logo

Comments (5)

guillaumekln avatar guillaumekln commented on May 18, 2024

We could support shard training in the future to automatically handle this kind of scenario.

In the meantime, you have 2 options:

  1. Reduce your training data to a size that fits in your memory. You would certainly get good results as the model is reasonably good at generalizing to new data.
  2. You could split your training data into several files and preprocess them independently. For example, if you want to split your training data in two parts:
th preprocess.lua -train_src src-train-1.txt -train_tgt tgt-train-1.txt -valid_src src-valid.txt -valid_tgt tgt-valid.txt -save_data data-1
th preprocess.lua -train_src src-train-2.txt -train_tgt tgt-train-2.txt -valid_src src-valid.txt -valid_tgt tgt-valid.txt -src_vocab data-1.src.dict -tgt_vocab data-1.tgt.dict -save_data data-2

which produce the data packages:

  • data-1-train.t7
  • data-2-train.t7

Then, you start an initial training for one epoch on data-1-train.t7:

th train.lua -data data-1-train.t7 -save_model model -end_epoch 1

And retrain your model for one epoch on the second data package:

th train.lua -data data-2-train.t7 -train_from model_epoch1_X.XX.t7 -save_model model -end_epoch 1

Finally you alternate a retraining on data-1-train.t7 and data-2-train.t7 for as long as is required.

th train.lua -data data-1-train.t7 -train_from model_epoch1_X.XX.t7 -save_model model -start_epoch 2 -end_epoch 2
th train.lua -data data-2-train.t7 -train_from model_epoch2_X.XX.t7 -save_model model -start_epoch 2 -end_epoch 2
th train.lua -data data-1-train.t7 -train_from model_epoch2_X.XX.t7 -save_model model -start_epoch 3 -end_epoch 3
...

from opennmt.

wsnooker avatar wsnooker commented on May 18, 2024

Great, the second option you suggest is good to me. Thanks for your kindly reply. Looking forward to seeing shard training.

from opennmt.

wsnooker avatar wsnooker commented on May 18, 2024

@guillaumekln, a problem occurred in the second option you provided. When processing the splitted train data, the preprocess.lua produced two different dict.src and it contained conflicted word_id in trainiing. How to prove that the splitted data shared same dict?

from opennmt.

wsnooker avatar wsnooker commented on May 18, 2024

oh, I find paramter '-src_vocab', '', [[Path to an existing source vocabulary]] in preprocess.lua. Is this the parameter to set to the shared dict? When creating the dict using -src_vocab_size = 20000 in the first data part, it actually contains 20004 words. What about the -src_vocab_size in preprocessing the second data part? Is it 20000 or 20004 ?

from opennmt.

guillaumekln avatar guillaumekln commented on May 18, 2024

Yes you got it, I missed that. You indeed need to share the vocabularies by using src_vocab and tgt_vocab. The vocabulary size will be inferred from the file.

from opennmt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.