Comments (5)
We could support shard training in the future to automatically handle this kind of scenario.
In the meantime, you have 2 options:
- Reduce your training data to a size that fits in your memory. You would certainly get good results as the model is reasonably good at generalizing to new data.
- You could split your training data into several files and preprocess them independently. For example, if you want to split your training data in two parts:
th preprocess.lua -train_src src-train-1.txt -train_tgt tgt-train-1.txt -valid_src src-valid.txt -valid_tgt tgt-valid.txt -save_data data-1
th preprocess.lua -train_src src-train-2.txt -train_tgt tgt-train-2.txt -valid_src src-valid.txt -valid_tgt tgt-valid.txt -src_vocab data-1.src.dict -tgt_vocab data-1.tgt.dict -save_data data-2
which produce the data packages:
data-1-train.t7
data-2-train.t7
Then, you start an initial training for one epoch on data-1-train.t7
:
th train.lua -data data-1-train.t7 -save_model model -end_epoch 1
And retrain your model for one epoch on the second data package:
th train.lua -data data-2-train.t7 -train_from model_epoch1_X.XX.t7 -save_model model -end_epoch 1
Finally you alternate a retraining on data-1-train.t7
and data-2-train.t7
for as long as is required.
th train.lua -data data-1-train.t7 -train_from model_epoch1_X.XX.t7 -save_model model -start_epoch 2 -end_epoch 2
th train.lua -data data-2-train.t7 -train_from model_epoch2_X.XX.t7 -save_model model -start_epoch 2 -end_epoch 2
th train.lua -data data-1-train.t7 -train_from model_epoch2_X.XX.t7 -save_model model -start_epoch 3 -end_epoch 3
...
from opennmt.
Great, the second option you suggest is good to me. Thanks for your kindly reply. Looking forward to seeing shard training.
from opennmt.
@guillaumekln, a problem occurred in the second option you provided. When processing the splitted train data, the preprocess.lua produced two different dict.src and it contained conflicted word_id in trainiing. How to prove that the splitted data shared same dict?
from opennmt.
oh, I find paramter '-src_vocab', '', [[Path to an existing source vocabulary]] in preprocess.lua. Is this the parameter to set to the shared dict? When creating the dict using -src_vocab_size = 20000 in the first data part, it actually contains 20004 words. What about the -src_vocab_size in preprocessing the second data part? Is it 20000 or 20004 ?
from opennmt.
Yes you got it, I missed that. You indeed need to share the vocabularies by using src_vocab
and tgt_vocab
. The vocabulary size will be inferred from the file.
from opennmt.
Related Issues (20)
- Decoder.lua:446: 'for' limit must be a number HOT 1
- Error when translating with Language Model: "attempt to call method 'forwardOne' (a nil value)" HOT 2
- Meet with lua:446:'for' limit must be a number HOT 2
- Difference between using epoch and train_steps HOT 3
- Difference bteween gnmt and brnn encoder HOT 2
- WriteObject problem, bad argument #1 to 'ipairs' HOT 1
- IDEA: Process FOSS projects .po files for translation corpora
- The engine caches the data in the translation HOT 6
- How to speed up when preprocess the corpus ? HOT 1
- Embeddings: invalid line count HOT 3
- Test BLEU Score HOT 1
- share_embedding option HOT 2
- Better pre-trained PyTorch models
- Maximum length of string processed by the engine HOT 2
- unicode issue for translating Bangla to English HOT 2
- Error BPE tokenizer: tokenize.lua:87: attempt to concatenate upvalue 'line' HOT 3
- Is early stopping available in OpenNMT-py or OpenNMT-tf? HOT 1
- Is `early-stopping` available feature in OpenNMT-tf or OpenNMT-py?
- NMT, What if we do not pass input for decoder? HOT 1
- ModuleNotFoundError: No module named 'onmt.inputters' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opennmt.