There are following word2vec models.
Model | Data | Note |
---|---|---|
Space Vi | MT-Vi-Mono-VLSP2020 using space tokeniner for preprocessing | VnCoreNLP for pretokenizer |
Space En | CNN-DailyMail using space tokeniner for preprocessing | StandfordCoreNLP for sentence segment |
BPE Vi | MT-Vi-Mono-VLSP2020 using bpe tokeniner for preprocessing | VnCoreNLP for pretokenizer |
BPE En | CNN-DailyMail using space bpe for preprocessing | StandfordCoreNLP for sentence segment |
Notebook: embedding/word2vec_pretrain_machine_trans_en_vi.ipynb
Link models: Google Drive models word2vec should be saved in /embedding/model_name
Extract MT-EV-VLSP2020.zip to ./
Check model/
Check 2 main file or these Colab notebooks