Giter Site home page Giter Site logo

simple-nmt's Introduction

Simple Neural Machine Translation (Simple-NMT)

This repo contains a simple source code for advanced neural machine translation based on Sequence-to-Sequence and Transformer. Most open sources have unnecessarily too complicated structures, because they have too many features more than people's expected. I believe that this repo has minimal features to build NMT system. Therefore, I hope that this repo can be a good start for people who doesn't want unnecessarily many features.

Also, this repo is for lecture and book, what I conduct. Please, refer those sites for further information.

Features

This repo provides many features, and many of those codes were written from scratch. (e.g. Transformer and Beam search)

Implemented Optimization Algorithms

Maximum Likelihood Estimation (MLE)

Minimum Risk Training (MRT)

Dual Supervised Learning (DSL)

Requirements

Evaluation

Results

First, following table shows an evaluation result for each algorithm.

enko koen
Sequence-to-Sequence 32.53 29.67
Sequence-to-Sequence (MRT) 34.04 31.24
Sequence-to-Sequence (DSL) 33.47 31.00
Transformer 34.96 31.84
Transformer (MRT) - -
Transformer (DSL) 35.48 32.80

As you can see, Transformer outperforms in ENKO/KOEN task. Note that it was unable to run MRT on Transformer, due to lack of memory.

Following table shows the result based on beam-size on Sequence-to-Sequence model. Table shows that beam search improve BLEU score without data adding and model change.

beam_size enko koen
1 31.65 28.93
5 32.53 29.67
10 32.48 29.37

Setup

In order to evaluate this project, I used public dataset from AI-HUB, which provides 1,600,000 pairs of sentence. I randomly split this data into train/valid/test set by following number of lines each. In fact, original test set, which has about 200000 lines, is too big to take bunch of evaluations, I reduced it to 1,000 lines. (In other words, you can get better model, if you put removed 199,000 lines into training set.)

set lang #lines #tokens #characters
train en 1,200,000 43,700,390 367,477,362
ko 1,200,000 39,066,127 344,881,403
valid en 200,000 7,286,230 61,262,147
ko 200,000 6,516,442 57,518,240
valid-1000 en 1,000 36,307 305,369
ko 1,000 32,282 285,911
test-1000 en 1,000 35,686 298,993
ko 1,000 31,720 280,126

Each dataset is tokenized with Mecab/MosesTokenizer and BPE. After preprocessing, each language has vocabulary size like as below:

en ko
20,525 29,411

Also, we have following hyper-parameters for each model to proceed a evaluation. Note that both architectures have small number of parameters, because I don't have enough corpus. You need to increase the number of parameters, if you have more corpus.

parameter seq2seq transformer
batch_size 320 4096
word_vec_size 512 -
hidden_size 768 768
n_layers 4 4
n_splits - 8
n_epochs 30 30

Below is a table for hyper-parameters for each algorithm.

parameter MLE MRT DSL
n_epochs 30 30 + 40 30 + 10
optimizer Adam SGD Adam
lr 1e-3 1e-2 1e-2
max_grad_norm 1e+8 5 1e+8 $\rightarrow$ 5

Please, note that MRT has different optimization setup.

Usage

I recommend to use corpora from AI-Hub, if you are trying to build Kor/Eng machine translation.

Training

>> python train.py -h
usage: train.py [-h] --model_fn MODEL_FN --train TRAIN --valid VALID --lang
                LANG [--gpu_id GPU_ID] [--off_autocast]
                [--batch_size BATCH_SIZE] [--n_epochs N_EPOCHS]
                [--verbose VERBOSE] [--init_epoch INIT_EPOCH]
                [--max_length MAX_LENGTH] [--dropout DROPOUT]
                [--word_vec_size WORD_VEC_SIZE] [--hidden_size HIDDEN_SIZE]
                [--n_layers N_LAYERS] [--max_grad_norm MAX_GRAD_NORM]
                [--iteration_per_update ITERATION_PER_UPDATE] [--lr LR]
                [--lr_step LR_STEP] [--lr_gamma LR_GAMMA]
                [--lr_decay_start LR_DECAY_START] [--use_adam] [--use_radam]
                [--rl_lr RL_LR] [--rl_n_samples RL_N_SAMPLES]
                [--rl_n_epochs RL_N_EPOCHS] [--rl_n_gram RL_N_GRAM]
                [--rl_reward RL_REWARD] [--use_transformer]
                [--n_splits N_SPLITS]

optional arguments:
  -h, --help            show this help message and exit
  --model_fn MODEL_FN   Model file name to save. Additional information would
                        be annotated to the file name.
  --train TRAIN         Training set file name except the extention. (ex:
                        train.en --> train)
  --valid VALID         Validation set file name except the extention. (ex:
                        valid.en --> valid)
  --lang LANG           Set of extention represents language pair. (ex: en +
                        ko --> enko)
  --gpu_id GPU_ID       GPU ID to train. Currently, GPU parallel is not
                        supported. -1 for CPU. Default=-1
  --off_autocast        Turn-off Automatic Mixed Precision (AMP), which speed-
                        up training.
  --batch_size BATCH_SIZE
                        Mini batch size for gradient descent. Default=32
  --n_epochs N_EPOCHS   Number of epochs to train. Default=20
  --verbose VERBOSE     VERBOSE_SILENT, VERBOSE_EPOCH_WISE, VERBOSE_BATCH_WISE
                        = 0, 1, 2. Default=2
  --init_epoch INIT_EPOCH
                        Set initial epoch number, which can be useful in
                        continue training. Default=1
  --max_length MAX_LENGTH
                        Maximum length of the training sequence. Default=100
  --dropout DROPOUT     Dropout rate. Default=0.2
  --word_vec_size WORD_VEC_SIZE
                        Word embedding vector dimension. Default=512
  --hidden_size HIDDEN_SIZE
                        Hidden size of LSTM. Default=768
  --n_layers N_LAYERS   Number of layers in LSTM. Default=4
  --max_grad_norm MAX_GRAD_NORM
                        Threshold for gradient clipping. Default=5.0
  --iteration_per_update ITERATION_PER_UPDATE
                        Number of feed-forward iterations for one parameter
                        update. Default=1
  --lr LR               Initial learning rate. Default=1.0
  --lr_step LR_STEP     Number of epochs for each learning rate decay.
                        Default=1
  --lr_gamma LR_GAMMA   Learning rate decay rate. Default=0.5
  --lr_decay_start LR_DECAY_START
                        Learning rate decay start at. Default=10
  --use_adam            Use Adam as optimizer instead of SGD. Other lr
                        arguments should be changed.
  --use_radam           Use rectified Adam as optimizer. Other lr arguments
                        should be changed.
  --rl_lr RL_LR         Learning rate for reinforcement learning. Default=0.01
  --rl_n_samples RL_N_SAMPLES
                        Number of samples to get baseline. Default=1
  --rl_n_epochs RL_N_EPOCHS
                        Number of epochs for reinforcement learning.
                        Default=10
  --rl_n_gram RL_N_GRAM
                        Maximum number of tokens to calculate BLEU for
                        reinforcement learning. Default=6
  --rl_reward RL_REWARD
                        Method name to use as reward function for RL training.
                        Default=gleu
  --use_transformer     Set model architecture as Transformer.
  --n_splits N_SPLITS   Number of heads in multi-head attention in
                        Transformer. Default=8

example usage:

Seq2Seq

>> python train.py --train ./data/corpus.shuf.train.tok.bpe --valid ./data/corpus.shuf.valid.tok.bpe --lang enko \
--gpu_id 0 --batch_size 128 --n_epochs 30 --max_length 100 --dropout .2 \
--word_vec_size 512 --hidden_size 768 --n_layers 4 --max_grad_norm 1e+8 --iteration_per_update 2 \
--lr 1e-3 --lr_step 0 --use_adam --rl_n_epochs 0 \
--model_fn ./model.pth

To continue with RL training

>> python continue_train.py --load_fn ./model.pth --model_fn ./model.rl.pth \
--init_epoch 31 --iteration_per_update 1 --max_grad_norm 5

Transformer

>> python train.py --train ./data/corpus.shuf.train.tok.bpe --valid ./data/corpus.shuf.valid.tok.bpe --lang enko \
--gpu_id 0 --batch_size 128 --n_epochs 30 --max_length 100 --dropout .2 \
--hidden_size 768 --n_layers 4 --max_grad_norm 1e+8 --iteration_per_update 32 \
--lr 1e-3 --lr_step 0 --use_adam --use_transformer --rl_n_epochs 0 \
--model_fn ./model.pth

Dual Supervised Learning

LM Training:

>> python lm_train.py --train ./data/corpus.shuf.train.tok.bpe --valid ./data/corpus.shuf.valid.tok.bpe --lang enko \
--gpu_id 0 --batch_size 256 --n_epochs 20 --max_length 64 --dropout .2 \
--word_vec_size 512 --hidden_size 768 --n_layers 4 --max_grad_norm 1e+8 \
--model_fn ./lm.pth

DSL using pretrained LM:

>> python dual_train.py --train ./data/corpus.shuf.train.tok.bpe --valid ./data/corpus.shuf.valid.tok.bpe --lang enko \
--gpu_id 0 --batch_size 64 --n_epochs 40 --max_length 64 --dropout .2 \
--word_vec_size 512 --hidden_size 768 --n_layers 4 --max_grad_norm 1e+8 --iteration_per_update 4 \
--dsl_n_warmup_epochs 30 --dsl_lambda 1e-2 \
--lm_fn ./lm.pth \
--model_fn ./model.pth

Note that I recommend to use different 'max_grad_norm value' (e.g. 5) for after warm-up training. You can use 'continue_dual_train.py' to change 'max_grad_norm' argument.

Inference

You can translate any sentence via standard input and output.

>> python translate.py -h
usage: translate.py [-h] --model_fn MODEL_FN [--gpu_id GPU_ID]
                    [--batch_size BATCH_SIZE] [--max_length MAX_LENGTH]
                    [--n_best N_BEST] [--beam_size BEAM_SIZE] [--lang LANG]
                    [--length_penalty LENGTH_PENALTY]

optional arguments:
  -h, --help            show this help message and exit
  --model_fn MODEL_FN   Model file name to use
  --gpu_id GPU_ID       GPU ID to use. -1 for CPU. Default=-1
  --batch_size BATCH_SIZE
                        Mini batch size for parallel inference. Default=128
  --max_length MAX_LENGTH
                        Maximum sequence length for inference. Default=255
  --n_best N_BEST       Number of best inference result per sample. Default=1
  --beam_size BEAM_SIZE
                        Beam size for beam search. Default=5
  --lang LANG           Source language and target language. Example: enko
  --length_penalty LENGTH_PENALTY
                        Length penalty parameter that higher value produce
                        shorter results. Default=1.2

example usage:

>> python translate.py --model_fn ./model.pth --gpu_id 0 --lang enko < test.txt > test.result.txt

You may also need to change the argument parameters.

Translation Examples

Below table shows that result from both MLE and MRT in Korean-English translation task.

INPUT REF MLE MRT
우리는 또한 그 지역의 생선 가공 공장에서 심한 악취를 내며 썩어가는 엄청난 양의 생선도 치웠습니다. We cleared tons and tons of stinking, rotting fish carcasses from the local fish processing plant. We also had a huge stink in the fish processing plant in the area, smelling havoc with a huge amount of fish. We also cleared a huge amount of fish that rot and rot in the fish processing factory in the area.
회사를 이전할 이상적인 장소이다. It is an ideal place to relocate the company. It's an ideal place to transfer the company. It's an ideal place to transfer the company.
나는 이것들이 내 삶을 바꾸게 하지 않겠어. I won't let this thing alter my life. I'm not gonna let these things change my life. I won't let these things change my life.
사람들이 슬퍼보인다. Their faces appear tearful. People seem to be sad. People seem to be sad.
아냐, 그런데 넌 그렇다고 생각해. No, but I think you do. No, but I think you do. No, but you think it's.
하지만, 나는 나중에 곧 잠들었다. But I fell asleep shortly afterwards. However, I fell asleep in a moment. However, I fell asleep soon afterwards.
하지만 1997년 아시아에 외환위기가 불어닥쳤다. But Asia was hit hard by the 1997 foreign currency crisis. In 1997, however, the financial crisis in Asia has become a reality for Asia. But in 1997, the foreign currency crisis was swept in Asia.
메이저 리그 공식 웹사이트에 따르면, 12월 22일, 추씨는 텍사스 레인져스와 7년 계약을 맺었다. According to Major League Baseball's official website, on Dec. 22, Choo signed a seven year contract with the Texas Rangers. According to the Major League official website on December 22, Choo signed a seven-year contract with Texas Rangers in Texas According to the Major League official website on December 22, Choo made a seven-year contract with Texas Rangers.
한 개인. a private individual a person of personal importance a personal individual
도로에 차가 꼬리를 물고 늘어서있다. The traffic is bumper to bumper on the road. The road is on the road with a tail. The road is lined with tail on the road.
내가 그렇게 늙지 않았다는 점을 지적해도 될까요. Let me point out that I'm not that old. You can point out that I'm not that old. You can point out that I'm not that old.
닐슨 시청률은 15분 단위 증감으로 시청률을 측정하므로, ABC, NBC, CBS 와 Fox 의 순위를 정하지 않았다. Nielsen had no ratings for ABC, NBC, CBS and Fox because it measures their viewership in 15-minute increments. The Nielsen ratings measured the viewer's ratings with increments for 15-minute increments, so they did not rank ABC, NBC, CBS and Fox. Nielson ratings measured ratings with 15-minute increments, so they did not rank ABC, NBC, CBS and Fox.
다시말해서, 학교는 교사 부족이다. In other words, the school is a teacher short. In other words, school is a teacher short of a teacher. In other words, school is a lack of teacher.
그 다음 몇 주 동안에 사태가 극적으로 전환되었다. Events took a dramatic turn in the weeks that followed. The situation has been dramatically changed for the next few weeks. The situation was dramatically reversed for the next few weeks.
젊은이들을 물리학에 대해 흥미를 붙일수 있게 할수 있는 가장 좋은 사람은 졸업생 물리학자이다. The best possible person to excite young people about physics is a graduate physicist. The best person to be able to make young people interested in physics is a self-thomac physicist. The best person to make young people interested in physics is a graduate physicist.
5월 20일, 인도는 팔로디 마을에서 충격적인 기온인 섭씨 51도를 달성하며, 가장 더운 날씨를 기록했습니다. On May 20, India recorded its hottest day ever in the town of Phalodi with a staggering temperature of 51 degrees Celsius. On May 20, India achieved its hottest temperatures, even 51 degrees Celsius, in the Palrody village, and recorded the hottest weather. On May 20, India achieved 51 degrees Celsius, a devastating temperature in Paldydy town, and recorded the hottest weather.
내말은, 가끔 바나는 그냥 바나나야. I mean, sometimes a banana is just a banana. I mean, sometimes a banana is just a banana. I mean, sometimes a banana is just a banana.

References

simple-nmt's People

Contributors

calee88 avatar kh-kim avatar texify[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

simple-nmt's Issues

translate.py 실행 방법

안녕하세요. nmt관련해서 찾아보다가 작성하신 nmt를 사용하게 되었는데요.

모델학습은 끝났고 translate 해보려고 translate.py 명령어를 readme의 예시와 같이 입력하였고

번역하고자 하는 입력값도 입력한 상태인데 그 다음부터 작동을 안하는것 같습니다.

명령어는 python translate.py --model models/koth.01.10.22-27452.85.10.33-30499.12.pth --gpu_id -1 --batch_size 8 --beam_size 8 --lang koth

이런식으로 실행했구요. 한글단어든 문장이든 아무거나 입력해도 다음 step으로 안넘어가는 것 같습니다.

제가 뭔가 잘못하고 있는 건지 확인 부탁드립니다.

problem about the rl trainer

i'm confused about the calculation of gradient of rl training(the function _get_gradient in rl_trainer.py).
firstly, in line 77 multiply "-reward" for each sample and in line 79 multiply -1, as a result, we are minimizing the reawrd, but rl is to maximize the total reward.
secondly, this implemention of Minimum Risk Training is not the same as the original paper, the hyper-parameter \alpha is not considered .

cuda out of memory (continue with RL training on transformer )

WARNING!!! Argument "--load_fn" is not found in saved model.	Use current value: ckpts/transformer_model_test.05.4.78-119.61.4.75-115.20.pth
WARNING!!! You changed value for argument "--model_fn".	Use current value: ckpts/rl_seq2seq_model_test.pth
WARNING!!! You changed value for argument "--init_epoch".	Use current value: 6
WARNING!!! You changed value for argument "--max_grad_norm".	Use current value: 5.0
WARNING!!! You changed value for argument "--iteration_per_update".	Use current value: 1
WARNING!!! You changed value for argument "--rl_n_epochs".	Use current value: 5
{   'batch_size': 128,
    'dropout': 0.2,
    'gpu_id': 5,
    'hidden_size': 768,
    'init_epoch': 6,
    'iteration_per_update': 1,
    'lang': 'deen',
    'load_fn': 'ckpts/transformer_model_test.05.4.78-119.61.4.75-115.20.pth',
    'lr': 0.001,
    'lr_decay_start': 10,
    'lr_gamma': 0.5,
    'lr_step': 0,
    'max_grad_norm': 5.0,
    'max_length': 100,
    'model_fn': 'ckpts/rl_seq2seq_model_test.pth',
    'n_epochs': 5,
    'n_layers': 4,
    'n_splits': 8,
    'off_autocast': False,
    'rl_lr': 0.01,
    'rl_n_epochs': 5,
    'rl_n_gram': 6,
    'rl_n_samples': 1,
    'rl_reward': 'gleu',
    'train': 'data/corpus_test/corpus.train',
    'use_adam': True,
    'use_radam': False,
    'use_transformer': True,
    'valid': 'data/corpus_test/corpus.valid',
    'verbose': 2,
    'word_vec_size': 512}
Transformer(
  (emb_enc): Embedding(14210, 768)
  (emb_dec): Embedding(10250, 768)
  (emb_dropout): Dropout(p=0.2, inplace=False)
  (encoder): MySequential(
    (0): EncoderBlock(
      (attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn_dropout): Dropout(p=0.2, inplace=False)
      (fc): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=True)
        (1): ReLU()
        (2): Linear(in_features=3072, out_features=768, bias=True)
      )
      (fc_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (fc_dropout): Dropout(p=0.2, inplace=False)
    )
    (1): EncoderBlock(
      (attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn_dropout): Dropout(p=0.2, inplace=False)
      (fc): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=True)
        (1): ReLU()
        (2): Linear(in_features=3072, out_features=768, bias=True)
      )
      (fc_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (fc_dropout): Dropout(p=0.2, inplace=False)
    )
    (2): EncoderBlock(
      (attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn_dropout): Dropout(p=0.2, inplace=False)
      (fc): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=True)
        (1): ReLU()
        (2): Linear(in_features=3072, out_features=768, bias=True)
      )
      (fc_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (fc_dropout): Dropout(p=0.2, inplace=False)
    )
    (3): EncoderBlock(
      (attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn_dropout): Dropout(p=0.2, inplace=False)
      (fc): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=True)
        (1): ReLU()
        (2): Linear(in_features=3072, out_features=768, bias=True)
      )
      (fc_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (fc_dropout): Dropout(p=0.2, inplace=False)
    )
  )
  (decoder): MySequential(
    (0): DecoderBlock(
      (masked_attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (masked_attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (masked_attn_dropout): Dropout(p=0.2, inplace=False)
      (attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn_dropout): Dropout(p=0.2, inplace=False)
      (fc): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=True)
        (1): ReLU()
        (2): Linear(in_features=3072, out_features=768, bias=True)
      )
      (fc_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (fc_dropout): Dropout(p=0.2, inplace=False)
    )
    (1): DecoderBlock(
      (masked_attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (masked_attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (masked_attn_dropout): Dropout(p=0.2, inplace=False)
      (attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn_dropout): Dropout(p=0.2, inplace=False)
      (fc): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=True)
        (1): ReLU()
        (2): Linear(in_features=3072, out_features=768, bias=True)
      )
      (fc_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (fc_dropout): Dropout(p=0.2, inplace=False)
    )
    (2): DecoderBlock(
      (masked_attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (masked_attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (masked_attn_dropout): Dropout(p=0.2, inplace=False)
      (attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn_dropout): Dropout(p=0.2, inplace=False)
      (fc): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=True)
        (1): ReLU()
        (2): Linear(in_features=3072, out_features=768, bias=True)
      )
      (fc_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (fc_dropout): Dropout(p=0.2, inplace=False)
    )
    (3): DecoderBlock(
      (masked_attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (masked_attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (masked_attn_dropout): Dropout(p=0.2, inplace=False)
      (attn): MultiHead(
        (Q_linear): Linear(in_features=768, out_features=768, bias=False)
        (K_linear): Linear(in_features=768, out_features=768, bias=False)
        (V_linear): Linear(in_features=768, out_features=768, bias=False)
        (linear): Linear(in_features=768, out_features=768, bias=False)
        (attn): Attention(
          (softmax): Softmax(dim=-1)
        )
      )
      (attn_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn_dropout): Dropout(p=0.2, inplace=False)
      (fc): Sequential(
        (0): Linear(in_features=768, out_features=3072, bias=True)
        (1): ReLU()
        (2): Linear(in_features=3072, out_features=768, bias=True)
      )
      (fc_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (fc_dropout): Dropout(p=0.2, inplace=False)
    )
  )
  (generator): Sequential(
    (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (1): Linear(in_features=768, out_features=10250, bias=True)
    (2): LogSoftmax(dim=-1)
  )
)
NLLLoss()
Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.98)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
)
Epoch [1/5]:   1%|     | 1/77 [00:00<?, ?it/s, actor=4.15, baseline=4.04, reward=0.104, |g_param|=28.5, |param|=4.34e+3]Current run is terminating due to exception: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 5; 31.75 GiB total capacity; 26.41 GiB already allocated; 3.44 MiB free; 30.69 GiB reserved in total by PyTorch)
Engine run is terminating due to exception: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 5; 31.75 GiB total capacity; 26.41 GiB already allocated; 3.44 MiB free; 30.69 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "/root/project/simple-nmt/continue_train.py", line 55, in <module>
    continue_main(config, main)
  File "/root/project/simple-nmt/continue_train.py", line 48, in continue_main
    main(config, model_weight=model_weight, opt_weight=opt_weight)
  File "/root/project/simple-nmt/train.py", line 355, in main
    n_epochs=config.rl_n_epochs,
  File "/root/project/simple-nmt/simple_nmt/trainer.py", line 311, in train
    train_engine.run(train_loader, max_epochs=n_epochs)
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/ignite/engine/engine.py", line 702, in run
    return self._internal_run()
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/ignite/engine/engine.py", line 775, in _internal_run
    self._handle_exception(e)
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
    raise e
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/ignite/engine/engine.py", line 745, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/ignite/engine/engine.py", line 850, in _run_once_on_dataset
    self._handle_exception(e)
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
    raise e
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/ignite/engine/engine.py", line 833, in _run_once_on_dataset
    self.state.output = self._process_function(self, self.state.batch)
  File "/root/project/simple-nmt/simple_nmt/rl_trainer.py", line 140, in train
    max_length=engine.config.max_length
  File "/root/project/simple-nmt/simple_nmt/models/transformer.py", line 419, in search
    h_t, _, _, _, _ = block(h_t, z, mask_dec, prev, None)
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/project/simple-nmt/simple_nmt/models/transformer.py", line 214, in forward
    mask=mask))
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/project/simple-nmt/simple_nmt/models/transformer.py", line 55, in forward
    QWs = self.Q_linear(Q).split(self.hidden_size // self.n_splits, dim=-1)
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 91, in forward
    return F.linear(input, self.weight, self.bias)
  File "/root/anaconda3/envs/rlnmt/lib/python3.6/site-packages/torch/nn/functional.py", line 1676, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 5; 31.75 GiB total capacity; 26.41 GiB already allocated; 3.44 MiB free; 30.69 GiB reserved in total by PyTorch)

how to fix it? thank you

Error while training model in Google Colab

Thanks for making your repo public.

I am using your repo for training model for EN to DE translation.

I tried to train translation model in google colab with runtime type as TPU, GPU and CPU but i am getting the below errors for all type as some memory issues, could you please let me know what could be done:

Runtime type: TPU
NLLLoss()
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=47 error=100 : no CUDA-capable device is detected
Traceback (most recent call last):
File "train.py", line 491, in
main(config)
File "train.py", line 420, in main
model.cuda(config.gpu_id)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 307, in cuda
return self._apply(lambda t: t.cuda(device))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 203, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 225, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 307, in
return self._apply(lambda t: t.cuda(device))
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 153, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:47

Runtime type: GPU
NLLLoss()
SGD (
Parameter Group 0
dampening: 0
initial_lr: 1.0
lr: 1.0
momentum: 0
nesterov: False
weight_decay: 0
)
/pytorch/aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.
Current run is terminating due to exception: CUDA out of memory. Tried to allocate 2.58 GiB (GPU 0; 11.17 GiB total capacity; 8.93 GiB already allocated; 1.92 GiB free; 8.95 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7ff00ff05536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1cf1e (0x7ff01014ef1e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1df9e (0x7ff01014ff9e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: THCStorage_resize + 0x96 (0x7ff0113dce76 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: THCTensor_resizeNd + 0x441 (0x7ff0113ee031 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: THNN_CudaClassNLLCriterion_updateGradInput + 0x6a (0x7ff011c8ed4a in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x1027578 (0x7ff011388578 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xfab71b (0x7ff01130c71b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: + 0x10c33d3 (0x7ff048eda3d3 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: + 0x2b493d8 (0x7ff04a9603d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0x10c33d3 (0x7ff048eda3d3 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::generated::NllLossBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x276 (0x7ff04a6bbda6 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2d89705 (0x7ff04aba0705 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7ff04ab9da03 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7ff04ab9e7e2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_init(int) + 0x39 (0x7ff04ab96e59 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7ff0574de488 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #17: + 0xbd6df (0x7ff059a9b6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #18: + 0x76db (0x7ff05ab7d6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #19: clone + 0x3f (0x7ff05aeb6a3f in /lib/x86_64-linux-gnu/libc.so.6)
.
Engine run is terminating due to exception: CUDA out of memory. Tried to allocate 2.58 GiB (GPU 0; 11.17 GiB total capacity; 8.93 GiB already allocated; 1.92 GiB free; 8.95 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7ff00ff05536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1cf1e (0x7ff01014ef1e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1df9e (0x7ff01014ff9e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: THCStorage_resize + 0x96 (0x7ff0113dce76 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: THCTensor_resizeNd + 0x441 (0x7ff0113ee031 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: THNN_CudaClassNLLCriterion_updateGradInput + 0x6a (0x7ff011c8ed4a in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x1027578 (0x7ff011388578 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xfab71b (0x7ff01130c71b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: + 0x10c33d3 (0x7ff048eda3d3 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: + 0x2b493d8 (0x7ff04a9603d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0x10c33d3 (0x7ff048eda3d3 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::generated::NllLossBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x276 (0x7ff04a6bbda6 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2d89705 (0x7ff04aba0705 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7ff04ab9da03 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7ff04ab9e7e2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_init(int) + 0x39 (0x7ff04ab96e59 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7ff0574de488 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #17: + 0xbd6df (0x7ff059a9b6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #18: + 0x76db (0x7ff05ab7d6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #19: clone + 0x3f (0x7ff05aeb6a3f in /lib/x86_64-linux-gnu/libc.so.6)
.
Traceback (most recent call last):
File "train.py", line 491, in
main(config)
File "train.py", line 468, in main
lr_scheduler=lr_scheduler,
File "/content/drive/My Drive/Simple_NMT/simple_nmt/trainer.py", line 282, in train
train_engine.run(train_loader, max_epochs=n_epochs)
File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 658, in run
return self._internal_run()
File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 722, in _internal_run
self._handle_exception(e)
File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 437, in _handle_exception
raise e
File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 697, in _internal_run
time_taken = self._run_once_on_dataset()
File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 788, in _run_once_on_dataset
self._handle_exception(e)
File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 437, in _handle_exception
raise e
File "/usr/local/lib/python3.6/dist-packages/ignite/engine/engine.py", line 771, in _run_once_on_dataset
self.state.output = self._process_function(self, self.state.batch)
File "/content/drive/My Drive/Simple_NMT/simple_nmt/trainer.py", line 61, in train
loss.div(y.size(0)).backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 2.58 GiB (GPU 0; 11.17 GiB total capacity; 8.93 GiB already allocated; 1.92 GiB free; 8.95 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7ff00ff05536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1cf1e (0x7ff01014ef1e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1df9e (0x7ff01014ff9e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #3: THCStorage_resize + 0x96 (0x7ff0113dce76 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: THCTensor_resizeNd + 0x441 (0x7ff0113ee031 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: THNN_CudaClassNLLCriterion_updateGradInput + 0x6a (0x7ff011c8ed4a in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x1027578 (0x7ff011388578 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xfab71b (0x7ff01130c71b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: + 0x10c33d3 (0x7ff048eda3d3 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: + 0x2b493d8 (0x7ff04a9603d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0x10c33d3 (0x7ff048eda3d3 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::generated::NllLossBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x276 (0x7ff04a6bbda6 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2d89705 (0x7ff04aba0705 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7ff04ab9da03 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7ff04ab9e7e2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_init(int) + 0x39 (0x7ff04ab96e59 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7ff0574de488 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #17: + 0xbd6df (0x7ff059a9b6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #18: + 0x76db (0x7ff05ab7d6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #19: clone + 0x3f (0x7ff05aeb6a3f in /lib/x86_64-linux-gnu/libc.so.6)

Runtime type: CPU
Traceback (most recent call last):
File "./train.py", line 491, in
main(config)
File "./train.py", line 468, in main
lr_scheduler=lr_scheduler,
File "C:\01_Development\Master_Thesis\Simple_NMT\simple_nmt\trainer.py", line 282, in train
train_engine.run(train_loader, max_epochs=n_epochs)
File "C:\Program Files (x86)\Python37\lib\site-packages\ignite\engine\engine.py", line 359, in run
self._handle_exception(e)
File "C:\Program Files (x86)\Python37\lib\site-packages\ignite\engine\engine.py", line 324, in _handle_exception
raise e
File "C:\Program Files (x86)\Python37\lib\site-packages\ignite\engine\engine.py", line 346, in run
hours, mins, secs = self._run_once_on_dataset()
File "C:\Program Files (x86)\Python37\lib\site-packages\ignite\engine\engine.py", line 313, in _run_once_on_dataset
self._handle_exception(e)
File "C:\Program Files (x86)\Python37\lib\site-packages\ignite\engine\engine.py", line 324, in _handle_exception
raise e
File "C:\Program Files (x86)\Python37\lib\site-packages\ignite\engine\engine.py", line 305, in _run_once_on_dataset
self.state.output = self.process_function(self, batch)
File "C:\01_Development\Master_Thesis\Simple_NMT\simple_nmt\trainer.py", line 61, in train
loss.div(y.size(0)).backward()
File "C:\Program Files (x86)\Python37\lib\site-packages\torch\tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Program Files (x86)\Python37\lib\site-packages\torch\autograd_init
.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 6930739200 bytes. Buy new RAM!
(no backtrace available)

Could please help me how to get rid of this memory issues?

transformer에서 Reinforcement learning for fine-tuning like Minimum Risk Training (MRT)를 수행하려고 합니다.

transformer에서 Reinforcement learning for fine-tuning like Minimum Risk Training (MRT)를 수행하려고 합니다.
사전학습에 사용된 데이터 모델에서 획득한 bpe모델을 이용하여 bpe과정 까지 진행한 미세조정 할 데이터셋은 준비하였습니다.
미세조정을 하려면 실행문이

python continue_train.py --train [미세조정 데이터.train.tok.bpe] -valid [미세조정 데이터.valid.tok.bpe] --lang enko
--load_fn ./model.pth
--iteration_per_update 1 --max_grad_norm 5
--use_adam --use_transformer --rl_n_epochs 10
--model_fn ./model.rl.pth
로 To continue with RL training 방법에서 --use_adam --use_transformer --rl_n_epochs 10 만 추가 하면 될까요??

안녕하세요~ 딥러닝을 공부하는 학생입니다.

안녕하세요! 패스트캠퍼스 강의를 수강하는 학생입니다.

다름이 아니라 제 컴퓨터가 별로 좋지 않아서, 전체 데이터를 학습시키지 못해서 예측이 잘 안 되더라구요... 그래서 혹시 최종 모델 파일(pt,pkl 등)들을 공유해주실 수 있으신가요? 실제로 예측할 때 결과가 잘 나오는지 궁금해서요 ㅠㅠ

감사합니다 ㅎㅎ

[Help Wanted] How to evaluate BLEU score on test sets

Hello, first of all thanks for making this repo public.

I'm trying to translate from DE to EN, and I've been able to train a model that does so. Now i want to evaluate the BLEU score of the model. How could I BLEU score my test set (I have test.de and test.en files)? Do I have to first translate the whole test.de and then compare to the references in test.en? How did you score your koen models?

If I'm not mistaken you seem to have scripts for that task: massive_test.py and multi-bleu.perl, but I don't really understand how to use them properly.

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.