I have no problems training on a CPU. But when I train on a GPU it crashes every time.

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi,I use version 1.2.0 of Pytorch.Thank you.Bhavani <span cla

Hi, I have made a pull request <a class="issue-link js-issue-link" d

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

GPU training crashes,about unbabel/openkiwi

Comments (36)

Bhavani01 commented on August 20, 2024 1

Is there any update on this? I now have the latest versions of of Kiwi and Pytorch. But the GPU training still fails. I also have an additional issue. The training on the CPU is fine but when I try to predict, it fails. It exits without giving an error. Pasting the log here. Any insights on what I could do differently? Thanks in advance.
2019-11-04 09:15:15.747 [kiwi.lib.predict setup:159] {'batch_size': 64,
'config': 'predict_estimator.yaml',
'debug': False,
'experiment_name': 'predict-predest',
'gpu_id': None,
'load_data': None,
'load_model': '/ec/dgt/local/exodus/home/bhaskbh/new_train/best_model.torch',
'load_vocab': '/ec/dgt/local/exodus/home/bhaskbh/new_train/vocab.torch',
'log_interval': 100,
'mlflow_always_log_artifacts': False,
'mlflow_tracking_uri': 'mlruns/',
'model': 'estimator',
'output_dir': '/ec/dgt/local/exodus/home/bhaskbh/test_data',
'quiet': False,
'run_uuid': None,
'save_config': None,
'save_data': None,
'seed': 42}
2019-11-04 09:15:15.747 [kiwi.lib.predict setup:160] Local output directory is: /ec/dgt/local/home/bhaskbh/test_data
2019-11-04 09:15:15.747 [kiwi.lib.predict run:100] Predict with the PredEst (Predictor-Estimator) model
2019-11-04 09:15:18.168 [kiwi.data.utils load_vocabularies_to_fields:126] Loaded vocabularies from /ec/dgt/local/home/bhaskbh/new_train/best_model.torch

from openkiwi.

captainvera commented on August 20, 2024 1

Are you talking about this last comment? So the predict pipeline is working as expected? 🙂

from openkiwi.

Bhavani01 commented on August 20, 2024 1

Yes. I was outside the virtual env and the gpu was not visible to it.

from openkiwi.

captainvera commented on August 20, 2024

Hi @Bhavani01 !

This issue should have been fixed with #39. Can you specify which version of pytorch you're using so we can test appropriately?

Thanks!

from openkiwi.

Bhavani01 commented on August 20, 2024

Hi, I use version 1.2.0 of Pytorch. Thank you. Bhavani

…

On Wed, 16 Oct 2019 at 18:27, Miguel Vera ***@***.***> wrote: Hi @Bhavani01 <https://github.com/Bhavani01> ! This issue should have been fixed with #39 <#39>. Can you specify which version of pytorch you're using so we can test appropriately? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#43?email_source=notifications&email_token=AC27SNVO3AC5UUA6VOAPCSTQO46IHA5CNFSM4JBIDUV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBNDWKY#issuecomment-542784299>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC27SNVZDCBJIFGZJEH4IE3QO46IHANCNFSM4JBIDUVQ> .

from openkiwi.

captainvera commented on August 20, 2024

Hi!
I'm really sorry about the late response, I let this slip through the cracks. I'll give you an update later today!

On the second issue, it is hard to diagnose through a stopped log, would you mind sharing the command/config of how you're running the prediction pipeline?

from openkiwi.

Bhavani01 commented on August 20, 2024

experiment-name: predict-predest
output-dir: /ec/dgt/local/home/bhaskbh/test_data
seed: 42
#gpu-id: 0
model: estimator
sentence-level: True
binary-level: True
load-model: /ec/dgt/local/home/bhaskbh/new_train/best_model.torch
load-vocab: /ec/dgt/local/home/bhaskbh/new_train/vocab.torch
wmt18-format: False
test-source: /ec/dgt/local/home/bhaskbh/test_data/uuuu10k.en.txt
test-target: /ec/dgt/local/home/bhaskbh/test_data/uuuu10k.en_DE.txt
valid-batch-size: 64

from openkiwi.

captainvera commented on August 20, 2024

Hi,

I have made a pull request #44 that should solve the issue at hand.
On your second issue, I'll get back to you soon. I was able to reproduce it and am working on a fix.

Miguel

from openkiwi.

kepler commented on August 20, 2024

Hi @Bhavani01. Please let us know whether the current version of master solves the first issue.

from openkiwi.

Bhavani01 commented on August 20, 2024

I re-installed it but it still crashes. This is the only difference in the output log.
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'other' in call to th_iand

from openkiwi.

captainvera commented on August 20, 2024

In the exact same line as before?
I'm having trouble reproducing this issue now, I'm using the exact same config as yours with pytorch 1.2.

As for your second issue, I can only reproduce this logging-but-no-output situation when running the predict pipeline with a Predictor and not a Predictor-Estimator. It should be noted that the Predictor is just a pre-training step and can't actually generate QE tags. You need to train the Estimator on top of the Predictor. Can you confirm you have a predictor-estimator?

@kepler maybe we should add an error message when trying to run the predict pipeline with a predictor. (the names are kind of confusing hehe). This would avoid these silent crashes and provide actionable feedback.

from openkiwi.

Bhavani01 commented on August 20, 2024

Yes, the same line with the addition of in call to th_iand
I did train an estimator and the log showed a successful training. Also checked the location of the model and I was pointing at the right model. I will test it again.

from openkiwi.

Bhavani01 commented on August 20, 2024

Could it be because of this? I assumed --load-pred-source was when I wanted to predict the source and similarly for --load-pred-target but I now looked at the documentation again it says --load-pred-target - If set, model architecture and vocabulary parameters are ignored. Load pretrained predictor tgt->src. Will let you know if this training is successful in 3 days. But the GPU problem still persists. Thanks a lot for your help.

from openkiwi.

captainvera commented on August 20, 2024

Hmmm, I think you assumed the correct thing and our documentation is wrong. I'm going to confirm this but an initial look seems to indicate that --load-pred-target is indeed used when predicting src -> tgt which is what I assume you want to do.

On the other hand, I'm not being able to reproduce your error with the gpu. I'm using python 3.6.8, latest version of openkiwi (installed from master, this is important as we haven't updated the version on pip yet) and pytorch 1.2.0 and can't reproduce your error. The only thing I've changed in the config you provided was adding a line with model: predictor as that is required to run the training pipeline.

As for your second issue, an easy way to test if the model is at fault is to download our pre-trained models available on our releases and run the same config but pointing to one of our pre-trained models.

from openkiwi.

Bhavani01 commented on August 20, 2024

I did test the config with the pretrained models. So, I guess there is something wrong with the model even though the training completed successfully. I am retraining now. Will test it again when completed. For the gpu issue, I will download from the master again this time instead of updating it, and test. I would be really grateful if you can confirm the --load-pred-target. Thanks

from openkiwi.

Bhavani01 commented on August 20, 2024

I cloned the master again instead of pulling changes and the training on the GPU seems OK so far. Thanks. Will report about the other issue when I finish.

from openkiwi.

captainvera commented on August 20, 2024

Nice! Glad to hear your problem has been solved :)

I confirmed the issue about the --load-pred-target and my suspicion was correct. It is used to load src -> tgt predictors. Our documentation has a mistake, thanks for pointing it out!

from openkiwi.

Bhavani01 commented on August 20, 2024

The predictor training on the GPU was fine. However, it crashed for the estimator training.
Here is the log.

Command:

kiwi train --config /ec/dgt/local/exodus/home/bhaskbh/new_train/estimate.yaml

Logging:

2019-11-08 09:06:48.174 [root setup:380] This is run ID: 27917144000c41e4a505dcaff111c669
2019-11-08 09:06:48.174 [root setup:383] Inside experiment ID: 0 (EN-DE Train Estimator)
2019-11-08 09:06:48.174 [root setup:386] Local output directory is: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test
2019-11-08 09:06:48.174 [root setup:389] Logging execution to MLflow at: None
2019-11-08 09:06:48.194 [root setup:395] Using GPU: 2
2019-11-08 09:06:48.194 [root setup:400] Artifacts location: None
2019-11-08 09:06:48.201 [kiwi.lib.train run:154] Training the PredEst (Predictor-Estimator) model
2019-11-08 09:07:05.865 [kiwi.data.utils load_vocabularies_to_fields:126] Loaded vocabularies from /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test/best_model.torch
2019-11-08 09:07:12.816 [kiwi.lib.train run:187] Estimator(
  (predictor_tgt): Predictor(
    (attention): Attention(
      (scorer): MLPScorer(
        (layers): ModuleList(
          (0): Sequential(
            (0): Linear(in_features=1600, out_features=800, bias=True)
            (1): Tanh()
          )
          (1): Sequential(
            (0): Linear(in_features=800, out_features=1, bias=True)
            (1): Tanh()
          )
        )
      )
    )
    (embedding_source): Embedding(45004, 200, padding_idx=1)
    (embedding_target): Embedding(45004, 200, padding_idx=1)
    (lstm_source): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
    (forward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
    (backward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
    (W1): Embedding(45004, 200, padding_idx=1)
    (_loss): CrossEntropyLoss()
  )
  (mlp): Sequential(
    (0): Linear(in_features=1000, out_features=125, bias=True)
    (1): Tanh()
  )
  (lstm): LSTM(125, 125, batch_first=True, bidirectional=True)
  (embedding_out): Linear(in_features=250, out_features=2, bias=True)
  (sentence_pred): Sequential(
    (0): Linear(in_features=250, out_features=125, bias=True)
    (1): Sigmoid()
    (2): Linear(in_features=125, out_features=62, bias=True)
    (3): Sigmoid()
    (4): Linear(in_features=62, out_features=1, bias=True)
  )
  (binary_pred): Sequential(
    (0): Linear(in_features=250, out_features=125, bias=True)
    (1): Tanh()
    (2): Linear(in_features=125, out_features=62, bias=True)
    (3): Tanh()
    (4): Linear(in_features=62, out_features=2, bias=True)
  )
  (xents): ModuleDict(
    (tags): CrossEntropyLoss()
  )
  (mse_loss): MSELoss()
  (xent_binary): CrossEntropyLoss()
)
2019-11-08 09:07:12.816 [kiwi.lib.train run:188] 39845791 parameters
2019-11-08 09:07:12.817 [kiwi.trainers.trainer run:75] Epoch 1 of 10
Batches:   0%|                         | 1/5942 [00:02<3:29:00,  2.11s/ batches]/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
Floating point exception

My config is as follows:

model: estimator
output-dir: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test
hidden-est: 125
rnn-layers-est: 1
dropout-est: 0.0
mlp-est: True
token-level: True
sentence-level: True
sentence-ll: False
binary-level: True
predict-target: true
target-bad-weight: 2.5
predict-source: false
source-bad-weight: 2.5
predict-gaps: false
target-bad-weight: 2.5
epochs: 10
checkpoint-validation-steps: 0
checkpoint-save: true
checkpoint-keep-only-best: 3
checkpoint-early-stop-patience: 0
log-interval: 100
learning-rate: 2e-3
train-batch-size: 64
valid-batch-size: 64
load-pred-target: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test/best_model.torch
wmt18-format: false
train-source: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.src
train-target: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.mt
train-pe: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.pe
train-target-tags: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.tags
train-sentence-scores: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.ter
split: 0.99
experiment-name: EN-DE Train Estimator
gpu-id: 2

from openkiwi.

captainvera commented on August 20, 2024

Hey, @Bhavani01 I'll take a look at this today and get back to you shortly!

I took the freedom to edit your comment to make it easier to read :)

from openkiwi.

captainvera commented on August 20, 2024

I'm having problems reproducing your problem.

I trained a small predictor with the example training config we make available on the repo and then trained an estimator with your config above.The only things I changed were the data (using WMT19) and the wmt18-format flag (since you're not predicting gaps and WMT19 has gaps and kiwi needs to know to filter them out).

Am I correct to you assume that you trained your predictor with the config you show in your first comment? I'd like your confirmation on that, but while I wait I'll try training a predictor with that and an estimator on top to see if I can find something and get back to you.

from openkiwi.

Bhavani01 commented on August 20, 2024

For the predictor I used the same config as my first comment. Other than saving best model, is there any other message to indicate successful training. Coz I say the predictor training was successful because it ran for 6 epochs and saved the best model.

from openkiwi.

captainvera commented on August 20, 2024

Nope, that should be it. If you're getting reasonable results (Acc > 0.6) and the model is improving and finishes training without any error then yes, it is a successful training.

I've just finished training a predictor and an estimator with your configs (using WMT19 data on both, something done solely for testing purposes) and they both trained successfully.

With this, I can't really reproduce your problem...
Could you try training your models with WMT19 data for testing purposes? It is available here

Maybe it is somehow related to the data you're using and it being handled wrongly by kiwi somehow? That floating point error is extremely weird.

Also, can you train with the CPU? The estimator should be pretty fast to train on cpu, that can be an alternative for the time being while we find out what's going on here!

from openkiwi.

Bhavani01 commented on August 20, 2024

OK. I will try to train with the CPU and the WMT19 data.

from openkiwi.

Bhavani01 commented on August 20, 2024

This is my error with the CPU:

Batches:   0%|                        | 1/5942 [00:37<62:35:24, 37.93s/ batches]Traceback (most recent call last):
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/bin/kiwi", line 11, in <module>
    load_entry_point('openkiwi', 'console_scripts', 'kiwi')()
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/__main__.py", line 22, in main
    return kiwi.cli.main.cli()
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/cli/main.py", line 71, in cli
    train.main(extra_args)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/cli/pipelines/train.py", line 142, in main
    train.train_from_options(options)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/lib/train.py", line 123, in train_from_options
    trainer = run(ModelClass, output_dir, pipeline_options, model_options)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/lib/train.py", line 204, in run
    trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 76, in run
    self.train_epoch(train_iterator, valid_iterator)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 96, in train_epoch
    outputs = self.train_step(batch)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 141, in train_step
    loss_dict = self.model.loss(model_out, batch)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/models/predictor_estimator.py", line 507, in loss
    loss_bin = self.binary_loss(model_out, batch)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/models/predictor_estimator.py", line 497, in binary_loss
    loss = self.xent_binary(model_out[const.BINARY], labels.long())
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/modules/loss.py", line 904, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1790, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed.  at /pytorch/aten/src/THNN/generic/ClassNLLCriterion.c:93

from openkiwi.

captainvera commented on August 20, 2024

Hmmm, this definitely seems to imply something weird with the data you're feeding into Kiwi, would you mind sharing a subset of this data?

Since this fails on the first batch, a couple of lines of each file should be enough for me to check what's going on!

from openkiwi.

Bhavani01 commented on August 20, 2024

I suspected this has to do with the tags. Since I don't have human annotators I have a simple script to generate the labels from the alignments. A superficial glance seemed fine. I trained with just TER and sentence level and got the same error. Since the predictor trained successfully with the same src, mt and pe files, TER is the only additional input. Could this be because my TER range is not 0-1 but some segments have a score greater than 1? I will try rounding all >1 to 1 and check. Will also try to get a subset from my training data that is public domain to share. Thanks.

from openkiwi.

captainvera commented on August 20, 2024

Hi @Bhavani01, has your issue been solved by regenerating/ repairing the data?

from openkiwi.

Bhavani01 commented on August 20, 2024

I don't get the same error anymore after changing the TER scores. In the GPU training I get the "RuntimeError: CUDA out of memory." error even though I have more than enough memory and irrespective of the size of data I am training. I got a different error in the CPU training with the full dataset. Now I am running with a subsection of the data. It is still training. I am trying to use the --load-model option to train with smaller sets of data until I reach the one that is causing problems as I did basic data cleaning and cant see any obvious problems.

from openkiwi.

captainvera commented on August 20, 2024

That's good news.

On the GPU error, that is not related to the computer's memory but to the GPU memory. As such, it does not matter the size of the training data (that will fill the RAM but not the GPU), what matters is the batch size and the number of tokens in each sentence.

My recommendations would be to decrease the batch_size (while adjusting the learning rate accordingly) and to use the options we provide to control the max token count of src and tgt sentences, respectively: --max-source-length: X & --max-target-length: X where X is usually something between 50 and 100.

It can happen that you have some unusually long sentences being loaded into the GPU and this exceeds the amount of memory available.

As for the CPU training, I'd be very interested in the error you're getting, as that is not expected.

Finally, we are preparing some updates for Kiwi that should add some sanity checks for data. This should help us avoid errors like your previous one in the future, stay tuned!

from openkiwi.

Bhavani01 commented on August 20, 2024

I did try with batch size 32. I restrict the length of segments to 200 in my data cleaning. I will reduce it and test. Thanks.

from openkiwi.

Bhavani01 commented on August 20, 2024

Hi, for the NuQE and Quetch trainings, the target is the MT output right? Not the reference or the post edit?
BTW, the time stamp on the log does not match system or local time(Central European in my case). It is one hour behind(london time). Doesnt affect training. Just thought you should know.

from openkiwi.

captainvera commented on August 20, 2024

When training a QE model, target should always be MT. This applies to all models in OpenKiwi.

We normally refer to things with the following nomenclature:
Source - Text in the source language
Target - MT produced from Source

Thanks for the headsup! We'll see how to use system time. I'd say we probably set up something wrong but never noticed since we are in London time :)

from openkiwi.

Bhavani01 commented on August 20, 2024

Hi,
Even if I specify the gpu id, the predict config picks the cpu. Is there a way around this?

Thanks.
Bhavani

from openkiwi.

captainvera commented on August 20, 2024

Hey @Bhavani01, that shouldn't happen, let me have a look into what's going on.

Also, I'd appreciate it if you could open new issues instead of continuing the conversation on this one! That way we can containerise topics and use these issues to help similar questions in the future.

from openkiwi.

Bhavani01 commented on August 20, 2024

Got it. BTW the gpu issue was from my system, all my gpus were blocked. Sorry about that.

from openkiwi.

captainvera commented on August 20, 2024

Ah! Glad to know It's solved! I'll close this issue for now. Feel free to open a new one in case you have any further questions

from openkiwi.

GPU training crashes about openkiwi HOT 36 CLOSED

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent