Thank you for developing great work, RotatE. I'm really interested in your research.
- I ran your program as the following, but I found that there is a bug "RuntimeError: CUDA out of memory". How did you debug?
I changed the batch size from 1024 to 256 and the program could run successfully. But, I don't really want to change the batch size.
dl-box@DL-Box:~/Downloads/RotatE$ CUDA_VISIBLE_DEVICES=0 python -u codes/run.py --do_train \
--cuda
--do_valid
--do_test
--data_path data/FB15k
--model RotatE
-n 256 -b 1024 -d 1000
-g 24.0 -a 1.0 -adv
-lr 0.0001 --max_steps 150000
-save models/RotatE_FB15k_0 --test_batch_size 16 -de
2021-11-07 17:21:05,436 INFO Model: RotatE
2021-11-07 17:21:05,437 INFO Data Path: data/FB15k
2021-11-07 17:21:05,437 INFO #entity: 14951
2021-11-07 17:21:05,437 INFO #relation: 1345
2021-11-07 17:21:05,892 INFO #train: 483142
2021-11-07 17:21:05,941 INFO #valid: 50000
2021-11-07 17:21:06,000 INFO #test: 59071
2021-11-07 17:21:06,202 INFO Model Parameter Configuration:
2021-11-07 17:21:06,202 INFO Parameter gamma: torch.Size([1]), require_grad = False
2021-11-07 17:21:06,202 INFO Parameter embedding_range: torch.Size([1]), require_grad = False
2021-11-07 17:21:06,202 INFO Parameter entity_embedding: torch.Size([14951, 2000]), require_grad = True
2021-11-07 17:21:06,202 INFO Parameter relation_embedding: torch.Size([1345, 1000]), require_grad = True
2021-11-07 17:21:12,102 INFO Ramdomly Initializing RotatE Model...
2021-11-07 17:21:12,102 INFO Start Training...
2021-11-07 17:21:12,102 INFO init_step = 0
2021-11-07 17:21:12,102 INFO batch_size = 1024
2021-11-07 17:21:12,102 INFO negative_adversarial_sampling = 1
2021-11-07 17:21:12,102 INFO hidden_dim = 1000
2021-11-07 17:21:12,102 INFO gamma = 24.000000
2021-11-07 17:21:12,102 INFO negative_adversarial_sampling = True
2021-11-07 17:21:12,102 INFO adversarial_temperature = 1.000000
2021-11-07 17:21:12,102 INFO learning_rate = 0
Traceback (most recent call last):
File "codes/run.py", line 361, in
main(parse_args())
File "codes/run.py", line 305, in main
log = kge_model.train_step(kge_model, optimizer, train_iterator, args)
File "/home/dl-box/Downloads/RotatE/codes/model.py", line 300, in train_step
loss.backward()
File "/home/dl-box/.local/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/dl-box/.local/lib/python3.6/site-packages/torch/autograd/init.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.95 GiB (GPU 0; 10.92 GiB total capacity; 6.11 GiB already allocated; 866.06 MiB free; 7.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
- I ran the command line "bash run.sh train ComplEx FB15k 0 0 1024 256 1000 500.0 1.0 0.001 150000 16 -de -dr -r 0.000002" as the following (with dataset FB15k), and your program could run successfully on my Ubuntu server.
dl-box@DL-Box:~/Downloads/RotatE$ bash run.sh train ComplEx FB15k 0 0 1024 256 1000 500.0 1.0 0.001 150000 16 -de -dr -r 0.000002
1.10.0+cu102
Start Training......
2021-11-08 04:46:49,552 INFO Model: ComplEx
2021-11-08 04:46:49,552 INFO Data Path: data/FB15k
2021-11-08 04:46:49,552 INFO #entity: 14951
2021-11-08 04:46:49,552 INFO #relation: 1345
2021-11-08 04:46:50,009 INFO #train: 483142
2021-11-08 04:46:50,058 INFO #valid: 50000
2021-11-08 04:46:50,120 INFO #test: 59071
2021-11-08 04:46:50,336 INFO Model Parameter Configuration:
2021-11-08 04:46:50,336 INFO Parameter gamma: torch.Size([1]), require_grad = False
2021-11-08 04:46:50,336 INFO Parameter embedding_range: torch.Size([1]), require_grad = False
2021-11-08 04:46:50,336 INFO Parameter entity_embedding: torch.Size([14951, 2000]), require_grad = True
2021-11-08 04:46:50,336 INFO Parameter relation_embedding: torch.Size([1345, 2000]), require_grad = True
2021-11-08 04:46:56,318 INFO Ramdomly Initializing ComplEx Model...
2021-11-08 04:46:56,318 INFO Start Training...
2021-11-08 04:46:56,318 INFO init_step = 0
2021-11-08 04:46:56,318 INFO batch_size = 1024
2021-11-08 04:46:56,318 INFO negative_adversarial_sampling = 1
2021-11-08 04:46:56,318 INFO hidden_dim = 1000
2021-11-08 04:46:56,318 INFO gamma = 500.000000
2021-11-08 04:46:56,318 INFO negative_adversarial_sampling = True
2021-11-08 04:46:56,318 INFO adversarial_temperature = 1.000000
2021-11-08 04:46:56,318 INFO learning_rate = 0
2021-11-08 04:46:57,568 INFO Training average regularization at step 0: 2.061783
2021-11-08 04:46:57,568 INFO Training average positive_sample_loss at step 0: 0.959978
2021-11-08 04:46:57,568 INFO Training average negative_sample_loss at step 0: 2.498887
2021-11-08 04:46:57,569 INFO Training average loss at step 0: 3.791215
2021-11-08 04:46:57,569 INFO Evaluating on Valid Dataset...
2021-11-08 04:46:58,255 INFO Evaluating the model... (0/6250)
2021-11-08 04:47:40,912 INFO Evaluating the model... (1000/6250)
2021-11-08 04:48:24,477 INFO Evaluating the model... (2000/6250)
2021-11-08 04:49:08,110 INFO Evaluating the model... (3000/6250)
2021-11-08 04:49:51,826 INFO Evaluating the model... (4000/6250)
2021-11-08 04:50:35,372 INFO Evaluating the model... (5000/6250)
2021-11-08 04:51:18,479 INFO Evaluating the model... (6000/6250)
2021-11-08 04:51:29,527 INFO Valid MRR at step 0: 0.000718
2021-11-08 04:51:29,527 INFO Valid MR at step 0: 7412.979920
2021-11-08 04:51:29,527 INFO Valid HITS@1 at step 0: 0.000050
2021-11-08 04:51:29,527 INFO Valid HITS@3 at step 0: 0.000190
2021-11-08 04:51:29,527 INFO Valid HITS@10 at step 0: 0.000820
2021-11-08 04:51:44,653 INFO Training average regularization at step 100: 1.869630
2021-11-08 04:51:44,653 INFO Training average positive_sample_loss at step 100: 0.878554
2021-11-08 04:51:44,654 INFO Training average negative_sample_loss at step 100: 2.214018
2021-11-08 04:51:44,654 INFO Training average loss at step 100: 3.415917
2021-11-08 04:51:59,475 INFO Training average regularization at step 200: 1.649423
2021-11-08 04:51:59,475 INFO Training average positive_sample_loss at step 200: 0.795739
2021-11-08 04:51:59,475 INFO Training average negative_sample_loss at step 200: 1.878687
2021-11-08 04:51:59,475 INFO Training average loss at step 200: 2.986636
2021-11-08 04:52:14,330 INFO Training average regularization at step 300: 1.493370
2021-11-08 04:52:14,330 INFO Training average positive_sample_loss at step 300: 0.723991
2021-11-08 04:52:14,330 INFO Training average negative_sample_loss at step 300: 1.647611
2021-11-08 04:52:14,330 INFO Training average loss at step 300: 2.679172
2021-11-08 04:52:29,411 INFO Training average regularization at step 400: 1.364369
2021-11-08 04:52:29,411 INFO Training average positive_sample_loss at step 400: 0.668379
2021-11-08 04:52:29,411 INFO Training average negative_sample_loss at step 400: 1.480148
2021-11-08 04:52:29,411 INFO Training average loss at step 400: 2.438632
2021-11-08 04:52:44,290 INFO Training average regularization at step 500: 1.252640
2021-11-08 04:52:44,290 INFO Training average positive_sample_loss at step 500: 0.615634
2021-11-08 04:52:44,290 INFO Training average negative_sample_loss at step 500: 1.347466
2021-11-08 04:52:44,290 INFO Training average loss at step 500: 2.234190
2021-11-08 04:52:59,189 INFO Training average regularization at step 600: 1.153765
2021-11-08 04:52:59,189 INFO Training average positive_sample_loss at step 600: 0.570805
2021-11-08 04:52:59,189 INFO Training average negative_sample_loss at step 600: 1.245437
2021-11-08 04:52:59,189 INFO Training average loss at step 600: 2.061886
2021-11-08 04:53:14,166 INFO Training average regularization at step 700: 1.065076
2021-11-08 04:53:14,166 INFO Training average positive_sample_loss at step 700: 0.524925
2021-11-08 04:53:14,166 INFO Training average negative_sample_loss at step 700: 1.163066
2021-11-08 04:53:14,166 INFO Training average loss at step 700: 1.909072
2021-11-08 04:53:29,006 INFO Training average regularization at step 800: 0.984837
2021-11-08 04:53:29,006 INFO Training average positive_sample_loss at step 800: 0.489442
2021-11-08 04:53:29,006 INFO Training average negative_sample_loss at step 800: 1.097700
2021-11-08 04:53:29,006 INFO Training average loss at step 800: 1.778408
2021-11-08 04:53:43,852 INFO Training average regularization at step 900: 0.911781
2021-11-08 04:53:43,852 INFO Training average positive_sample_loss at step 900: 0.451165
2021-11-08 04:53:43,852 INFO Training average negative_sample_loss at step 900: 1.044625
2021-11-08 04:53:43,852 INFO Training average loss at step 900: 1.659676
2021-11-08 04:53:59,565 INFO Training average regularization at step 1000: 0.845027
2021-11-08 04:53:59,565 INFO Training average positive_sample_loss at step 1000: 0.363237
2021-11-08 04:53:59,565 INFO Training average negative_sample_loss at step 1000: 1.000880
2021-11-08 04:53:59,565 INFO Training average loss at step 1000: 1.527086
2021-11-08 04:54:14,571 INFO Training average regularization at step 1100: 0.783731
2021-11-08 04:54:14,571 INFO Training average positive_sample_loss at step 1100: 0.312674
2021-11-08 04:54:14,571 INFO Training average negative_sample_loss at step 1100: 0.966706
2021-11-08 04:54:14,571 INFO Training average loss at step 1100: 1.423422
2021-11-08 04:54:29,543 INFO Training average regularization at step 1200: 0.726847
2021-11-08 04:54:29,543 INFO Training average positive_sample_loss at step 1200: 0.310942
...................................................................................................
- However, before that time, I ran the following command line "bash run.sh train RotatE wn18 0 0 512 1024 500 12.0 0.5 0.0001 80000 8 -de 1.10.0+cu102" (with dataset wn18), and I also found that your program still has a bug "RuntimeError: CUDA out of memory". Would you please explain to me why sometimes your program has a bug "RuntimeError: CUDA out of memory", but why sometimes your program could run successfully by changing the dataset? How did you debug with this problem?
dl-box@DL-Box:~/Downloads/RotatE$ bash run.sh train RotatE wn18 0 0 512 1024 500 12.0 0.5 0.0001 80000 8 -de
1.10.0+cu102
Start Training......
2021-11-08 04:46:15,756 INFO Model: RotatE
2021-11-08 04:46:15,756 INFO Data Path: data/wn18
2021-11-08 04:46:15,757 INFO #entity: 40943
2021-11-08 04:46:15,757 INFO #relation: 18
2021-11-08 04:46:15,886 INFO #train: 141442
2021-11-08 04:46:15,890 INFO #valid: 5000
2021-11-08 04:46:15,894 INFO #test: 5000
2021-11-08 04:46:16,147 INFO Model Parameter Configuration:
2021-11-08 04:46:16,147 INFO Parameter gamma: torch.Size([1]), require_grad = False
2021-11-08 04:46:16,147 INFO Parameter embedding_range: torch.Size([1]), require_grad = False
2021-11-08 04:46:16,147 INFO Parameter entity_embedding: torch.Size([40943, 1000]), require_grad = True
2021-11-08 04:46:16,147 INFO Parameter relation_embedding: torch.Size([18, 500]), require_grad = True
2021-11-08 04:46:19,692 INFO Ramdomly Initializing RotatE Model...
2021-11-08 04:46:19,692 INFO Start Training...
2021-11-08 04:46:19,692 INFO init_step = 0
2021-11-08 04:46:19,692 INFO batch_size = 512
2021-11-08 04:46:19,692 INFO negative_adversarial_sampling = 1
2021-11-08 04:46:19,692 INFO hidden_dim = 500
2021-11-08 04:46:19,692 INFO gamma = 12.000000
2021-11-08 04:46:19,692 INFO negative_adversarial_sampling = True
2021-11-08 04:46:19,692 INFO adversarial_temperature = 0.500000
2021-11-08 04:46:19,692 INFO learning_rate = 0
Traceback (most recent call last):
File "codes/run.py", line 361, in
main(parse_args())
File "codes/run.py", line 305, in main
log = kge_model.train_step(kge_model, optimizer, train_iterator, args)
File "/home/dl-box/Downloads/RotatE/codes/model.py", line 267, in train_step
negative_score = model((positive_sample, negative_sample), mode=mode)
File "/home/dl-box/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dl-box/Downloads/RotatE/codes/model.py", line 159, in forward
score = model_func[self.model_name](head, relation, tail, mode)
File "/home/dl-box/Downloads/RotatE/codes/model.py", line 225, in RotatE
score = score.norm(dim = 0)
File "/home/dl-box/.local/lib/python3.6/site-packages/torch/_tensor.py", line 442, in norm
return torch.norm(self, p, dim, keepdim, dtype=dtype)
File "/home/dl-box/.local/lib/python3.6/site-packages/torch/functional.py", line 1442, in norm
return _VF.frobenius_norm(input, _dim, keepdim=keepdim)
RuntimeError: CUDA out of memory. Tried to allocate 1000.00 MiB (GPU 0; 10.92 GiB total capacity; 7.00 GiB already allocated; 22.62 MiB free; 7.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF