The crosse from wencolani

OOM Problems

Hi, I am training CrossE on FB15K and I am encountering some problems with OOM errors.
I am using a Tesla P100 GPU with 16GB.

Apparently, the issue is related to the batch size, as it only takes place when I use a batch size greater than around 2500.

If I reduce the batch size to 2000 it works (it raises some OOM errors here and there, but since the training does not stop, I assume that tensorflow manages to handle the situation under the hood):

totalMemory: 15.90GiB freeMemory: 3.12GiB
2019-07-31 12:26:55.290114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-07-31 12:26:55.291419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 12:26:55.291432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-07-31 12:26:55.291437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-07-31 12:26:55.291676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7814 MB memory) -> physical GPU (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:05:00.0, compute capability: 6.0)
initializing raw training data...
raw training data initialized.
2019-07-31 12:26:59.036369: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 7.63G (8194432512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 12:26:59.037461: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 6.87G (7374989312 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 12:26:59.038501: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 6.18G (6637490176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 12:26:59.039536: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 5.56G (5973741056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 12:26:59.040558: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 5.01G (5376366592 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 12:26:59.041597: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.51G (4838729728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 12:26:59.042615: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.06G (4354856448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 12:26:59.043631: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.65G (3919370752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 12:26:59.044693: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.29G (3527433472 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 12:26:59.128179: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[100 sec](186000/483142) : 0.38 -- loss : 10.61627 rloss: 0.00032

The weird thing is, using nvidia-smi I observe that CrossE seems to allocate almost all the memory in my GPU (15278MiB), but only a small amount is actually used (2479MiB).

| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 00000000:05:00.0 Off |                    0 |
| N/A   36C    P0    41W / 300W |  15278MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2748      C   python3                                     2479MiB |
+-----------------------------------------------------------------------------+

Unfortunately I can not just use batch size 2000, because I need to replicate your results and I guess using a smaller batch size will result in worse performances.

What environment did you use to train the model on FB15K? (OS, tensorflow version, CUDA drivers, GPU).

Hello, I want to study your code more.(Request code for CrossE paper)

Hello, I'm Minho Lee

First of all, I've read your paper CrossE in 2019, Thank you.
However, when I analyzed GitHub that you've uploaded, I wasn't able to find the code to generate an explanation part.
If you don't mind, I'd like to request you to upload the code generating an explanation part.

Thank you.

Hardware Requirements

Hello, I'm studying your research and trying to extend your code for my Master Degree thesis.
I'm not able to run the code locally due to RAM limits (16 Gb); may I ask you if it is possible to provide the minimum, and also recommended, hardware requirements in order to execute the project?

Hi,sir,could you help me with two questions?

What configuration are you using?
What is the approximate running time?
Thank you for your attention
Looking forward to your reply~

How to use the trained model?

@wencolani So I just used the training file and trained the model using triples ( A , B --> C/D -> D/F )

I have two questions regarding the trained model :

How to use the trained model?
How to visualize the triples as shown in the paper?

Waiting for your reply
Thank you !!

question about generating explanations

hi, could you help me with some questions about your code?
In your paper, there is a experiment for generating explanations.
Do you have this experiment in your code? Where is it?

Thank you for your attention.
Wish you happy every day.

wencolani / crosse Goto Github PK

crosse's People

Contributors

Stargazers

Watchers

Forkers

crosse's Issues

OOM Problems

Hello, I want to study your code more.(Request code for CrossE paper)

Hardware Requirements

Hi,sir,could you help me with two questions?

How to use the trained model?

question about generating explanations

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent