Giter Site home page Giter Site logo

Comments (36)

amosella avatar amosella commented on August 23, 2024 1

Sure! I launched one experiment with 0 workers. Tomorrow I will come back with the results.

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Hi, first of all I'm sorry about the delay, somehow I haven't received any notification from github.

  • Sydney Urban Objects: Indeed, I have introduced a bug when removing dependency on PCL recently. I have added a missing cast to pointcloud_dataset.py. Please pull the change and try it again, it should be working now. Thanks for reporting.
  • CUDA_ERROR_ILLEGAL_ADDRESS: That's very weird and thank you so much for spending time with that. It shouldn't be a memory problem (12 GB is enough and what I also have), I would also exclude open3d (that's just used to read point clouds), I guess it also shouldn't be a problem in pytorch itself... Could you perhaps try to run it like CUDA_LAUNCH_BLOCKING=1 python main.py ...? This should allow to pin point the problem better, as by default the stack trace doesn't coincide with the actual crash. Thanks ahead!

from ecc.

dhorka avatar dhorka commented on August 23, 2024

Hi, For sure, during this week I will try to reproduce the error. I will execute several experiments with the CUDA_LAUNCH_BLOCKING=1 and I will come back to you.

EDIT:

This is the log with the CUDA_LAUNCH_BLOCKING=1

File "main.py", line 315, in <module>
    main()
  File "main.py", line 217, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 148, in train
    outputs = model(inputs)
  File "/work/env/ecc/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/code/ecc/models.py", line 103, in forward
    input = module(input)
  File "/work/env/ecc/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/code/ecc/ecc/GraphConvModule.py", line 171, in forward
    return GraphConvFunction(self._in_channels, self._out_channels, idxn, idxe, degs, degs_gpu, self._edge_mem_limit)(input, weights)
  File "/work/code/ecc/ecc/GraphConvModule.py", line 67, in forward
    cuda_kernels.conv_aggregate_fw(output.narrow(0,startd,numd), products.view(-1,self._out_channels), self._degs_gpu.narrow(0,startd,numd))
  File "/work/code/ecc/ecc/cuda_kernels.py", line 122, in conv_aggregate_fw
    block=(CUDA_NUM_THREADS,1,1), grid=(GET_BLOCKS(w),n//blockDimY+1,1), stream=stream)
  File "cupy/cuda/function.pyx", line 147, in cupy.cuda.function.Function.__call__
  File "cupy/cuda/function.pyx", line 129, in cupy.cuda.function._launch
  File "cupy/cuda/driver.pyx", line 195, in cupy.cuda.driver.launchKernel
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Thanks a lot! It's weird it happens in the forward pass, there should be still enough memory available no matter what, especially that you've said you have full 12GB available. Just to check:

  • what is your version of pynvrtc and cupy?
  • what is your cuda version?
  • what command are you exactly running, the original one for ModelNet10?

from ecc.

dhorka avatar dhorka commented on August 23, 2024

Hi,

  • Cupy version 4.3 pynvrtc 9.2
  • Cuda version: 8 cudnn: 6
  • I am running the command provided in the documentation for the modelnet10 this one:

python main.py \ --dataset modelnet10 --test_nth_epoch 25 --lr 0.1 --lr_steps '[50,100,150]' --epochs 175 --batch_size 64 --batch_parts 4 \ --model_config 'i_1_2, c_16,b,r, c_32,b,r, m_2.5_7.5, c_32,b,r, c_32,b,r, m_7.5_22.5, c_64,b,r, m_1e10_1e10, f_64,b,r,d_0.2,f_10' \ --fnet_llbias 0 --fnet_widths '[16,32]' --pc_augm_scale 1.2 --pc_augm_mirror_prob 0.2 --pc_augm_input_dropout 0.1 \ --nworkers 3 --edgecompaction 1 --odir results/modelnet10

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Thanks! I've upgraded to your latest version of cupy (btw, it seems there is now a cleaner way how to define custom kernels with https://docs-cupy.chainer.org/en/latest/reference/generated/cupy.RawKernel.html), so my setup should be the same as yours. But I'm sorry but I can't reproduce it, I haven't received the error during training. I have no idea, sorry:(. Perhaps using RawKernel and rewriting the pytorch functions in the modern way with ctx parameter could help, but it's just a guess...

from ecc.

dhorka avatar dhorka commented on August 23, 2024

Thanks! The error is completely random, it is not appearing always. Are you using also cuda 8, cudnn 6? Can you tell me wich driver of nvidia are you using?

And my last question it is just for curiosity. Why do you choose use cupy instead of using the methodology of pytorch to do custom extension with cuda? Is there a technical reason?

I suspect that the error is related to the management of the gpu memory done by pytorch. I mean, as you know, pytorch is using caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in other applications. That maybe can be a conflict with the code executed by cupy. What do you think? But, I do not understand why it is only happening in my setup and not in yours...

from ecc.

mys007 avatar mys007 commented on August 23, 2024

My driver is 384.130, cuda 8.0.61, cudnn 6021. But I'm unable to shuffle around with drivers and cuda, as I'm using a shared computer.

Thanks for the tip, the support for extensions in pytorch seems to have improved a lot, there is even a JIT solution in particular! The reason I went the cupy way about 1.5y ago was because the pytorch way was more rudimentary and explicit compilation only. I think my current code could surely be rewritten to use just pytorch/JIT. I might give it a quick shot in the next few days...

Regarding the interactions between pytorch and cupy, who knows... but I would assume they both use some standard cuda allocation calls in the end so it should not happen that memory gets assigned twice. But removing the dependency on cupy would sort it out anyway.
Another explanation would be that a physical part of your GPU memory is somehow corrupted - but I assume you have otherwise no problems training other large networks, don't you.

from ecc.

dhorka avatar dhorka commented on August 23, 2024

I tested with different gpu's in order to discard the hardware error and I had the same issue. Regarding of the adaptation to the pytorch extension, if I can help in somehow tell me. Btw, another interesting thing of this adaptation is that allows to use the multigpu training in order to train with a bigger batch_size :) Thanks for your time!

from ecc.

dhorka avatar dhorka commented on August 23, 2024

Today I tried to adapt your code to use pytorch extension. here you can find the modified files used in my first try: https://github.com/dhorka/ecc_cuda_extension. I am not using JIT. You need to compile the kernel using the provided setup.py. At this moment the code is failing in run time with a segmentation fault. I was not able to figure out what is going on. But maybe the skeleton can help you. I will check it again later. Thanks.

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Thanks for your hard work on this, that was definitely a great starting point! I've fixed your code (the main problem was that not all types were supposed to be floats and that the grid-block parameters were not right), written it as JIT, added backward aggregation and ported the code to pytorch 0.4. It's in branch https://github.com/mys007/ecc/tree/pytorch4_cuda_extensions . Could you perhaps try to run it on your machine with pytorch 0.4.1 and see if works now? If CUDA_ERROR_ILLEGAL_ADDRESS appears only with other kernels, we might be on the right track. One weird thing now is that now the training takes all GPU memory (12 GB), instead of about 8 GB with pytorch 0.3 and cupy, but whatever:).

from ecc.

dhorka avatar dhorka commented on August 23, 2024

Hi mys, thanks also for your work on this issue!! I launched right now 2 training processes in order to be sure that the issue has disappeared :) On Monday I will tell you the results. Regarding the gpu consumption, if you are looking the consumption on nvidia-smi, we can not be sure that it is the real consumption, because pytorch uses a caching memory allocator to speed up memory allocations but the unused memory managed by the allocator will still show as if used in nvidia-smi. In order to check wich is the real memory used, we need to use some of the methods provided by pytorch like max_memory_allocated(). If I will not have the Illegal Address issue on Monday I can check it the memory used.

from ecc.

mys007 avatar mys007 commented on August 23, 2024

launched right now 2 training processes in order to be sure that the issue has disappeared

Thanks. But it may still crash in the other kernels (pooling), perhaps I should have ported all of them when I was at it... Can you please run the processes with CUDA_LAUNCH_BLOCKING=1 so that one can get the right stack trace?

nvidia-smi, we can not be sure that it is the real consumption,

Indeed, but I though this has been a feature of pytorch from the beginning, but maybe they have changed the laziness of deallocations...

from ecc.

dhorka avatar dhorka commented on August 23, 2024

Thanks. But it may still crash in the other kernels (pooling), perhaps I should have ported all of them when I was at it... Can you please run the processes with CUDA_LAUNCH_BLOCKING=1 so that one can get the right stack trace?

Sure, I re-executed the experiment with CUDA_LAUNCH_BLOCKING=1, also in parallel I will try to port the other kernels, using as example your ported kernels.

Indeed, but I though this has been a feature of pytorch from the beginning, but maybe they have changed the laziness of deallocations...

I am not sure what is happening with the gpu memory, because far as I saw when I launch the experiments (with cupy) at the beggining of the training the gpu memory consumption it is more or less at 8GB, but in advanced epochs, I can see that sometimes the memory consumption is 4GB and other times 12GB ...

from ecc.

dhorka avatar dhorka commented on August 23, 2024

All kernels ported, here you can find it https://github.com/dhorka/ecc_cuda_extension. I was not able to test if in runtime all the kernels works properly (at this moment I do not have any gpu avaibale) but atleast the compilation is working.

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Wow, what a great effort! Let's wait for the result of your jobs and if it's good, I can merge & clean up everything.

from ecc.

amosella avatar amosella commented on August 23, 2024

Hi Mys,
I'm Dhorka, this is my main account. (I was not able to post with this account because my account was flagged several times, github automated security mechanisms are incorrectly triggered but now seems to be solved). I have some results. First at all, the kernels that I ported yesterday are working, atleast at this moment some experiments are running without errors. On another hand, the experiments that I executed yesterday with only the kernels of the convolution ported to pytorch04 was crashed with the following error:

check FAIL file=/pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu line=21 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "main.py", line 314, in <module>
    main()
  File "main.py", line 216, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 147, in train
    outputs = model(inputs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/code/ecc/models.py", line 103, in forward
    input = module(input)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 57, in forward
    self.num_batches_tracked += 1
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu:21

Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

It seems like the error is cupy related, right? At this moment I have two more experiments with all the kernels ported running. I will let you know about the results obtained when the experiments finishes.

from ecc.

amosella avatar amosella commented on August 23, 2024

Hi,

It seems is not related to cupy... Below you can see the output error of one of the experiments with all the kernels ported to pytorch041:

Epoch 166/175 (results/modelnet10_all_cuda_kernels):
 48%|█████████████████████████████                                | 119/250 [02:00<02:33,  1.17s/it]THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu line=21 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "main.py", line 314, in <module>
    main()
  File "main.py", line 216, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 147, in train
    outputs = model(inputs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/code/ecc/models.py", line 103, in forward
    input = module(input)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 57, in forward
    self.num_batches_tracked += 1
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu:21

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Damn, that's really frustrating. I guess it must be some bug in the kernels demonstrating itself only under some rare condition of the input data. Could you perhaps run the training with --nworkers 0, which will be slow but should be deterministic? I will run the same (for now with half cupy, half extension).

from ecc.

amosella avatar amosella commented on August 23, 2024

Well I got some results... to be honest this starts to be weird.... I got the error in the epoch 9 as you can see in the following trace:

Epoch 7/175 (results/modelnet10_all_cuda_kernels_w0):
 66%|████████████████████████████████████████                     | 164/250 [05:09<03:21,  2.35s/it]THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu line=21 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "main.py", line 314, in <module>
    main()
  File "main.py", line 216, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 147, in train
    outputs = model(inputs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/code/ecc/models.py", line 103, in forward
    input = module(input)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 57, in forward
    self.num_batches_tracked += 1
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu:21

This is the result of this command:

python main.py --dataset modelnet10 --test_nth_epoch 25 --lr 0.1 --lr_steps '[50,100,150]' --epochs 175 --batch_size 64 --batch_parts 4 --model_config 'i_1_2, c_16,b,r, c_32,b,r, m_2.5_7.5, c_32,b,r, c_32,b,r, m_7.5_22.5, c_64,b,r, m_1e10_1e10, f_64,b,r,d_0.2,f_10' --fnet_llbias 0 --fnet_widths '[16,32]' --pc_augm_scale 1.2 --pc_augm_mirror_prob 0.2 --pc_augm_input_dropout 0.1 --nworkers 0 --edgecompaction 1 --odir results/modelnet10_all_cuda_kernels_w0

After that I tried to resume the experiment in order to check if I can reproduce the error but... after resume the training continue without problems... This is the command that I use to resume the experiment:

python main.py --dataset modelnet10 --test_nth_epoch 25 --lr 0.1 --lr_steps '[50,100,150]' --epochs 175 --batch_size 64 --batch_parts 4 --resume results/modelnet10_all_cuda_kernels_w0/model.pth.tar - -fnet_llbias 0 --fnet_widths '[16,32]' --pc_augm_scale 1.2 --pc_augm_mirror_prob 0.2 --pc_augm_input_dropout 0.1 --nworkers 0 --edgecompaction 1 --odir results/modelnet10_all_cuda_kernels_w0_resume

Now I am thinking to run an experiment forcing the seed of the data loader and also setting CUDNN in deterministic mode.

EDIT:
I didn't realize that the declaration of the dataloader in your code is inside the epoch loop. The reason for that is to do a shuffle in each epoch? Or something like this? Normally the codes I saw this declaration is outside the epoch loop. Because far as I know the shuffle for each epoch is done by dataloader without the needs to be initialized it on each epoch.

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Thanks for your report. Is the crash reproducible at your side, meaning that if you rerun the training from scratch (the first command line above), will it break during episode 7 again? In my case, I received no crash:(.

Resuming will not produce the same results as training straight without resuming because the states of random generators are not saved/restored (too complicated). But data loading should basically start again and crash in episode 14 then, weird that it didn't.

Now I am thinking to run an experiment forcing the seed of the data loader and also setting CUDNN in deterministic mode.

If 'nworkers=0, the worker is in the same thread as the main function, which is seeded in seed()` call. Any non-determinism in activations/weight updates should not matter because the weights don't affect the control flow. Besides, graph convolution is also not deterministic due to aggregation functions.

I didn't realize that the declaration of the dataloader in your code is inside the epoch loop. The reason for that is to do a shuffle in each epoch?

I think it's because DataLoaders are not infinite (they return StopIteration), so I have to restart them - is it not the case? But anyway, with nworkers=0 it shouldn't matter...

from ecc.

amosella avatar amosella commented on August 23, 2024

Hi Mys,

Thanks for your report. Is the crash reproducible at your side, meaning that if you rerun the training from scratch (the first command line above), will it break during episode 7 again? In my case, I received no crash:(.

I launched two experiments and always crash in the iteration 164 of the epoch 7. Far as I saw I think it is reproducible on my side.

Resuming will not produce the same results as training straight without resuming because the states of random generators are not saved/restored (too complicated). But data loading should basically start again and crash in episode 14 then, weird that it didn't.

Yep, it is weird... I do not understand what is different after the resume...

If 'nworkers=0, the worker is in the same thread as the main function, which is seeded in seed()` call. Any non-determinism in activations/weight updates should not matter because the weights don't affect the control flow. Besides, graph convolution is also not deterministic due to aggregation functions.

I understand. Thanks for the explanation!

I think it's because DataLoaders are not infinite (they return StopIteration), so I have to restart them - is it not the case? But anyway, with nworkers=0 it shouldn't matter...

Far as I know (also I tested), You do not need to restart them, because DataLoaders are able to manage the different epochs.

To sum up, I think I can reproduce the error. Maybe I can debug on my side, following your instructions.

from ecc.

mys007 avatar mys007 commented on August 23, 2024

I launched two experiments and always crash in the iteration 164 of the epoch 7.

Great news! Could you please pickle (inputs, targets, GIs, PIs) in

ecc/main.py

Line 136 in 8fbc901

and make it available for me to download? Then I can look at the particular case:). I think the easiest implementation is just to keep pickling and when it crashes, the violating batch will have been stored.

from ecc.

amosella avatar amosella commented on August 23, 2024

Hi Mys,

Done it! Here you can find the file.

The code used it to pickle(inputs,targets,GIs,PIS) is this one:
torch.save({'inputs': inputs, 'targets': targets, 'GIs': GIs, 'PIs' : PIs}, os.path.join(args.odir, 'inputs_targets_GIs_PIs.pth.tar') In line 136 of the main file.

I runned the experiment with cudnn in descriminative mode. (I forgot to disable) But doesn't matter the error is the same without descriminative.

EDIT:
Here you can find another file, generated with w=3. Also, I was trying to reproduce the error with sydney dataset but in sydney dataset I do not have this error...

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Hi, thanks a lot... but when I load the batch on my computer (from either of your files) so that each training iteration runs on it, I get no crash (w/pytorch 0.4.1). I'm sorry but I just think that resolving this issue is beyond my powers :(.

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Hi, I was wondering: if you're in a very experimental mood, could you try to run https://github.com/mys007/ecc/tree/crazy_fix with pytorch 0.3? There is just one extra line which touches dest. I remembered some other project where a similar hack helped to "resolve" a crashing kernel by probably reordering something...;)

from ecc.

amosella avatar amosella commented on August 23, 2024

Hi Mys,

Sorry for didn't answer your last comment. I have a cold and I was not able to check the e-mail. Yes, of course I will try it. Also I would like to try if with the files that I send to you I can reproduce the error on my setup, because I send it to you but I did not try to reproduce the error, I was thiking that maybe it something related with the state of the rng ... But anyway, this weekend I will test your fix and also, I will try to reproduce again the error. Thanks for your dedication!

from ecc.

amosella avatar amosella commented on August 23, 2024

Hi Mys,
I tried your fix and it doesn't work :(. On another hand, you are right, with the files that I send to you if I resume using them, I haven't errors.. It is weird... Also I tried to save all the rng states in order to reproduce the error but... I haven't errors when I resume it.. I do not know but I think that after all the things that we done it we can assume that the error is due for something of my setup. maybe nvidia drivers or something like this.

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Damn, but thanks a lot. Well, actually, there has been one other user who has contacted me per email with the same issue in the meantime (though on Sydney; CUDA 9.1, TITAN X Pascal). It's so difficult to debug. Perhaps I could rewrite the whole aggregation with sparse matrix operations but I need to have a look what's the current support in pytorch.

from ecc.

amosella avatar amosella commented on August 23, 2024

If I can do anything else do not hesitate to ask :) By the way, on my side I ran several experiments using the sydney dataset without errors...

from ecc.

ShengyuH avatar ShengyuH commented on August 23, 2024

hi all, I have the same issue, quite randomly. I'm now using SPG to benchmark on ScanNet, I think they just adopt your codes of master branch. I'm now trying to use your pytorch4_cuda_extension branch. I use CUDA 10.2, Driver Version: 430.26, 1080 Ti, cupy-cuda100 6.3.0, pytorch1.2.0. I will come to you later.

update:
File "train.py", line 132, in train
loss.backward()
File "/scratch/shengyu/anaconda/envs/venv_spg/lib/python3.7/site-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/scratch/shengyu/anaconda/envs/venv_spg/lib/python3.7/site-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 9.98 GiB (GPU 0; 10.91 GiB total capacity; 185.49 MiB already allocated; 9.98 GiB free; 6.51 MiB cached)

error occured again. Orz

from ecc.

mys007 avatar mys007 commented on August 23, 2024

@HenrryBryant Thanks for your report and thanks for the effort of trying out the experimental branch. I'm sorry that the problem has not been solved. Although the new error message is "CUDA out of memory" rather than "CUDA_ERROR_ILLEGAL_ADDRESS"?

from ecc.

ShengyuH avatar ShengyuH commented on August 23, 2024

@HenrryBryant Thanks for your report and thanks for the effort of trying out the experimental branch. I'm sorry that the problem has not been solved. Although the new error message is "CUDA out of memory" rather than "CUDA_ERROR_ILLEGAL_ADDRESS"?

hi Martin, thank you for your quick reply, actually these two errors just randomly take turns to occur, I will try with CUDA_LAUNCH_BLOCKING, otherwise i just have to turn to Pytorch_Geometric.

btw, I really like several works from you and Loic, they are really beautiful.

from ecc.

mys007 avatar mys007 commented on August 23, 2024

Well, what I meant is that "CUDA out of memory" might not be a bug but rather indeed running out of memory. Is the GPU completely free before running the code? Otherwise, you can try decreasing the batch size just for the sake. And yeah, I wish I can really make a rewrite in Pytorch_Geometric one day!

from ecc.

ShengyuH avatar ShengyuH commented on August 23, 2024

hi, If you have the same problem as dhorka mentioned and you are using DataLoader, please consider replacing your DataLoader object with for loop(though make it difficult to run on multiple GPUs), this works magically for me so far. Btw, you may also replace tensor with Tensor in lines 225 & 227 in GraphConvModule.py, I'm not sure if this also contributes to finally fixing this random bug, but my supervisor told me this could also cause wired memory leaks problem. I use CUDA10.2, Driver Version is 430.26, PyTorch 1.2.0 installed by pip.

from ecc.

mys007 avatar mys007 commented on August 23, 2024

@HenrryBryant Thanks for the investigations.

That's an interesting note with DataLoader but I believe this is just a random workaround causing the timing of kernel runs being changed and more spread (since there is no parallel loading, meaning also the training must be very slow). I guess one could achieve the same effect with setting the number of workers to 0 or 1?

The tensor vs Tensor syntax was introduced in Pytorch 0.4, so it shouldn't matter for the original setting (with Pytorch 0.3). But frankly, I'm really amazed the code works with 1.2.0!

Nevertheless, I think the out of memory issue and the illegal address crash are two different things.

from ecc.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.