Giter Site home page Giter Site logo

Comments (12)

syb7573330 avatar syb7573330 commented on May 24, 2024 1

Hi,

Thanks for following our work. Yes, for this task, 1 GPU (12 GB memory of Nvidia Titan) is not enough to hold the data of predefined batch size, so we have to use 2 to train the model. But one thing you can try with a single GPU is to split the batch data into a few more smaller "mini_batches", and accumulate gradients across those mini_batches before update model trainable variables. This will achieve the same result theoretically. Good luck!

from dgcnn.

suan0365006 avatar suan0365006 commented on May 24, 2024 1

謝謝您的答复!最好的祝愿!Sun Yongbin [email protected]於2019年6月14日星期五下午9:31應力:

嗨,謝謝您關注我們的工作。是的,對於此任務,僅1個GPU(Nvidia Titan的12 GB內存)不足以容納預定義的批處理大小的數據,因此我們必須使用2個訓練模型。但是,使用單個GPU可以嘗試的一件事是將批處理數據拆分為幾個更小的“ mini_batches”,並在更新模型可訓練變量之前在這些mini_batches上累積梯度。理論上,這將獲得相同的結果。祝好運!—您收到此消息是因為您創建了線程。直接回复此電子郵件,在GitHub < #21上查看?email_source =通知&email_token = AMIC2SNWCTHWNWIPS7AC4VTP2OMRJA5CNFSM4HYHLYRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXWZOKI#issuecomment-502109993>,或靜音螺紋https://github.com/notifications/unsubscribe-auth/AMIC2SPN6M5AEZUTCBCMWLDP2OMRJANCNFSM4HYHLYRA。

您好,我也遇到同樣的問題。如何將批處理數據拆分為幾個較小的“ mini_batches”?

您好,
進入sem_seg / train.py
您將在第24行中找到batch_size,並更改為所需的大小。
試試吧。

感謝您的答复,我知道,但是我將batch_size設置為1,它無法工作,我的GPU是rtx2080(8g),我遇到此問題:
InternalError(請參見上文的回溯):Blas xGEMM啟動失敗:a.shape = [1,4096,3],b.shape = [1,3,4096],m = 4096,n = 4096,k = 3
[[節點:MatMul = BatchMatMul [T = DT_FLOAT,adj_x = false,adj_y = false ,_device =“ / job:localhost / replica:0 / task:0 / device:GPU:0”](ExpandDims_1,transpose)]]
[[節點:adj_conv2 / bn / cond / add_1 / _223 = _Recvclient_terminated = false,recv_device =“ /作業:本地主機/副本:0 /任務:0 /設備:CPU:0”,send_device =“ /作業:本地主機/副本:0 /任務:0 /設備:GPU:0”,send_device_incarnation = 1,張量_名稱=“ edge_525_adj_conv2 / bn / cond / add_1”,tensor_type = DT_FLOAT,_device =“ / job:localhost /副本:0 / task:0 / device:CPU:0”]]]
您知道這個問題的原因和解決方法嗎?期待您。

好的,看來gpu出了點問題。(或CUDA,Tensorflow,Keras)
也許,您可以嘗試更改“創建會話”代碼來解決此問題,
我的代碼是:

建立會議

#config = tf.ConfigProto()
#config.gpu_options.allow_growth = True
#config.allow_soft_placement = True
#sess = tf.Session(config = config)
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.allow_soft_placement = True
sess = tf.Session(config = config)
希望可以解決您的問題。

謝謝您的答复,在根據您的建議更改“創建會話”代碼後,我遇到相同的問題
nternalError(請參見上文的回溯):Blas xGEMM啟動失敗:a.shape = [1,4096,3], b.shape = [1,3,4096],m = 4096,n = 4096,k = 3
還有其他解決方法嗎?我的CUDA,Tensorflow沒問題,因為我成功地在modelnet40。
當我運行您的代碼時,內存是
:totalMemory:7.76GiB freeMemory:7.13GiB

Hi,
sorry, this problem seem can't be solved easily,
I suggest to collect this problem for some detail, and republish this issure to author.

In train.py, I change some parameter , I just one a 1050ti(4G) card
my code(this code is base on DGCNN author code to regulate it):

parser = argparse.ArgumentParser()
parser.add_argument('--num_gpu', type=int, default=2, help='the number of GPUs to use [default: 2]')
parser.add_argument('--log_dir', default='log', help='Log dir [default: log]')
parser.add_argument('--num_point', type=int, default=4096, help='Point number [default: 4096]')
parser.add_argument('--max_epoch', type=int, default=200, help='Epoch to run [default: 50]')
parser.add_argument('--batch_size', type=int, default=2, help='Batch Size during training for each GPU [default: 24]')
parser.add_argument('--learning_rate', type=float, default=0.001, help='Initial learning rate [default: 0.001]')
parser.add_argument('--momentum', type=float, default=0.9, help='Initial learning rate [default: 0.9]')
parser.add_argument('--optimizer', default='adam', help='adam or momentum [default: adam]')
parser.add_argument('--decay_step', type=int, default=300000, help='Decay step for lr decay [default: 300000]')
parser.add_argument('--decay_rate', type=float, default=0.5, help='Decay rate for lr decay [default: 0.5]')
parser.add_argument('--test_area', type=int, default=6, help='Which area to use for test, option: 1-6 [default: 6]')
FLAGS = parser.parse_args()

It's worth mentioning that "the number of GPUs to use".
Although I only have one 1050ti card, I have to set it to 2. It worked.

You can try this code, good luck.

from dgcnn.

suan0365006 avatar suan0365006 commented on May 24, 2024

from dgcnn.

longmalongma avatar longmalongma commented on May 24, 2024

Thanks for following our work. Yes, for this task, 1 GPU (12 GB memory of Nvidia Titan) is not enough to hold the data of predefined batch size, so we have to use 2 to train the model. But one thing you can try with a single GPU is to split the batch data into a few more smaller "mini_batches", and accumulate gradients across those mini_batches before update model trainable variables. This will achieve the same result theoretically. Good luck!

Hi,

Thanks for following our work. Yes, for this task, 1 GPU (12 GB memory of Nvidia Titan) is not enough to hold the data of predefined batch size, so we have to use 2 to train the model. But one thing you can try with a single GPU is to split the batch data into a few more smaller "mini_batches", and accumulate gradients across those mini_batches before update model trainable variables. This will achieve the same result theoretically. Good luck!

how to split the batch data into a few more smaller "mini_batches"?

from dgcnn.

longmalongma avatar longmalongma commented on May 24, 2024

Thank you reply this answer Best wish! Yongbin Sun [email protected] 於 2019年6月14日 週五 下午9:31 寫道:

Hi, Thanks for following our work. Yes, for this task, 1 GPU (12 GB memory of Nvidia Titan) is not enough to hold the data of predefined batch size, so we have to use 2 to train the model. But one thing you can try with a single GPU is to split the batch data into a few more smaller "mini_batches", and accumulate gradients across those mini_batches before update model trainable variables. This will achieve the same result theoretically. Good luck! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#21?email_source=notifications&email_token=AMIC2SNWCTHWNWIPS7AC4VTP2OMRJA5CNFSM4HYHLYRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXWZOKI#issuecomment-502109993>, or mute the thread https://github.com/notifications/unsubscribe-auth/AMIC2SPN6M5AEZUTCBCMWLDP2OMRJANCNFSM4HYHLYRA .

hello,I also meet same problem.how to split the batch data into a few more smaller "mini_batches"?

from dgcnn.

suan0365006 avatar suan0365006 commented on May 24, 2024

Thank you reply this answer Best wish! Yongbin Sun [email protected] 於 2019年6月14日 週五 下午9:31 寫道:

Hi, Thanks for following our work. Yes, for this task, 1 GPU (12 GB memory of Nvidia Titan) is not enough to hold the data of predefined batch size, so we have to use 2 to train the model. But one thing you can try with a single GPU is to split the batch data into a few more smaller "mini_batches", and accumulate gradients across those mini_batches before update model trainable variables. This will achieve the same result theoretically. Good luck! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#21?email_source=notifications&email_token=AMIC2SNWCTHWNWIPS7AC4VTP2OMRJA5CNFSM4HYHLYRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXWZOKI#issuecomment-502109993>, or mute the thread https://github.com/notifications/unsubscribe-auth/AMIC2SPN6M5AEZUTCBCMWLDP2OMRJANCNFSM4HYHLYRA .

hello,I also meet same problem.how to split the batch data into a few more smaller "mini_batches"?

Hello,
Into sem_seg/train.py
You will find batch_size in line 24, and change for your require size.
Try it.

from dgcnn.

longmalongma avatar longmalongma commented on May 24, 2024

Thank you reply this answer Best wish! Yongbin Sun [email protected] 於 2019年6月14日 週五 下午9:31 寫道:

Hi, Thanks for following our work. Yes, for this task, 1 GPU (12 GB memory of Nvidia Titan) is not enough to hold the data of predefined batch size, so we have to use 2 to train the model. But one thing you can try with a single GPU is to split the batch data into a few more smaller "mini_batches", and accumulate gradients across those mini_batches before update model trainable variables. This will achieve the same result theoretically. Good luck! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#21?email_source=notifications&email_token=AMIC2SNWCTHWNWIPS7AC4VTP2OMRJA5CNFSM4HYHLYRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXWZOKI#issuecomment-502109993>, or mute the thread https://github.com/notifications/unsubscribe-auth/AMIC2SPN6M5AEZUTCBCMWLDP2OMRJANCNFSM4HYHLYRA .

hello,I also meet same problem.how to split the batch data into a few more smaller "mini_batches"?

Hello,
Into sem_seg/train.py
You will find batch_size in line 24, and change for your require size.
Try it.

Thanks for your reply,I know it,but i set batch_size as 1,it can not work,my gpu is rtx2080(8g)i meet this problem:
InternalError (see above for traceback): Blas xGEMM launch failed : a.shape=[1,4096,3], b.shape=[1,3,4096], m=4096, n=4096, k=3
[[Node: MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ExpandDims_1, transpose)]]
[[Node: adj_conv2/bn/cond/add_1/_223 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_525_adj_conv2/bn/cond/add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

do you know cause for this problem and how to solve it?looking forward to you.

from dgcnn.

suan0365006 avatar suan0365006 commented on May 24, 2024

Thank you reply this answer Best wish! Yongbin Sun [email protected] 於 2019年6月14日 週五 下午9:31 寫道:

Hi, Thanks for following our work. Yes, for this task, 1 GPU (12 GB memory of Nvidia Titan) is not enough to hold the data of predefined batch size, so we have to use 2 to train the model. But one thing you can try with a single GPU is to split the batch data into a few more smaller "mini_batches", and accumulate gradients across those mini_batches before update model trainable variables. This will achieve the same result theoretically. Good luck! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#21?email_source=notifications&email_token=AMIC2SNWCTHWNWIPS7AC4VTP2OMRJA5CNFSM4HYHLYRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXWZOKI#issuecomment-502109993>, or mute the thread https://github.com/notifications/unsubscribe-auth/AMIC2SPN6M5AEZUTCBCMWLDP2OMRJANCNFSM4HYHLYRA .

hello,I also meet same problem.how to split the batch data into a few more smaller "mini_batches"?

Hello,
Into sem_seg/train.py
You will find batch_size in line 24, and change for your require size.
Try it.

Thanks for your reply,I know it,but i set batch_size as 1,it can not work,my gpu is rtx2080(8g)i meet this problem:
InternalError (see above for traceback): Blas xGEMM launch failed : a.shape=[1,4096,3], b.shape=[1,3,4096], m=4096, n=4096, k=3
[[Node: MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ExpandDims_1, transpose)]]
[[Node: adj_conv2/bn/cond/add_1/_223 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_525_adj_conv2/bn/cond/add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

do you know cause for this problem and how to solve it?looking forward to you.

Okay, it look like something wrong with gpu. (or CUDA , Tensorflow, Keras)
Maybe, you can try to change "Create a session" code to fix this problem,
My code is:
# Create a session
#config = tf.ConfigProto()
#config.gpu_options.allow_growth = True
#config.allow_soft_placement = True
#sess = tf.Session(config=config)
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.allow_soft_placement = True
sess = tf.Session(config=config)

Hope can solve your problem.

from dgcnn.

longmalongma avatar longmalongma commented on May 24, 2024

Thank you reply this answer Best wish! Yongbin Sun [email protected] 於 2019年6月14日 週五 下午9:31 寫道:

Hi, Thanks for following our work. Yes, for this task, 1 GPU (12 GB memory of Nvidia Titan) is not enough to hold the data of predefined batch size, so we have to use 2 to train the model. But one thing you can try with a single GPU is to split the batch data into a few more smaller "mini_batches", and accumulate gradients across those mini_batches before update model trainable variables. This will achieve the same result theoretically. Good luck! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#21?email_source=notifications&email_token=AMIC2SNWCTHWNWIPS7AC4VTP2OMRJA5CNFSM4HYHLYRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXWZOKI#issuecomment-502109993>, or mute the thread https://github.com/notifications/unsubscribe-auth/AMIC2SPN6M5AEZUTCBCMWLDP2OMRJANCNFSM4HYHLYRA .

hello,I also meet same problem.how to split the batch data into a few more smaller "mini_batches"?

Hello,
Into sem_seg/train.py
You will find batch_size in line 24, and change for your require size.
Try it.

Thanks for your reply,I know it,but i set batch_size as 1,it can not work,my gpu is rtx2080(8g)i meet this problem:
InternalError (see above for traceback): Blas xGEMM launch failed : a.shape=[1,4096,3], b.shape=[1,3,4096], m=4096, n=4096, k=3
[[Node: MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ExpandDims_1, transpose)]]
[[Node: adj_conv2/bn/cond/add_1/_223 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_525_adj_conv2/bn/cond/add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
do you know cause for this problem and how to solve it?looking forward to you.

Okay, it look like something wrong with gpu. (or CUDA , Tensorflow, Keras)
Maybe, you can try to change "Create a session" code to fix this problem,
My code is:

Create a session

#config = tf.ConfigProto()
#config.gpu_options.allow_growth = True
#config.allow_soft_placement = True
#sess = tf.Session(config=config)
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.allow_soft_placement = True
sess = tf.Session(config=config)

Hope can solve your problem.

Thank you for your reply,after I change "Create a session" code according to your advice,I meet same problem,
nternalError (see above for traceback): Blas xGEMM launch failed : a.shape=[1,4096,3], b.shape=[1,3,4096], m=4096, n=4096, k=3
Do you have any other ideas to solve this problem?my CUDA , Tensorflow is ok,because I successfully run your code about classification on modelnet40.
when i run your code,memory is that:
totalMemory: 7.76GiB freeMemory: 7.13GiB

from dgcnn.

longmalongma avatar longmalongma commented on May 24, 2024

謝謝您的答复!最好的祝愿!Sun Yongbin [email protected]於2019年6月14日星期五下午9:31應力:

嗨,謝謝您關注我們的工作。是的,對於此任務,僅1個GPU(Nvidia Titan的12 GB內存)不足以容納預定義的批處理大小的數據,因此我們必須使用2個訓練模型。但是,使用單個GPU可以嘗試的一件事是將批處理數據拆分為幾個更小的“ mini_batches”,並在更新模型可訓練變量之前在這些mini_batches上累積梯度。理論上,這將獲得相同的結果。祝好運!—您收到此消息是因為您創建了線程。直接回复此電子郵件,在GitHub < #21上查看?email_source =通知&email_token = AMIC2SNWCTHWNWIPS7AC4VTP2OMRJA5CNFSM4HYHLYRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXWZOKI#issuecomment-502109993>,或靜音螺紋https://github.com/notifications/unsubscribe-auth/AMIC2SPN6M5AEZUTCBCMWLDP2OMRJANCNFSM4HYHLYRA。

您好,我也遇到同樣的問題。如何將批處理數據拆分為幾個較小的“ mini_batches”?

您好,
進入sem_seg / train.py
您將在第24行中找到batch_size,並更改為所需的大小。
試試吧。

感謝您的答复,我知道,但是我將batch_size設置為1,它無法工作,我的GPU是rtx2080(8g),我遇到此問題:
InternalError(請參見上文的回溯):Blas xGEMM啟動失敗:a.shape = [1,4096,3],b.shape = [1,3,4096],m = 4096,n = 4096,k = 3
[[節點:MatMul = BatchMatMul [T = DT_FLOAT,adj_x = false,adj_y = false ,_device =“ / job:localhost / replica:0 / task:0 / device:GPU:0”](ExpandDims_1,transpose)]]
[[節點:adj_conv2 / bn / cond / add_1 / _223 = _Recvclient_terminated = false,recv_device =“ /作業:本地主機/副本:0 /任務:0 /設備:CPU:0”,send_device =“ /作業:本地主機/副本:0 /任務:0 /設備:GPU:0”,send_device_incarnation = 1,張量_名稱=“ edge_525_adj_conv2 / bn / cond / add_1”,tensor_type = DT_FLOAT,_device =“ / job:localhost /副本:0 / task:0 / device:CPU:0”]]]
您知道這個問題的原因和解決方法嗎?期待您。

好的,看來gpu出了點問題。(或CUDA,Tensorflow,Keras)
也許,您可以嘗試更改“創建會話”代碼來解決此問題,
我的代碼是:

建立會議

#config = tf.ConfigProto()
#config.gpu_options.allow_growth = True
#config.allow_soft_placement = True
#sess = tf.Session(config = config)
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.allow_soft_placement = True
sess = tf.Session(config = config)
希望可以解決您的問題。

謝謝您的答复,在根據您的建議更改“創建會話”代碼後,我遇到相同的問題
nternalError(請參見上文的回溯):Blas xGEMM啟動失敗:a.shape = [1,4096,3], b.shape = [1,3,4096],m = 4096,n = 4096,k = 3
還有其他解決方法嗎?我的CUDA,Tensorflow沒問題,因為我成功地在modelnet40。
當我運行您的代碼時,內存是
:totalMemory:7.76GiB freeMemory:7.13GiB

Hi,
sorry, this problem seem can't be solved easily,
I suggest to collect this problem for some detail, and republish this issure to author.

In train.py, I change some parameter , I just one a 1050ti(4G) card
my code(this code is base on DGCNN author code to regulate it):

parser = argparse.ArgumentParser()
parser.add_argument('--num_gpu', type=int, default=2, help='the number of GPUs to use [default: 2]')
parser.add_argument('--log_dir', default='log', help='Log dir [default: log]')
parser.add_argument('--num_point', type=int, default=4096, help='Point number [default: 4096]')
parser.add_argument('--max_epoch', type=int, default=200, help='Epoch to run [default: 50]')
parser.add_argument('--batch_size', type=int, default=2, help='Batch Size during training for each GPU [default: 24]')
parser.add_argument('--learning_rate', type=float, default=0.001, help='Initial learning rate [default: 0.001]')
parser.add_argument('--momentum', type=float, default=0.9, help='Initial learning rate [default: 0.9]')
parser.add_argument('--optimizer', default='adam', help='adam or momentum [default: adam]')
parser.add_argument('--decay_step', type=int, default=300000, help='Decay step for lr decay [default: 300000]')
parser.add_argument('--decay_rate', type=float, default=0.5, help='Decay rate for lr decay [default: 0.5]')
parser.add_argument('--test_area', type=int, default=6, help='Which area to use for test, option: 1-6 [default: 6]')
FLAGS = parser.parse_args()

It's worth mentioning that "the number of GPUs to use".
Although I only have one 1050ti card, I have to set it to 2. It worked.

You can try this code, good luck.

Thanks for your reply,I will try to run codes again according to your advice.Can you give me your contact?

from dgcnn.

suan0365006 avatar suan0365006 commented on May 24, 2024

謝謝您的回應!最好的祝愿!孫永斌[email protected]於2019年6月14日星期五下午9:31應力:

嗨,謝謝您關注我們的工作。是的,關於此任務,僅1一個GPU(Nvidia Titan的12 GB內存)不足以容納預定義的批處理大小的數據,因此我們必須使用2個訓練模型。但是,使用GPU可以嘗試的一件事是將批處理數據替換為幾個更小的“ mini_batches”,並在更新模型可訓練變量之前在這些mini_batches上累積梯度。理論上,這將獲得相同的結果。祝好運!—您收到此消息是因為您創建了線程。直接回复此電子郵件,在GitHub的< #21上查看?email_source =通知&email_token = AMIC2SNWCTHWNWIPS7AC4VTP2OMRJA5CNFSM4HYHLYRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXWZOKI#issuecomment-502109993>,靜音或螺紋https://github.com/notifications/unsubscribe-auth/AMIC2SPN6M5AEZUTCBCMWLDP2OMRJANCNFSM4HYHLYRA。

您好,我也遇到同樣的問題。如何將批處理數據分割為幾個較小的“ mini_batches”?

您好,
進入sem_seg / train.py
您將在第24行中找到batch_size,並更改為所需的大小。
試試吧。

感謝您的答复,我知道,但是我將batch_size設置為1,它無法工作,我的GPU是rtx2080(8g),我遇到此問題:
InternalError(請參見上文的回溯):Blas xGEMM啟動失敗: a.shape = [1,4096,3],b.shape = [1,3,4096],m = 4096,n = 4096,k = 3
[[相鄰:MatMul = BatchMatMul [T = DT_FLOAT,adj_x = false ,adj_y = false,_device =“ /作業:本地主機/複製副本:0 /任務:0 /設備:GPU:0”](ExpandDims_1,轉置)]]]
[[中斷:adj_conv2 / bn / cond / add_1 / _223 = _Recvclient_terminated = false,recv_device =“ /作業:本地主機/副本:0 /任務:0 /設備:CPU:0”,send_device =“ /作業:本地主機/副本:0 /任務:0 /設備:GPU:0” ,send_device_incarnation = 1,張量_名稱=“edge_525_adj_conv2 / BN / COND / add_1”,tensor_type = DT_FLOAT,_device =“/作業:本地主機/副本:0 /任務:0 /裝置:CPU:0”]]]
您知道這個問題的原因和解決方法嗎?期待您。

(或CUDA,Tensorflow,Keras)
也許,您可以嘗試更改“創建會話”代碼來解決此問題,
我的代碼是:

建立會議

#config = tf.ConfigProto()
#config.gpu_options.allow_growth = True
#config.allow_soft_placement = True
#sess = tf.Session(config = config)
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.allow_soft_ = True
sess = tf.Session(配置=配置)
希望可以解決您的問題。

謝謝您的答复,在根據您的建議更改“創建會話”代碼後,我
遇到相同的問題nternalError(請參見上文的回溯):Blas xGEMM啟動失敗:a.shape = [1,4096,3] ,b.shape = [1,3,4096],m = 4096,n = 4096,k = 3
還有其他解決方法嗎?我的CUDA,Tensorflow沒問題,因為我成功地在modelnet40。
當我運行您的代碼時,內存是
:totalMemory:7.76GiB freeMemory:7.13GiB

嗨,
很抱歉,這個問題似乎無法輕鬆解決,
我建議收集一些詳細信息,然後將此問題重新發布給作者。
在train.py中,我更改一些參數,我的代碼只是一張1050ti(4G)卡
(此代碼基於DGCNN作者代碼進行調節):
parser = argparse.ArgumentParser()
parser.add_argument('-num_gpu',type = int,default = 2,help ='要使用的GPU數量[默認:2]')
parser.add_argument('-log_dir' ,default ='log',help ='Log dir [default:log]')
parser.add_argument('-num_point',type = int,default = 4096,help ='Point number [default:4096]')
解析器.add_argument('-max_epoch',type = int,default = 200,help ='Epoch to run [default:50]')
parser.add_argument('-batch_size',type = int,default = 2,help = '每個GPU訓練期間的批處理大小[默認:24]')
parser.add_argument('-learning_rate',type = float,default = 0.001,help ='初始學習率[默認:0.001]')
parser.add_argument( '--momentum',類型=浮動,默認= 0.9,help ='初始學習率[默認:0.9]')
parser.add_argument('-optimizer',默認='adam',help ='adam或動量[default:adam]')
parser.add_argument('-decay_step',type = int,default = 300000,help =' lr衰減的衰減步長[默認值:300000]')
parser.add_argument('-decay_rate',type = float,default = 0.5,help ='lr衰減的衰減率[默認值:0.5]')
parser.add_argument(' --test_area',類型= int,默認= 6,幫助='用於測試的區域,選項:1-6 [默認:6]')
FLAGS = parser.parse_args()
值得一提的是“要使用的GPU數量”。
儘管我只有一張1050ti卡,但我必須將其設置為2。
您可以嘗試使用此代碼,祝您好運。

感謝您的回复,我將根據您的建議再次嘗試運行代碼。能否給我您的聯繫方式?

okay, E-mail:[email protected]

from dgcnn.

longmalongma avatar longmalongma commented on May 24, 2024

謝謝您的回應!最好的祝愿!孫永斌[email protected]於2019年6月14日星期五下午9:31應力:

嗨,謝謝您關注我們的工作。是的,關於此任務,僅1一個GPU(Nvidia Titan的12 GB內存)不足以容納預定義的批處理大小的數據,因此我們必須使用2個訓練模型。但是,使用GPU可以嘗試的一件事是將批處理數據替換為幾個更小的“ mini_batches”,並在更新模型可訓練變量之前在這些mini_batches上累積梯度。理論上,這將獲得相同的結果。祝好運!—您收到此消息是因為您創建了線程。直接回复此電子郵件,在GitHub的< #21上查看?email_source =通知&email_token = AMIC2SNWCTHWNWIPS7AC4VTP2OMRJA5CNFSM4HYHLYRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXWZOKI#issuecomment-502109993>,靜音或螺紋https://github.com/notifications/unsubscribe-auth/AMIC2SPN6M5AEZUTCBCMWLDP2OMRJANCNFSM4HYHLYRA。

您好,我也遇到同樣的問題。如何將批處理數據分割為幾個較小的“ mini_batches”?

您好,
進入sem_seg / train.py
您將在第24行中找到batch_size,並更改為所需的大小。
試試吧。

感謝您的答复,我知道,但是我將batch_size設置為1,它無法工作,我的GPU是rtx2080(8g),我遇到此問題:
InternalError(請參見上文的回溯):Blas xGEMM啟動失敗: a.shape = [1,4096,3],b.shape = [1,3,4096],m = 4096,n = 4096,k = 3
[[相鄰:MatMul = BatchMatMul [T = DT_FLOAT,adj_x = false ,adj_y = false,_device =“ /作業:本地主機/複製副本:0 /任務:0 /設備:GPU:0”](ExpandDims_1,轉置)]]]
[[中斷:adj_conv2 / bn / cond / add_1 / _223 = _Recvclient_terminated = false,recv_device =“ /作業:本地主機/副本:0 /任務:0 /設備:CPU:0”,send_device =“ /作業:本地主機/副本:0 /任務:0 /設備:GPU:0” ,send_device_incarnation = 1,張量_名稱=“edge_525_adj_conv2 / BN / COND / add_1”,tensor_type = DT_FLOAT,_device =“/作業:本地主機/副本:0 /任務:0 /裝置:CPU:0”]]]
您知道這個問題的原因和解決方法嗎?期待您。

(或CUDA,Tensorflow,Keras)
也許,您可以嘗試更改“創建會話”代碼來解決此問題,
我的代碼是:

建立會議

#config = tf.ConfigProto()
#config.gpu_options.allow_growth = True
#config.allow_soft_placement = True
#sess = tf.Session(config = config)
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.allow_soft_ = True
sess = tf.Session(配置=配置)
希望可以解決您的問題。

謝謝您的答复,在根據您的建議更改“創建會話”代碼後,我
遇到相同的問題nternalError(請參見上文的回溯):Blas xGEMM啟動失敗:a.shape = [1,4096,3] ,b.shape = [1,3,4096],m = 4096,n = 4096,k = 3
還有其他解決方法嗎?我的CUDA,Tensorflow沒問題,因為我成功地在modelnet40。
當我運行您的代碼時,內存是
:totalMemory:7.76GiB freeMemory:7.13GiB

嗨,
很抱歉,這個問題似乎無法輕鬆解決,
我建議收集一些詳細信息,然後將此問題重新發布給作者。
在train.py中,我更改一些參數,我的代碼只是一張1050ti(4G)卡
(此代碼基於DGCNN作者代碼進行調節):
parser = argparse.ArgumentParser()
parser.add_argument('-num_gpu',type = int,default = 2,help ='要使用的GPU數量[默認:2]')
parser.add_argument('-log_dir' ,default ='log',help ='Log dir [default:log]')
parser.add_argument('-num_point',type = int,default = 4096,help ='Point number [default:4096]')
解析器.add_argument('-max_epoch',type = int,default = 200,help ='Epoch to run [default:50]')
parser.add_argument('-batch_size',type = int,default = 2,help = '每個GPU訓練期間的批處理大小[默認:24]')
parser.add_argument('-learning_rate',type = float,default = 0.001,help ='初始學習率[默認:0.001]')
parser.add_argument( '--momentum',類型=浮動,默認= 0.9,help ='初始學習率[默認:0.9]')
parser.add_argument('-optimizer',默認='adam',help ='adam或動量[default:adam]')
parser.add_argument('-decay_step',type = int,default = 300000,help =' lr衰減的衰減步長[默認值:300000]')
parser.add_argument('-decay_rate',type = float,default = 0.5,help ='lr衰減的衰減率[默認值:0.5]')
parser.add_argument(' --test_area',類型= int,默認= 6,幫助='用於測試的區域,選項:1-6 [默認:6]')
FLAGS = parser.parse_args()
值得一提的是“要使用的GPU數量”。
儘管我只有一張1050ti卡,但我必須將其設置為2。
您可以嘗試使用此代碼,祝您好運。

感謝您的回复,我將根據您的建議再次嘗試運行代碼。能否給我您的聯繫方式?

okay, E-mail:[email protected]

Hi,
I have been troubled by this problem many days. Can you send me your whole train.py file?or your models have trained?

from dgcnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.