训练的时候可以设置成多卡，但是验证的时候就会报错~大佬解决了这个问题了吗？

多卡训练，batchnorm vs syn_batch_norm 最后的效果影响大么？ <a class="user-mention notranslate" da

请教大佬们，DBNet多卡训练如何设置啊？,about wenmuzhou/pytorchocr

novioleo commented on April 20, 2024 1

差距不大。速度差异有点大------------------ 原始邮件 ------------------ 发件人: "Hao&nbsp;Luo"<[email protected]> 发送时间: 2020年10月28日(星期三) 晚上9:21 收件人: "WenmuZhou/PytorchOCR"<[email protected]>; 抄送: "Tao Luo"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [WenmuZhou/PytorchOCR] 请教大佬们，DBNet多卡训练如何设置啊？ (#104)

from pytorchocr.

novioleo commented on April 20, 2024 1

没有绝对的。跟太多东西相关联。------------------ 原始邮件 ------------------ 发件人: "Hao&nbsp;Luo"<[email protected]> 发送时间: 2020年10月29日(星期四) 凌晨0:25 收件人: "WenmuZhou/PytorchOCR"<[email protected]>; 抄送: "Tao Luo"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [WenmuZhou/PytorchOCR] 请教大佬们，DBNet多卡训练如何设置啊？ (#104)

from pytorchocr.

PanFei748 commented on April 20, 2024 1

训练多卡是把一个batch拆了出来。验证的时候理论可以多卡但是没啥意义，你可以每张卡跑不同的batch，data parallel一下也是可以的，只是没必要，每张卡之间还要互相等待，浪费时间。所以单卡已经不错了。------------------ 原始邮件 ------------------ 发件人: "Hao Luo"<[email protected]> 发送时间: 2020年10月29日(星期四) 上午10:43 收件人: "WenmuZhou/PytorchOCR"<[email protected]>; 抄送: "Tao Luo"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [WenmuZhou/PytorchOCR] 请教大佬们，DBNet多卡训练如何设置啊？ (#104)

我意思是说我训练的时候设置的多卡：CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 tools/det_train.py --config './config/det_train_db_config.py'，训练集训练的时候都没问题，但是在验证集进行验证时候（也是在train.py代码立面），会报错。

from pytorchocr.

WenmuZhou commented on April 20, 2024

验证为啥要多卡

from pytorchocr.

novioleo commented on April 20, 2024

验证为啥要多卡

from pytorchocr.

novioleo commented on April 20, 2024

我感觉我又鸽了。我后续会更新上Triton的教程。支持多卡推断~

from pytorchocr.

PanFei748 commented on April 20, 2024

验证为啥要多卡

大佬你好，是我没表达清楚。我意思是说我训练的时候设置的多卡：CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 tools/det_train.py --config './config/det_train_db_config.py'，训练集训练的时候都没问题，但是在验证集进行验证时候（也是在train.py立面），会报错。

from pytorchocr.

PanFei748 commented on April 20, 2024

验证为啥要多卡

大佬你好，是我没表达清楚。我意思是说我训练的时候设置的多卡：CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 tools/det_train.py --config './config/det_train_db_config.py'，训练集训练的时候都没问题，但是在验证集进行验证时候（也是在train.py立面），会报错。

from pytorchocr.

PanFei748 commented on April 20, 2024

我感觉我又鸽了。我后续会更新上Triton的教程。支持多卡推断~

期待：）

from pytorchocr.

lhao0301 commented on April 20, 2024

多卡训练，batchnorm vs syn_batch_norm 最后的效果影响大么？
@novioleo
@WenmuZhou

from pytorchocr.

lhao0301 commented on April 20, 2024

速度慢多少有经验值么(了解一下)

from pytorchocr.

lhao0301 commented on April 20, 2024

什么原因限制多卡训练过程中间做测试不能多卡呢？

from pytorchocr.

novioleo commented on April 20, 2024

训练多卡是把一个batch拆了出来。验证的时候理论可以多卡但是没啥意义，你可以每张卡跑不同的batch，data parallel一下也是可以的，只是没必要，每张卡之间还要互相等待，浪费时间。所以单卡已经不错了。------------------ 原始邮件 ------------------ 发件人: "Hao&nbsp;Luo"<[email protected]> 发送时间: 2020年10月29日(星期四) 上午10:43 收件人: "WenmuZhou/PytorchOCR"<[email protected]>; 抄送: "Tao Luo"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [WenmuZhou/PytorchOCR] 请教大佬们，DBNet多卡训练如何设置啊？ (#104)

from pytorchocr.

lhao0301 commented on April 20, 2024

训练多卡是把一个batch拆了出来。验证的时候理论可以多卡但是没啥意义，你可以每张卡跑不同的batch，data parallel一下也是可以的，只是没必要，每张卡之间还要互相等待，浪费时间。所以单卡已经不错了。------------------ 原始邮件 ------------------ 发件人: "Hao Luo"<[email protected]> 发送时间: 2020年10月29日(星期四) 上午10:43 收件人: "WenmuZhou/PytorchOCR"<[email protected]>; 抄送: "Tao Luo"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [WenmuZhou/PytorchOCR] 请教大佬们，DBNet多卡训练如何设置啊？ (#104)

换个说法：用多卡跑训练(CUDA_VISIBLE_DEVICES=0,1 python3 tools/det_train.py --config ./config/det_train_db_config.py)的时候，报错了。报错的原因是，在验证集上做evaluation多卡出错。想知道，为什么在验证集上多卡做evaluation会出错？理论上来说，evalution过程在多卡上也把batch_data拆分，然后把output合并，是不会报错的。

# fail
run cmd: CUDA_VISIBLE_DEVICES=0,1 python3 tools/det_train.py --config ./config/det_train_db_config.py
# success
run cmd: CUDA_VISIBLE_DEVICES=0 python3 tools/det_train.py --config ./config/det_train_db_config.py

from pytorchocr.

novioleo commented on April 20, 2024

@lhao0301 你把错误的日志发出来我看下。

from pytorchocr.

lhao0301 commented on April 20, 2024

3> @lhao0301 你把错误的日志发出来我看下。
大概是下面这样的 (暂时没有空的机器去复现了，下次机器空出来再贴完整的error)。

evaluate() 这个函数里面调 forward()的时候
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
return self.gather(outputs, self.output_device)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: expected sequence object with len >= 0 or a single integer

from pytorchocr.

PanFei748 commented on April 20, 2024

训练多卡是把一个batch拆了出来。验证的时候理论可以多卡但是没啥意义，你可以每张卡跑不同的batch，data parallel一下也是可以的，只是没必要，每张卡之间还要互相等待，浪费时间。所以单卡已经不错了。------------------ 原始邮件 ------------------ 发件人: "Hao Luo"<[email protected]> 发送时间: 2020年10月29日(星期四) 上午10:43 收件人: "WenmuZhou/PytorchOCR"<[email protected]>; 抄送: "Tao Luo"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [WenmuZhou/PytorchOCR] 请教大佬们，DBNet多卡训练如何设置啊？ (#104)

换个说法：用多卡跑训练(CUDA_VISIBLE_DEVICES=0,1 python3 tools/det_train.py --config ./config/det_train_db_config.py)的时候，报错了。报错的原因是，在验证集上做evaluation多卡出错。想知道，为什么在验证集上多卡做evaluation会出错？理论上来说，evalution过程在多卡上也把batch_data拆分，然后把output合并，是不会报错的。
# fail
run cmd: CUDA_VISIBLE_DEVICES=0,1 python3 tools/det_train.py --config ./config/det_train_db_config.py
# success
run cmd: CUDA_VISIBLE_DEVICES=0 python3 tools/det_train.py --config ./config/det_train_db_config.py

或者设置成，训练的时候用多卡，evaluate的时候就用一张卡~

from pytorchocr.

lhao0301 commented on April 20, 2024

训练多卡是把一个batch拆了出来。验证的时候理论可以多卡但是没啥意义，你可以每张卡跑不同的batch，data parallel一下也是可以的，只是没必要，每张卡之间还要互相等待，浪费时间。所以单卡已经不错了。------------------ 原始邮件 ------------------ 发件人: "Hao Luo"<[email protected]> 发送时间: 2020年10月29日(星期四) 上午10:43 收件人: "WenmuZhou/PytorchOCR"<[email protected]>; 抄送: "Tao Luo"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [WenmuZhou/PytorchOCR] 请教大佬们，DBNet多卡训练如何设置啊？ (#104)

换个说法：用多卡跑训练(CUDA_VISIBLE_DEVICES=0,1 python3 tools/det_train.py --config ./config/det_train_db_config.py)的时候，报错了。报错的原因是，在验证集上做evaluation多卡出错。想知道，为什么在验证集上多卡做evaluation会出错？理论上来说，evalution过程在多卡上也把batch_data拆分，然后把output合并，是不会报错的。
# fail
run cmd: CUDA_VISIBLE_DEVICES=0,1 python3 tools/det_train.py --config ./config/det_train_db_config.py
# success
run cmd: CUDA_VISIBLE_DEVICES=0 python3 tools/det_train.py --config ./config/det_train_db_config.py
或者设置成，训练的时候用多卡，evaluate的时候就用一张卡~

还可以这样么。。。具体大概怎么设置(运行过程中改环境变量CUDA_VISIBLE_DEVICES？)？

from pytorchocr.

lhao0301 commented on April 20, 2024

训练多卡是把一个batch拆了出来。验证的时候理论可以多卡但是没啥意义，你可以每张卡跑不同的batch，data parallel一下也是可以的，只是没必要，每张卡之间还要互相等待，浪费时间。所以单卡已经不错了。------------------ 原始邮件 ------------------ 发件人: "Hao Luo"<[email protected]> 发送时间: 2020年10月29日(星期四) 上午10:43 收件人: "WenmuZhou/PytorchOCR"<[email protected]>; 抄送: "Tao Luo"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [WenmuZhou/PytorchOCR] 请教大佬们，DBNet多卡训练如何设置啊？ (#104)

换个说法：用多卡跑训练(CUDA_VISIBLE_DEVICES=0,1 python3 tools/det_train.py --config ./config/det_train_db_config.py)的时候，报错了。报错的原因是，在验证集上做evaluation多卡出错。想知道，为什么在验证集上多卡做evaluation会出错？理论上来说，evalution过程在多卡上也把batch_data拆分，然后把output合并，是不会报错的。
# fail
run cmd: CUDA_VISIBLE_DEVICES=0,1 python3 tools/det_train.py --config ./config/det_train_db_config.py
# success
run cmd: CUDA_VISIBLE_DEVICES=0 python3 tools/det_train.py --config ./config/det_train_db_config.py
或者设置成，训练的时候用多卡，evaluate的时候就用一张卡~

先判断一下，下面这样？

if isinstance(model, torch.nn.DataParallel):
                model.device_ids = [0]

from pytorchocr.

lhao0301 commented on April 20, 2024

@lhao0301 你把错误的日志发出来我看下。

validation的时候为什么batch_size必须为1呢？

PytorchOCR/config/det_train_db_config.py

Line 114 in 4a4f705

'batch_size': 1, # 必须为1

from pytorchocr.

novioleo commented on April 20, 2024

3> @lhao0301 你把错误的日志发出来我看下。
大概是下面这样的 (暂时没有空的机器去复现了，下次机器空出来再贴完整的error)。

evaluate() 这个函数里面调 forward()的时候
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
return self.gather(outputs, self.output_device)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: expected sequence object with len >= 0 or a single integer

你看下你的validation的batchsize。

from pytorchocr.

lhao0301 commented on April 20, 2024

3> @lhao0301 你把错误的日志发出来我看下。
大概是下面这样的 (暂时没有空的机器去复现了，下次机器空出来再贴完整的error)。

evaluate() 这个函数里面调 forward()的时候
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
return self.gather(outputs, self.output_device)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: expected sequence object with len >= 0 or a single integer

你看下你的validation的batchsize。

validation的batch_size我设置的是2(两张卡，所以从1改成2了)。
(在evaluate里面调forward()的之前，check过data的shape，看起来是正常的)

from pytorchocr.

novioleo commented on April 20, 2024

3> @lhao0301 你把错误的日志发出来我看下。
大概是下面这样的 (暂时没有空的机器去复现了，下次机器空出来再贴完整的error)。

evaluate() 这个函数里面调 forward()的时候
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
return self.gather(outputs, self.output_device)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: expected sequence object with len >= 0 or a single integer

你看下你的validation的batchsize。

validation的batch_size我设置的是2(两张卡，所以从1改成2了)。
(在evaluate里面调forward()的之前，check过data的shape，看起来是正常的)

你改为2之后的错误又是什么错误呢？

from pytorchocr.

lhao0301 commented on April 20, 2024

3> @lhao0301 你把错误的日志发出来我看下。
大概是下面这样的 (暂时没有空的机器去复现了，下次机器空出来再贴完整的error)。
evaluate() 这个函数里面调 forward()的时候
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
return self.gather(outputs, self.output_device)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/disk1/huijie/anaconda2/envs/ocr_lite/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: expected sequence object with len >= 0 or a single integer
你看下你的validation的batchsize。
validation的batch_size我设置的是2(两张卡，所以从1改成2了)。
(在evaluate里面调forward()的之前，check过data的shape，看起来是正常的)
你改为2之后的错误又是什么错误呢？

改为2之后的错误就是上面这个。（不记得batchsize=1的时候，错误是什么了，好像也是上面这个；理论上来说，两卡的时候，batchsize必须要=2*卡数吧，不然拆分不了data？）。。卡空出来才能复现。。。

from pytorchocr.

lhao0301 commented on April 20, 2024

@lhao0301 你把错误的日志发出来我看下。

validation的时候为什么batch_size必须为1呢？

PytorchOCR/config/det_train_db_config.py

Line 114 in 4a4f705

'batch_size': 1, # 必须为1

这个地方是为什么呢？图片size不一样，validation是short_resize，所以batch_size必须等于1? @novioleo

from pytorchocr.

novioleo commented on April 20, 2024

@lhao0301 你把错误的日志发出来我看下。

validation的时候为什么batch_size必须为1呢？

PytorchOCR/config/det_train_db_config.py

Line 114 in 4a4f705

'batch_size': 1, # 必须为1

这个地方是为什么呢？图片size不一样，validation是short_resize，所以batch_size必须等于1? @novioleo

首先说结论，这两个并没有关系哈～跟啥resize有关哟。
具体错误，应该是data_parallel的使用方法不对～哈哈哈哈。因为我们没有多卡，所以这个点没去测试。你可以参考下：
data_parallel官方文档
进行修改下～～如果我能用多卡的机器测试的话，这个bug我可以来修复。

from pytorchocr.

lhao0301 commented on April 20, 2024

@lhao0301 你把错误的日志发出来我看下。

validation的时候为什么batch_size必须为1呢？

PytorchOCR/config/det_train_db_config.py

Line 114 in 4a4f705

'batch_size': 1, # 必须为1

这个地方是为什么呢？图片size不一样，validation是short_resize，所以batch_size必须等于1? @novioleo

首先说结论，这两个并没有关系哈～跟啥resize有关哟。
具体错误，应该是data_parallel的使用方法不对～哈哈哈哈。因为我们没有多卡，所以这个点没去测试。你可以参考下：
data_parallel官方文档
进行修改下～～如果我能用多卡的机器测试的话，这个bug我可以来修复。

水平太菜，已经调了半天了，还没找到原因。

from pytorchocr.

请教大佬们，DBNet多卡训练如何设置啊？ about pytorchocr HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent