Comments (10)
Could you provide details of the problem?
Thanks.
from crnn.pytorch.
batch_index = random_start + torch.range(0, self.batch_size - 1)
Traceback (most recent call last):
File "crnn_main.py", line 220, in <module>
cost = trainBatch(crnn, criterion, optimizer)
File "crnn_main.py", line 195, in trainBatch
preds = crnn(image)
File "/gruntdata/DL_dataset/steven.lzc/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "/gruntdata/DL_dataset/steven.lzc/workspace/crnn.pytorch.armor/models/crnn.py", line 86, in forward
output = utils.data_parallel(self.rnn, conv, self.ngpu)
File "/gruntdata/DL_dataset/steven.lzc/workspace/crnn.pytorch.armor/models/utils.py", line 10, in data_parallel
output = nn.parallel.data_parallel(model, input, range(ngpu))
File "/gruntdata/DL_dataset/steven.lzc/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 105, in data_parallel
outputs = parallel_apply(replicas, inputs, module_kwargs)
File "/gruntdata/DL_dataset/steven.lzc/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 46, in parallel_apply
raise output
RuntimeError: all tensors must be on devices[0]
i got the error when i use --ngpu 2
from crnn.pytorch.
@Lzc6996 have you solve this problem? thanks
from crnn.pytorch.
I also find this problem.
My OS is CentOS Linux release 7.2.1511 (Core), x86_64.
My server has four Tesla P40 GPU cards.
My command is:
python crnn_main.py --trainroot /home/xuliang/CRNN_org/crnn/tool/synth90k_test_sort --valroot /home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort --cuda --ngpu 2 --keep_ratio --random_sample
And error message is:
Namespace(adadelta=False, adam=False, alphabet='0123456789abcdefghijklmnopqrstuvwxyz', batchSize=64, beta1=0.5, crnn='', cuda=True, displayInterval=500, experiment=None, imgH=32, imgW=100, keep_ratio=True, lr=0.01, n_test_disp=10, ngpu=2, nh=256, niter=25, random_sample=True, saveInterval=500, trainroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_test_sort', valInterval=500, valroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort', workers=2)
Random Seed: 7224
CRNN (
(cnn): Sequential (
(conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu0): ReLU (inplace)
(pooling0): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu1): ReLU (inplace)
(pooling1): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
(conv2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(batchnorm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(relu2): ReLU (inplace)
(conv3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu3): ReLU (inplace)
(pooling2): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1))
(conv4): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(batchnorm4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(relu4): ReLU (inplace)
(conv5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu5): ReLU (inplace)
(pooling3): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1))
(conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1))
(batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(relu6): ReLU (inplace)
)
(rnn): Sequential (
(0): BidirectionalLSTM (
(rnn): LSTM(512, 256, bidirectional=True)
(embedding): Linear (512 -> 256)
)
(1): BidirectionalLSTM (
(rnn): LSTM(256, 256, bidirectional=True)
(embedding): Linear (512 -> 37)
)
)
)
Traceback (most recent call last):
File "crnn_main.py", line 197, in
cost = trainBatch(crnn, criterion, optimizer)
File "crnn_main.py", line 180, in trainBatch
preds = crnn(image)
File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "/home/xuliang/CRNN_pytorch/crnn.pytorch/models/crnn.py", line 85, in forward
output = utils.data_parallel(self.rnn, conv, self.ngpu)
File "/home/xuliang/CRNN_pytorch/crnn.pytorch/models/utils.py", line 10, in data_parallel
output = nn.parallel.data_parallel(model, input, range(ngpu))
File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 105, in data_parallel
outputs = parallel_apply(replicas, inputs, module_kwargs)
File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 46, in parallel_apply
raise output
RuntimeError: all tensors must be on devices[0]
How to solve this problem?
from crnn.pytorch.
@XuLiangFRDC Have you try the updated version?
from crnn.pytorch.
@meijieru I updated the code to the current version. Now a different problem appeared:
The command is:
python crnn_main.py --trainroot /home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort --valroot /home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort --cuda --ngpu 4 --adadelta --keep_ratio --random_sample
And error message is:
Namespace(adadelta=True, adam=False, alphabet='0123456789abcdefghijklmnopqrstuvwxyz', batchSize=64, beta1=0.5, crnn='', cuda=True, displayInterval=500, experiment=None, imgH=32, imgW=100, keep_ratio=True, lr=0.01, n_test_disp=10, ngpu=4, nh=256, niter=25, random_sample=True, saveInterval=500, trainroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort', valInterval=500, valroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort', workers=2)
Random Seed: 7716
CRNN (
(cnn): Sequential (
(conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu0): ReLU (inplace)
(pooling0): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu1): ReLU (inplace)
(pooling1): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
(conv2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(batchnorm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(relu2): ReLU (inplace)
(conv3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu3): ReLU (inplace)
(pooling2): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1))
(conv4): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(batchnorm4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(relu4): ReLU (inplace)
(conv5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu5): ReLU (inplace)
(pooling3): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1))
(conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1))
(batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(relu6): ReLU (inplace)
)
(rnn): Sequential (
(0): BidirectionalLSTM (
(rnn): LSTM(512, 256, bidirectional=True)
(embedding): Linear (512 -> 256)
)
(1): BidirectionalLSTM (
(rnn): LSTM(256, 256, bidirectional=True)
(embedding): Linear (512 -> 37)
)
)
)
[0/25][500/112885] Loss: 16.009935
Start val
Traceback (most recent call last):
File "crnn_main.py", line 207, in
val(crnn, test_dataset, criterion)
File "crnn_main.py", line 158, in val
sim_preds = converter.decode(preds.data, preds_size.data, raw=False)
File "/home/xuliang/CRNN_pytorch_v2/crnn.pytorch/utils.py", line 51, in decode
t[index:index + l], torch.IntTensor([l]), raw=raw))
ValueError: result of slicing is an empty tensor
Thanks for your help.
from crnn.pytorch.
I have solve it. change output = utils.data_parallel(self.rnn, conv, self.ngpu)
to output = self.rnn(conv)
makes it work.
from crnn.pytorch.
@XuLiangFRDC Please open another issue as it seems not to be this problem
from crnn.pytorch.
@meijieru I have opened another issue. Please see:
#41
thanks!
from crnn.pytorch.
@Lzc6996 Could you please give more details according to your solution: I have solve it. change output = utils.data_parallel(self.rnn, conv, self.ngpu) to output = self.rnn(conv) makes it work.
I modified the file (crnn.pytorch-master/models/crnn.py) as following:
import utils
replace code: output = self.rnn(conv)
as: output = utils.data_parallel(self.rnn, conv, self.ngpu)
And I added the file (utils.py) into dir ((crnn.pytorch-master/models):
the new file utils.py is defined as following:
#!/usr/bin/python
#encoding: utf-8
import torch.nn as nn
import torch.nn.parallel
def data_parallel(model, input, ngpu):
if isinstance(input.data, torch.cuda.FloatTensor) and ngpu > 1:
output = nn.parallel.data_parallel(model, input, range(ngpu))
else:
output = model(input)
return output
But it doesn't work. Why?
The command is:
python crnn_main.py --trainroot /home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort --valroot /home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort --cuda --adadelta --keep_ratio --random_sample
The following error message is:
Namespace(adadelta=True, adam=False, alphabet='0123456789abcdefghijklmnopqrstuvwxyz', batchSize=64, beta1=0.5, crnn='', cuda=True, displayInterval=500, experiment=None, imgH=32, imgW=100, keep_ratio=True, lr=0.01, n_test_disp=10, ngpu=1, nh=256, niter=25, random_sample=True, saveInterval=500, trainroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort', valInterval=500, valroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort', workers=2)
Random Seed: 5088
CRNN (
(cnn): Sequential (
(conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu0): ReLU (inplace)
(pooling0): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu1): ReLU (inplace)
(pooling1): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
(conv2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(batchnorm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(relu2): ReLU (inplace)
(conv3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu3): ReLU (inplace)
(pooling2): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1))
(conv4): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(batchnorm4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(relu4): ReLU (inplace)
(conv5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu5): ReLU (inplace)
(pooling3): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1))
(conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1))
(batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(relu6): ReLU (inplace)
)
(rnn): Sequential (
(0): BidirectionalLSTM (
(rnn): LSTM(512, 256, bidirectional=True)
(embedding): Linear (512 -> 256)
)
(1): BidirectionalLSTM (
(rnn): LSTM(256, 256, bidirectional=True)
(embedding): Linear (512 -> 37)
(rnn): LSTM(256, 256, bidirectional=True)
(embedding): Linear (512 -> 37)
)
)
)
Traceback (most recent call last):
File "crnn_main.py", line 197, in
cost = trainBatch(crnn, criterion, optimizer)
File "crnn_main.py", line 180, in trainBatch
preds = crnn(image)
File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 59, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "/home/xuliang/CRNN_pytorch_v2/crnn.pytorch/models/crnn.py", line 83, in forward
output = utils.data_parallel(self.rnn, conv, self.ngpu)
File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 238, in getattr
type(self).name, name))
AttributeError: 'CRNN' object has no attribute 'ngpu'
By the way, I changed the code to include attribute 'ngpu' and this problem is solved.
But for Multi-GPU, there still exists problem,
Please see:
#41
from crnn.pytorch.
Related Issues (20)
- About loss HOT 1
- loss值是特别大,但是还不知道那么下降,怎么解决这个问题?
- Height of image to 1
- Why My CTC Loss is always Nan or Inf value? HOT 6
- 出现报错RuntimeError: CUDA error: an illegal memory access was encountered HOT 3
- 不收敛,loss下降到15左右就震荡不动了 HOT 3
- What is the difference between warp_ctc_pytorch.ctcloss and torch.nn.ctcloss? HOT 3
- 训练icdar2015
- 训练自己的数据时报错:an illegal memory access was encountered HOT 1
- Utility of method oneHot in utils.py
- Is it normal for the model to predict the same string for the whole train batch and val?
- raise UnidentifiedImageError( PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x0000024631A5E360> HOT 1
- CRNN过拟合求解The Problem Of Overfitting
- The Problem Of Over fitting过拟合问题
- The loss occurred in Nan's case,What is the reason? HOT 1
- 关于模型结构最后两层的BN层 HOT 1
- Why is the pooling layer in the network different from the 1 * 2 pooling in the original paper?
- thi project support python3?
- 训练数据集用的是什么??? HOT 1
- RuntimeError: CUDA error: an illegal memory access was encountered HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crnn.pytorch.