Giter Site home page Giter Site logo

Need help for backward training about nn HOT 15 CLOSED

torch avatar torch commented on August 17, 2024
Need help for backward training

from nn.

Comments (15)

soumith avatar soumith commented on August 17, 2024

@russellfei can you provide a small snippet of your model, along with the input tensor sizes.

Your network's math does not seem to work out, maybe you are providing a gradOutput that is too big

from nn.

russellfei avatar russellfei commented on August 17, 2024

Thx~ @soumith

----------------------------------------------------------------------
function train()

   -- epoch tracker
   epoch = epoch or 1

   -- local vars
   local time = sys.clock()
   local batchSize = opt.batchSize

   -- create augmented dataset
   if opt.augment == 'false' then
      ----> added by r.f.
      local totalSize = trainData:size()
      -- shuffle at each epoch
      trsize = 1680
      local shuffle = torch.randperm(trsize):type('torch.LongTensor')
      -- BDHW mode
      local inputs = torch.Tensor(totalSize,nBands,height,width)
      local targets = torch.Tensor(totalSize):zero()
      -- shuffle input
      inputs = trainData.data:index(1,shuffle)
      targets = trainData.labels:index(1,shuffle)
      --print('targets of train data')
      --print(targets)
      -- do one epoch
      print('==> doing epoch on training data:')
      print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')

      for t = 1,totalSize,opt.batchSize do
         -- disp progress
         xlua.progress(t, totalSize)

         -- create mini batch
         -------------------------------------------------------
         -- the key for use CUDA lies in the support in torch lib
         -- not in table, as a result, this code will surely fail.
         -- need to add flag 'bmode' for handling with cudaconvnet api
         -- TBD
         ------------------------------------------------------------
         local input = inputs[t]
         local target = targets[t]
         -- evaluate function for complete mini batch
         ---> get all output at first ---------------
         --> error: input is not a floatTensor ???
         -- essential data format
         if opt.type == 'double' then input = input:double() end
         if opt.type == 'cuda' then input = input:cuda() end
         -- optimize on current mini-batch
         ------------------------------------------------------------
         -- optim function
         -- create closure to evaluate f(X) and df/dX
         local feval = function(x)

            --print('--> data preparation')
            local batchSize = opt.batchSize
            -- get new parameters
            if x ~= parameters then
               parameters:copy(x)
            end
            -- reset gradients
            gradParameters:zero()

            -- f is the average of all criterions
            local f = 0

            --print('---> forward propagation')
            local outputs = model:forward(input)
            outputs = outputs:float()
            ---> transfer to floatTensor to calculate
            -- calculate gradient matrix
            local df_do = torch.Tensor(outputs:size())

            --print('---> gradients accumulation')
            for i = 1,batchSize do
               -- estimate f
               local err = criterion:forward(outputs, target)
               f = f + err
               --print('add err 1')
               -- estimate df/dW
               -- split to calculate df_do
               df_do = criterion:backward(outputs,target)
               --print('---> backprop')
               -- do backwards together
               if opt.type == 'cuda' then
                  model:backward( input,df_do:cuda() )
               else
                  model:backward( input,df_do )
               end
               -- update confusion
               confusion:add(outputs, target)
            end


            -- normalize gradients and f(X)
            gradParameters:div( batchSize )
            f = f/batchSize
            -- check for convergence at 1st epoch
            -- if error doesn't decrease to less than half
            -- that model might be diverged.
            --print('err: ' .. (f))
            -- return f and df/dX
            return f,gradParameters
         end
         --print '------>start to optim'
         if optimMethod == optim.asgd then
            _,_,average = optimMethod(feval, parameters, optimState)
         else
            optimMethod(feval, parameters, optimState)
         end
      end
   else
      -- augmented inputs and targets
      -- store entire augment dataset needs 155G RAM
      -- do immediate augment as alternatives
      if opt.augment == 'true' then
         local bangIdx = 2640
         trsize = 1680
         local totalSize = bangIdx * trsize
         local shuffle = torch.randperm(trsize):type('torch.LongTensor')
         -- BDHW mode
         local in_inputs = torch.Tensor(trsize,nBands,height,width)
         local in_targets = torch.Tensor(trsize):zero()
         -- shuffle input
         in_inputs = trainData.data:index(1,shuffle)
         in_targets = trainData.labels:index(1,shuffle)
         -- autmented one image
         local inputs = torch.Tensor(bangIdx,nBands,height,width)
         local targets = torch.Tensor(bangIdx):zero()

         -- do one epoch
         print('==> doing epoch on training data:')
         print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')

         for t = 1,totalSize,opt.batchSize do
            -- disp progress
            xlua.progress(t, totalSize)

            -- augment first image
            if  (t-1) % bangIdx == 0 then
               -- originImageIndex: j
               local j = torch.ceil(t/bangIdx)
               inputs,targets = dataBang(in_inputs[j],in_targets[j])
            end
            -- create mini batch
            --print('==> map index')
            -- related idx for inputs
            p_idx = t % bangIdx
            --print('p idx = '..p_idx..', t = '..t)
            local input = inputs[p_idx]
            local target = targets[p_idx]

            -- essential data format
            if opt.type == 'double' then input = input:double() end
            if opt.type == 'cuda' then input = input:cuda() end
            ------------------------------------------------------------
            -- optim function
            -- create closure to evaluate f(X) and df/dX
            local feval = function(x)

               --print('--> data preparation')
               local batchSize = opt.batchSize
               -- get new parameters
               if x ~= parameters then
                  parameters:copy(x)
               end
               -- reset gradients
               gradParameters:zero()

               -- f is the average of all criterions
               local f = 0

               --print('---> forward propagation')
               local outputs = model:forward(input)
               outputs = outputs:float()
               ---> transfer to floatTensor to calculate
               -- calculate gradient matrix
               local df_do = torch.Tensor(outputs:size())

               --print('---> gradients accumulation')
               for i = 1,batchSize do
                  -- estimate f
                  local err = criterion:forward(outputs, target)
                  f = f + err
                  --print('add err 1')
                  -- estimate df/dW
                  -- split to calculate df_do
                  df_do = criterion:backward(outputs,target)
                  --print('---> backprop')
                  -- do backwards together
                  if opt.type == 'cuda' then
                     model:backward( input,df_do:cuda() )
                  else
                     model:backward( input,df_do )
                  end
                  -- update confusion
                  confusion:add(outputs, target)
               end


               -- normalize gradients and f(X)
               gradParameters:div( batchSize )
               f = f/batchSize
               -- check for convergence at 1st epoch
               -- if error doesn't decrease to less than half
               -- that model might be diverged.
               --print('err: ' .. (f))
               -- return f and df/dX
               return f,gradParameters
            end


            -- optimize on current mini-batch
            --print ('==> start to optim')
            if optimMethod == optim.asgd then
               _,_,average = optimMethod(feval, parameters, optimState)
            else
               optimMethod(feval, parameters, optimState)
            end
         end
      else
         print 'error at data augment flag value'
      end
   end


   --------end of local optim funciton--------------------------------------
   -- time taken
   time = sys.clock() - time
   time = time / trainData:size()
   print("\n==> time to learn 1 sample = " .. (time*1000) .. 'ms')

   -- print confusion matrix
   print(confusion)
   sys.sleep(1)
   -- update logger/plot
   trainLogger:add{['% mean class accuracy (train set)'] = confusion.totalValid * 100}
   if opt.plot then
      trainLogger:style{['% mean class accuracy (train set)'] = '-'}
      trainLogger:plot()
   end

   -- save/log current net
   local filename = paths.concat(opt.save, 'model.net')
   os.execute('mkdir -p ' .. sys.dirname(filename))
   print('==> saving model to '..filename)
   torch.save(filename, model)

   -- next epoch
   confusion:zero()
   epoch = epoch + 1
end

In the snippet above, there're two identical feval function and each time the train() function process just only one image. opt.augment is a trigger for create various small images from origin input (3x256x256, sliced into 3x224x224 then resize to 3x112x112)

The model:backward( input, df_do:cuda() ) at the section where opt.augment == 'true'.

According to source code of model:backward, it needs input and adjust the results with df_do.
The same line works fine, well, why the other line fails? T_T

from nn.

soumith avatar soumith commented on August 17, 2024

ok so if one feval is working fine and the other fails. your dataBang function is not giving the correct sized inputs.
What you can do is right before the line model:forward, in both locations, print the input sizes, with:
print(#input)
You will then see that in your second (augment=true) code, the inputs are shaped wrong by dataBang. (at least I suspect this)

from nn.

russellfei avatar russellfei commented on August 17, 2024

Morning~ @soumith
Well, I tried before, the input size is 3x256x256 when augment == "false" and 3x112x112 when augment == "true", they are actually feed into different network architectures which is listed below.

model = nn.Sequential()

if opt.model == 'convnet' then
   -- input dimensions
   if opt.augment == 'true' then
      nBands = 3
      width = 112
      height = 112
      --TODO: specify augmented cnn arch
      hidConv = {96,128,256,384,512,768,210}
      filtsize = {5,5,3,3,3,3}
      poolsize = {2,0,3,0,4,0}

      -- stage 1 : filter bank -> nonlinear -> L2 pooling
      model:add(nn.SpatialConvolutionMM(nBands, hidConv[1], filtsize[1], filtsize[1]))
      model:add(nn.ReLU())
      model:add(nn.SpatialLPPooling(hidConv[1],2,poolsize[1],poolsize[1],poolsize[1],poolsize[1]))
      -- stage 2 : filter bank -> nonlinear -> L2 pooling
      model:add(nn.SpatialConvolutionMM(hidConv[1], hidConv[2], filtsize[2], filtsize[2]))
      model:add(nn.ReLU())
      --model:add(nn.SpatialLPPooling(hidConv[2],2,poolsize[2],poolsize[2],poolsize[2],poolsize[2]))
      -- stage 3: filter bank --> nonlinear  -> L2 pooling
      model:add(nn.SpatialConvolutionMM(hidConv[2], hidConv[3], filtsize[3], filtsize[3]))
      model:add(nn.ReLU())
      model:add(nn.SpatialLPPooling(hidConv[3],poolsize[3],poolsize[3],poolsize[3],poolsize[3]))

      -- stage 4: filter bank --> nonlinear --> L2 pooling
      model:add(nn.SpatialConvolutionMM(hidConv[3], hidConv[4], filtsize[4], filtsize[4]))
      model:add(nn.ReLU())
      --model:add(nn.SpatialLPPooling(hidConv[4],poolsize[4],poolsize[4],poolsize[4],poolsize[4]))

      -- stage 5: filter bank --> nonlinear -> L2 pooling
      model:add(nn.SpatialConvolutionMM(hidConv[4], hidConv[5], filtsize[5], filtsize[5]))
      model:add(nn.ReLU())
      model:add(nn.SpatialLPPooling(hidConv[5],poolsize[5],poolsize[5],poolsize[5],poolsize[5]))

      -- stage 6: filter bank --> nonlinear -> L2 pooling
      model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))
      model:add(nn.ReLU())
      --model:add(nn.SpatialLPPooling(hidConv[6],poolsize[6],poolsize[6],poolsize[6],poolsize[6]))

      -- stage 6 : standard 2-layer neural network
      model:add(nn.Reshape(hidConv[6]))
      model:add(nn.Linear(hidConv[6], hidConv[7]))
      model:add(nn.Tanh())
      model:add(nn.Linear(hidConv[7], noutputs))
   else
      if opt.augment == 'false' then
         nBands = 3
         width = 256
         height = 256
         -- hidden units, filter sizes (for ConvNet only):
         hidConv = {128,256,384,512,768,768,210}
         filtsize = {5,7,5,5,3,3}
         poolsize = {2,2,2,2,2,3}

         -- stage 1 : filter bank -> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(nBands, hidConv[1], filtsize[1], filtsize[1]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[1],2,poolsize[1],poolsize[1],poolsize[1],poolsize[1]))
         -- stage 2 : filter bank -> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[1], hidConv[2], filtsize[2], filtsize[2]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[2],2,poolsize[2],poolsize[2],poolsize[2],poolsize[2]))
         -- stage 3: filter bank --> nonlinear  -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[2], hidConv[3], filtsize[3], filtsize[3]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[3],poolsize[3],poolsize[3],poolsize[3],poolsize[3]))

         -- stage 4: filter bank --> nonlinear --> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[3], hidConv[4], filtsize[4], filtsize[4]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[4],poolsize[4],poolsize[4],poolsize[4],poolsize[4]))

         -- stage 5: filter bank --> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[4], hidConv[5], filtsize[5], filtsize[5]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[5],poolsize[5],poolsize[5],poolsize[5],poolsize[5]))

         -- stage 6: filter bank --> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))
         model:add(nn.ReLU())
         model:add(nn.SpatialLPPooling(hidConv[6],poolsize[6],poolsize[6],poolsize[6],poolsize[6]))

         -- stage 6 : standard 2-layer neural network
         model:add(nn.Reshape(hidConv[6]))
         model:add(nn.Linear(hidConv[6], hidConv[7]))
         model:add(nn.Tanh())
         model:add(nn.Linear(hidConv[7], noutputs))
      end
   end
end

The forward process is identical, because only one image pass and back at one time.
Besides, I've checked my net architecture parameters and they are congruent with my proposed computation, i.e., the output is 1x1 for each feature map.

I'll check it again.

from nn.

soumith avatar soumith commented on August 17, 2024

the network looks fine, however, i am saying check that your dataBang function always gives out 112x112 cases. When doing random crops, you might hit a corner case somewhere.
Add this line right before the forward call:
print(#input)
See that for every sample it is the exact same input size, and databang is not sometimes returning 112x111 for example.

from nn.

russellfei avatar russellfei commented on August 17, 2024

@soumith
However, this error is asserted when the first image is backproped.

and the input size for that first image is

input............................. 1/4435200 ..................................] ETA: 0ms | Step: 0ms                              
   3
 112
 112
[torch.LongStorage of size 3]

df_do   

 21
[torch.LongStorage of size 1]

using this snippet

            print()
            print('input')
            print(#input)
            --print('---> forward propagation')
            local outputs = model:forward(input)
            outputs = outputs:float()
            ---> transfer to floatTensor to calculate
            -- calculate gradient matrix 
           local df_do = torch.Tensor(outputs:size())
            print('df_do')
            print(#df_do)

Maybe df_do is not strictly defined?

But I also checked the input and df_do when augment == "false", both cases are congruent

from nn.

soumith avatar soumith commented on August 17, 2024

df_do should be equal to noutputs afaik.

from nn.

soumith avatar soumith commented on August 17, 2024

Also, try replacing the LPPooling with MaxPooling and see if that works. just to be sure something funky is not going on with LPPooling

from nn.

russellfei avatar russellfei commented on August 17, 2024

Thanks @soumith
I'll check that, it's a really strange error.
Night ;-)

from nn.

russellfei avatar russellfei commented on August 17, 2024

Genius!!! @soumith
it works!
Thanks again!
I've been struggling with this error for almost 20h in 2 days

Notes:
I've noticed that the nn.Power module is a part of SpatialLPPooling
using print(model), maybe I should read more lower implements

from nn.

soumith avatar soumith commented on August 17, 2024

what was the solution?

from nn.

russellfei avatar russellfei commented on August 17, 2024

Maybe there's something wrong when I call 'SpatialLPPooling`
I'll figure out that. ;-)

from nn.

russellfei avatar russellfei commented on August 17, 2024

I changed SpatitalMaxPooling back to SpatialLPPooling
and got error message like this

==> defining some tools 
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output]
  (1): nn.SpatialConvolutionMM
  (2): nn.ReLU
  (3): nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): nn.Square
    (2): nn.SpatialSubSampling
    (3): nn.Sqrt
  }
  (4): nn.SpatialConvolutionMM
  (5): nn.ReLU
  (6): nn.SpatialConvolutionMM
  (7): nn.ReLU
  (8): nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): nn.Square
    (2): nn.SpatialSubSampling
    (3): nn.Sqrt
  }
  (9): nn.SpatialConvolutionMM
  (10): nn.ReLU
  (11): nn.SpatialConvolutionMM
  (12): nn.ReLU
  (13): nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): nn.Square
    (2): nn.SpatialSubSampling
    (3): nn.Sqrt
  }
  (14): nn.SpatialConvolutionMM
  (15): nn.ReLU
  (16): nn.Reshape
  (17): nn.Linear
  (18): nn.Tanh
  (19): nn.Linear
  (20): nn.LogSoftMax
}
==> configuring optimizer   
==> training!   
==> doing epoch on training data:   
==> online epoch # 1 [batchSize = 1]    
/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/Sequential.lua:37: size mismatchA: 0ms | Step: 0ms                              
stack traceback:
    [C]: in function 'updateOutput'
    /usr/local/share/lua/5.1/nn/Sequential.lua:37: in function 'forward'
    ucmcnn_aug_LP.lua:939: in function 'opfunc'
    /usr/local/share/lua/5.1/optim/sgd.lua:40: in function 'optimMethod'
    ucmcnn_aug_LP.lua:979: in function 'train'
    ucmcnn_aug_LP.lua:1166: in main chunk
    [C]: in function 'dofile'
    /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:109: in main chunk
    [C]: at 0x00404480

Well, there is a block of code in SpatialLPPooling like

   if pnorm == 2 then
      self:add(nn.Square())
   else
      self:add(nn.Power(pnorm))
   end
   self:add(nn.SpatialSubSampling(nInputPlane, kW, kH, dW, dH))
   if pnorm == 2 then
      self:add(nn.Sqrt())
   else
      self:add(nn.Power(1/pnorm))
   end

   self:get(2).bias:zero()
   self:get(2).weight:fill(1)

I think there's some rule to follow when SpatialLPPooling is called.
BTW, the 'nn.powererror is due to the missing ofpnormduring laterSpatialLPPooling` definitions.

In short, here's something we have to penetrate in.

Too tired to continue, see you~

from nn.

russellfei avatar russellfei commented on August 17, 2024

The bug has been caught, a really little bug!

         -- stage 6: filter bank --> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))

should be

         -- stage 6: filter bank --> nonlinear -> L2 pooling
         model:add(nn.SpatialConvolutionMM(hidConv[5], hidConv[6], filtsize[6], filtsize[6]))

However, during these past hours, I've noted another weird thing and I'll issue a new bug for this.

Thanks~ @soumith

from nn.

russellfei avatar russellfei commented on August 17, 2024

According to the contribution regulations of torch, please delete this issue because it is a personal help request which should be posted on mailing list (the google group, which I often have no access to), thanks @soumith

from nn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.