I am trying to use SAM optimizer when I use the backward function twice in train_epoch

reduction oh sorry for the codes <p d

Trying to use SAM optimizer for Random Sampling Image Classification,about kozistr/pytorch_optimizer

Comments (14)

kozistr commented on May 23, 2024 1

hello!

maybe, there's only one time loss calculation, but SAM needs 2 times fwd & bwd in your code. At the SAM Optimizer part, actually the loss is calculated at only here (loss = m_backbone_loss).

to fix the code,

    with torch.cuda.device(CUDA_VISIBLE_DEVICES):
        inputs = data[0].cuda()
        labels = data[1].cuda()

    iters += 1
    optimizers.zero_grad()  

    # i don't know which criterion you used, 
    # but, (in case of `creiterion` is the pytorch built-in `cross entropy` function), 
    # it could be better to use `reduction=mean`.
    # it's equal to 1) and 2)
    # 1) loss = criterion(scores, labels, reduction='mean') # output is scalar
    # 2) target_loss = criterion(scores, labels,)
    #    loss = torch.sum(target_loss) / target_loss.size(0)

    # -----------------SAM Optimizer -------------------
    # first forward-backward pass
    # need to assign the result of `criterion` to `loss`
    loss = criterion(models(inputs)[0], labels, reduction='mean')
    loss.backward(retain_graph=True)
    optimizers.first_step(zero_grad=True)
 
    # second forward-backward pass
    # need to assign the result of `criterion` to `loss`
    loss = criterion(models(inputs)[0], labels, reduction='mean')
    loss.backward(retain_graph=True)
    optimizers.second_step(zero_grad=True)

from pytorch_optimizer.

kozistr commented on May 23, 2024 1

IndexError: dimension specified as 0 but tensor has no dimensions

could you upload the whole error messages?

also, you should comment the codes like below! SAM Optimizer parts already do fwd & bwd twice!

+) how size of models(inputs)[0] is ? it should be (bs, num_classes) logits!

    #scores, _, features = models(inputs)   
    # scores = models(inputs)     
    # target_loss = criterion(scores, labels)
    # loss = torch.sum(target_loss) / target_loss.size(0)

from pytorch_optimizer.

kozistr commented on May 23, 2024 1

reduction

oh sorry for the codes

I mean that set reduction parameter to mean at the definition like your code! criterion = nn.CrossEntropyLoss(reduction='mean').

removing the reduction parameter should be fine! (loss = criterion(models(inputs)[0], labels, reduction='mean')
to loss = criterion(models(inputs)[0], labels)

from pytorch_optimizer.

manza-ari commented on May 23, 2024 1

yeah! you are right, I got you earlier and tried without (reduction='mean') in train_epoch() but another error was raised that says

File "/home/kanza/workspace/WithoutDictionary/RandomFixedSAM/main.py", line 120, in
train(models, criterion, optimizers, schedulers, dataloaders, args.no_of_epochs, EPOCHL)
File "/home/kanza/workspace/WithoutDictionary/RandomFixedSAM/train_test.py", line 64, in train
loss = train_epoch(models, criterion, optimizers, dataloaders)
File "/home/kanza/workspace/WithoutDictionary/RandomFixedSAM/train_test.py", line 48, in train_epoch
loss = criterion(models(inputs)[0], labels)
File "/home/kanza/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/kanza/anaconda3/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 1164, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/home/kanza/anaconda3/lib/python3.9/site-packages/torch/nn/functional.py", line 3014, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: size mismatch (got input: [100], target: [10])

from pytorch_optimizer.

kozistr commented on May 23, 2024 1

i guess RuntimeError: size mismatch (got input: [100], target: [10]) error means that the output of the model is 100 (num_classes of model : 100), but the label (num_classes) is 10.

so, correcting the num_classes should solve the issue

from pytorch_optimizer.

kozistr commented on May 23, 2024 1

Yeah, I changed the number of classes to 100 in ResNet but .... I don't know

then the label data you have is total 10 classes (num_classes is 10). so changing num_classes of the model to 10!

num_classes must be equal to the number of dataset (label) classes : )

for example, in case of MNIST dataset, it has 10 classes (0 ~ 9), so the output of the model (num_classes) should be 10.

from pytorch_optimizer.

kozistr commented on May 23, 2024 1

oh you are using CIFAR100 dataset. then as you said, num_classes 100 is correct. also your transformation code seems no problem i think.

how about checking the number of classes of data_train? perhaps, it doesn't actually load the CIFAR100 dataset, but instead CIFAR10. (my rough guess)

i guess there's an issue with the dataset part!

from pytorch_optimizer.

manza-ari commented on May 23, 2024 1

I think there was some bug, it was not working for any of the datasets, so I made the changes again with a new folder. and everything is working fine now. I cannot thank you enough for helping me today.
I have a question!
Let's go a little back to the original train_epoch() of the original repository where loss is calculated by following the method

target_loss = criterion(scores, labels)
loss = torch.sum(target_loss) / target_loss.size(0)

and now we are doing it like this, I hope we are not doing anything mathematically or logically wrong.

`for data in tqdm(dataloaders['train'], leave=False, total=len(dataloaders['train'])):
with torch.cuda.device(CUDA_VISIBLE_DEVICES):
inputs = data[0].cuda()
labels = data[1].cuda()
iters += 1
optimizers.zero_grad()
scores, _, features = models(inputs)
#target_loss = criterion(scores, labels)
#m_backbone_loss = torch.sum(target_loss) / target_loss.size(0)
#loss = m_backbone_loss
# -----------------SAM Optimizer -------------------
# first forward-backward pass
loss = criterion(models(inputs)[0], labels)
loss.backward(retain_graph=True)
optimizers.first_step(zero_grad=True)

    # second forward-backward pass
    loss = criterion(models(inputs)[0], labels)
    loss.backward(retain_graph=True)
    optimizers.second_step(zero_grad=True) `

from pytorch_optimizer.

kozistr commented on May 23, 2024 1

I think there was some bug, it was not working for any of the datasets, so I made the changes again with a new folder. and everything is working fine now. I cannot thank you enough for helping me today. I have a question! Let's go a little back to the original train_epoch() of the original repository where loss is calculated by following the method

target_loss = criterion(scores, labels) loss = torch.sum(target_loss) / target_loss.size(0)

and now we are doing it like this, I hope we are not doing anything mathematically or logically wrong.

`for data in tqdm(dataloaders['train'], leave=False, total=len(dataloaders['train'])): with torch.cuda.device(CUDA_VISIBLE_DEVICES): inputs = data[0].cuda() labels = data[1].cuda() iters += 1 optimizers.zero_grad() scores, _, features = models(inputs) #target_loss = criterion(scores, labels) #m_backbone_loss = torch.sum(target_loss) / target_loss.size(0) #loss = m_backbone_loss # -----------------SAM Optimizer ------------------- # first forward-backward pass loss = criterion(models(inputs)[0], labels) loss.backward(retain_graph=True) optimizers.first_step(zero_grad=True)
    # second forward-backward pass
    loss = criterion(models(inputs)[0], labels)
    loss.backward(retain_graph=True)
    optimizers.second_step(zero_grad=True) `

i'm glad you solve the issuse :)

in the aspect of the loss value, there's no difference between

criterion(scores, labels) and target_loss = criterion(scores, labels); loss = torch.sum(target_loss ) / target_loss .size()[0].

because criterion(scores, labels) returns a single scalar value, which likes e.g. 0.1234. so torch.sum(target_loss) / target_loss.size()[0] won't affect to the loss!

in the aspect of the SAM part, the final loss must be backpropagated like loss.backward(retain_graph=True). (assume torch.sum(target_loss) / target_loss.size()[0] affects to the loss)

But, in your original codes, the final loss (target_loss = criterion(scores, labels); loss = torch.sum(target_loss) / target_loss.size()[0]) isn't passed to the .backward(), but only criterion(scores, labels) part though.

I hope this could help!

from pytorch_optimizer.

manza-ari commented on May 23, 2024

yea I am using "Cross Entropy" . Okay I edited as per your recommendation but now it says
IndexError: dimension specified as 0 but tensor has no dimensions
`def train_epoch(models, criterion, optimizers, dataloaders):

models.train()

global iters
for data in tqdm(dataloaders['train'], leave=False, total=len(dataloaders['train'])):
    with torch.cuda.device(CUDA_VISIBLE_DEVICES):
        inputs = data[0].cuda()
        labels = data[1].cuda()
    iters += 1
    optimizers.zero_grad()   
    #scores, _, features = models(inputs)   
    scores = models(inputs)     
    target_loss = criterion(scores, labels)
    loss = torch.sum(target_loss) / target_loss.size(0)        
    
    # -----------------SAM Optimizer -------------------
    
    # first forward-backward pass
    loss = criterion(models(inputs)[0], labels, reduction='mean')
    loss.backward(retain_graph=True)
    optimizers.first_step(zero_grad=True)
    
    # second forward-backward pass
    loss = criterion(models(inputs)[0], labels, reduction='mean')
    loss.backward(retain_graph=True)
    optimizers.second_step(zero_grad=True)`

from pytorch_optimizer.

manza-ari commented on May 23, 2024

So in main() I made everything like this

` # Loss, criterion and scheduler (re)initialization
criterion = nn.CrossEntropyLoss(reduction='mean')
base_optimizer = torch.optim.SGD
optim_backbone = SAM(models.parameters(), base_optimizer, lr=LR, momentum=MOMENTUM, weight_decay=WDECAY)

        sched_backbone = lr_scheduler.MultiStepLR(optim_backbone, milestones=MILESTONES)
        optimizers =  optim_backbone
        schedulers = sched_backbone`

and tran_epoch() is

`def train_epoch(models, criterion, optimizers, dataloaders):

models.train()

global iters
for data in tqdm(dataloaders['train'], leave=False, total=len(dataloaders['train'])):
    with torch.cuda.device(CUDA_VISIBLE_DEVICES):
        inputs = data[0].cuda()
        labels = data[1].cuda()
    iters += 1
    optimizers.zero_grad()   
    #scores, _, features = models(inputs)   
    #scores = models(inputs)     
    #target_loss = criterion(scores, labels)
    #loss = torch.sum(target_loss) / target_loss.size(0)        
    
    # -----------------SAM Optimizer -------------------

    # first forward-backward pass
    loss = criterion(models(inputs)[0], labels, reduction='mean')
    loss.backward(retain_graph=True)
    optimizers.first_step(zero_grad=True)
    
    # second forward-backward pass
    loss = criterion(models(inputs)[0], labels, reduction='mean')
    loss.backward(retain_graph=True)
    optimizers.second_step(zero_grad=True)`

and now the error is

Train a Model.
/home/kanza/anaconda3/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "
Finished.
Trial 1/1 || Cycle 1/10 || Label set size 5: Test acc 1.04
Train a Model.
Traceback (most recent call last):
File "/home/kanza/workspace/WithoutDictionary/RandomFixedSAM/main.py", line 120, in
train(models, criterion, optimizers, schedulers, dataloaders, args.no_of_epochs, EPOCHL)
File "/home/kanza/workspace/WithoutDictionary/RandomFixedSAM/train_test.py", line 64, in train
loss = train_epoch(models, criterion, optimizers, dataloaders)
File "/home/kanza/workspace/WithoutDictionary/RandomFixedSAM/train_test.py", line 48, in train_epoch
loss = criterion(models(inputs)[0], labels, reduction='mean')
File "/home/kanza/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'reduction'

from pytorch_optimizer.

manza-ari commented on May 23, 2024

Yeah, I changed the number of classes to 100 in ResNet but .... I don't know

from pytorch_optimizer.

manza-ari commented on May 23, 2024

I changed the number of classes in the backbone, I am using CIFAR100 and changed number of classes

`class ResNet(nn.Module):
def init(self, block, num_blocks, num_classes=100):
super(ResNet, self).init()
self.in_planes = 64

    self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
    self.bn1 = nn.BatchNorm2d(64)
    self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
    self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
    self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
    self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
    self.linear = nn.Linear(512*block.expansion, num_classes)
    # self.linear2 = nn.Linear(1000, num_classes)

def _make_layer(self, block, planes, num_blocks, stride):
    strides = [stride] + [1]*(num_blocks-1)
    layers = []
    for stride in strides:
        layers.append(block(self.in_planes, planes, stride))
        self.in_planes = planes * block.expansion
    return nn.Sequential(*layers)

def forward(self, x):
    out = F.relu(self.bn1(self.conv1(x)))
    out1 = self.layer1(out)
    out2 = self.layer2(out1)
    out3 = self.layer3(out2)
    out4 = self.layer4(out3)
    out = F.avg_pool2d(out4, 4)
    outf = out.view(out.size(0), -1)
    # outl = self.linear(outf)
    out = self.linear(outf)
    #return out, outf, [out1, out2, out3, out4]
    return out

def ResNet18(num_classes = 100):
return ResNet(BasicBlock, [2,2,2,2], num_classes)`

load_dataset() is also sharing the number of classes

`def load_dataset(dataset):
train_transform = T.Compose([
T.RandomHorizontalFlip(),
T.RandomCrop(size=32, padding = 4), #ImageNet 224
T.ToTensor(),
T.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010]) # T.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)) # CIFAR-100
])

test_transform = T.Compose([
    T.ToTensor(),
    T.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010]) # T.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)) # CIFAR-100
])
if dataset == 'cifar100':
    data_train = CIFAR100('../cifar100', train=True, download=True, transform=train_transform)
    data_unlabeled = MyDataset(dataset, True, test_transform)
    data_test  = CIFAR100('../cifar100', train=False, download=True, transform=test_transform)
    NO_CLASSES = 100
    adden = ADDENDUM
    no_train = NUM_TRAIN `

from pytorch_optimizer.

kozistr commented on May 23, 2024

I'll close the issues (w/ your previous issue). if you have any questions, feel free to repoen the issue or create another issue :)

best regards

from pytorch_optimizer.

Trying to use SAM optimizer for Random Sampling Image Classification about pytorch_optimizer HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent