alexnet [with cifar-10, batch size 256, worker 6]. training loss: NAN about examples HOT 7 CLOSED

pytorch commented on May 18, 2024

alexnet [with cifar-10, batch size 256, worker 6]. training loss: NAN

from examples.

Comments (7)

wangg12 commented on May 18, 2024

@Foristkirito Could you provide a small snippet to reproduce your bug?

from examples.

Foristkirito commented on May 18, 2024

@wangg12 of course, I just use the modified code by me.
with command python main.py -a alexnet -j 6 --resume ./alexnet_cp --epochs 90 -b 256 ./data. I think the problem is that the loss is too large to over flow.
I also ran with --pretrained, loss is ok, but after 90 epochs, performance nearly did not change. shown as below:

 * Prec@1 10.000 Prec@5 50.000
Epoch: [89][0/196]      Time 1.455 (1.455)      Data 1.072 (1.072)      Loss 2.3118 (2.3118)    Prec@1 9.375 (9.375)    Prec@5 49.219 (49.219)
Epoch: [89][10/196]     Time 0.406 (0.507)      Data 0.001 (0.100)      Loss 2.3118 (2.3118)    Prec@1 10.547 (10.085)  Prec@5 51.172 (50.604)
Epoch: [89][20/196]     Time 0.408 (0.460)      Data 0.001 (0.053)      Loss 2.3118 (2.3118)    Prec@1 8.594 (10.212)   Prec@5 51.172 (50.930)
Epoch: [89][30/196]     Time 0.401 (0.444)      Data 0.001 (0.036)      Loss 2.3119 (2.3118)    Prec@1 8.594 (9.929)    Prec@5 44.922 (50.441)
Epoch: [89][40/196]     Time 0.004 (0.436)      Data 0.001 (0.027)      Loss 2.3118 (2.3118)    Prec@1 10.156 (9.861)   Prec@5 55.859 (50.210)
Epoch: [89][50/196]     Time 0.410 (0.431)      Data 0.001 (0.022)      Loss 2.3119 (2.3118)    Prec@1 10.547 (9.934)   Prec@5 49.609 (50.444)
Epoch: [89][60/196]     Time 0.415 (0.428)      Data 0.001 (0.019)      Loss 2.3117 (2.3118)    Prec@1 11.719 (10.028)  Prec@5 55.078 (50.506)
Epoch: [89][70/196]     Time 0.407 (0.426)      Data 0.001 (0.016)      Loss 2.3119 (2.3118)    Prec@1 8.594 (10.030)   Prec@5 49.219 (50.539)
Epoch: [89][80/196]     Time 0.393 (0.428)      Data 0.001 (0.014)      Loss 2.3118 (2.3118)    Prec@1 6.641 (9.968)    Prec@5 50.781 (50.236)
Epoch: [89][90/196]     Time 0.392 (0.426)      Data 0.001 (0.013)      Loss 2.3119 (2.3118)    Prec@1 8.984 (10.045)   Prec@5 49.219 (50.206)
Epoch: [89][100/196]    Time 0.591 (0.425)      Data 0.001 (0.011)      Loss 2.3118 (2.3118)    Prec@1 10.156 (9.998)   Prec@5 50.000 (50.085)
Epoch: [89][110/196]    Time 0.399 (0.423)      Data 0.001 (0.011)      Loss 2.3118 (2.3118)    Prec@1 13.672 (10.015)  Prec@5 51.953 (50.070)
Epoch: [89][120/196]    Time 0.395 (0.422)      Data 0.001 (0.010)      Loss 2.3119 (2.3118)    Prec@1 8.203 (9.985)    Prec@5 48.438 (49.913)
Epoch: [89][130/196]    Time 0.389 (0.422)      Data 0.001 (0.009)      Loss 2.3118 (2.3118)    Prec@1 10.938 (9.951)   Prec@5 50.781 (49.860)
Epoch: [89][140/196]    Time 0.404 (0.421)      Data 0.001 (0.008)      Loss 2.3119 (2.3118)    Prec@1 8.984 (9.912)    Prec@5 50.000 (49.986)
Epoch: [89][150/196]    Time 0.397 (0.421)      Data 0.001 (0.008)      Loss 2.3119 (2.3118)    Prec@1 7.422 (9.910)    Prec@5 49.609 (50.000)
Epoch: [89][160/196]    Time 0.408 (0.419)      Data 0.001 (0.008)      Loss 2.3119 (2.3118)    Prec@1 9.766 (9.899)    Prec@5 47.266 (49.939)
Epoch: [89][170/196]    Time 0.399 (0.419)      Data 0.001 (0.007)      Loss 2.3119 (2.3118)    Prec@1 12.109 (9.875)   Prec@5 48.438 (49.836)
Epoch: [89][180/196]    Time 0.405 (0.419)      Data 0.001 (0.007)      Loss 2.3119 (2.3118)    Prec@1 11.328 (9.874)   Prec@5 48.828 (49.767)
Epoch: [89][190/196]    Time 0.398 (0.419)      Data 0.000 (0.006)      Loss 2.3119 (2.3118)    Prec@1 8.203 (9.835)    Prec@5 49.219 (49.691)
Test: [0/40]    Time 0.845 (0.845)      Loss 2.3119 (2.3119)    Prec@1 8.984 (8.984)    Prec@5 46.875 (46.875)
Test: [10/40]   Time 0.155 (0.264)      Loss 2.3118 (2.3118)    Prec@1 10.547 (10.298)  Prec@5 54.688 (49.503)
Test: [20/40]   Time 0.168 (0.214)      Loss 2.3118 (2.3118)    Prec@1 10.938 (10.305)  Prec@5 53.125 (50.186)
Test: [30/40]   Time 0.338 (0.203)      Loss 2.3117 (2.3118)    Prec@1 9.766 (10.131)   Prec@5 57.812 (50.101)
 * Prec@1 10.000 Prec@5 50.000

from examples.

Foristkirito commented on May 18, 2024

@wangg12 I figure it out. I made a mistake. Problem solved thank you guy.

from examples.

wangg12 commented on May 18, 2024

@Foristkirito What is the problem?

from examples.

Foristkirito commented on May 18, 2024

@wangg12 the problem is directories. It's necessary to maintain the project directory structure and put it right at your home directory. I do not understand why but it does not work if you put imagenet at your home directory. However, resnet always works fine i will spend some time to figure our the real problem.

from examples.

wangg12 commented on May 18, 2024

@Foristkirito Do you get better results with alexnet on cifar-10 now?

Also IMO, alexnet is not suitable for cifar10 though. The architecture is for bigger images but 32x32 cifar-10 images.

Besides, if you do not use pre-trained weights, you should be careful with the learning rate and the weights initialization (The random behavior can be fixed by torch.manual_seed(seed)).

from examples.

Foristkirito commented on May 18, 2024

@wangg12 The accuracy of alexnet is still low. It sounds a good solution i will try it.

from examples.

alexnet [with cifar-10, batch size 256, worker 6]. training loss: NAN about examples HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent