Comments (7)
@Foristkirito Could you provide a small snippet to reproduce your bug?
from examples.
@wangg12 of course, I just use the modified code by me.
with command python main.py -a alexnet -j 6 --resume ./alexnet_cp --epochs 90 -b 256 ./data
. I think the problem is that the loss is too large to over flow.
I also ran with --pretrained
, loss is ok, but after 90 epochs, performance nearly did not change. shown as below:
* Prec@1 10.000 Prec@5 50.000
Epoch: [89][0/196] Time 1.455 (1.455) Data 1.072 (1.072) Loss 2.3118 (2.3118) Prec@1 9.375 (9.375) Prec@5 49.219 (49.219)
Epoch: [89][10/196] Time 0.406 (0.507) Data 0.001 (0.100) Loss 2.3118 (2.3118) Prec@1 10.547 (10.085) Prec@5 51.172 (50.604)
Epoch: [89][20/196] Time 0.408 (0.460) Data 0.001 (0.053) Loss 2.3118 (2.3118) Prec@1 8.594 (10.212) Prec@5 51.172 (50.930)
Epoch: [89][30/196] Time 0.401 (0.444) Data 0.001 (0.036) Loss 2.3119 (2.3118) Prec@1 8.594 (9.929) Prec@5 44.922 (50.441)
Epoch: [89][40/196] Time 0.004 (0.436) Data 0.001 (0.027) Loss 2.3118 (2.3118) Prec@1 10.156 (9.861) Prec@5 55.859 (50.210)
Epoch: [89][50/196] Time 0.410 (0.431) Data 0.001 (0.022) Loss 2.3119 (2.3118) Prec@1 10.547 (9.934) Prec@5 49.609 (50.444)
Epoch: [89][60/196] Time 0.415 (0.428) Data 0.001 (0.019) Loss 2.3117 (2.3118) Prec@1 11.719 (10.028) Prec@5 55.078 (50.506)
Epoch: [89][70/196] Time 0.407 (0.426) Data 0.001 (0.016) Loss 2.3119 (2.3118) Prec@1 8.594 (10.030) Prec@5 49.219 (50.539)
Epoch: [89][80/196] Time 0.393 (0.428) Data 0.001 (0.014) Loss 2.3118 (2.3118) Prec@1 6.641 (9.968) Prec@5 50.781 (50.236)
Epoch: [89][90/196] Time 0.392 (0.426) Data 0.001 (0.013) Loss 2.3119 (2.3118) Prec@1 8.984 (10.045) Prec@5 49.219 (50.206)
Epoch: [89][100/196] Time 0.591 (0.425) Data 0.001 (0.011) Loss 2.3118 (2.3118) Prec@1 10.156 (9.998) Prec@5 50.000 (50.085)
Epoch: [89][110/196] Time 0.399 (0.423) Data 0.001 (0.011) Loss 2.3118 (2.3118) Prec@1 13.672 (10.015) Prec@5 51.953 (50.070)
Epoch: [89][120/196] Time 0.395 (0.422) Data 0.001 (0.010) Loss 2.3119 (2.3118) Prec@1 8.203 (9.985) Prec@5 48.438 (49.913)
Epoch: [89][130/196] Time 0.389 (0.422) Data 0.001 (0.009) Loss 2.3118 (2.3118) Prec@1 10.938 (9.951) Prec@5 50.781 (49.860)
Epoch: [89][140/196] Time 0.404 (0.421) Data 0.001 (0.008) Loss 2.3119 (2.3118) Prec@1 8.984 (9.912) Prec@5 50.000 (49.986)
Epoch: [89][150/196] Time 0.397 (0.421) Data 0.001 (0.008) Loss 2.3119 (2.3118) Prec@1 7.422 (9.910) Prec@5 49.609 (50.000)
Epoch: [89][160/196] Time 0.408 (0.419) Data 0.001 (0.008) Loss 2.3119 (2.3118) Prec@1 9.766 (9.899) Prec@5 47.266 (49.939)
Epoch: [89][170/196] Time 0.399 (0.419) Data 0.001 (0.007) Loss 2.3119 (2.3118) Prec@1 12.109 (9.875) Prec@5 48.438 (49.836)
Epoch: [89][180/196] Time 0.405 (0.419) Data 0.001 (0.007) Loss 2.3119 (2.3118) Prec@1 11.328 (9.874) Prec@5 48.828 (49.767)
Epoch: [89][190/196] Time 0.398 (0.419) Data 0.000 (0.006) Loss 2.3119 (2.3118) Prec@1 8.203 (9.835) Prec@5 49.219 (49.691)
Test: [0/40] Time 0.845 (0.845) Loss 2.3119 (2.3119) Prec@1 8.984 (8.984) Prec@5 46.875 (46.875)
Test: [10/40] Time 0.155 (0.264) Loss 2.3118 (2.3118) Prec@1 10.547 (10.298) Prec@5 54.688 (49.503)
Test: [20/40] Time 0.168 (0.214) Loss 2.3118 (2.3118) Prec@1 10.938 (10.305) Prec@5 53.125 (50.186)
Test: [30/40] Time 0.338 (0.203) Loss 2.3117 (2.3118) Prec@1 9.766 (10.131) Prec@5 57.812 (50.101)
* Prec@1 10.000 Prec@5 50.000
from examples.
@wangg12 I figure it out. I made a mistake. Problem solved thank you guy.
from examples.
@Foristkirito What is the problem?
from examples.
@wangg12 the problem is directories. It's necessary to maintain the project directory structure and put it right at your home directory. I do not understand why but it does not work if you put imagenet
at your home directory. However, resnet
always works fine i will spend some time to figure our the real problem.
from examples.
@Foristkirito Do you get better results with alexnet on cifar-10 now?
Also IMO, alexnet is not suitable for cifar10 though. The architecture is for bigger images but 32x32 cifar-10 images.
Besides, if you do not use pre-trained weights, you should be careful with the learning rate and the weights initialization (The random behavior can be fixed by torch.manual_seed(seed)).
from examples.
@wangg12 The accuracy of alexnet
is still low. It sounds a good solution i will try it.
from examples.
Related Issues (20)
- Add `save_model` arg to `mnist_hogwild` example
- main.py: TensorBoard in case of Multi-processing Distributed Data Parallel Training
- resume train HOT 1
- Build failing on C++ 20 - M2 MacOS
- add scaler.unscale_(optimizer) before clip_grad_norm_
- Can not launch DDP training using distributed/ddp-tutorial-series/multigpu.py
- multi-node DDP
- world_language_model example throws UnicodeEncodeError
- add examples/siamese_network with triplet loss example
- FSDP T5 Example not working HOT 3
- Daily CI failed
- RL Examples had bugs on current gym version
- The doc build deployment has been failing since jan HOT 1
- word_language_model/data.py - two areas of redundant code
- word_language_model/data.py - remove '<eos>'
- If I am training on a SINGLE GPU, should this "--dist-backend 'gloo'" argument be added to the command? HOT 10
- SSL Error When downloading dataset HOT 3
- Testing a C++ case with MPI failed.
- Long training time for ResNet50 on ImageNet-1k HOT 1
- Segmentation fault (core dumped) at `model(images)` for examples/imagenet/main.py HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from examples.