The pytorchpipeline from ghnreigns

Weights are not saved on best epoch

I will open issues here from now on. Previous issues created on 9/10 Jan will be in the comments of the different commit versions.

11 Jan Issue: Weights are not saved on the best epoch based on the monitored_metric. I have taken a look at the source code of results.py and cannot find the error. It seems to me that results are updated once the new_result > old_result. However, I am not sure why the results are not saved and loaded for the best epoch. To reproduce the issue, just drop me a message @jansky.

A snippet can be seen as follows:

Training on Fold 1 and using tf_efficientnet_b2_ns

2021-01-10 18-12-48
LR: 0.001
[RESULT]: Training Epoch: 1 | Avg Validation Summary Loss: 0.086217 | Validation Accuracy: 0.981698 | Time Elapsed: 00:03:51
[RESULT]: Validation Epoch: 1 | Avg Validation Summary Loss: 0.084663 | Validation Accuracy: 0.982342 | Validation ROC: 0.726715 | MultiClass ROC: {0: 0.27328498476140206, 1: 0.7267150152385979} | Time Elapsed: 00:00:17
Adjusting learning rate of group 0 to 1.0000e-03.

2021-01-10 18-16-58
LR: 0.001
[RESULT]: Training Epoch: 2 | Avg Validation Summary Loss: 0.083393 | Validation Accuracy: 0.982377 | Time Elapsed: 00:03:50
[RESULT]: Validation Epoch: 2 | Avg Validation Summary Loss: 0.074522 | Validation Accuracy: 0.982342 | Validation ROC: 0.853385 | MultiClass ROC: {0: 0.14661487775637413, 1: 0.8533851222436258} | Time Elapsed: 00:00:17
Adjusting learning rate of group 0 to 3.0000e-04.

2021-01-10 18-21-07
LR: 0.0003
[RESULT]: Training Epoch: 3 | Avg Validation Summary Loss: 0.075850 | Validation Accuracy: 0.982377 | Time Elapsed: 00:03:52
[RESULT]: Validation Epoch: 3 | Avg Validation Summary Loss: 0.072389 | Validation Accuracy: 0.982342 | Validation ROC: 0.853426 | MultiClass ROC: {0: 0.14657548456903197, 1: 0.8534245154309681} | Time Elapsed: 00:00:17
Adjusting learning rate of group 0 to 3.0000e-04.

2021-01-10 18-25-17
LR: 0.0003
[RESULT]: Training Epoch: 4 | Avg Validation Summary Loss: 0.075004 | Validation Accuracy: 0.982377 | Time Elapsed: 00:03:53
[RESULT]: Validation Epoch: 4 | Avg Validation Summary Loss: 0.072246 | Validation Accuracy: 0.982342 | Validation ROC: 0.865661 | MultiClass ROC: {0: 0.1343386474743058, 1: 0.8656613525256942} | Time Elapsed: 00:00:17
Adjusting learning rate of group 0 to 9.0000e-05.

2021-01-10 18-29-29
LR: 8.999999999999999e-05
[RESULT]: Training Epoch: 5 | Avg Validation Summary Loss: 0.070796 | Validation Accuracy: 0.982377 | Time Elapsed: 00:03:52
[RESULT]: Validation Epoch: 5 | Avg Validation Summary Loss: 0.071158 | Validation Accuracy: 0.982342 | Validation ROC: 0.873397 | MultiClass ROC: {0: 0.12660379513966855, 1: 0.8733962048603314} | Time Elapsed: 00:00:17
Adjusting learning rate of group 0 to 9.0000e-05.

2021-01-10 18-33-39
LR: 8.999999999999999e-05
[RESULT]: Training Epoch: 6 | Avg Validation Summary Loss: 0.067821 | Validation Accuracy: 0.982340 | Time Elapsed: 00:03:51
[RESULT]: Validation Epoch: 6 | Avg Validation Summary Loss: 0.069738 | Validation Accuracy: 0.982342 | Validation ROC: 0.877959 | MultiClass ROC: {0: 0.1220394378329545, 1: 0.8779605621670455} | Time Elapsed: 00:00:17
Adjusting learning rate of group 0 to 2.7000e-05.

2021-01-10 18-37-49
LR: 2.6999999999999996e-05
[RESULT]: Training Epoch: 7 | Avg Validation Summary Loss: 0.067316 | Validation Accuracy: 0.982377 | Time Elapsed: 00:03:49
[RESULT]: Validation Epoch: 7 | Avg Validation Summary Loss: 0.068827 | Validation Accuracy: 0.982342 | Validation ROC: 0.881831 | MultiClass ROC: {0: 0.11816905717658521, 1: 0.8818309428234148} | Time Elapsed: 00:00:17
Adjusting learning rate of group 0 to 2.7000e-05.

2021-01-10 18-41-57
LR: 2.6999999999999996e-05
[RESULT]: Training Epoch: 8 | Avg Validation Summary Loss: 0.064954 | Validation Accuracy: 0.982340 | Time Elapsed: 00:03:52
[RESULT]: Validation Epoch: 8 | Avg Validation Summary Loss: 0.069614 | Validation Accuracy: 0.982342 | Validation ROC: 0.879680 | MultiClass ROC: {0: 0.12031926865234593, 1: 0.879680731347654} | Time Elapsed: 00:00:18
Adjusting learning rate of group 0 to 8.1000e-06.



OOF Score for Fold 1: 0.879680731347654

The OOF score for each epoch should be merely the highest monitored metrics, which happens at epoch 7. This is further confirmed to only happen in Fold 1 and 4, where coincidentally, the last epoch is not the best result - and for Fold 2, 3 and 5, the last epoch turns out to be the best epoch. Can only hypothesize that the weights are saved on the last epoch.

Consider putting df into config

TODO for hongnan: put df from Dataset into config's paths parameter.

updated codes

@jansky 16Jan updates, codes were updated so that one can choose to use AMP in PyTorch or not, main changes are detailed in the update remarks today.

ghnreigns / pytorchpipeline Goto Github PK

pytorchpipeline's People

Contributors

Stargazers

Watchers

Forkers

pytorchpipeline's Issues

Weights are not saved on best epoch

Consider putting df into config

updated codes

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent