when i train the model ,the loss_bbox is after iteration 20 I0411 17:16:45.481537

No. That won't be necessary. Directly running the training should be fine. Coul

That's quite weird. Could you please Remove <a href="https://g

Sorry,when I first run the training with no modify!There are one error! <co

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Why I got the nan loss_bbox when i train and eval? about person_search HOT 12 CLOSED

shuangli59 commented on June 5, 2024

Why I got the nan loss_bbox when i train and eval?

from person_search.

Comments (12)

Cysu commented on June 5, 2024

Did you modify the code? For the first training iteration, it should be something like

I1113 15:51:24.800622 32170 solver.cpp:240] Iteration 0, loss = 6.22973
I1113 15:51:24.800657 32170 solver.cpp:255]     Train net output #0: det_accuracy = 0.078125
I1113 15:51:24.800668 32170 solver.cpp:255]     Train net output #1: det_loss = 0.706399 (* 1 = 0.706399 loss)
I1113 15:51:24.800671 32170 solver.cpp:255]     Train net output #2: id_accuracy = 0
I1113 15:51:24.800676 32170 solver.cpp:255]     Train net output #3: id_loss = 9.26615 (* 1 = 9.26615 loss)
I1113 15:51:24.800681 32170 solver.cpp:255]     Train net output #4: loss_bbox = 1.04062e-05 (* 1 = 1.04062e-05 loss)
I1113 15:51:24.800685 32170 solver.cpp:255]     Train net output #5: rpn_bbox_loss = 0.188907 (* 1 = 0.188907 loss)
I1113 15:51:24.800689 32170 solver.cpp:255]     Train net output #6: rpn_cls_loss = 0.693245 (* 1 = 0.693245 loss)
I1113 15:51:24.800700 32170 solver.cpp:640] Iteration 0, lr = 0.001

from person_search.

andongchen commented on June 5, 2024

I have not modify the code! Could I modify the code?

from person_search.

Cysu commented on June 5, 2024

No. That won't be necessary. Directly running the training script should be fine. Could you please provide a full training log (by uploading to BaiduYun / GoogleDrive / Dropbox) for me to have further analysis?

Also could you please evaluate our trained model by following the instructions in the README, to see if it works properly?

from person_search.

andongchen commented on June 5, 2024

Yes, I can evaluate by your trained model,and there is no error.
The train log is here：https://drive.google.com/file/d/0Bz7UoqmY26NkeWphcnZYckNKUU0/view?usp=sharing

from person_search.

Cysu commented on June 5, 2024

That's quite weird. Could you please

Remove this line of randomness
Run the training script with specified random seed

experiments/scripts/train.sh 0 --set EXP_DIR resnet50 RNG_SEED 1

On my machine, this will lead to the same loss as follows for iteration 0

I0412 10:00:41.251739 29112 solver.cpp:240] Iteration 0, loss = 11.4016
I0412 10:00:41.251796 29112 solver.cpp:255]     Train net output #0: det_accuracy = 0.804688
I0412 10:00:41.251809 29112 solver.cpp:255]     Train net output #1: det_loss = 0.681872 (* 1 = 0.681872 loss)
I0412 10:00:41.251818 29112 solver.cpp:255]     Train net output #2: id_accuracy = 0
I0412 10:00:41.251827 29112 solver.cpp:255]     Train net output #3: id_loss = 9.40343 (* 1 = 9.40343 loss)
I0412 10:00:41.251835 29112 solver.cpp:255]     Train net output #4: loss_bbox = 0.522466 (* 1 = 0.522466 loss)
I0412 10:00:41.251844 29112 solver.cpp:255]     Train net output #5: rpn_bbox_loss = 0.123584 (* 1 = 0.123584 loss)
I0412 10:00:41.251876 29112 solver.cpp:255]     Train net output #6: rpn_cls_loss = 0.693231 (* 1 = 0.693231 loss)
I0412 10:00:41.251895 29112 solver.cpp:640] Iteration 0, lr = 0.001

from person_search.

andongchen commented on June 5, 2024

Sorry,when I first run the training script with no modify!There are one error!
experiments/scripts/train.sh 0 --set EXP_DIR resnet50`

Normalizing targets
done
Traceback (most recent call last):
  File "tools/train_net.py", line 130, in <module>
    max_iters=args.max_iters)
  File "/home/cy/PycharmProjects/person_search-master/tools/../lib/fast_rcnn/train.py", line 121, in train_net
    pretrained_model=pretrained_model)
  File "/home/cy/PycharmProjects/person_search-master/tools/../lib/fast_rcnn/train.py", line 50, in __init__
    pb2.text_format.Merge(f.read(), self.solver_param)
AttributeError: 'module' object has no attribute 'text_format'

And then I google solved by adding import google.protobuf.text_format in /lib/fast_rcnn/train.py!
and then got the nan_loss error!

Now I do as you say the step 1 and 2! also got the nan loss

I0412 11:58:24.734537 15281 solver.cpp:240] Iteration 0, loss = 45.384
I0412 11:58:24.734563 15281 solver.cpp:255]     Train net output #0: det_accuracy = 0.03125
I0412 11:58:24.734571 15281 solver.cpp:255]     Train net output #1: det_loss = 0.693147 (* 1 = 0.693147 loss)
I0412 11:58:24.734575 15281 solver.cpp:255]     Train net output #2: id_accuracy = -nan
I0412 11:58:24.734578 15281 solver.cpp:255]     Train net output #3: id_loss = 0 (* 1 = 0 loss)
I0412 11:58:24.734582 15281 solver.cpp:255]     Train net output #4: loss_bbox = 0.0592934 (* 1 = 0.0592934 loss)
I0412 11:58:24.734586 15281 solver.cpp:255]     Train net output #5: rpn_bbox_loss = 0.00123454 (* 1 = 0.00123454 loss)
I0412 11:58:24.734591 15281 solver.cpp:255]     Train net output #6: rpn_cls_loss = 0.693147 (* 1 = 0.693147 loss)
I0412 11:58:24.734596 15281 solver.cpp:640] Iteration 0, lr = 0.001
I0412 11:58:48.375877 15281 solver.cpp:240] Iteration 20, loss = nan
I0412 11:58:48.376101 15281 solver.cpp:255]     Train net output #0: det_accuracy = 0.929688
I0412 11:58:48.376142 15281 solver.cpp:255]     Train net output #1: det_loss = 0.620077 (* 1 = 0.620077 loss)
I0412 11:58:48.376157 15281 solver.cpp:255]     Train net output #2: id_accuracy = -nan
I0412 11:58:48.376165 15281 solver.cpp:255]     Train net output #3: id_loss = 0 (* 1 = 0 loss)
I0412 11:58:48.376170 15281 solver.cpp:255]     Train net output #4: loss_bbox = nan (* 1 = nan loss)
I0412 11:58:48.376176 15281 solver.cpp:255]     Train net output #5: rpn_bbox_loss = 0.185238 (* 1 = 0.185238 loss)
I0412 11:58:48.376183 15281 solver.cpp:255]     Train net output #6: rpn_cls_loss = 0.680902 (* 1 = 0.680902 loss)
I0412 11:58:48.376190 15281 solver.cpp:640] Iteration 20, lr = 0.001

from person_search.

andongchen commented on June 5, 2024

@Cysu First ,very thanks for your perfect job.There is no issue,but I have a question, have you try YOLO9000 for pedestrain detection,YOLO v2 for object detection is more faster and precision than faster rcnn.At your current work have the detection accuracy influence the person_search‘s mAP.

from person_search.

Cysu commented on June 5, 2024

Thank you very much for the suggestion. I really appreciate recent advances in object detection, e.g., YOLO v2, FPN, etc., and would like to give it a try if I have some time in the future. But currently I may not have enough spare time for it, and YOLO v2 seems to be implemented only in darknet, which is not that popular, compared with caffe / tf / pytorch.

By the way, do you still suffer from nan loss? If not, how did you solve it?

from person_search.

andongchen commented on June 5, 2024

Now, there are tensorflow verson YOLO:https://github.com/thtrieu/darkflow
I still suffer from the nan loss,i think it's machine environment's error ,but i not sure.

from person_search.

Cysu commented on June 5, 2024

Thank you very much for the link. I will check about it.

It's quite weird about the nan problem. Sorry but currently I have no idea about why it happens.

from person_search.

duanLH commented on June 5, 2024

@andongchen @Cysu When training ,I got "id_accuracy = -nan", Is normal ?

from person_search.

Cysu commented on June 5, 2024

@duanLH, id_accuracy = -nan is possible, because there are cases that the proposals do not contain any ground truth person, especially at the beginning stage of training.

from person_search.

Why I got the nan loss_bbox when i train and eval? about person_search HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent