When running this line of code "python main.py --training_file_pattern=/home/hhh/Data/

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I met the same problem with <a class="user-mention notranslate" data-hovercard-type="u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

GPU train about automl HOT 28 CLOSED

google commented on August 24, 2024

GPU train

from automl.

Comments (28)

CraigWang1 commented on August 24, 2024 3

With the new github updates I've been able to train on a GPU (the load on the GPU increases), but I've been running into this message, so I can't see my loss, epochs or progress.

W0325 03:20:53.415277 140300720293760 meta_graph.py:436] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'

Anyone else have the same issue?

from automl.

mingxingtan commented on August 24, 2024 2

This warning doesn't matter. Due to my laziness, I was using the old way of adding summaries to tf.collections, which causes this warning. I have just submitted a simple change to avoid this warning.

from automl.

CraigWang1 commented on August 24, 2024

Just wondering, have you been able to train for epochs or get the training process started?

from automl.

silence0628 commented on August 24, 2024

Yes, training has started, and the following information appears

I0324 11:48:15.718143 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.59419
INFO:tensorflow:examples/sec: 6.37677
I0324 11:48:15.718615 140160666445632 tpu_estimator.py:2160] examples/sec: 6.37677
INFO:tensorflow:global_step/sec: 1.60237
I0324 11:48:16.342203 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.60237
INFO:tensorflow:examples/sec: 6.40949
I0324 11:48:16.342702 140160666445632 tpu_estimator.py:2160] examples/sec: 6.40949
INFO:tensorflow:global_step/sec: 1.58549
I0324 11:48:16.972882 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.58549
INFO:tensorflow:examples/sec: 6.34196

But a few minutes later, NaN appeared

from automl.

tabsun commented on August 24, 2024

@silence0628 My training is really slow with 1080Ti and what GPU are you using?

INFO:tensorflow:global_step/sec: 0.00882378
I0324 15:34:16.805580 140436817930048 tpu_estimator.py:2307] global_step/sec: 0.00882378
INFO:tensorflow:examples/sec: 0.564722
I0324 15:34:16.805948 140436817930048 tpu_estimator.py:2308] examples/sec: 0.564722
INFO:tensorflow:global_step/sec: 0.0100131
I0324 15:35:56.674830 140436817930048 tpu_estimator.py:2307] global_step/sec: 0.0100131
INFO:tensorflow:examples/sec: 0.640839
I0324 15:35:56.675580 140436817930048 tpu_estimator.py:2308] examples/sec: 0.640839
INFO:tensorflow:global_step/sec: 0.00949146

from automl.

ancorasir commented on August 24, 2024

I met the same problem with @silence0628 after 1200 steps using my own training dataset. The loss in the summary file looks alright and is decreasing before the crash. I'm running the code on my local GPU. I noticed that the learning rate is scheduled using the following parameters. Is it possible that the learning rate is too big?

optimization

h.momentum = 0.9
h.learning_rate = 0.08
h.lr_warmup_init = 0.008
h.lr_warmup_epoch = 1.0
h.first_lr_drop_epoch = 200.0
h.second_lr_drop_epoch = 250.0
h.clip_gradients_norm = 10.0
h.num_epochs = 300

from automl.

Byronnar commented on August 24, 2024

Are you really train with your GPU rather than cpu?
I also run python main.py --training_file_pattern=/home/hhh/Data/YOLO/VOCdevkit/VOC2007/tfrecords_/voc_train* --model_dir=/tmp/efficientnet/ --hparams="use_bfloat16=false" --use_tpu=False"..
But it seems that it will use CPU instead of GPU, if we set use_tpu FALSE

from automl.

tabsun commented on August 24, 2024

@Byronnar I think so. But as the tf doc says:
TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.

And the training only take up 147M memory of each GPU. It's really strange.

from automl.

Byronnar commented on August 24, 2024

@Byronnar I think so. But as the tf doc says:
TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.

And the training only take up 147M memory of each GPU. It's really strange.
谢谢回复，这个确实很奇怪，原来是只用了147m,　我还以为一点GPU没用.　后续再看看吧

from automl.

mingxingtan commented on August 24, 2024

@CraigWang1 Oh sorry, I disabled the log info in main.py. You can either add "--logtostderr" or remove the disable line.

from automl.

silence0628 commented on August 24, 2024

@tabsun my gpu is RTX2080ti, it should be faster. Have you finished your training?

from automl.

tabsun commented on August 24, 2024

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

from automl.

bitwangdan commented on August 24, 2024

@silence0628 Hi, have you solved this problem? same error

from automl.

silence0628 commented on August 24, 2024

@bitwangdan now is ok, It can be trained normally;but there's a problem like @CraigWang1 ;according to the new command author @mingxingtan said, loss not shown,as follows
W0326 08:32:29.445561 139727162947392 meta_graph.py:449] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
i dont know the reason

from automl.

bitwangdan commented on August 24, 2024

@ancorasir Hi, have you solved this problem? NaN loss during training

from automl.

dx111 commented on August 24, 2024

@ancorasir Hi, have you solved this problem? NaN loss during training
Hi, have you solved this problem? NaN loss during training

from automl.

CraigWang1 commented on August 24, 2024

@silence0628 After changing
logging.set_verbosity(logging.WARNING) to logging.set_verbosity(logging.INFO) in main.py, I was able to see the steps. However, I can't see the loss.

from automl.

tabsun commented on August 24, 2024

Why not using tensorboard to check the loss?

from automl.

goldwater668 commented on August 24, 2024

My CUDA version is 10.1.243. I can use tensorflow-gpu1.15.0 + efficient det-d0 to infer the effect. There is a problem in tensorflow-gpu2.1.0 reasoning. Would you like to ask about the CUDA version and tensorflow GPU version that this code supports GPU?

from automl.

ancorasir commented on August 24, 2024

@dx9527 @bitwangdan , Not yet. I tried lower the learning rate, and I was able to train efficientdet-1 with my own dataset for 66000 iteration steps before Nan loss appeared. I'm still working on finding a good learning rate. Any Suggestions? @mingxingtan

python main.py --training_file_pattern=./tfrecords/*.tfrecord
--model_dir=./output
--hparams="use_bfloat16=false,num_classes=104,skip_crowd_during_training = False"
--use_tpu=False
--backbone_ckpt=./efficientnet-b1
--train_batch_size=8
--num_examples_per_epoch=82990
--num_epochs=15

optimization

h.momentum = 0.9
h.learning_rate = 0.001
h.lr_warmup_init = 0.0001
h.lr_warmup_epoch = 1.0
h.first_lr_drop_epoch = 200.0
h.second_lr_drop_epoch = 250.0

from automl.

mad-fogs commented on August 24, 2024

With the new github updates I've been able to train on a GPU (the load on the GPU increases), but I've been running into this message, so I can't see my loss, epochs or progress.
W0325 03:20:53.415277 140300720293760 meta_graph.py:436] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
Anyone else have the same issue?

i have got the same warning.

from automl.

sunzhe09 commented on August 24, 2024

me too

from automl.

Li505358678 commented on August 24, 2024

I want to know if the training code with GPU is the same as before?I want to reproduce the results in the paper. The Eval on Coco 2017 Val of results which were got from the model provided are consistent, but the results of the model evaluation trained by myself are very poor. I don't know the reason

from automl.

shenxiaofei715 commented on August 24, 2024

我自己研究的一下如何训练自己的数据请看
https://github.com/shenxiaofei715/efficientdet.git

from automl.

elv-xuwen commented on August 24, 2024

@bitwangdan now is ok, It can be trained normally;but there's a problem like @CraigWang1 ;according to the new command author @mingxingtan said, loss not shown,as follows
W0326 08:32:29.445561 139727162947392 meta_graph.py:449] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
i dont know the reason

Hi @silence0628 can you tell me how you solved the NaN loss issue? Thank you!

from automl.

C-SJK commented on August 24, 2024

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

Can I see your config file? My training speed in TiTan XP

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?

from automl.

C-SJK commented on August 24, 2024

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

Can I see your config file? My training speed in TiTan XP

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?

@tabsun

from automl.

tabsun commented on August 24, 2024

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

Can I see your config file? My training speed in TiTan XP

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?

@tabsun

It's almost 5 months ago that I trained my model using this repo. After I used SINGLE GPU to train, the speed is normal I remember. But mAP is not satisfying.

from automl.

GPU train about automl HOT 28 CLOSED

Comments (28)

optimization

optimization

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent