Giter Site home page Giter Site logo

GPU train about automl HOT 28 CLOSED

google avatar google commented on August 24, 2024
GPU train

from automl.

Comments (28)

CraigWang1 avatar CraigWang1 commented on August 24, 2024 3

With the new github updates I've been able to train on a GPU (the load on the GPU increases), but I've been running into this message, so I can't see my loss, epochs or progress.

W0325 03:20:53.415277 140300720293760 meta_graph.py:436] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'

Anyone else have the same issue?

from automl.

mingxingtan avatar mingxingtan commented on August 24, 2024 2

This warning doesn't matter. Due to my laziness, I was using the old way of adding summaries to tf.collections, which causes this warning. I have just submitted a simple change to avoid this warning.

from automl.

CraigWang1 avatar CraigWang1 commented on August 24, 2024

Just wondering, have you been able to train for epochs or get the training process started?

from automl.

silence0628 avatar silence0628 commented on August 24, 2024

Yes, training has started, and the following information appears

I0324 11:48:15.718143 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.59419
INFO:tensorflow:examples/sec: 6.37677
I0324 11:48:15.718615 140160666445632 tpu_estimator.py:2160] examples/sec: 6.37677
INFO:tensorflow:global_step/sec: 1.60237
I0324 11:48:16.342203 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.60237
INFO:tensorflow:examples/sec: 6.40949
I0324 11:48:16.342702 140160666445632 tpu_estimator.py:2160] examples/sec: 6.40949
INFO:tensorflow:global_step/sec: 1.58549
I0324 11:48:16.972882 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.58549
INFO:tensorflow:examples/sec: 6.34196

But a few minutes later, NaN appeared

from automl.

tabsun avatar tabsun commented on August 24, 2024

@silence0628 My training is really slow with 1080Ti and what GPU are you using?

INFO:tensorflow:global_step/sec: 0.00882378
I0324 15:34:16.805580 140436817930048 tpu_estimator.py:2307] global_step/sec: 0.00882378
INFO:tensorflow:examples/sec: 0.564722
I0324 15:34:16.805948 140436817930048 tpu_estimator.py:2308] examples/sec: 0.564722
INFO:tensorflow:global_step/sec: 0.0100131
I0324 15:35:56.674830 140436817930048 tpu_estimator.py:2307] global_step/sec: 0.0100131
INFO:tensorflow:examples/sec: 0.640839
I0324 15:35:56.675580 140436817930048 tpu_estimator.py:2308] examples/sec: 0.640839
INFO:tensorflow:global_step/sec: 0.00949146

from automl.

ancorasir avatar ancorasir commented on August 24, 2024

I met the same problem with @silence0628 after 1200 steps using my own training dataset. The loss in the summary file looks alright and is decreasing before the crash. I'm running the code on my local GPU. I noticed that the learning rate is scheduled using the following parameters. Is it possible that the learning rate is too big?

optimization

h.momentum = 0.9
h.learning_rate = 0.08
h.lr_warmup_init = 0.008
h.lr_warmup_epoch = 1.0
h.first_lr_drop_epoch = 200.0
h.second_lr_drop_epoch = 250.0
h.clip_gradients_norm = 10.0
h.num_epochs = 300

from automl.

Byronnar avatar Byronnar commented on August 24, 2024

Are you really train with your GPU rather than cpu?
I also run python main.py --training_file_pattern=/home/hhh/Data/YOLO/VOCdevkit/VOC2007/tfrecords_/voc_train* --model_dir=/tmp/efficientnet/ --hparams="use_bfloat16=false" --use_tpu=False"..
But it seems that it will use CPU instead of GPU, if we set use_tpu FALSE

from automl.

tabsun avatar tabsun commented on August 24, 2024

@Byronnar I think so. But as the tf doc says:
TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.

And the training only take up 147M memory of each GPU. It's really strange.

from automl.

Byronnar avatar Byronnar commented on August 24, 2024

@Byronnar I think so. But as the tf doc says:
TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.

And the training only take up 147M memory of each GPU. It's really strange.
谢谢回复,这个确实很奇怪,原来是只用了147m, 我还以为一点GPU没用. 后续再看看吧

from automl.

mingxingtan avatar mingxingtan commented on August 24, 2024

@CraigWang1 Oh sorry, I disabled the log info in main.py. You can either add "--logtostderr" or remove the disable line.

from automl.

silence0628 avatar silence0628 commented on August 24, 2024

@tabsun my gpu is RTX2080ti, it should be faster. Have you finished your training?

from automl.

tabsun avatar tabsun commented on August 24, 2024

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

from automl.

bitwangdan avatar bitwangdan commented on August 24, 2024

@silence0628 Hi, have you solved this problem? same error

from automl.

silence0628 avatar silence0628 commented on August 24, 2024

@bitwangdan now is ok, It can be trained normally;but there's a problem like @CraigWang1 ;according to the new command author @mingxingtan said, loss not shown,as follows
W0326 08:32:29.445561 139727162947392 meta_graph.py:449] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
i dont know the reason

from automl.

bitwangdan avatar bitwangdan commented on August 24, 2024

@ancorasir Hi, have you solved this problem? NaN loss during training

from automl.

dx111 avatar dx111 commented on August 24, 2024

@ancorasir Hi, have you solved this problem? NaN loss during training
Hi, have you solved this problem? NaN loss during training

from automl.

CraigWang1 avatar CraigWang1 commented on August 24, 2024

@silence0628 After changing
logging.set_verbosity(logging.WARNING) to logging.set_verbosity(logging.INFO) in main.py, I was able to see the steps. However, I can't see the loss.

from automl.

tabsun avatar tabsun commented on August 24, 2024

Why not using tensorboard to check the loss?

from automl.

goldwater668 avatar goldwater668 commented on August 24, 2024

My CUDA version is 10.1.243. I can use tensorflow-gpu1.15.0 + efficient det-d0 to infer the effect. There is a problem in tensorflow-gpu2.1.0 reasoning. Would you like to ask about the CUDA version and tensorflow GPU version that this code supports GPU?

from automl.

ancorasir avatar ancorasir commented on August 24, 2024

@dx9527 @bitwangdan , Not yet. I tried lower the learning rate, and I was able to train efficientdet-1 with my own dataset for 66000 iteration steps before Nan loss appeared. I'm still working on finding a good learning rate. Any Suggestions? @mingxingtan

python main.py --training_file_pattern=./tfrecords/*.tfrecord
--model_dir=./output
--hparams="use_bfloat16=false,num_classes=104,skip_crowd_during_training = False"
--use_tpu=False
--backbone_ckpt=./efficientnet-b1
--train_batch_size=8
--num_examples_per_epoch=82990
--num_epochs=15

optimization

h.momentum = 0.9
h.learning_rate = 0.001
h.lr_warmup_init = 0.0001
h.lr_warmup_epoch = 1.0
h.first_lr_drop_epoch = 200.0
h.second_lr_drop_epoch = 250.0

Screenshot from 2020-03-27 09-47-34

from automl.

mad-fogs avatar mad-fogs commented on August 24, 2024

With the new github updates I've been able to train on a GPU (the load on the GPU increases), but I've been running into this message, so I can't see my loss, epochs or progress.

W0325 03:20:53.415277 140300720293760 meta_graph.py:436] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'

Anyone else have the same issue?

i have got the same warning.

from automl.

sunzhe09 avatar sunzhe09 commented on August 24, 2024

me too

from automl.

Li505358678 avatar Li505358678 commented on August 24, 2024

I want to know if the training code with GPU is the same as before?I want to reproduce the results in the paper. The Eval on Coco 2017 Val of results which were got from the model provided are consistent, but the results of the model evaluation trained by myself are very poor. I don't know the reason

from automl.

shenxiaofei715 avatar shenxiaofei715 commented on August 24, 2024

我自己研究的一下如何训练自己的数据请看
https://github.com/shenxiaofei715/efficientdet.git

from automl.

elv-xuwen avatar elv-xuwen commented on August 24, 2024

@bitwangdan now is ok, It can be trained normally;but there's a problem like @CraigWang1 ;according to the new command author @mingxingtan said, loss not shown,as follows
W0326 08:32:29.445561 139727162947392 meta_graph.py:449] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
i dont know the reason

Hi @silence0628 can you tell me how you solved the NaN loss issue? Thank you!

from automl.

C-SJK avatar C-SJK commented on August 24, 2024

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

Can I see your config file? My training speed in TiTan XP

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?

from automl.

C-SJK avatar C-SJK commented on August 24, 2024

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

Can I see your config file? My training speed in TiTan XP

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?

@tabsun

from automl.

tabsun avatar tabsun commented on August 24, 2024

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

Can I see your config file? My training speed in TiTan XP

@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.

My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?

@tabsun

It's almost 5 months ago that I trained my model using this repo. After I used SINGLE GPU to train, the speed is normal I remember. But mAP is not satisfying.

from automl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.