Comments (28)
With the new github updates I've been able to train on a GPU (the load on the GPU increases), but I've been running into this message, so I can't see my loss, epochs or progress.
W0325 03:20:53.415277 140300720293760 meta_graph.py:436] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
Anyone else have the same issue?
from automl.
This warning doesn't matter. Due to my laziness, I was using the old way of adding summaries to tf.collections, which causes this warning. I have just submitted a simple change to avoid this warning.
from automl.
Just wondering, have you been able to train for epochs or get the training process started?
from automl.
Yes, training has started, and the following information appears
I0324 11:48:15.718143 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.59419
INFO:tensorflow:examples/sec: 6.37677
I0324 11:48:15.718615 140160666445632 tpu_estimator.py:2160] examples/sec: 6.37677
INFO:tensorflow:global_step/sec: 1.60237
I0324 11:48:16.342203 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.60237
INFO:tensorflow:examples/sec: 6.40949
I0324 11:48:16.342702 140160666445632 tpu_estimator.py:2160] examples/sec: 6.40949
INFO:tensorflow:global_step/sec: 1.58549
I0324 11:48:16.972882 140160666445632 tpu_estimator.py:2159] global_step/sec: 1.58549
INFO:tensorflow:examples/sec: 6.34196
But a few minutes later, NaN appeared
from automl.
@silence0628 My training is really slow with 1080Ti and what GPU are you using?
INFO:tensorflow:global_step/sec: 0.00882378
I0324 15:34:16.805580 140436817930048 tpu_estimator.py:2307] global_step/sec: 0.00882378
INFO:tensorflow:examples/sec: 0.564722
I0324 15:34:16.805948 140436817930048 tpu_estimator.py:2308] examples/sec: 0.564722
INFO:tensorflow:global_step/sec: 0.0100131
I0324 15:35:56.674830 140436817930048 tpu_estimator.py:2307] global_step/sec: 0.0100131
INFO:tensorflow:examples/sec: 0.640839
I0324 15:35:56.675580 140436817930048 tpu_estimator.py:2308] examples/sec: 0.640839
INFO:tensorflow:global_step/sec: 0.00949146
from automl.
I met the same problem with @silence0628 after 1200 steps using my own training dataset. The loss in the summary file looks alright and is decreasing before the crash. I'm running the code on my local GPU. I noticed that the learning rate is scheduled using the following parameters. Is it possible that the learning rate is too big?
optimization
h.momentum = 0.9
h.learning_rate = 0.08
h.lr_warmup_init = 0.008
h.lr_warmup_epoch = 1.0
h.first_lr_drop_epoch = 200.0
h.second_lr_drop_epoch = 250.0
h.clip_gradients_norm = 10.0
h.num_epochs = 300
from automl.
Are you really train with your GPU rather than cpu?
I also run python main.py --training_file_pattern=/home/hhh/Data/YOLO/VOCdevkit/VOC2007/tfrecords_/voc_train* --model_dir=/tmp/efficientnet/ --hparams="use_bfloat16=false" --use_tpu=False"..
But it seems that it will use CPU instead of GPU, if we set use_tpu FALSE
from automl.
@Byronnar I think so. But as the tf doc says:
TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.
And the training only take up 147M memory of each GPU. It's really strange.
from automl.
@Byronnar I think so. But as the tf doc says:
TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.And the training only take up 147M memory of each GPU. It's really strange.
谢谢回复,这个确实很奇怪,原来是只用了147m, 我还以为一点GPU没用. 后续再看看吧
from automl.
@CraigWang1 Oh sorry, I disabled the log info in main.py. You can either add "--logtostderr" or remove the disable line.
from automl.
@tabsun my gpu is RTX2080ti, it should be faster. Have you finished your training?
from automl.
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
from automl.
@silence0628 Hi, have you solved this problem? same error
from automl.
@bitwangdan now is ok, It can be trained normally;but there's a problem like @CraigWang1 ;according to the new command author @mingxingtan said, loss not shown,as follows
W0326 08:32:29.445561 139727162947392 meta_graph.py:449] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
i dont know the reason
from automl.
@ancorasir Hi, have you solved this problem? NaN loss during training
from automl.
@ancorasir Hi, have you solved this problem? NaN loss during training
Hi, have you solved this problem? NaN loss during training
from automl.
@silence0628 After changing
logging.set_verbosity(logging.WARNING)
to logging.set_verbosity(logging.INFO)
in main.py, I was able to see the steps. However, I can't see the loss.
from automl.
Why not using tensorboard to check the loss?
from automl.
My CUDA version is 10.1.243. I can use tensorflow-gpu1.15.0 + efficient det-d0 to infer the effect. There is a problem in tensorflow-gpu2.1.0 reasoning. Would you like to ask about the CUDA version and tensorflow GPU version that this code supports GPU?
from automl.
@dx9527 @bitwangdan , Not yet. I tried lower the learning rate, and I was able to train efficientdet-1 with my own dataset for 66000 iteration steps before Nan loss appeared. I'm still working on finding a good learning rate. Any Suggestions? @mingxingtan
python main.py --training_file_pattern=./tfrecords/*.tfrecord
--model_dir=./output
--hparams="use_bfloat16=false,num_classes=104,skip_crowd_during_training = False"
--use_tpu=False
--backbone_ckpt=./efficientnet-b1
--train_batch_size=8
--num_examples_per_epoch=82990
--num_epochs=15
optimization
h.momentum = 0.9
h.learning_rate = 0.001
h.lr_warmup_init = 0.0001
h.lr_warmup_epoch = 1.0
h.first_lr_drop_epoch = 200.0
h.second_lr_drop_epoch = 250.0
from automl.
With the new github updates I've been able to train on a GPU (the load on the GPU increases), but I've been running into this message, so I can't see my loss, epochs or progress.
W0325 03:20:53.415277 140300720293760 meta_graph.py:436] Issue encountered when serializing edsummaries. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. 'tuple' object has no attribute 'name'
Anyone else have the same issue?
i have got the same warning.
from automl.
me too
from automl.
I want to know if the training code with GPU is the same as before?I want to reproduce the results in the paper. The Eval on Coco 2017 Val of results which were got from the model provided are consistent, but the results of the model evaluation trained by myself are very poor. I don't know the reason
from automl.
我自己研究的一下如何训练自己的数据请看
https://github.com/shenxiaofei715/efficientdet.git
from automl.
@bitwangdan now is ok, It can be trained normally;but there's a problem like @CraigWang1 ;according to the new command author @mingxingtan said, loss not shown,as follows
W0326 08:32:29.445561 139727162947392 meta_graph.py:449] Issue encountered when serializing edsummaries.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
i dont know the reason
Hi @silence0628 can you tell me how you solved the NaN loss issue? Thank you!
from automl.
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
Can I see your config file? My training speed in TiTan XP
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?
from automl.
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
Can I see your config file? My training speed in TiTan XP
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?
from automl.
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
Can I see your config file? My training speed in TiTan XP
@silence0628 No, It's still in training. After I used single GPU to train as the new README, the speed is normal.
My training speed only have examples/sec: 4.65129 in single TiTan XP(the GPU memory have been used 11651M ).How do you slove this problem?
It's almost 5 months ago that I trained my model using this repo. After I used SINGLE GPU to train, the speed is normal I remember. But mAP is not satisfying.
from automl.
Related Issues (20)
- More inplace ops for pytorch lion's impl
- ERROR : 'ImageFont' object has no attribute 'getbbox' HOT 2
- Potentially wrong type inference
- How to apply quantization aware training on EfficientDet keras model?
- How to train ViT image classification model on our dataset using LION optimizer
- how to train model by lion optimizer with fp16? HOT 1
- how to fix (terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[48], line 3 1 #!rm summary.h5 2 #!rm statepoint.*.h5 ----> 3 sp_filename = model.run() 5 sp = openmc.StatePoint(sp_filename)?
- Error during prediction within coreML framework of the converted Efficientdet-lite0 model
- why the text label is not showing on the bounding box HOT 1
- Question about Lion HOT 1
- TypeError: The `filenames` argument must contain `tf.string` elements. Got `tf.float32` elements error HOT 1
- buffer_size must be greater than zero error when use custom dataset HOT 1
- p.add_(..., inplace=True) error HOT 1
- efficientnetv2-bn parameters for progressive learning
- How to add class weights?
- Error reading original efficientdet-d3_frozen.pb on openCV`s readNetFromTensorflow HOT 2
- EfficienDet output format question
- Recommended way for EfficientDet-Lite Quantization
- Training on custom dataset of EfficientDet-0 model crash : TypeError: 'NoneType' object is not callable
- Exported tflite model is incompatible with MediaPipe
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from automl.