Giter Site home page Giter Site logo

hikapok / ssd.tensorflow Goto Github PK

View Code? Open in Web Editor NEW
323.0 20.0 155.0 956 KB

State-of-the-art Single Shot MultiBox Detector in Pure TensorFlow, QQ Group: 758790869

License: Apache License 2.0

Python 100.00%
single-shot-multibox-detector ssd tensorflow yolo faster-rcnn objectdetection ssd-tensorflow

ssd.tensorflow's Introduction

State-of-the-art Single Shot MultiBox Detector in TensorFlow

This repository contains codes of the reimplementation of SSD: Single Shot MultiBox Detector in TensorFlow. If your goal is to reproduce the results in the original paper, please use the official codes.

There are already some TensorFlow based SSD reimplementation codes on GitHub, the main special features of this repo inlcude:

  • state of the art performance(77.8%mAP) when training from VGG-16 pre-trained model (SSD300-VGG16).
  • the model is trained using TensorFlow high level API tf.estimator. Although TensorFlow provides many APIs, the Estimator API is highly recommended to yield scalable, high-performance models.
  • all codes were writen by pure TensorFlow ops (no numpy operation) to ensure the performance and portability.
  • using ssd augmentation pipeline discribed in the original paper.
  • PyTorch-like model definition using high-level tf.layers API for better readability ^-^.
  • high degree of modularity to ease futher development.
  • using replicate_model_fn makes it flexible to use one or more GPUs.

New Update(77.9%mAP): using absolute bbox coordinates instead of normalized coordinates, checkout here.

Usage

  • Download Pascal VOC Dataset and reorganize the directory as follows:

     VOCROOT/
     	   |->VOC2007/
     	   |    |->Annotations/
     	   |    |->ImageSets/
     	   |    |->...
     	   |->VOC2012/
     	   |    |->Annotations/
     	   |    |->ImageSets/
     	   |    |->...
     	   |->VOC2007TEST/
     	   |    |->Annotations/
     	   |    |->...
    

    VOCROOT is your path of the Pascal VOC Dataset.

  • Run the following script to generate TFRecords.

     python dataset/convert_tfrecords.py --dataset_directory=VOCROOT --output_directory=./dataset/tfrecords
  • Download the pre-trained VGG-16 model (reduced-fc) from here and put them into one sub-directory named 'model' (we support SaverDef.V2 by default, the V1 version is also available for sake of compatibility).

  • Run the following script to start training:

     python train_ssd.py
  • Run the following script for evaluation and get mAP:

     python eval_ssd.py
     python voc_eval.py

    Note: you need first modify some directory in voc_eval.py.

  • Run the following script for visualization:

     python simple_ssd_demo.py

All the codes was tested under TensorFlow 1.6, Python 3.5, Ubuntu 16.04 with CUDA 8.0. If you want to run training by yourself, one decent GPU will be highly recommended. The whole training process for VOC07+12 dataset took ~120k steps in total, and each step (32 samples per-batch) took ~1s on my little workstation with single GTX1080-Ti GPU Card. If you need run training without enough GPU memory you can try half of the current batch size(e.g. 16), try to lower the learning rate and run more steps, watching the TensorBoard until convergency. BTW, the codes here had also been tested under TensorFlow 1.4 with CUDA 8.0, but some modifications to the codes are needed to enable replicate model training, take following steps if you need:

  • copy all the codes of this file to your local file named 'tf_replicate_model_fn.py'
  • add one more line here to import module 'tf_replicate_model_fn'
  • change 'tf.contrib.estimator' in here and here to 'tf_replicate_model_fn'
  • now the training process should run perfectly
  • before you run 'eval_ssd.py', you should also remove this line because of the interface compatibility

This repo is just created recently, any contribution will be welcomed.

Results (VOC07 Metric)

This implementation(SSD300-VGG16) yield mAP 77.8% on PASCAL VOC 2007 test dataset(the original performance described in the paper is 77.2%mAP), the details are as follows:

sofa bird pottedplant bus diningtable cow bottle horse aeroplane motorbike
78.9 76.2 53.5 85.2 75.5 85.0 48.6 86.7 82.2 83.4
sheep train boat bicycle chair cat tvmonitor person car dog
82.4 87.6 72.7 83.0 61.3 88.2 74.5 79.6 85.3 86.4

You can download the trained model(VOC07+12 Train) from GoogleDrive for further research.

For Chinese friends, you can also download both the trained model and pre-trained vgg16 weights from BaiduYun Drive, access code: tg64.

Here is the training logs and some detection results:

Too Busy TODO

  • Adapting for CoCo Dataset
  • Update version SSD-512
  • Transfer to other backbone networks

Known Issues

  • Got 'TypeError: Expected binary or unicode string, got None' while training
    • Why: There maybe some inconsistent between different TensorFlow version.
    • How: If you got this error, try change the default value of checkpoint_path to './model/vgg16.ckpt' in train_ssd.py. For more information issue6 and issue9.
  • Nan loss during training
    • Why: This is caused by the default learning rate which is a little higher for some TensorFlow version.

    • How: I don't know the details about the different behavior between different versions. There are two workarounds:

      • Adding warm-up: change some codes here to the following snippet:
      tf.app.flags.DEFINE_string(
      'decay_boundaries', '2000, 80000, 100000',
      'Learning rate decay boundaries by global_step (comma-separated list).')
      tf.app.flags.DEFINE_string(
      'lr_decay_factors', '0.1, 1, 0.1, 0.01',
      'The values of learning_rate decay factor for each segment between boundaries (comma-separated list).')
      • Lower the learning rate and run more steps until convergency.
  • Why this re-implementation perform better than the reported performance
    • I don't know

Citation

Use this bibtex to cite this repository:

@misc{kapok_ssd_2018,
  title={Single Shot MultiBox Detector in TensorFlow},
  author={Changan Wang},
  year={2018},
  publisher={Github},
  journal={GitHub repository},
  howpublished={\url{https://github.com/HiKapok/SSD.TensorFlow}},
}

Discussion

Welcome to join in QQ Group(758790869) for more discussion

Apache License, Version 2.0

ssd.tensorflow's People

Contributors

hikapok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ssd.tensorflow's Issues

Some Questions about training

1.What does the acc(post_forward/cls_accuracy_1) mean? Is it possible to show the AP/mAP during training?
2. I try half of the batchsize with GTX1060 and try to lower the learning rate and run more steps as u told. Is it also necessary to change the end_learning_rate/decay_boundaries/lr_deacay_factors?

Some Debug Slove and Suggestions

the convert_tfrecords.py: row 216

anna_file = os.path.join(directory, cur_record[0], 'Annotations', cur_record[1].replace('jpg', 'xml'))

Suggestion as followed:
anna_file = os.path.join(directory, cur_record[0], 'Annotations', cur_record[1].replace('.jpg', '.xml'))

because the file name contain the substring 'jpg' maybe replace not only the last substring.

找不到文件

运行完eeval_ssd,之后运行voc_eval,提示FileNotFoundError: [Errno 2] No such file or directory: '/media/streamax/disk1/DataSet/test-VOC-2012/VOCdevkit/VOC2012/Annotations/2008_000001.xml'

没有2008_0000001.xml啊

How can I get the data of the tensor ?

作者你好,我想在训练中打印出final_mask 的值,但是发现final_mask 是一个tensor,我没办法把它转换成数组打印出来,请问您能在这个问题上帮助我一下吗?我最近也在做SSD的研究,可以加个微信交流一下吗?

I do not understand the sample_patch on your code

作者你好,我没有很能看懂您代码中数据预处理的随机裁剪部分,请问对一张图片随机裁剪的时候,min_iou 是每次从 ratio_list 中等概率随机得到吗?这是不是意味着极端情况如果我将ratio_list变成[0.1, 0.1, 0.1, 0.1, 0.1, 1.]的话,随机裁剪出的iou>0.1的框是最多的?
还有,ratio_list[-0.1,0.1, 0.3, 0.5, 0.7, 0.9, 1.]有-0.1是为了随机裁剪出只有背景的图片吗?
期待您的解答~

Why your number of anchors is 6792?

I find the reason is that your anchor_ratios don't have 1. If the 1 is added then the anchors is 8732 which is match with the original ssd source.

Train new dataset with 77 classes, Result not good

hi, your work is nice, but i meet some problems about train my own data which has 77 kinds of fruits&veg. i load the pre-trained vgg16 model and train about 86000 steps, followed pics show the result, could you give some suggestions?
3_result
1_result

And i wonder how you solve the loss not converge problem(balancap version), looking forward to your reply, thank you

VGG16 Pre-trained weights

Hi, i am curious how did you make the VGG pre-trained weight as it seems different with the default weight provided in tensorflow (tf.slim). Did you apply subsampling for FC6 and FC7? Or did you convert the weights from the caffe model? Thanks in advance.

How can I solve this problem?

File "/mnt/lustre/wuguangbin/Data_t1/cornell_ssd_code/Hikapok_SSD/SSD/train_ssd.py", line 463, in main
hooks=[logging_hook], max_steps=FLAGS.max_number_of_steps)
File "/mnt/lustre/share/anaconda2_bigvideo/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 302, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/mnt/lustre/share/anaconda2_bigvideo/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 783, in _train_model
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/mnt/lustre/share/anaconda2_bigvideo/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 537, in exit
self._close_internal(exception_type)
File "/mnt/lustre/share/anaconda2_bigvideo/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 574, in _close_internal
self._sess.close()
File "/mnt/lustre/share/anaconda2_bigvideo/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 820, in close
self._sess.close()
File "/mnt/lustre/share/anaconda2_bigvideo/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 941, in close
ignore_live_threads=True)
File "/mnt/lustre/share/anaconda2_bigvideo/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/mnt/lustre/share/anaconda2_bigvideo/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/mnt/lustre/share/anaconda2_bigvideo/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
target_list_as_strings, status, None)
File "/mnt/lustre/share/anaconda2_bigvideo/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 12
[[Node: dataset_data_provider/parallel_read/ReaderReadV2 = ReaderReadV2[device="/job:localhost/replica:0/task:0/device:CPU:0"](dataset_data_provider/parallel_read/TFRecordReaderV2, dataset_data$
rovider/parallel_read/filenames)]]

Couldn't train on my own dataset, because of model diverged with loss = NaN.

I make my own dateset, and check carefully to ensure all the training image do have at least one bbox. I also set scaffold = None, but it still doesn't work. The ce loss is high, so the total loss is high, and the model diverged with loss = Nan. How can I solve this problem, I have been working with it for some days.
2018-07-29

how to find the output_node_names ?

你好!@HiKapok: 向你请教一个问题:
使用你的SSD代码训练完成后,现在需要将其ckpt转换为pb形式,需要output_node_names 但是我发现ssd模型训练的输出好像是
all_labels = tf.concat(labels_list, axis=0)
all_scores = tf.concat(scores_list, axis=0)
all_bboxes = tf.concat(bboxes_list, axis=0)
多个输出,但是,我没办法看到网络Graph的最终输出节点?可以帮忙解答下当前网络的output_node_names是什么吗? 十分感谢

why voc2012test mp is so low(4%)?

Hi, HiKapok:
Your code is so pretty, good job . I use voc2012test instead of voc2007test, but it's mp is so low, I don't understand the reason.

AP for pottedplant = 0.0000
AP for motorbike = 0.0000
AP for bottle = 0.0000
AP for cow = 0.0000
AP for tvmonitor = 0.0000
AP for person = 0.8002
AP for chair = 0.0008
AP for bus = 0.0000
AP for train = 0.0000
AP for sofa = 0.0000
AP for sheep = 0.0000
AP for boat = 0.0000
AP for horse = 0.0000
AP for diningtable = 0.0078
AP for cat = 0.0000
AP for bicycle = 0.0000
AP for aeroplane = 0.0000
AP for car = 0.0000
AP for dog = 0.0000
AP for bird = 0.0000
Mean AP = 0.0404

An Error Occur While Training

I followed everything as you write,but an error occur :
TypeError: Expected binary or unicode string,got None
can you help me

A error occured while training

I used the pre-trained vgg16 model as you provided ,but an error occured :
INFO:tensorflow:Fine-tuning from None. Ignoring missing vars: True.

Traceback (most recent call last):

File "train_ssd.py", line 464, in

  tf.app.run()

File "/home/h/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run

  _sys.exit(main(argv))

File "train_ssd.py", line 460, in main

  hooks=[logging_hook], max_steps=FLAGS.max_number_of_steps)

File "/home/h/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 352, in train

  loss = self._train_model(input_fn, hooks, saving_listeners)

File "/home/h/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 812, in _train_model

  features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)

File "/home/h/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 793, in _call_model_fn

  model_fn_results = self._model_fn(features=features, **kwargs)

File "/home/h/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/estimator/python/estimator/replicate_model_fn.py", line 220, in single_device_model_fn

  local_ps_devices=ps_devices)[0]  # One device, so one spec is out.

File "/home/h/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/estimator/python/estimator/replicate_model_fn.py", line 558, in _get_loss_towers

  **optional_params)

File "train_ssd.py", line 403, in ssd_model_fn
scaffold=tf.train.Scaffold(init_fn=get_init_fn()))
File "train_ssd.py", line 158, in get_init_fn

  name_remap={'/kernel': '/weights', '/bias': '/biases'})

File "/home/h/Desktop/SSD.TensorFlow-master/utility/scaffolds.py", line 66, in get_init_fn_for_scaffold

  reader = tf.train.NewCheckpointReader(checkpoint_path)

File "/home/h/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 254, in NewCheckpointReader

  return CheckpointReader(compat.as_bytes(filepattern), status)

File "/home/h/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 67, in as_bytes

  (bytes_or_text,))

TypeError: Expected binary or unicode string, got None

could you help me

some questions about training

Thanks for your works, it is reproduced easily. There are still some questions confusing me.

  1. The pre-trained vgg16_reduced_fc is different the orginal vgg16, including different weights on conv1/conv1_1 and reducing the fc8 layer, why?

  2. The learning rate is a piecewise function. how did you design this function? and how to set the initial learning rate?

  3. i trained the net in a pc with two 1080ti. the tow gpu is both used. After 120000 steps, the mAP is 0.78. And the log print in the screen like that:

INFO:tensorflow: global_step/sec: 2.22489

INFO: tensorflow: lr=0.00001, ce= 1.487573, loc=0.481032, loss=7.578554, l2=5.609949, acc= 0.835833

INFO: tensorflow: loss=10.1866, step=119990(4.494 sec).

there are two data named "loss", what is their difference?

suppose the second "loss" is the total loss just as the file loss.JPG saved in the log directory. why total loss in your loss.JPG is converged about 7~8. Then my is converged 9~10.

And my time of training one batch cost more than you(yours are 1s) why?

please help me.

Portability for custom dataset

Hi! I have a dataset of only one class object. I attempted to use balancap's SSD-Tensorflow but meet a unsolvable trouble, so I have to look for a substitute. Is it easy to change your code for my dataset? What should I notice? Thanks in advance!

关于改变anchor个数的问题

你好,我想改变一下anchor的个数,比如增加更多的长宽比,请问我应该修改哪些地方呢?谢谢作者!

How you create the anchor?

Hi,writer !I have some problem on your code about how to create anchors.I read your code but I found that the the return of function get_layer_anchors are

tf.expand_dims(y_on_image,** axis=-1), tf.expand_dims(x_on_image, axis=-1),
tf.constant(list_h_on_image, dtype=tf.float32),
tf.constant(list_w_on_image, dtype=tf.float32),
num_anchors_along_depth,
num_anchors_along_spatial

but the shape of list_h_on_image in the first feature layer is ### 4,while the ### shape of the y_on_image is ### 1440(38*38) how they combine with each other? I cannot find the code.

关于channel first和channel last

您好 我想使用预训练模型来训练自己的数据 请问您提供的预训练模型是基于channel first训练出来的吗

那我想训练channel last模式的就不能使用该模型吗?

eval error

after i download your file and checkpoint ,then i run the eval.py, it give out error like this.
C:\Users\DELL\PycharmProjects\unstructured_data\venv\Scripts\python.exe D:/git_code/SSD.TensorFlow/eval_ssd.py
INFO:tensorflow:Using config: {'_model_dir': './logs/', '_tf_random_seed': None, '_save_summary_steps': 500, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002B7C3A76F98>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Starting a predict cycle.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From D:\git_code\SSD.TensorFlow\net\ssd_net.py:114: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2018-06-28 21:01:10.495852: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
INFO:tensorflow:Restoring parameters from ./model\vgg16_reducedfc.ckpt
2018-06-28 21:01:10.574406: W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key global_step not found in checkpoint
Traceback (most recent call last):
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
return fn(*args)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\client\session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\client\session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key global_step not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:/git_code/SSD.TensorFlow/eval_ssd.py", line 456, in
tf.app.run()
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
_sys.exit(main(argv))
File "D:/git_code/SSD.TensorFlow/eval_ssd.py", line 427, in main
det_results = list(pred_results)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\estimator\estimator.py", line 507, in predict
hooks=all_hooks) as mon_sess:
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 816, in init
stop_grace_period_secs=stop_grace_period_secs)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 539, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1002, in init
_WrappedSession.init(self, self._create_session())
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1007, in _create_session
return self._sess_creator.create_session()
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 696, in create_session
self.tf_sess = self._session_creator.create_session()
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 467, in create_session
init_fn=self._scaffold.init_fn)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\session_manager.py", line 279, in prepare_session
config=config)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\session_manager.py", line 191, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1802, in restore
{self.saver_def.filename_tensor_name: save_path})
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\client\session.py", line 900, in run
run_metadata_ptr)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run
run_metadata)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key global_step not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
File "D:/git_code/SSD.TensorFlow/eval_ssd.py", line 456, in
tf.app.run()
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
_sys.exit(main(argv))
File "D:/git_code/SSD.TensorFlow/eval_ssd.py", line 427, in main
det_results = list(pred_results)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\estimator\estimator.py", line 507, in predict
hooks=all_hooks) as mon_sess:
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 816, in init
stop_grace_period_secs=stop_grace_period_secs)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 539, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1002, in init
_WrappedSession.init(self, self._create_session())
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1007, in _create_session
return self._sess_creator.create_session()
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 696, in create_session
self.tf_sess = self._session_creator.create_session()
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 458, in create_session
self._scaffold.finalize()
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 212, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\saver.py", line 910, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1338, in init
self.build()
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1347, in build
self._build(self._filename, build_save=True, build_restore=True)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1384, in _build
build_save=build_save, build_restore=build_restore)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\saver.py", line 829, in _build_internal
restore_sequentially, reshape)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\saver.py", line 525, in _AddShardedRestoreOps
name="restore_shard"))
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\saver.py", line 472, in _AddRestoreOps
restore_sequentially)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\training\saver.py", line 886, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1546, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
op_def=op_def)
File "C:\Users\DELL\PycharmProjects\unstructured_data\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): Key global_step not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Process finished with exit code 1

a problem meet during training my own data

i have modified convert_tfrecords.py to produce my own *.tfrecords. My xml file is constructed like this.
image
when i start traning , i have met problem, and information on console is like this.
ssh://[email protected]:22/home/jiawen/anaconda3/envs/mnist_test/bin/python3.5 -u /home/jiawen/gitcode/ssd.tensorflow/train_ssd.py
2018-07-04 15:41:59.453717: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-07-04 15:41:59.962936: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-07-04 15:41:59.964121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:08.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-07-04 15:41:59.964159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-04 15:42:00.289100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-04 15:42:00.289164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-07-04 15:42:00.289179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-07-04 15:42:00.289601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 15135 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
INFO:tensorflow:Replicating the model_fn across ['/device:GPU:0']. Variables are going to be placed on ['/device:GPU:0']. Consolidation device is going to be /device:GPU:0.
INFO:tensorflow:Using config: {'_task_type': 'worker', '_num_worker_replicas': 1, '_evaluation_master': '', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f931a6ca630>, '_keep_checkpoint_every_n_hours': 10000, '_task_id': 0, '_log_step_count_steps': 10, '_global_id_in_cluster': 0, '_save_summary_steps': 500, '_tf_random_seed': 20180503, '_model_dir': './logs/', '_save_checkpoints_secs': 7200, '_service': None, '_save_checkpoints_steps': None, '_train_distribute': None, '_master': '', '_keep_checkpoint_max': 5, '_num_ps_replicas': 0, '_is_chief': True, '_session_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
allow_soft_placement: true
}
Starting a training cycle.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /home/jiawen/gitcode/ssd.tensorflow/net/ssd_net.py:114: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Fine-tuning from ./model/vgg16_reducedfc.ckpt. Ignoring missing vars: True.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-07-04 15:42:05.769670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-04 15:42:05.769732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-04 15:42:05.769749: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-07-04 15:42:05.769763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-07-04 15:42:05.769994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15135 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
INFO:tensorflow:Restoring parameters from ./model/vgg16_reducedfc.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2018-07-04 15:42:13.305340: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at gather_nd_op.cc:50 : Invalid argument: flat indices[30, :] = [30, -1] does not index into param (shape: [32,6792]).
Traceback (most recent call last):
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: flat indices[30, :] = [30, -1] does not index into param (shape: [32,6792]).
[[Node: post_forward/GatherNd = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](post_forward/TopKV2, post_forward/stack)]]
[[Node: post_forward/boolean_mask/Squeeze/_691 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_653_post_forward/boolean_mask/Squeeze", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/jiawen/gitcode/ssd.tensorflow/train_ssd.py", line 464, in
tf.app.run()
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/home/jiawen/gitcode/ssd.tensorflow/train_ssd.py", line 460, in main
hooks=[logging_hook], max_steps=FLAGS.max_number_of_steps)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 859, in _train_model_default
saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1059, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 567, in run
run_metadata=run_metadata)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1043, in run
run_metadata=run_metadata)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1134, in run
raise six.reraise(*original_exc_info)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1119, in run
return self._sess.run(*args, **kwargs)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1191, in run
run_metadata=run_metadata)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 971, in run
return self._sess.run(*args, **kwargs)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: flat indices[30, :] = [30, -1] does not index into param (shape: [32,6792]).
[[Node: post_forward/GatherNd = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](post_forward/TopKV2, post_forward/stack)]]
[[Node: post_forward/boolean_mask/Squeeze/_691 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_653_post_forward/boolean_mask/Squeeze", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op 'post_forward/GatherNd', defined at:
File "/home/jiawen/gitcode/ssd.tensorflow/train_ssd.py", line 464, in
tf.app.run()
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/home/jiawen/gitcode/ssd.tensorflow/train_ssd.py", line 460, in main
hooks=[logging_hook], max_steps=FLAGS.max_number_of_steps)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 856, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 831, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/jiawen/gitcode/ssd.tensorflow/tf_replicate_model_fn.py", line 201, in single_device_model_fn
local_ps_devices=ps_devices)[0] # One device, so one spec is out.
File "/home/jiawen/gitcode/ssd.tensorflow/tf_replicate_model_fn.py", line 534, in _get_loss_towers
**optional_params)
File "/home/jiawen/gitcode/ssd.tensorflow/train_ssd.py", line 321, in ssd_model_fn
score_at_k = tf.gather_nd(topk_prob_for_bg, tf.stack([tf.range(tf.shape(features)[0]), batch_n_neg_select - 1], axis=-1))
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2975, in gather_nd
"GatherNd", params=params, indices=indices, name=name)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
op_def=op_def)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): flat indices[30, :] = [30, -1] does not index into param (shape: [32,6792]).
[[Node: post_forward/GatherNd = GatherNd[Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](post_forward/TopKV2, post_forward/stack)]]
[[Node: post_forward/boolean_mask/Squeeze/_691 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_653_post_forward/boolean_mask/Squeeze", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Process finished with exit code 1

training time and hardware configuration

thank you for your codes, could you give the training time and your machine hardware configuration? I am also training a ssd300 net on my own dataset, but I found the training process is very slow, so I want to know your training time for 60k steps on pascal voc datasets, that is useful for me, maybe. Thank you!

update pipeline using MirroredStrategy

Hi Changan,
Do you have any experience with tf.contrib.distribute.MirroredStrategy? As TF is going to depreciate replicate_model_fn, I was trying to update your code with MirroredStrategy, but I always got "ValueError: destinations must be one of a DistributedValues object, a device string, a list of device strings or None".

Please take a look when you have time.
Thanks,
Yu

the number of anchor is different from the paper

Hi,I found that ,your the number of anchors in your code are{3,5,5,5,3,3],but the number of anchors which proposed on paper are[4,6,6,6,4,4],but your code had a better result than paper.It is amazing!

程序卡住但是也没有报错

Hi @HiKapok :

我在使用你的SSD算法应用时候,运行Pascal VOC数据正常!程序可以完美的跑起来,但是当我运行自己的数据集时候,程序卡住同时也不报错!想请教一下,这部分该如何在程序中定位? ps: 数据通过VoC格式已经成功转换为tf格式,没有报错!

数据集内容:其中有部分没有目标bbox坐标信息,只有基本的长宽高!

anchor_allowed_borders, what is it for?

I donnot understand what the 'anchor_allowed_borders' is for in function 'encode_all_labels'

inside_mask = tf.logical_and(tf.logical_and(anchors_ymin > -anchor_allowed_borders * 1.,
                                                        anchors_xmin > -anchor_allowed_borders * 1.),
                                        tf.logical_and(anchors_ymax < (1. + anchor_allowed_borders * 1.),
                                                        anchors_xmax < (1. + anchor_allowed_borders * 1.)))

I read your code that 'anchor_allowed_borders ' is all 1.0, so will the mask always be true? (as the anchors_ymin, anchors_ymax, anchors_xmin, anchors_ymax are always between 0.0 and 1.0

A problem occured when training the model

I have download vgg16, but there still have a problem. I'm using tensorflow 1.8.

raceback (most recent call last):
File "/root/Downloads/SSD.TensorFlow-master/train_ssd.py", line 464, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/root/Downloads/SSD.TensorFlow-master/train_ssd.py", line 460, in main
hooks=[logging_hook], max_steps=FLAGS.max_number_of_steps)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 856, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 831, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/estimator/python/estimator/replicate_model_fn.py", line 221, in single_device_model_fn
local_ps_devices=ps_devices)[0] # One device, so one spec is out.
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/estimator/python/estimator/replicate_model_fn.py", line 559, in _get_loss_towers
**optional_params)
File "/root/Downloads/SSD.TensorFlow-master/train_ssd.py", line 403, in ssd_model_fn
scaffold=tf.train.Scaffold(init_fn=get_init_fn()))
File "/root/Downloads/SSD.TensorFlow-master/train_ssd.py", line 158, in get_init_fn
name_remap={'/kernel': '/weights', '/bias': '/biases'})
File "/root/Downloads/SSD.TensorFlow-master/utility/scaffolds.py", line 66, in get_init_fn_for_scaffold
reader = tf.train.NewCheckpointReader(checkpoint_path)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 290, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern), status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 68, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got None

NaN loss

Hi,
I have changed the model definition for face detection task. As mentioned in readme, I have changed the required places for adapting it in 1.4tf.
But NaN loss issue is not getting solved. Even in reduced learning rate of 1e-4 its giving Nan.
have you found any solution to this?

I do not understand the match strategy of you code?

between_mask = tf.logical_and(tf.less(match_values, high_thres), tf.greater_equal(match_values, low_thres))
negative_mask = less_mask if ignore_between else between_mask
ignore_mask = between_mask if ignore_between else less_mask

what does the between_mask means ? can u explain it more clearly?
Thanks you very much!

预训练模型

抱歉,我想请问一下,我不用预训练参数可以吗,因为我前面的模型改了,预训练参数对不上,我想自己训练

error

AttributeError: module 'tensorflow.python.layers.layers' has no attribute 'Layer'

Will it be that the TF version is too low?

prediction_hooks argument for tf.estimator.EstimatorSpec

When trying to run eval_ssd.py I get the following error:
File "eval_ssd.py", line 373, in ssd_model_fn
loss=None, train_op=None)
TypeError: new() got an unexpected keyword argument 'prediction_hooks'

I'm using TF 1.4.1. Is this another TF compatibility issue?
Is there an easy patch I can use as you did for tf_replicate_model_fn?

Thanks in advance!

What do we need to run on COCO

Hi friend,
I interesting in moving to test your code on COCO dataset. I downloaded the train/val files and ready to go. So, do you have any idea where should I start?

About voc_eval.py

I just finished training on my own dataset(10 classes) and Sucessfully run the eval_edsd.py. But an Error occurs when I run the voc_eval.py. It seems like there is something wrong with the VOC 07 11 point method.
default
default

Is tf.stop_gradient in the loss function affect the final result?

Hi, I know your SSD implementation from the "Issues" under balancap's SSD implementation and I am training your implementation now when I am typing the questions :D . Good work of your code!

I find out that you used tf.stop_gradient() when you defined your mask in loss. Is it affect the training if not using it?

The implementation by balancap is not stable during training while your implementation is easy to converge. Have you find out the reason?

About the mAP

I trained the VOC dataset, but only got 64% mAP. Is there any trick do you use to reach the report map (77%) without the pertained model?

预训练模型

你的程序必须提供预训练模型才能训练吗?

still can not train by my own data

i solved the problem mentioned yesterday, the num_classes set to be 3, and change VOC_LABELS to MY_LABELS

but i get another bug

ssh://[email protected]:22/home/jiawen/anaconda3/envs/mnist_test/bin/python3.5 -u /home/jiawen/gitcode/ssd.tensorflow/train_ssd.py
2018-07-05 15:28:46.450990: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-07-05 15:28:46.940678: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-07-05 15:28:46.941827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:08.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-07-05 15:28:46.941862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-05 15:28:47.291170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-05 15:28:47.291233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-07-05 15:28:47.291253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-07-05 15:28:47.291677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 15135 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
INFO:tensorflow:Replicating the model_fn across ['/device:GPU:0']. Variables are going to be placed on ['/device:GPU:0']. Consolidation device is going to be /device:GPU:0.
INFO:tensorflow:Using config: {'_evaluation_master': '', '_model_dir': './logs/', '_num_worker_replicas': 1, '_session_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
allow_soft_placement: true
, '_task_id': 0, '_master': '', '_task_type': 'worker', '_global_id_in_cluster': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f791f7da630>, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_keep_checkpoint_max': 5, '_save_checkpoints_secs': 7200, '_log_step_count_steps': 10, '_num_ps_replicas': 0, '_tf_random_seed': 20180503, '_service': None, '_train_distribute': None, '_save_summary_steps': 500, '_save_checkpoints_steps': None}
Starting a training cycle.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /home/jiawen/gitcode/ssd.tensorflow/net/ssd_net.py:114: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Ignoring --checkpoint_path because a checkpoint already exists in ./logs/.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-07-05 15:28:52.828754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-05 15:28:52.828818: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-05 15:28:52.828834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-07-05 15:28:52.828849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-07-05 15:28:52.829054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15135 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
INFO:tensorflow:Restoring parameters from ./logs/model.ckpt-1
Traceback (most recent call last):
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [9] rhs shape= [63]
[[Node: save/Assign_116 = Assign[T=DT_FLOAT, _class=["loc:@ssd300/multibox_head/cls_5/bias"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ssd300/multibox_head/cls_5/bias/Momentum, save/RestoreV2_1/_1)]]
[[Node: save/RestoreV2_1/_254 = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_262_save/RestoreV2_1", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/jiawen/gitcode/ssd.tensorflow/train_ssd.py", line 464, in
tf.app.run()
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/home/jiawen/gitcode/ssd.tensorflow/train_ssd.py", line 460, in main
hooks=[logging_hook], max_steps=FLAGS.max_number_of_steps)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 859, in _train_model_default
saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1056, in _train_with_estimator_spec
log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 405, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 816, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 539, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1002, in init
_WrappedSession.init(self, self._create_session())
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
return self._sess_creator.create_session()
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
self.tf_sess = self._session_creator.create_session()
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 467, in create_session
init_fn=self._scaffold.init_fn)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
config=config)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/session_manager.py", line 207, in _restore_checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1802, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [9] rhs shape= [63]
[[Node: save/Assign_116 = Assign[T=DT_FLOAT, _class=["loc:@ssd300/multibox_head/cls_5/bias"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ssd300/multibox_head/cls_5/bias/Momentum, save/RestoreV2_1/_1)]]
[[Node: save/RestoreV2_1/_254 = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_262_save/RestoreV2_1", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'save/Assign_116', defined at:
File "/home/jiawen/gitcode/ssd.tensorflow/train_ssd.py", line 464, in
tf.app.run()
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/home/jiawen/gitcode/ssd.tensorflow/train_ssd.py", line 460, in main
hooks=[logging_hook], max_steps=FLAGS.max_number_of_steps)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 859, in _train_model_default
saving_listeners)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1056, in _train_with_estimator_spec
log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 405, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 816, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 539, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1002, in init
_WrappedSession.init(self, self._create_session())
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
return self._sess_creator.create_session()
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
self.tf_sess = self._session_creator.create_session()
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 458, in create_session
self._scaffold.finalize()
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 214, in finalize
self._saver.build()
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1347, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1384, in _build
build_save=build_save, build_restore=build_restore)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 829, in _build_internal
restore_sequentially, reshape)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 525, in _AddShardedRestoreOps
name="restore_shard"))
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 494, in _AddRestoreOps
assign_ops.append(saveable.restore(saveable_tensors, shapes))
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 185, in restore
self.op.get_shape().is_fully_defined())
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/ops/state_ops.py", line 283, in assign
validate_shape=validate_shape)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/ops/gen_state_ops.py", line 60, in assign
use_locking=use_locking, name=name)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
op_def=op_def)
File "/home/jiawen/anaconda3/envs/mnist_test/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [9] rhs shape= [63]
[[Node: save/Assign_116 = Assign[T=DT_FLOAT, _class=["loc:@ssd300/multibox_head/cls_5/bias"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ssd300/multibox_head/cls_5/bias/Momentum, save/RestoreV2_1/_1)]]
[[Node: save/RestoreV2_1/_254 = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_262_save/RestoreV2_1", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Process finished with exit code 1
can you list out all lines which should be changed if i want to train by my own data

Should "layer_steps" change with the size of input?

Hi, I have a question of evaluation the ssd300 using different size of images

The layer_steps=[8, 16, 32, 64, 100, 300]
and the layers_shapes = [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
when using the input size 300.
When I evaluate a new image with size 512, the layers_shapes will be
[(64, 64), (32,32), (16, 16), (8, 8), (6, 6), (4, 4)]. Should the "layer_steps" be [8, 16, 32, 64, 85.3, 128]?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.