prouast / deep-intake-detection Goto Github PK
View Code? Open in Web Editor NEWA deep neural network implementation for detection of food and drink intake gestures from video.
License: MIT License
A deep neural network implementation for detection of food and drink intake gestures from video.
License: MIT License
Hey, I'm trying to use this codebase to run some baseline evaluations on a different eating dataset. I've got the input pipeline and the model working (see repo), but I'm having a couple of issues training on multiple gpus.
Currently, running oreba_main.py
with use_distribution=True
only uses the cpu for training resulting in very slow performance. It gives the following log:
2020-07-20 21:44:48.576363: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:There is non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
W0720 21:44:48.580603 140048971515712 cross_device_ops.py:1209] There is non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
INFO:tensorflow:Initializing RunConfig with distribution strategies.
I0720 21:44:48.581110 140048971515712 run_config.py:566] Initializing RunConfig with distribution strategies.
INFO:tensorflow:Not using Distribute Coordinator.
I0720 21:44:48.581351 140048971515712 estimator_training.py:167] Not using Distribute Coordinator.
I also tried manually edited run_loop().main()
with:
strategy = tf.contrib.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3"])
but it throws:
Traceback (most recent call last):
File "oreba_main.py", line 586, in <module>
main=run_oreba
File "/usr/lib/python3/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/lib/python3/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "oreba_main.py", line 581, in run_oreba
warmstart_settings=oreba_warmstart_settings())
File "/home/andyxu/Rouast_Paper_Code/deep-intake-detection/run_loop.py", line 120, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1159, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1222, in _train_model_distributed
self._config._train_distribute, input_fn, hooks, saving_listeners)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1302, in _actual_train_model_distributed
self.config))
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1810, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 662, in _call_for_each_replica
fn, args, kwargs)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 196, in _call_for_each_replica
coord.join(threads)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/andyxu/.local/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 880, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "oreba_main.py", line 472, in oreba_model_fn
model=model)
File "/home/andyxu/Rouast_Paper_Code/deep-intake-detection/run_loop.py", line 319, in model_fn
labels=final_labels, predictions=final_logits, k=1, class_id=i)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/metrics_impl.py", line 3619, in precision_at_k
name=scope)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/metrics_impl.py", line 3485, in precision_at_top_k
weights=weights)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/metrics_impl.py", line 2375, in _streaming_sparse_true_positive_at_k
var = metric_variable([], dtypes.float64, name=scope)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/metrics_impl.py", line 85, in metric_variable
name=name)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 258, in __call__
return cls._variable_v1_call(*args, **kwargs)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 219, in _variable_v1_call
shape=shape)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 65, in getter
return captured_getter(captured_previous, **kwargs)
File "/home/andyxu/.local/lib/python3.6/site-packages/tensorflow_core/python/distribute/shared_variable_creator.py", line 92, in reuse_variable
format(name, device_id))
RuntimeError: Tried to create variable replica_1/precision_at_1_class0/true_positive_at_1_class0/ with mismatching name on device 1
Any idea what the issue could be? For reference, I'm running tf-1.15.2 with python 3.6.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.