Hello! Thank you for your work.
I see you have provided a detailed tutorial for training.
However, with all the preprocessing steps finished, I keep failing to run code/train.py in the tutorial. While tracing code, I guess there might be something wrong while calculating gradient. Furthermore, I find that if I remove the optimizer (Trainer.train_op in your repo) from the inputs of sess.run(inputs) in line 2045, the training process would start running smoothly.
To be clear, the inputs in original repo is:
inputs = [self.loss, self.train_op, self.wd_loss]
While I run the train.py, I would get the error shown below:
multiview data stats:
min 1, max 4
{1: 748, 2: 474, 3: 275, 4: 11121}
loaded 47005 data points for train
loaded 7839 data points for val
batch_size:12, epoch:30, 3918 step every epoch, total step:117540, eval/save every 3000 steps
0%| | 0/117540 [00:00<?, ?it/s]
0%| | 0/117540 [00:11<?, ?it/s]
Traceback (most recent call last):
File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.AlreadyExistsError: Resource __per_step_3/gradients/AddN_6/tmp_var/N10tensorflow19TemporaryVariableOp6TmpVarE
[[{{node gradients/AddN_6/tmp_var}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "code/train.py", line 335, in <module>
main(arguments)
File "code/train.py", line 308, in main
trainer.step(sess, batch)
File "/home/ziyan/simaug/Multiverse/SimAug/code/pred_models.py", line 2073, in step
outputs = sess.run(inputs, feed_dict=feed_dict)
File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/ziyan/anaconda3/envs/tf1.15/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.AlreadyExistsError: Resource __per_step_3/gradients/AddN_6/tmp_var/N10tensorflow19TemporaryVariableOp6TmpVarE
[[{{node gradients/AddN_6/tmp_var}}]]
Then, if I remove the self.train_op
in inputs, train.py would run smoothly:
inputs = [self.loss, self.wd_loss]
Screen logs:
multiview data stats:
min 1, max 4
{1: 748, 2: 474, 3: 275, 4: 11121}
loaded 47005 data points for train
loaded 7839 data points for val
batch_size:12, epoch:30, 3918 step every epoch, total step:117540, eval/save every 3000 steps
0%| | 0/117540 [00:00<?, ?it/s]
0%| | 1/117540 [00:20<671:10:59, 20.56s/it]
0%| | 2/117540 [00:38<615:44:18, 18.86s/it]
0%| | 3/117540 [00:56<601:51:16, 18.43s/it]
0%| | 4/117540 [01:14<594:36:26, 18.21s/it]
0%| | 5/117540 [01:31<588:21:06, 18.02s/it]
0%| | 6/117540 [01:49<587:02:14, 17.98s/it]
Could you check that code/train.py can be executed in the right way and show the packages installed in your environment (like pip list)?
The conda environment I used to execute your code include:
- python3
- tensorflow1.15
both are mentioned in README
And, all preprocessing steps have been done. So, now I don't have any clue to solve the problem.
Your help would be much appreciated, I'm close to making this thing work! Thanks for your time.