Hi, first of all, thank you so much for sharing such amazing work & code.
I really loved the idea and the results of this paper, and am trying to apply some ideas on top of this.
However, I have faced some problems. I trained the model for dmc_vision dmc_walker_walk task using GPU with 16GB and 24GB VRAM, but received an out-of-memory error. I changed the batch size to 1, but it did not help fixing the problem.
Also, when I ran this on GPU with smaller VRAM (like 8GB or 12GB), I noticed that training process gets stuck after 8008 steps (about 3-5 minutes after training starts). In the paper, it says the training can be done in one day using V100 GPU which has 32GB VRAM. I was wondering if I need a GPU with larger VRAM to train this model. I could infer that this is the case because running dmc_proprio did not have any problem. I think using a model with CNN causes this problem. I was wondering if there is a way to run training on a GPU with smaller VRAM.
Assuming that lack of VRAM is the problem, I also tried to use multi-gpus, and tried "multi_gpu" and "multi_worker" configurations in tfagent.py, but now I am getting a new error as follows:
metrics.update(self.model_opt(model_tape, model_loss, modules))
File
"/vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/tfutils.py",
ine 246, in __call__ *
self._opt.apply_gradients(
File
"/vol/bitbucket/xmbhrl/lib/python3.10/site-packages/keras/optimizer_v2/op
timizer_v2.py", line 671, in apply_gradients
return tf.__internal__.distribute.interim.maybe_merge_call(
RuntimeError: `merge_call` called while defining a new graph or a
tf.function. This can often happen if the function `fn` passed to
`strategy.run()` contains a nested `@tf.function`, and the nested `@tf.function`
contains a synchronization point, such as aggregating gradients (e.g,
optimizer.apply_gradients), or if the function `fn` uses a control flow
statement which contains a synchronization point in the body. Such behaviors are
not yet supported. Instead, please avoid nested `tf.function`s or control flow
statements that may potentially cross a synchronization boundary, for example,
wrap the `fn` passed to `strategy.run` or the entire `strategy.run` inside a
`tf.function` or move the control flow out of `fn`. If you are subclassing a
`tf.keras.Model`, please avoid decorating overridden methods `test_step` and
`train_step` in `tf.function`.
There's a high chance that I am using a wrong tensorflow version, so please do understand if I am using wrong dependencies.
I checked out the dockerfile and saw that it is using tensorflow 2.8 or 2.9, but when using 2.9, JIT compilation failed.
Would be amazing if someone can share if they're also facing similar issues or know the solution to this problem. Thank you so much.