The container launch successfully. I start extracting images and faceset. Then comes the training phase, here is the output:
$ bash scripts/6_train_Quick96.sh
Running trainer.
[new] No saved models found. Enter a name of a new model : quick96
quick96
Model first run.
Choose one or several GPU idxs (separated by comma).
[CPU] : CPU
[0] : Tesla T4
[0] Which GPU indexes to choose? :
0
Initializing models: 100%|###################################################################################################################################################| 5/5 [00:01<00:00, 3.31it/s]
Loading samples: 100%|################################################################################################################################################| 1222/1222 [00:02<00:00, 436.58it/s]
Loading samples: 100%|################################################################################################################################################| 1217/1217 [00:02<00:00, 523.04it/s]
============ Model Summary =============
== ==
== Model name: quick96_Quick96 ==
== ==
== Current iteration: 0 ==
== ==
==---------- Model Options -----------==
== ==
== batch_size: 4 ==
== ==
==------------ Running On ------------==
== ==
== Device index: 0 ==
== Name: Tesla T4 ==
== VRAM: 0.02GB ==
== ==
========================================
Starting. Press "Enter" to stop training and save model.
Trying to do the first iteration. If an error occurs, reduce the model parameters.
Error: 2 root error(s) found.
(0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
[[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
[[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[gradients/Reshape_18_grad/Reshape/_579]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Original stack trace for 'decoder_dst/res0/conv1/weight/read':
File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/deepfacelab/mainscripts/Trainer.py", line 58, in trainerThread
debug=debug)
File "/deepfacelab/models/ModelBase.py", line 193, in __init__
self.on_initialize()
File "/deepfacelab/models/Model_Quick96/Model.py", line 73, in on_initialize
self.src_dst_trainable_weights = self.encoder.get_weights() + self.inter.get_weights() + self.decoder_src.get_weights() + self.decoder_dst.get_weights()
File "/deepfacelab/core/leras/models/ModelBase.py", line 77, in get_weights
self.build()
File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
self._build_sub(v[name],name)
File "/deepfacelab/core/leras/models/ModelBase.py", line 35, in _build_sub
layer.build()
File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
self._build_sub(v[name],name)
File "/deepfacelab/core/leras/models/ModelBase.py", line 33, in _build_sub
layer.build_weights()
File "/deepfacelab/core/leras/layers/Conv2D.py", line 61, in build_weights
self.weight = tf.get_variable("weight", (self.kernel_size,self.kernel_size,self.in_ch,self.out_ch), dtype=self.dtype, initializer=kernel_initializer, trainable=self.trainable )
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1593, in get_variable
aggregation=aggregation)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1336, in get_variable
aggregation=aggregation)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 591, in get_variable
aggregation=aggregation)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 543, in _true_getter
aggregation=aggregation)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 961, in _get_single_variable
aggregation=aggregation)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 260, in __call__
return cls._variable_v1_call(*args, **kwargs)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 221, in _variable_v1_call
shape=shape)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 199, in <lambda>
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2634, in default_variable_creator
shape=shape)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
return super(VariableMetaclass, cls).__call__(*args, **kwargs)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1668, in __init__
shape=shape)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1861, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 287, in identity
ret = gen_array_ops.identity(input, name=name)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3943, in identity
"Identity", input=input, name=name)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3536, in _create_op_internal
op_def=op_def)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
self._traceback = tf_stack.extract_stack()
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
return fn(*args)
File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
target_list, run_metadata)
File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
[[{{node decoder_dst/res0/conv1/weight/read}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
[[{{node decoder_dst/res0/conv1/weight/read}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[gradients/Reshape_18_grad/Reshape/_579]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/deepfacelab/mainscripts/Trainer.py", line 129, in trainerThread
iter, iter_time = model.train_one_iter()
File "/usr/local/deepfacelab/models/ModelBase.py", line 474, in train_one_iter
losses = self.onTrainOneIter()
File "/usr/local/deepfacelab/models/Model_Quick96/Model.py", line 276, in onTrainOneIter
warped_dst, target_dst, target_dstm)
File "/usr/local/deepfacelab/models/Model_Quick96/Model.py", line 178, in src_dst_train
self.target_dstm:target_dstm,
File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
run_metadata_ptr)
File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
run_metadata)
File "/usr/local/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
[[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;0000000000000001;/job:localhost/replica:0/task:0/device:GPU:0;edge_1360_decoder_dst/res0/conv1/weight/read;0:0
[[node decoder_dst/res0/conv1/weight/read (defined at /deepfacelab/core/leras/layers/Conv2D.py:61) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[gradients/Reshape_18_grad/Reshape/_579]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Original stack trace for 'decoder_dst/res0/conv1/weight/read':
File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/anaconda3/envs/deepfacelab/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/deepfacelab/mainscripts/Trainer.py", line 58, in trainerThread
debug=debug)
File "/deepfacelab/models/ModelBase.py", line 193, in __init__
self.on_initialize()
File "/deepfacelab/models/Model_Quick96/Model.py", line 73, in on_initialize
self.src_dst_trainable_weights = self.encoder.get_weights() + self.inter.get_weights() + self.decoder_src.get_weights() + self.decoder_dst.get_weights()
File "/deepfacelab/core/leras/models/ModelBase.py", line 77, in get_weights
self.build()
File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
self._build_sub(v[name],name)
File "/deepfacelab/core/leras/models/ModelBase.py", line 35, in _build_sub
layer.build()
File "/deepfacelab/core/leras/models/ModelBase.py", line 65, in build
self._build_sub(v[name],name)
File "/deepfacelab/core/leras/models/ModelBase.py", line 33, in _build_sub
layer.build_weights()
File "/deepfacelab/core/leras/layers/Conv2D.py", line 61, in build_weights
self.weight = tf.get_variable("weight", (self.kernel_size,self.kernel_size,self.in_ch,self.out_ch), dtype=self.dtype, initializer=kernel_initializer, trainable=self.trainable )
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1593, in get_variable
aggregation=aggregation)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1336, in get_variable
aggregation=aggregation)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 591, in get_variable
aggregation=aggregation)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 543, in _true_getter
aggregation=aggregation)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 961, in _get_single_variable
aggregation=aggregation)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 260, in __call__
return cls._variable_v1_call(*args, **kwargs)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 221, in _variable_v1_call
shape=shape)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 199, in <lambda>
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variable_scope.py", line 2634, in default_variable_creator
shape=shape)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
return super(VariableMetaclass, cls).__call__(*args, **kwargs)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1668, in __init__
shape=shape)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 1861, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 287, in identity
ret = gen_array_ops.identity(input, name=name)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3943, in identity
"Identity", input=input, name=name)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3536, in _create_op_internal
op_def=op_def)
File "/anaconda3/envs/deepfacelab/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
self._traceback = tf_stack.extract_stack()