rm /tmp/agent -Rf; python3 /thomas_rl/src/agents/vtrace/thomas/main.py --run_mode=learner --job-dir='gs://xxxxxx//XXXXXX_20200921085830/' --logtostderr --pdb_post_mortem --num_envs=16 --e
nv_batch_size=4 --tpu_name=boost-7j6bk
________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.
To avoid this, run the container by specifying your user's userid:
$ docker run -u $(id -u):$(id -g) args...
root@484f4c16e62e:/thomas_rl/docker# rm /tmp/agent -Rf; python3 /thomas_rl/src/agents/vtrace/thomas/main.py --run_mode=learner --job-dir='gs://xxxxxx//XXXXXX_20200921085830/' --logtostderr
--pdb_post_mortem --num_envs=16 --env_batch_size=4 --tpu_name=boost-7j6bk
2020-09-21 09:58:32.157014: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
$$$$$$$$\ $$\ $$\ $$$$$$\ $$\ $$\ $$$$$$\ $$$$$$\
\__$$ __|$$ | $$ |$$ __$$\ $$$\ $$$ |$$ __$$\ $$ __$$\
$$ | $$ | $$ |$$ / $$ |$$$$\ $$$$ |$$ / $$ |$$ / \__|
$$ | $$$$$$$$ |$$ | $$ |$$\$$\$$ $$ |$$$$$$$$ |\$$$$$$\
$$ | $$ __$$ |$$ | $$ |$$ \$$$ $$ |$$ __$$ | \____$$\
$$ | $$ | $$ |$$ | $$ |$$ |\$ /$$ |$$ | $$ |$$\ $$ |
$$ | $$ | $$ | $$$$$$ |$$ | \_/ $$ |$$ | $$ |\$$$$$$ |
\__| \__| \__| \______/ \__| \__|\__| \__| \______/ version b97dd88
python /thomas_rl/src/agents/vtrace/thomas/main.py --run_mode=learner --job-dir=gs://xxxxxx//XXXXXX_20200921085830/ --logtostderr --pdb_post_mortem --num_envs=16 --env_batch_size=4 --tpu_na
me=boost-7j6bk
I0921 09:58:35.211318 140059370288960 learner.py:193] Starting learner loop
inference-batch-size: 16
I0921 09:58:35.273476 140059370288960 transport.py:157] Attempting refresh to obtain initial access_token
I0921 09:58:35.358564 140059370288960 transport.py:157] Attempting refresh to obtain initial access_token
2020-09-21 09:58:35.413014: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-21 09:58:35.413081: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: UNKNOWN ERROR (-1)
2020-09-21 09:58:35.413104: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (484f4c16e62e): /proc/driver/nvidia/version does n
ot exist
I0921 09:58:35.442526 140059370288960 transport.py:157] Attempting refresh to obtain initial access_token
I0921 09:58:35.536518 140059370288960 transport.py:157] Attempting refresh to obtain initial access_token
2020-09-21 09:58:35.582941: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU inst
ructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-21 09:58:35.592080: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2299995000 Hz
2020-09-21 09:58:35.594242: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a57340 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-21 09:58:35.594291: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-09-21 09:58:35.613283: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-09-21 09:58:35.613358: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30102}
2020-09-21 09:58:35.629874: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-09-21 09:58:35.629946: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30102}
2020-09-21 09:58:35.630474: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:30102
I0921 09:58:35.631092 140059370288960 remote.py:218] Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0
1600678715 INFO: detected a TPU: [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_typ
e='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(na
me='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/
task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU')]
INFO:tensorflow:Initializing the TPU system: boost-7j6bk
I0921 09:58:35.632414 140059370288960 tpu_strategy_util.py:73] Initializing the TPU system: boost-7j6bk
INFO:tensorflow:Clearing out eager caches
I0921 09:58:45.013357 140059370288960 tpu_strategy_util.py:108] Clearing out eager caches
INFO:tensorflow:Finished initializing TPU system.
I0921 09:58:45.015173 140059370288960 tpu_strategy_util.py:131] Finished initializing TPU system.
W0921 09:58:45.015923 140059370288960 tpu_strategy.py:320] `tf.distribute.experimental.TPUStrategy` is deprecated, please use the non experimental symbol `tf.distribute.TPUStrategy` instead.
I0921 09:58:45.048566 140059370288960 transport.py:157] Attempting refresh to obtain initial access_token
I0921 09:58:45.124522 140059370288960 transport.py:157] Attempting refresh to obtain initial access_token
INFO:tensorflow:Found TPU system:
I0921 09:58:45.178662 140059370288960 tpu_system_metadata.py:159] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I0921 09:58:45.178842 140059370288960 tpu_system_metadata.py:160] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I0921 09:58:45.178928 140059370288960 tpu_system_metadata.py:161] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I0921 09:58:45.178997 140059370288960 tpu_system_metadata.py:163] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0921 09:58:45.179058 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I0921 09:58:45.179831 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0921 09:58:45.179914 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I0921 09:58:45.179981 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I0921 09:58:45.180045 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I0921 09:58:45.180108 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I0921 09:58:45.180172 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I0921 09:58:45.180235 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I0921 09:58:45.180303 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I0921 09:58:45.180362 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I0921 09:58:45.180426 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I0921 09:58:45.180494 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I0921 09:58:45.180557 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
W0921 09:58:45.181079 140059370288960 tpu_strategy.py:320] `tf.distribute.experimental.TPUStrategy` is deprecated, please use the non experimental symbol `tf.distribute.TPUStrategy` instead.
I0921 09:58:45.218042 140059370288960 transport.py:157] Attempting refresh to obtain initial access_token
I0921 09:58:45.292918 140059370288960 transport.py:157] Attempting refresh to obtain initial access_token
INFO:tensorflow:Found TPU system:
I0921 09:58:45.343351 140059370288960 tpu_system_metadata.py:159] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I0921 09:58:45.343567 140059370288960 tpu_system_metadata.py:160] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I0921 09:58:45.343652 140059370288960 tpu_system_metadata.py:161] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I0921 09:58:45.343721 140059370288960 tpu_system_metadata.py:163] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0921 09:58:45.343784 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I0921 09:58:45.343852 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I0921 09:58:45.343914 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I0921 09:58:45.343983 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I0921 09:58:45.344045 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I0921 09:58:45.344128 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I0921 09:58:45.344192 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I0921 09:58:45.344259 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I0921 09:58:45.344321 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I0921 09:58:45.344379 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I0921 09:58:45.344445 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I0921 09:58:45.344506 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I0921 09:58:45.344567 140059370288960 tpu_system_metadata.py:165] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
1600678725 INFO: Creating environment: thomas-base-v1 -- id: 0
1600678725 INFO: FC layers size : 512, lstm cell size: (256, 256)
2020-09-21 09:58:50.191816: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
2020-09-21 09:58:53.162754: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: 'GrpcServerResourceHandleOp' is neither a type of a primitive operation nor a name of a function registered in binary running on n-b0fdb3cc-w-0. Make sure the operation or function is registered in the binary running in this process.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/thomas_rl/src/agents/vtrace/thomas/main.py", line 43, in main
create_optimizer)
File "/thomas_rl/src/agents/vtrace/learner.py", line 436, in learner_loop
create_host(i, host, inference_devices)
File "/thomas_rl/src/agents/vtrace/learner.py", line 328, in create_host
(action_specs, env_output_specs, agent_output_specs))
File "/thomas_rl/src/common/utils.py", line 145, in __init__
timestep_specs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/nest.py", line 635, in map_structure
structure[0], [func(*x) for x in entries],
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/nest.py", line 635, in <listcomp>
structure[0], [func(*x) for x in entries],
File "/thomas_rl/src/common/utils.py", line 139, in create_unroll_variable
[num_envs, self._full_length] + spec.shape.dims, dtype=spec.dtype)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py", line 2747, in wrapped
tensor = fun(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py", line 2806, in zeros
output = fill(shape, constant(zero, dtype=dtype), name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py", line 239, in fill
result = gen_array_ops.fill(dims, value, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 3402, in fill
_ops.raise_from_not_ok_status(e, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: 'GrpcServerResourceHandleOp' is neither a type of a primitive operation nor a name of a function registered in binary running on n-b0fdb3cc-w-0. Make sure the operation or function is registered in the binary running in this process. [Op:Fill]
*** Entering post-mortem debugging ***