kengz / slm-lab Goto Github PK

Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".

Home Page: https://slm-lab.gitbook.io/slm-lab/

License: MIT License

Shell 1.73% Python 97.80% Dockerfile 0.48%

pytorch reinforcement-learning deep-reinforcement-learning benchmark policy-gradient dqn ppo sac a2c a3c

slm-lab's Introduction

SLM Lab

Modular Deep Reinforcement Learning framework in PyTorch.

Documentation:
https://slm-lab.gitbook.io/slm-lab/

NOTE: the book branch has been updated for issue fixes. For the original code in the book Foundations of Deep Reinforcement Learning, check out to git tag v4.1.1



BeamRider	Breakout	KungFuMaster	MsPacman

Pong	Qbert	Seaquest	Sp.Invaders

Ant	HalfCheetah	Hopper	Humanoid

Inv.DoublePendulum	InvertedPendulum	Reacher	Walker

slm-lab's People

Contributors

Stargazers

Watchers

Forkers

ale7714 jonkrohn tigerneil hedgefair zimoqingfeng amoliu xzm2004260 jdc08161063 wubizhi shubhampachori12110095 ammieqi bin913 xiaoliang008 kevin83919 vhcg77 chsafouane little1tow shlpu grseb9s raymondchua robot-ai-machinelearning intofint emigmo ml-lab cclauss trendingtechnology tttor angel-ayala priyanthi44 dantodor xiangshengcn kiaragrouwstra johannesheidecke tabshaikh irey72 danish004 gzabuelgasim 2series jurjsorinliviu dgiunchi xenakas portaisociety enggen mauricemanning raamshivajigoulikar antonosika bluecontra poornasandur fabio900 krishpop thechanrproject cjopengler tushargupta01 wh-forker daominglyu robotsdiy mcspx raghu1121 decoderkurt wilson1yan sgillen allan-avatar1 vin136 ichbingautam phillyschoolofai shivavarun jameslo1 sweetice ragtz karthik-jayasurya vmuthuk2 achao2013 sts0mrg0 rahim16 paulrich1234 colllin christinaliang mahdimor b2220333 ralphbrodriguez 8jasonstatham8 liangtianxin newton-rl hafidzdaud jimfleming hrgentry wantajob chansongjo nagizeroiw luweishuang manolaz coderpriya harshithballa kasimte kbu9299 pantelis sthagen fagan2888 milk-oolong jmribeiro

slm-lab's Issues

Evolution Strategies and Genetic Algorithms Policy in DRL

thank you so much for nice job.
I want to implement one of the algorithms without gradient in this project and compare the results with the algorithms in this project such as actorcritic, dqn ,reinforce.

I have a code that works in Pytorch https://towardsdatascience.com/reinforcement-learning-without-gradients-evolving-agents-using-genetic-algorithms-8685817d84f.

Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning
https://arxiv.org/pdf/1712.06567.pdf

Evolution Strategies as a Scalable Alternative to Reinforcement Learning https://arxiv.org/pdf/1703.03864.pdf

How can I do the implementation?

Need help in understanding the reinforce algorithm implementation in the foundations of deep RL book

Hi Wah Loon/ Laura,
I have started reading your book on deep RL and have enjoyed reading it so far.
Apologies for asking my question here. But I couldn't think of a better place to post this question. Please let me know if there is a discussion forum for this book, where I can ask questions going forward.
This question is related to the first standalone torch implementation of reinforce algorithm given in the book.
What I need help in understanding is:
As the criterion decreases, reward should increase.
However when I run the code, I observe that reward increases with increasing criterion.

Criterion(loss) is defined as
loss = - log_probs * rets # gradient term; Negative for maximizing
Because of negative sign, isn't a lower criterion(loss) better? But the results are contradictory.

Episode 0, loss: 240.3678741455078, total_reward: 27.0, solved: False
Episode 1, loss: 134.7480926513672, total_reward: 20.0, solved: False
Episode 2, loss: 47.81584930419922, total_reward: 12.0, solved: False
Episode 3, loss: 38.16853713989258, total_reward: 11.0, solved: False
Episode 4, loss: 130.42645263671875, total_reward: 20.0, solved: False
Episode 5, loss: 48.20455551147461, total_reward: 13.0, solved: False

...

Episode 295, loss: 6347.0849609375, total_reward: 200.0, solved: True !!!!!!
Episode 296, loss: 316.5134582519531, total_reward: 37.0, solved: False
Episode 297, loss: 6321.185546875, total_reward: 200.0, solved: True !!!!!!
Episode 298, loss: 6334.77197265625, total_reward: 200.0, solved: True !!!!!!
Episode 299, loss: 6197.91259765625, total_reward: 200.0, solved: True !!!!!!

why i get "terminating" ?

HI!

I get terminating when i trainning with search mode and connect to env by grpc ,the log like this:
"(pid=2023) terminating"
and has nothing else logs about this "terminating", my process also killed by it at the same time.
why i get that?
@kengz
@lgraesser

Empty Multi trial Graph in 'search' mode

Describe the bug
Hello,
After following Quick-start guide and run SARSA examples on cartpole, everything's works from terminal output to session graph. except the multi-trial graph in 'search' mode is empty but there is no error that stand out in the log. I tried re-installing , upgrading and downgrading plotly-orca, but the multi trial graph still empty.

Sorry the log is too long i can't post it all

command entered :
python run_lab.py slm_lab/spec/benchmark/sarsa/sarsa_cartpole.json sarsa_epsilon_greedy_cartpole search

To Reproduce

OS and environment: Ubuntu 18.04 LTS
SLM Lab git SHA (run git rev-parse HEAD to get it): dda02d0
spec file used: slm_lab/spec/benchmark/sarsa/sarsa_cartpole.json

Additional context

Error logs

[2021-05-04 16:51:39,536 PID:3860 INFO run_lab.py get_spec_and_run] Running lab spec_file:assignment_2/code_3_6.json spec_name:sarsa_epsilon_greedy_cartpole in mode:search
[2021-05-04 16:51:39,546 PID:3860 INFO search.py run_ray_search] Running ray search for spec sarsa_epsilon_greedy_cartpole
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/2 GPUs
Memory usage on this node: 3.8/33.6 GB

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 4/16 CPUs, 0/2 GPUs
Memory usage on this node: 3.8/33.6 GB
Result logdir: /home/iwan/ray_results/sarsa_epsilon_greedy_cartpole
Number of trials: 7 ({'RUNNING': 1, 'PENDING': 6})
PENDING trials:
 - ray_trainable_1_agent.0.net.optim_spec.lr=0.001,trial_index=1:	PENDING
 - ray_trainable_2_agent.0.net.optim_spec.lr=0.001,trial_index=2:	PENDING
 - ray_trainable_3_agent.0.net.optim_spec.lr=0.005,trial_index=3:	PENDING
 - ray_trainable_4_agent.0.net.optim_spec.lr=0.01,trial_index=4:	PENDING
 - ray_trainable_5_agent.0.net.optim_spec.lr=0.05,trial_index=5:	PENDING
 - ray_trainable_6_agent.0.net.optim_spec.lr=0.1,trial_index=6:	PENDING
RUNNING trials:
 - ray_trainable_0_agent.0.net.optim_spec.lr=0.0005,trial_index=0:	RUNNING

(pid=3914) [2021-05-04 16:51:41,303 PID:3914 INFO logger.py info] Running sessions
(pid=3913) [2021-05-04 16:51:41,303 PID:3913 INFO logger.py info] Running sessions
(pid=3926) [2021-05-04 16:51:41,303 PID:3926 INFO logger.py info] Running sessions
(pid=3925) [2021-05-04 16:51:41,286 PID:3925 INFO logger.py info] Running sessions
(pid=3914) [2021-05-04 16:51:41,344 PID:4083 INFO openai.py __init__] OpenAIEnv:
(pid=3914) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
(pid=3914) - eval_frequency = 2000
(pid=3914) - log_frequency = 10000
(pid=3914) - frame_op = None
(pid=3914) - frame_op_len = None
(pid=3914) - image_downsize = (84, 84)
(pid=3914) - normalize_state = False
(pid=3914) - reward_scale = None
(pid=3914) - num_envs = 1
(pid=3914) - name = CartPole-v0
(pid=3914) - max_t = 200
(pid=3914) - max_frame = 100000
(pid=3914) - to_render = False
(pid=3914) - is_venv = False
(pid=3914) - clock_speed = 1
(pid=3914) - clock = <slm_lab.env.base.Clock object at 0x7fe01ea91ba8>
(pid=3914) - done = False
(pid=3914) - total_reward = nan
(pid=3914) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
(pid=3914) - observation_space = Box(4,)
(pid=3914) - action_space = Discrete(2)
(pid=3914) - observable_dim = {'state': 4}
(pid=3914) - action_dim = 2
(pid=3914) - is_discrete = True
(pid=3914) [2021-05-04 16:51:41,351 PID:4079 INFO openai.py __init__] OpenAIEnv:
(pid=3914) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
(pid=3914) - eval_frequency = 2000
(pid=3914) - log_frequency = 10000
(pid=3914) - frame_op = None
(pid=3914) - frame_op_len = None
(pid=3914) - image_downsize = (84, 84)
(pid=3914) - normalize_state = False
(pid=3914) - reward_scale = None
(pid=3914) - num_envs = 1
(pid=3914) - name = CartPole-v0
(pid=3914) - max_t = 200
(pid=3914) - max_frame = 100000
(pid=3914) - to_render = False
(pid=3914) - is_venv = False
(pid=3914) - clock_speed = 1
(pid=3914) - clock = <slm_lab.env.base.Clock object at 0x7fe284ef5cf8>
(pid=3914) - done = False
(pid=3914) - total_reward = nan
(pid=3914) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
(pid=3914) - observation_space = Box(4,)
(pid=3914) - action_space = Discrete(2)
(pid=3914) - observable_dim = {'state': 4}
(pid=3914) - action_dim = 2
(pid=3914) - is_discrete = True
(pid=3927) [2021-05-04 16:53:24,046 PID:6133 INFO logger.py info] Session:
(pid=3927) - spec = {'cuda_offset': 0,
(pid=3927)  'distributed': False,
(pid=3927)  'eval_frequency': 2000,
(pid=3927)  'experiment': 0,
(pid=3927)  'experiment_ts': '2021_05_04_165139',
(pid=3927)  'git_sha': 'dda02d00031553aeda4c49c5baa7d0706c53996b',
(pid=3927)  'graph_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/graph/sarsa_epsilon_greedy_cartpole_t6_s0',
(pid=3927)  'info_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/info/sarsa_epsilon_greedy_cartpole_t6_s0',
(pid=3927)  'log_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/log/sarsa_epsilon_greedy_cartpole_t6_s0',
(pid=3927)  'max_session': 4,
(pid=3927)  'max_trial': 1,
(pid=3927)  'model_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/model/sarsa_epsilon_greedy_cartpole_t6_s0',
(pid=3927)  'prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/sarsa_epsilon_greedy_cartpole_t6_s0',
(pid=3927)  'random_seed': 1620714804,
(pid=3927)  'resume': False,
(pid=3927)  'rigorous_eval': 0,
(pid=3927)  'session': 0,
(pid=3927)  'trial': 6}
(pid=3927) - index = 0
(pid=3927) - agent = <slm_lab.agent.Agent object at 0x7f9cecb11160>
(pid=3927) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7f9cecba8668>
(pid=3927) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7f9cecba8668>
(pid=3927) [2021-05-04 16:53:24,046 PID:6133 INFO logger.py info] Running RL loop for trial 6 session 0
(pid=3927) [2021-05-04 16:53:24,053 PID:6136 INFO base.py end_init_nets] Initialized algorithm models for lab_mode: search
(pid=3927) [2021-05-04 16:53:24,054 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.1  explore_var: 1  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:24,057 PID:6136 INFO base.py __init__] SARSA:
(pid=3927) - agent = <slm_lab.agent.Agent object at 0x7f9cecb134a8>
(pid=3927) - action_pdtype = Argmax
(pid=3927) - action_policy = <function epsilon_greedy at 0x7f9cf98f8400>
(pid=3927) - explore_var_spec = {'end_step': 10000,
(pid=3927)  'end_val': 0.05,
(pid=3927)  'name': 'linear_decay',
(pid=3927)  'start_step': 0,
(pid=3927)  'start_val': 1.0}
(pid=3927) - gamma = 0.99
(pid=3927) - training_frequency = 5
(pid=3927) - to_train = 0
(pid=3927) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7f9f52f49438>
(pid=3927) - net = MLPNet(
(pid=3927)   (model): Sequential(
(pid=3927)     (0): Linear(in_features=4, out_features=64, bias=True)
(pid=3927)     (1): SELU()
(pid=3927)   )
(pid=3927)   (model_tail): Sequential(
(pid=3927)     (0): Linear(in_features=64, out_features=2, bias=True)
(pid=3927)   )
(pid=3927)   (loss_fn): MSELoss()
(pid=3927) )
(pid=3927) - net_names = ['net']
(pid=3927) - optim = RMSprop (
(pid=3927) Parameter Group 0
(pid=3927)     alpha: 0.99
(pid=3927)     centered: False
(pid=3927)     eps: 1e-08
(pid=3927)     lr: 0.1
(pid=3927)     momentum: 0
(pid=3927)     weight_decay: 0
(pid=3927) )
(pid=3927) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7f9cecb2ca90>
(pid=3927) - global_net = None
(pid=3927) [2021-05-04 16:53:24,059 PID:6136 INFO __init__.py __init__] Agent:
(pid=3927) - spec = {'cuda_offset': 0,
(pid=3927)  'distributed': False,
(pid=3927)  'eval_frequency': 2000,
(pid=3927)  'experiment': 0,
(pid=3927)  'experiment_ts': '2021_05_04_165139',
(pid=3927)  'git_sha': 'dda02d00031553aeda4c49c5baa7d0706c53996b',
(pid=3927)  'graph_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/graph/sarsa_epsilon_greedy_cartpole_t6_s3',
(pid=3927)  'info_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/info/sarsa_epsilon_greedy_cartpole_t6_s3',
(pid=3927)  'log_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/log/sarsa_epsilon_greedy_cartpole_t6_s3',
(pid=3927)  'max_session': 4,
(pid=3927)  'max_trial': 1,
(pid=3927)  'model_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/model/sarsa_epsilon_greedy_cartpole_t6_s3',
(pid=3927)  'prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/sarsa_epsilon_greedy_cartpole_t6_s3',
(pid=3927)  'random_seed': 1620717804,
(pid=3927)  'resume': False,
(pid=3927)  'rigorous_eval': 0,
(pid=3927)  'session': 3,
(pid=3927)  'trial': 6}
(pid=3927) - agent_spec = {'algorithm': {'action_pdtype': 'Argmax',
(pid=3927)                'action_policy': 'epsilon_greedy',
(pid=3927)                'explore_var_spec': {'end_step': 10000,
(pid=3927)                                     'end_val': 0.05,
(pid=3927)                                     'name': 'linear_decay',
(pid=3927)                                     'start_step': 0,
(pid=3927)                                     'start_val': 1.0},
(pid=3927)                'gamma': 0.99,
(pid=3927)                'name': 'SARSA',
(pid=3927)                'training_frequency': 5},
(pid=3927)  'memory': {'name': 'OnPolicyBatchReplay'},
(pid=3927)  'name': 'SARSA',
(pid=3927)  'net': {'clip_grad_val': 0.5,
(pid=3927)          'hid_layers': [64],
(pid=3927)          'hid_layers_activation': 'selu',
(pid=3927)          'loss_spec': {'name': 'MSELoss'},
(pid=3927)          'lr_scheduler_spec': None,
(pid=3927)          'optim_spec': {'lr': 0.1, 'name': 'RMSprop'},
(pid=3927)          'type': 'MLPNet'}}
(pid=3927) - name = SARSA
(pid=3927) - body = body: {
(pid=3927)   "agent": "<slm_lab.agent.Agent object at 0x7f9cecb134a8>",
(pid=3927)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7f9cecba8978>",
(pid=3927)   "a": 0,
(pid=3927)   "e": 0,
(pid=3927)   "b": 0,
(pid=3927)   "aeb": "(0, 0, 0)",
(pid=3927)   "explore_var": 1.0,
(pid=3927)   "entropy_coef": NaN,
(pid=3927)   "loss": NaN,
(pid=3927)   "mean_entropy": NaN,
(pid=3927)   "mean_grad_norm": NaN,
(pid=3927)   "best_total_reward_ma": -Infinity,
(pid=3927)   "total_reward_ma": NaN,
(pid=3927)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=3927)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=3927)   "observation_space": "Box(4,)",
(pid=3927)   "action_space": "Discrete(2)",
(pid=3927)   "observable_dim": {
(pid=3927)     "state": 4
(pid=3927)   },
(pid=3927)   "state_dim": 4,
(pid=3927)   "action_dim": 2,
(pid=3927)   "is_discrete": true,
(pid=3927)   "action_type": "discrete",
(pid=3927)   "action_pdtype": "Argmax",
(pid=3927)   "ActionPD": "<class 'slm_lab.lib.distribution.Argmax'>",
(pid=3927)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyBatchReplay object at 0x7f9cecb13630>"
(pid=3927) }
(pid=3927) - algorithm = <slm_lab.agent.algorithm.sarsa.SARSA object at 0x7f9cecb2c7b8>
(pid=3927) [2021-05-04 16:53:24,060 PID:6136 INFO logger.py info] Session:
(pid=3927) - spec = {'cuda_offset': 0,
(pid=3927)  'distributed': False,
(pid=3927)  'eval_frequency': 2000,
(pid=3927)  'experiment': 0,
(pid=3927)  'experiment_ts': '2021_05_04_165139',
(pid=3927)  'git_sha': 'dda02d00031553aeda4c49c5baa7d0706c53996b',
(pid=3927)  'graph_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/graph/sarsa_epsilon_greedy_cartpole_t6_s3',
(pid=3927)  'info_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/info/sarsa_epsilon_greedy_cartpole_t6_s3',
(pid=3927)  'log_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/log/sarsa_epsilon_greedy_cartpole_t6_s3',
(pid=3927)  'max_session': 4,
(pid=3927)  'max_trial': 1,
(pid=3927)  'model_prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/model/sarsa_epsilon_greedy_cartpole_t6_s3',
(pid=3927)  'prepath': 'data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139/sarsa_epsilon_greedy_cartpole_t6_s3',
(pid=3927)  'random_seed': 1620717804,
(pid=3927)  'resume': False,
(pid=3927)  'rigorous_eval': 0,
(pid=3927)  'session': 3,
(pid=3927)  'trial': 6}
(pid=3927) - index = 3
(pid=3927) - agent = <slm_lab.agent.Agent object at 0x7f9cecb134a8>
(pid=3927) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7f9cecba8978>
(pid=3927) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7f9cecba8978>
(pid=3927) [2021-05-04 16:53:24,060 PID:6136 INFO logger.py info] Running RL loop for trial 6 session 3
(pid=3927) [2021-05-04 16:53:24,067 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.1  explore_var: 1  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:28,061 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 240  t: 52  wall_t: 4  opt_step: 12000  frame: 10000  fps: 2500  total_reward: 97  total_reward_ma: 97  loss: 17.4592  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:28,120 PID:6064 INFO __init__.py log_summary] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df] epi: 427  t: 7  wall_t: 4  opt_step: 12000  frame: 10000  fps: 2500  total_reward: 67  total_reward_ma: 67  loss: 0.383154  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:28,620 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 292  t: 6  wall_t: 4  opt_step: 12000  frame: 10000  fps: 2500  total_reward: 10  total_reward_ma: 10  loss: 117.837  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:28,679 PID:6135 INFO __init__.py log_summary] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df] epi: 277  t: 84  wall_t: 4  opt_step: 12000  frame: 10000  fps: 2500  total_reward: 20  total_reward_ma: 20  loss: 61.2264  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:30,219 PID:6084 INFO __init__.py log_summary] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df] epi: 224  t: 14  wall_t: 6  opt_step: 12000  frame: 10000  fps: 1666.67  total_reward: 200  total_reward_ma: 200  loss: 4.94699  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:30,304 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 253  t: 11  wall_t: 6  opt_step: 12000  frame: 10000  fps: 1666.67  total_reward: 11  total_reward_ma: 11  loss: 37.0297  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:30,310 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 253  t: 11  wall_t: 6  opt_step: 12000  frame: 10000  fps: 1666.67  total_reward: 11  total_reward_ma: 11  loss: 37.0297  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:30,457 PID:6089 INFO __init__.py log_summary] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df] epi: 226  t: 170  wall_t: 7  opt_step: 12000  frame: 10000  fps: 1428.57  total_reward: 140  total_reward_ma: 140  loss: 0.199613  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:30,476 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 330  t: 160  wall_t: 7  opt_step: 12000  frame: 10000  fps: 1428.57  total_reward: 185  total_reward_ma: 185  loss: 0.56581  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:30,566 PID:6066 INFO __init__.py log_summary] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df] epi: 431  t: 9  wall_t: 7  opt_step: 12000  frame: 10000  fps: 1428.57  total_reward: 14  total_reward_ma: 14  loss: 0.321296  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:30,591 PID:6067 INFO __init__.py log_summary] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df] epi: 394  t: 13  wall_t: 7  opt_step: 12000  frame: 10000  fps: 1428.57  total_reward: 27  total_reward_ma: 27  loss: 1.83942  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:31,291 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 367  t: 10  wall_t: 7  opt_step: 12000  frame: 10000  fps: 1428.57  total_reward: 10  total_reward_ma: 10  loss: 37.3157  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:31,297 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 367  t: 10  wall_t: 7  opt_step: 12000  frame: 10000  fps: 1428.57  total_reward: 10  total_reward_ma: 10  loss: 37.3157  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:31,610 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 298  t: 6  wall_t: 7  opt_step: 12000  frame: 10000  fps: 1428.57  total_reward: 19  total_reward_ma: 19  loss: 2.98665  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:33,588 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 317  t: 39  wall_t: 10  opt_step: 24000  frame: 20000  fps: 2000  total_reward: 13  total_reward_ma: 55  loss: 5.29746  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:33,597 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df metrics] final_return_ma: 55  strength: 33.14  max_strength: 75.14  final_strength: -8.86  sample_efficiency: 0.000106684  training_efficiency: 8.89031e-05  stability: -0.117913
(pid=3919) [2021-05-04 16:53:33,729 PID:6064 INFO __init__.py log_summary] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df] epi: 741  t: 43  wall_t: 10  opt_step: 24000  frame: 20000  fps: 2000  total_reward: 72  total_reward_ma: 69.5  loss: 0.859354  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:33,739 PID:6064 INFO __init__.py log_metrics] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df metrics] final_return_ma: 69.5  strength: 47.64  max_strength: 50.14  final_strength: 50.14  sample_efficiency: 7.36881e-05  training_efficiency: 6.14067e-05  stability: 1
(pid=3927) [2021-05-04 16:53:34,169 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 1045  t: 3  wall_t: 10  opt_step: 24000  frame: 20000  fps: 2000  total_reward: 15  total_reward_ma: 12.5  loss: 10.5568  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:34,178 PID:6133 INFO __init__.py log_metrics] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df metrics] final_return_ma: 12.5  strength: -9.36  max_strength: -6.86  final_strength: -6.86  sample_efficiency: 8.16774e-05  training_efficiency: 6.80645e-05  stability: 1
(pid=3927) [2021-05-04 16:53:34,261 PID:6135 INFO __init__.py log_summary] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df] epi: 1044  t: 28  wall_t: 10  opt_step: 24000  frame: 20000  fps: 2000  total_reward: 26  total_reward_ma: 23  loss: 10.2876  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:34,271 PID:6135 INFO __init__.py log_metrics] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df metrics] final_return_ma: 23  strength: 1.14  max_strength: 4.14  final_strength: 4.14  sample_efficiency: 9.21049e-06  training_efficiency: 7.67541e-06  stability: 1
(pid=3919) [2021-05-04 16:53:37,573 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 835  t: 4  wall_t: 14  opt_step: 24000  frame: 20000  fps: 1428.57  total_reward: 13  total_reward_ma: 99  loss: 17498.1  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:37,586 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df metrics] final_return_ma: 99  strength: 77.14  max_strength: 163.14  final_strength: -8.86  sample_efficiency: 0.000102871  training_efficiency: 8.57262e-05  stability: -0.0543091
(pid=3917) [2021-05-04 16:53:38,632 PID:6084 INFO __init__.py log_summary] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df] epi: 298  t: 81  wall_t: 15  opt_step: 24000  frame: 20000  fps: 1333.33  total_reward: 66  total_reward_ma: 133  loss: 12.1518  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:38,645 PID:6084 INFO __init__.py log_metrics] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df metrics] final_return_ma: 133  strength: 111.14  max_strength: 178.14  final_strength: 44.14  sample_efficiency: 9.00711e-05  training_efficiency: 7.50592e-05  stability: 0.247783
(pid=3917) [2021-05-04 16:53:38,772 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 688  t: 15  wall_t: 15  opt_step: 24000  frame: 20000  fps: 1333.33  total_reward: 8  total_reward_ma: 9.5  loss: 1.01667  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:38,785 PID:6083 INFO __init__.py log_metrics] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df metrics] final_return_ma: 9.5  strength: -12.36  max_strength: -10.86  final_strength: -13.86  sample_efficiency: 7.1966e-05  training_efficiency: 5.99717e-05  stability: 0.723757
(pid=3917) [2021-05-04 16:53:39,092 PID:6089 INFO __init__.py log_summary] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df] epi: 296  t: 80  wall_t: 15  opt_step: 24000  frame: 20000  fps: 1333.33  total_reward: 200  total_reward_ma: 170  loss: 0.940201  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:39,106 PID:6089 INFO __init__.py log_metrics] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df metrics] final_return_ma: 170  strength: 148.14  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 6.99372e-05  training_efficiency: 5.8281e-05  stability: 1
(pid=3919) [2021-05-04 16:53:39,214 PID:6066 INFO __init__.py log_summary] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df] epi: 875  t: 49  wall_t: 15  opt_step: 24000  frame: 20000  fps: 1333.33  total_reward: 11  total_reward_ma: 12.5  loss: 0.807788  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:39,228 PID:6066 INFO __init__.py log_metrics] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df metrics] final_return_ma: 12.5  strength: -9.36  max_strength: -7.86  final_strength: -10.86  sample_efficiency: 7.09936e-05  training_efficiency: 5.91613e-05  stability: 0.618321
(pid=3919) [2021-05-04 16:53:39,239 PID:6067 INFO __init__.py log_summary] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df] epi: 804  t: 18  wall_t: 15  opt_step: 24000  frame: 20000  fps: 1333.33  total_reward: 30  total_reward_ma: 28.5  loss: 3.89151  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:39,254 PID:6067 INFO __init__.py log_metrics] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df metrics] final_return_ma: 28.5  strength: 6.64  max_strength: 8.14  final_strength: 8.14  sample_efficiency: 6.93524e-05  training_efficiency: 5.77937e-05  stability: 1
(pid=3919) [2021-05-04 16:53:39,423 PID:6064 INFO __init__.py log_summary] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df] epi: 857  t: 139  wall_t: 16  opt_step: 36000  frame: 30000  fps: 1875  total_reward: 200  total_reward_ma: 113  loss: 0.0415456  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:39,431 PID:6064 INFO __init__.py log_metrics] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df metrics] final_return_ma: 113  strength: 91.14  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 4.73959e-05  training_efficiency: 3.94966e-05  stability: 1
(pid=3927) [2021-05-04 16:53:39,782 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 1586  t: 7  wall_t: 15  opt_step: 36000  frame: 30000  fps: 2000  total_reward: 42  total_reward_ma: 22.3333  loss: 0.867192  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:39,790 PID:6133 INFO __init__.py log_metrics] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df metrics] final_return_ma: 22.3333  strength: 0.473333  max_strength: 20.14  final_strength: 20.14  sample_efficiency: -0.000603992  training_efficiency: -0.000503326  stability: 1
(pid=3927) [2021-05-04 16:53:39,961 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 1271  t: 25  wall_t: 15  opt_step: 24000  frame: 20000  fps: 1333.33  total_reward: 11  total_reward_ma: 10.5  loss: 7.17299  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:39,968 PID:6135 INFO __init__.py log_summary] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df] epi: 1738  t: 4  wall_t: 15  opt_step: 36000  frame: 30000  fps: 2000  total_reward: 46  total_reward_ma: 30.6667  loss: 1.031  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:39,975 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df metrics] final_return_ma: 10.5  strength: -11.36  max_strength: -10.86  final_strength: -10.86  sample_efficiency: 7.61004e-05  training_efficiency: 6.3417e-05  stability: 1
(pid=3927) [2021-05-04 16:53:39,979 PID:6135 INFO __init__.py log_metrics] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df metrics] final_return_ma: 30.6667  strength: 8.80667  max_strength: 24.14  final_strength: 24.14  sample_efficiency: 3.12516e-05  training_efficiency: 2.6043e-05  stability: 1
(pid=3927) [2021-05-04 16:53:40,414 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 1003  t: 1  wall_t: 16  opt_step: 24000  frame: 20000  fps: 1250  total_reward: 11  total_reward_ma: 15  loss: 107.965  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:40,428 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 15  strength: -6.86  max_strength: -2.86  final_strength: -10.86  sample_efficiency: 6.04227e-05  training_efficiency: 5.03523e-05  stability: -1.7972
(pid=3917) [2021-05-04 16:53:41,866 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 434  t: 8  wall_t: 18  opt_step: 36000  frame: 30000  fps: 1666.67  total_reward: 41  total_reward_ma: 50.3333  loss: 6.48499  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:41,877 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df metrics] final_return_ma: 50.3333  strength: 28.4733  max_strength: 75.14  final_strength: 19.14  sample_efficiency: 9.02482e-05  training_efficiency: 7.52068e-05  stability: -0.267351
(pid=3919) [2021-05-04 16:53:43,375 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 1306  t: 7  wall_t: 20  opt_step: 36000  frame: 30000  fps: 1500  total_reward: 9  total_reward_ma: 69  loss: 5.27055  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:43,390 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df metrics] final_return_ma: 69  strength: 47.14  max_strength: 163.14  final_strength: -12.86  sample_efficiency: 0.000109195  training_efficiency: 9.09957e-05  stability: -0.140783
(pid=3919) [2021-05-04 16:53:45,081 PID:6064 INFO __init__.py log_summary] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df] epi: 912  t: 195  wall_t: 21  opt_step: 48000  frame: 40000  fps: 1904.76  total_reward: 200  total_reward_ma: 134.75  loss: 0.207668  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:45,090 PID:6064 INFO __init__.py log_metrics] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df metrics] final_return_ma: 134.75  strength: 112.89  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 3.85608e-05  training_efficiency: 3.2134e-05  stability: 1
(pid=3927) [2021-05-04 16:53:45,479 PID:6135 INFO __init__.py log_summary] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df] epi: 2422  t: 7  wall_t: 21  opt_step: 48000  frame: 40000  fps: 1904.76  total_reward: 13  total_reward_ma: 26.25  loss: 6.85835  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:45,481 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 2237  t: 10  wall_t: 21  opt_step: 48000  frame: 40000  fps: 1904.76  total_reward: 12  total_reward_ma: 19.75  loss: 12.048  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:45,486 PID:6135 INFO __init__.py log_metrics] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df metrics] final_return_ma: 26.25  strength: 4.39  max_strength: 24.14  final_strength: -8.86  sample_efficiency: 3.44058e-05  training_efficiency: 2.86715e-05  stability: -0.249054
(pid=3927) [2021-05-04 16:53:45,488 PID:6133 INFO __init__.py log_metrics] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df metrics] final_return_ma: 19.75  strength: -2.11  max_strength: 20.14  final_strength: -9.86  sample_efficiency: 0.000130825  training_efficiency: 0.000109021  stability: -20.1268
(pid=3917) [2021-05-04 16:53:47,089 PID:6084 INFO __init__.py log_summary] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df] epi: 518  t: 86  wall_t: 23  opt_step: 36000  frame: 30000  fps: 1304.35  total_reward: 153  total_reward_ma: 139.667  loss: 3.09447  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:47,100 PID:6084 INFO __init__.py log_metrics] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df metrics] final_return_ma: 139.667  strength: 117.807  max_strength: 178.14  final_strength: 131.14  sample_efficiency: 6.9018e-05  training_efficiency: 5.7515e-05  stability: 0.397157
(pid=3917) [2021-05-04 16:53:47,256 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 1000  t: 53  wall_t: 23  opt_step: 36000  frame: 30000  fps: 1304.35  total_reward: 200  total_reward_ma: 73  loss: 1.48546  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:47,268 PID:6083 INFO __init__.py log_metrics] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df metrics] final_return_ma: 73  strength: 51.14  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 2.71086e-05  training_efficiency: 2.25905e-05  stability: 0.878641
(pid=3917) [2021-05-04 16:53:47,770 PID:6089 INFO __init__.py log_summary] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df] epi: 527  t: 1  wall_t: 24  opt_step: 36000  frame: 30000  fps: 1250  total_reward: 9  total_reward_ma: 116.333  loss: 130.66  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:47,781 PID:6089 INFO __init__.py log_metrics] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df metrics] final_return_ma: 116.333  strength: 94.4733  max_strength: 178.14  final_strength: -12.86  sample_efficiency: 7.15981e-05  training_efficiency: 5.96651e-05  stability: 0.35534
(pid=3919) [2021-05-04 16:53:47,874 PID:6066 INFO __init__.py log_summary] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df] epi: 1155  t: 18  wall_t: 24  opt_step: 36000  frame: 30000  fps: 1250  total_reward: 17  total_reward_ma: 14  loss: 39.2763  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:47,886 PID:6066 INFO __init__.py log_metrics] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df metrics] final_return_ma: 14  strength: -7.86  max_strength: -4.86  final_strength: -4.86  sample_efficiency: 6.32316e-05  training_efficiency: 5.2693e-05  stability: 0.839744
(pid=3919) [2021-05-04 16:53:47,887 PID:6067 INFO __init__.py log_summary] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df] epi: 1215  t: 4  wall_t: 24  opt_step: 36000  frame: 30000  fps: 1250  total_reward: 15  total_reward_ma: 24  loss: 156121  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:47,898 PID:6067 INFO __init__.py log_metrics] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df metrics] final_return_ma: 24  strength: 2.14  max_strength: 8.14  final_strength: -6.86  sample_efficiency: 0.00010784  training_efficiency: 8.98667e-05  stability: -0.129518
(pid=3927) [2021-05-04 16:53:48,633 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 1921  t: 50  wall_t: 24  opt_step: 36000  frame: 30000  fps: 1250  total_reward: 55  total_reward_ma: 25.3333  loss: 20.6762  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:48,652 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df metrics] final_return_ma: 25.3333  strength: 3.47333  max_strength: 33.14  final_strength: 33.14  sample_efficiency: -5.99169e-05  training_efficiency: -4.99307e-05  stability: 1
(pid=3927) [2021-05-04 16:53:49,077 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 1670  t: 38  wall_t: 25  opt_step: 36000  frame: 30000  fps: 1200  total_reward: 83  total_reward_ma: 37.6667  loss: 3.53205  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:49,089 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 37.6667  strength: 15.8067  max_strength: 61.14  final_strength: 61.14  sample_efficiency: 2.54956e-05  training_efficiency: 2.12463e-05  stability: 0.41691
(pid=3919) [2021-05-04 16:53:49,120 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 1857  t: 2  wall_t: 25  opt_step: 48000  frame: 40000  fps: 1600  total_reward: 18  total_reward_ma: 56.25  loss: 675317  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:49,131 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df metrics] final_return_ma: 56.25  strength: 34.39  max_strength: 163.14  final_strength: -3.86  sample_efficiency: 0.000111557  training_efficiency: 9.29645e-05  stability: -0.24452
(pid=3917) [2021-05-04 16:53:50,538 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 529  t: 91  wall_t: 27  opt_step: 48000  frame: 40000  fps: 1481.48  total_reward: 12  total_reward_ma: 40.75  loss: 14.1391  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:50,549 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df metrics] final_return_ma: 40.75  strength: 18.89  max_strength: 75.14  final_strength: -9.86  sample_efficiency: 9.87626e-05  training_efficiency: 8.23021e-05  stability: -0.322875
(pid=3919) [2021-05-04 16:53:50,698 PID:6064 INFO __init__.py log_summary] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df] epi: 969  t: 196  wall_t: 27  opt_step: 60000  frame: 50000  fps: 1851.85  total_reward: 200  total_reward_ma: 147.8  loss: 0.55013  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:50,705 PID:6064 INFO __init__.py log_metrics] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df metrics] final_return_ma: 147.8  strength: 125.94  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 3.331e-05  training_efficiency: 2.77583e-05  stability: 1
(pid=3927) [2021-05-04 16:53:50,984 PID:6135 INFO __init__.py log_summary] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df] epi: 3093  t: 61  wall_t: 26  opt_step: 60000  frame: 50000  fps: 1923.08  total_reward: 25  total_reward_ma: 26  loss: 0.702367  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:50,992 PID:6135 INFO __init__.py log_metrics] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df metrics] final_return_ma: 26  strength: 4.14  max_strength: 24.14  final_strength: 3.14  sample_efficiency: 3.22206e-05  training_efficiency: 2.68505e-05  stability: -0.879271
(pid=3927) [2021-05-04 16:53:51,150 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 2826  t: 3  wall_t: 27  opt_step: 60000  frame: 50000  fps: 1851.85  total_reward: 8  total_reward_ma: 17.4  loss: 18.9916  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:51,160 PID:6133 INFO __init__.py log_metrics] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df metrics] final_return_ma: 17.4  strength: -4.46  max_strength: 20.14  final_strength: -13.86  sample_efficiency: 6.19447e-05  training_efficiency: 5.16206e-05  stability: -3.02843
(pid=3919) [2021-05-04 16:53:54,782 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 2369  t: 35  wall_t: 31  opt_step: 60000  frame: 50000  fps: 1612.9  total_reward: 45  total_reward_ma: 54  loss: 29.9116  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:54,789 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df metrics] final_return_ma: 54  strength: 32.14  max_strength: 163.14  final_strength: 23.14  sample_efficiency: 9.83736e-05  training_efficiency: 8.1978e-05  stability: -0.279442
(pid=3917) [2021-05-04 16:53:55,547 PID:6084 INFO __init__.py log_summary] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df] epi: 880  t: 2  wall_t: 32  opt_step: 48000  frame: 40000  fps: 1250  total_reward: 8  total_reward_ma: 106.75  loss: 168.905  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:55,558 PID:6084 INFO __init__.py log_metrics] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df metrics] final_return_ma: 106.75  strength: 84.89  max_strength: 178.14  final_strength: -13.86  sample_efficiency: 7.08147e-05  training_efficiency: 5.90122e-05  stability: 0.210571
(pid=3917) [2021-05-04 16:53:55,738 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 1136  t: 85  wall_t: 32  opt_step: 48000  frame: 40000  fps: 1250  total_reward: 17  total_reward_ma: 59  loss: 0.272864  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:55,749 PID:6083 INFO __init__.py log_metrics] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df metrics] final_return_ma: 59  strength: 37.14  max_strength: 178.14  final_strength: -4.86  sample_efficiency: 2.71776e-05  training_efficiency: 2.2648e-05  stability: -0.212358
(pid=3919) [2021-05-04 16:53:56,302 PID:6064 INFO __init__.py log_summary] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df] epi: 1021  t: 2  wall_t: 32  opt_step: 72000  frame: 60000  fps: 1875  total_reward: 200  total_reward_ma: 156.5  loss: 881.388  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:56,311 PID:6064 INFO __init__.py log_metrics] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df metrics] final_return_ma: 156.5  strength: 134.64  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 2.96399e-05  training_efficiency: 2.46999e-05  stability: 1
(pid=3917) [2021-05-04 16:53:56,431 PID:6089 INFO __init__.py log_summary] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df] epi: 738  t: 10  wall_t: 32  opt_step: 48000  frame: 40000  fps: 1250  total_reward: 200  total_reward_ma: 137.25  loss: 2.31759  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:56,442 PID:6089 INFO __init__.py log_metrics] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df metrics] final_return_ma: 137.25  strength: 115.39  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 5.36135e-05  training_efficiency: 4.46779e-05  stability: 0.326088
(pid=3919) [2021-05-04 16:53:56,500 PID:6066 INFO __init__.py log_summary] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df] epi: 1744  t: 16  wall_t: 33  opt_step: 48000  frame: 40000  fps: 1212.12  total_reward: 43  total_reward_ma: 21.25  loss: 2.74655  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:56,523 PID:6135 INFO __init__.py log_summary] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df] epi: 3562  t: 33  wall_t: 32  opt_step: 72000  frame: 60000  fps: 1875  total_reward: 9  total_reward_ma: 23.1667  loss: 4.21795  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:56,530 PID:6135 INFO __init__.py log_metrics] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df metrics] final_return_ma: 23.1667  strength: 1.30667  max_strength: 24.14  final_strength: -12.86  sample_efficiency: 5.77339e-05  training_efficiency: 4.81115e-05  stability: -1.36715
(pid=3919) [2021-05-04 16:53:56,512 PID:6066 INFO __init__.py log_metrics] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df metrics] final_return_ma: 21.25  strength: -0.610001  max_strength: 21.14  final_strength: 21.14  sample_efficiency: 0.000394467  training_efficiency: 0.000328722  stability: 0.872774
(pid=3919) [2021-05-04 16:53:56,526 PID:6067 INFO __init__.py log_summary] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df] epi: 1627  t: 24  wall_t: 33  opt_step: 48000  frame: 40000  fps: 1212.12  total_reward: 21  total_reward_ma: 23.25  loss: 15.2223  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:53:56,537 PID:6067 INFO __init__.py log_metrics] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df metrics] final_return_ma: 23.25  strength: 1.39  max_strength: 8.14  final_strength: -0.860001  sample_efficiency: 0.000120654  training_efficiency: 0.000100545  stability: -1.33645
(pid=3927) [2021-05-04 16:53:56,761 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 3418  t: 4  wall_t: 32  opt_step: 72000  frame: 60000  fps: 1875  total_reward: 14  total_reward_ma: 16.8333  loss: 6.47685  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:56,769 PID:6133 INFO __init__.py log_metrics] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df metrics] final_return_ma: 16.8333  strength: -5.02667  max_strength: 20.14  final_strength: -7.86  sample_efficiency: 5.01448e-05  training_efficiency: 4.17873e-05  stability: -0.524663
(pid=3927) [2021-05-04 16:53:57,299 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 2655  t: 32  wall_t: 33  opt_step: 48000  frame: 40000  fps: 1212.12  total_reward: 32  total_reward_ma: 27  loss: 37.1835  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:57,311 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df metrics] final_return_ma: 27  strength: 5.14  max_strength: 33.14  final_strength: 10.14  sample_efficiency: -1.80367e-05  training_efficiency: -1.50305e-05  stability: -1.20729
(pid=3927) [2021-05-04 16:53:57,315 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 2655  t: 32  wall_t: 33  opt_step: 48000  frame: 40000  fps: 1212.12  total_reward: 32  total_reward_ma: 27  loss: 37.1835  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:57,327 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df metrics] final_return_ma: 27  strength: 5.14  max_strength: 33.14  final_strength: 10.14  sample_efficiency: -1.80367e-05  training_efficiency: -1.50305e-05  stability: -1.20729
(pid=3927) [2021-05-04 16:53:57,745 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 2373  t: 7  wall_t: 33  opt_step: 48000  frame: 40000  fps: 1212.12  total_reward: 10  total_reward_ma: 30.75  loss: 3.7728  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:53:57,756 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 30.75  strength: 8.89  max_strength: 61.14  final_strength: -11.86  sample_efficiency: 2.56609e-05  training_efficiency: 2.1384e-05  stability: -0.70814
(pid=3917) [2021-05-04 16:53:59,246 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 1254  t: 2  wall_t: 35  opt_step: 60000  frame: 50000  fps: 1428.57  total_reward: 13  total_reward_ma: 35.2  loss: 4.70696  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:53:59,257 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df metrics] final_return_ma: 35.2  strength: 13.34  max_strength: 75.14  final_strength: -8.86  sample_efficiency: 0.000109225  training_efficiency: 9.10207e-05  stability: -0.4955
(pid=3919) [2021-05-04 16:54:00,486 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 2878  t: 5  wall_t: 37  opt_step: 72000  frame: 60000  fps: 1621.62  total_reward: 14  total_reward_ma: 47.3333  loss: 43.1433  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:00,493 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df metrics] final_return_ma: 47.3333  strength: 25.4733  max_strength: 163.14  final_strength: -7.86  sample_efficiency: 0.000102575  training_efficiency: 8.54795e-05  stability: -0.288115
(pid=3919) [2021-05-04 16:54:01,921 PID:6064 INFO __init__.py log_summary] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df] epi: 1078  t: 137  wall_t: 38  opt_step: 84000  frame: 70000  fps: 1842.11  total_reward: 200  total_reward_ma: 162.714  loss: 0.908814  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:01,929 PID:6064 INFO __init__.py log_metrics] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df metrics] final_return_ma: 162.714  strength: 140.854  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 2.68658e-05  training_efficiency: 2.23882e-05  stability: 1
(pid=3927) [2021-05-04 16:54:02,007 PID:6135 INFO __init__.py log_summary] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df] epi: 4233  t: 1  wall_t: 37  opt_step: 84000  frame: 70000  fps: 1891.89  total_reward: 9  total_reward_ma: 21.1429  loss: 50.4187  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:02,015 PID:6135 INFO __init__.py log_metrics] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df metrics] final_return_ma: 21.1429  strength: -0.717143  max_strength: 24.14  final_strength: -12.86  sample_efficiency: -5.35695e-05  training_efficiency: -4.46412e-05  stability: -5.25
(pid=3927) [2021-05-04 16:54:02,479 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 4103  t: 4  wall_t: 38  opt_step: 84000  frame: 70000  fps: 1842.11  total_reward: 10  total_reward_ma: 15.8571  loss: 125.019  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:02,490 PID:6133 INFO __init__.py log_metrics] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df metrics] final_return_ma: 15.8571  strength: -6.00286  max_strength: 20.14  final_strength: -11.86  sample_efficiency: 4.00237e-05  training_efficiency: 3.33531e-05  stability: -0.259947
(pid=3917) [2021-05-04 16:54:03,996 PID:6084 INFO __init__.py log_summary] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df] epi: 1575  t: 167  wall_t: 40  opt_step: 60000  frame: 50000  fps: 1250  total_reward: 200  total_reward_ma: 125.4  loss: 0.857421  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:04,014 PID:6084 INFO __init__.py log_metrics] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df metrics] final_return_ma: 125.4  strength: 103.54  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 5.33294e-05  training_efficiency: 4.44412e-05  stability: 0.178348
(pid=3917) [2021-05-04 16:54:04,194 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 1310  t: 129  wall_t: 40  opt_step: 60000  frame: 50000  fps: 1250  total_reward: 126  total_reward_ma: 72.4  loss: 0.318435  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:04,205 PID:6083 INFO __init__.py log_metrics] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df metrics] final_return_ma: 72.4  strength: 50.54  max_strength: 178.14  final_strength: 104.14  sample_efficiency: 2.42196e-05  training_efficiency: 2.0183e-05  stability: -0.252019
(pid=3917) [2021-05-04 16:54:05,068 PID:6089 INFO __init__.py log_summary] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df] epi: 868  t: 121  wall_t: 41  opt_step: 60000  frame: 50000  fps: 1219.51  total_reward: 200  total_reward_ma: 149.8  loss: 0.521539  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:05,079 PID:6089 INFO __init__.py log_metrics] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df metrics] final_return_ma: 149.8  strength: 127.94  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 4.4253e-05  training_efficiency: 3.68775e-05  stability: 0.586186
(pid=3919) [2021-05-04 16:54:05,170 PID:6067 INFO __init__.py log_summary] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df] epi: 2063  t: 1  wall_t: 41  opt_step: 60000  frame: 50000  fps: 1219.51  total_reward: 13  total_reward_ma: 21.2  loss: 545933  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:05,177 PID:6066 INFO __init__.py log_summary] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df] epi: 2198  t: 35  wall_t: 41  opt_step: 60000  frame: 50000  fps: 1219.51  total_reward: 54  total_reward_ma: 27.8  loss: 8.51101  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:05,180 PID:6067 INFO __init__.py log_metrics] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df metrics] final_return_ma: 21.2  strength: -0.660001  max_strength: 8.14  final_strength: -8.86  sample_efficiency: -0.000149586  training_efficiency: -0.000124655  stability: -3.13669
(pid=3919) [2021-05-04 16:54:05,189 PID:6066 INFO __init__.py log_metrics] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df metrics] final_return_ma: 27.8  strength: 5.94  max_strength: 32.14  final_strength: 32.14  sample_efficiency: -1.07643e-05  training_efficiency: -8.97026e-06  stability: -0.229507
(pid=3927) [2021-05-04 16:54:05,936 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 3325  t: 13  wall_t: 41  opt_step: 60000  frame: 50000  fps: 1219.51  total_reward: 15  total_reward_ma: 24.6  loss: 3.1295  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:05,947 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df metrics] final_return_ma: 24.6  strength: 2.74  max_strength: 33.14  final_strength: -6.86  sample_efficiency: -3.70828e-05  training_efficiency: -3.09023e-05  stability: -0.945526
(pid=3919) [2021-05-04 16:54:06,145 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 3383  t: 9  wall_t: 42  opt_step: 84000  frame: 70000  fps: 1666.67  total_reward: 9  total_reward_ma: 41.8571  loss: 79.8052  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:06,152 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df metrics] final_return_ma: 41.8571  strength: 19.9971  max_strength: 163.14  final_strength: -12.86  sample_efficiency: 0.000110687  training_efficiency: 9.22389e-05  stability: -0.387071
(pid=3927) [2021-05-04 16:54:06,385 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 2985  t: 1  wall_t: 42  opt_step: 60000  frame: 50000  fps: 1190.48  total_reward: 20  total_reward_ma: 28.6  loss: 20.2023  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:06,397 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 28.6  strength: 6.74  max_strength: 61.14  final_strength: -1.86  sample_efficiency: 2.59733e-05  training_efficiency: 2.16444e-05  stability: -1.27784
(pid=3927) [2021-05-04 16:54:07,490 PID:6135 INFO __init__.py log_summary] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df] epi: 4790  t: 4  wall_t: 43  opt_step: 96000  frame: 80000  fps: 1860.47  total_reward: 11  total_reward_ma: 19.875  loss: 62.0882  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:07,497 PID:6135 INFO __init__.py log_metrics] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df metrics] final_return_ma: 19.875  strength: -1.985  max_strength: 24.14  final_strength: -10.86  sample_efficiency: -8.38595e-06  training_efficiency: -6.98829e-06  stability: -8.76095
(pid=3919) [2021-05-04 16:54:07,597 PID:6064 INFO __init__.py log_summary] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df] epi: 1135  t: 47  wall_t: 44  opt_step: 96000  frame: 80000  fps: 1818.18  total_reward: 200  total_reward_ma: 167.375  loss: 0.189801  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:07,605 PID:6064 INFO __init__.py log_metrics] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df metrics] final_return_ma: 167.375  strength: 145.515  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 2.46675e-05  training_efficiency: 2.05562e-05  stability: 1
(pid=3917) [2021-05-04 16:54:07,900 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 1549  t: 133  wall_t: 44  opt_step: 72000  frame: 60000  fps: 1363.64  total_reward: 116  total_reward_ma: 48.6667  loss: 9.72111  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:07,911 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df metrics] final_return_ma: 48.6667  strength: 26.8067  max_strength: 94.14  final_strength: 94.14  sample_efficiency: 5.50504e-05  training_efficiency: 4.58753e-05  stability: -0.694153
(pid=3927) [2021-05-04 16:54:08,173 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 4589  t: 12  wall_t: 44  opt_step: 96000  frame: 80000  fps: 1818.18  total_reward: 28  total_reward_ma: 17.375  loss: 6.95463  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:08,181 PID:6133 INFO __init__.py log_metrics] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df metrics] final_return_ma: 17.375  strength: -4.485  max_strength: 20.14  final_strength: 6.14  sample_efficiency: 4.47337e-05  training_efficiency: 3.72781e-05  stability: 0.0956688
(pid=3919) [2021-05-04 16:54:11,811 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 3910  t: 10  wall_t: 48  opt_step: 96000  frame: 80000  fps: 1666.67  total_reward: 12  total_reward_ma: 38.125  loss: 263.887  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:11,819 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df metrics] final_return_ma: 38.125  strength: 16.265  max_strength: 163.14  final_strength: -9.86  sample_efficiency: 0.000118127  training_efficiency: 9.84391e-05  stability: -0.514502
(pid=3917) [2021-05-04 16:54:12,416 PID:6084 INFO __init__.py log_summary] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df] epi: 1677  t: 1  wall_t: 49  opt_step: 72000  frame: 60000  fps: 1224.49  total_reward: 200  total_reward_ma: 137.833  loss: 492.941  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:12,427 PID:6084 INFO __init__.py log_metrics] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df metrics] final_return_ma: 137.833  strength: 115.973  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 4.39435e-05  training_efficiency: 3.66196e-05  stability: 0.461078
(pid=3917) [2021-05-04 16:54:12,634 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 1513  t: 104  wall_t: 49  opt_step: 72000  frame: 60000  fps: 1224.49  total_reward: 56  total_reward_ma: 69.6667  loss: 0.24508  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:12,645 PID:6083 INFO __init__.py log_metrics] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df metrics] final_return_ma: 69.6667  strength: 47.8067  max_strength: 178.14  final_strength: 34.14  sample_efficiency: 2.33207e-05  training_efficiency: 1.94339e-05  stability: -0.013059
(pid=3927) [2021-05-04 16:54:13,019 PID:6135 INFO __init__.py log_summary] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df] epi: 5531  t: 3  wall_t: 48  opt_step: 108000  frame: 90000  fps: 1875  total_reward: 12  total_reward_ma: 19  loss: 30.5229  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:13,026 PID:6135 INFO __init__.py log_metrics] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df metrics] final_return_ma: 19  strength: -2.86  max_strength: 24.14  final_strength: -9.86  sample_efficiency: -9.1738e-07  training_efficiency: -7.6448e-07  stability: -2.08564
(pid=3919) [2021-05-04 16:54:13,195 PID:6064 INFO __init__.py log_summary] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df] epi: 1192  t: 23  wall_t: 49  opt_step: 108000  frame: 90000  fps: 1836.73  total_reward: 18  total_reward_ma: 150.778  loss: 0.144138  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:13,203 PID:6064 INFO __init__.py log_metrics] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df metrics] final_return_ma: 150.778  strength: 128.918  max_strength: 178.14  final_strength: -3.86  sample_efficiency: 2.47126e-05  training_efficiency: 2.05938e-05  stability: 0.843659
(pid=3917) [2021-05-04 16:54:13,720 PID:6089 INFO __init__.py log_summary] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df] epi: 1408  t: 35  wall_t: 50  opt_step: 72000  frame: 60000  fps: 1200  total_reward: 77  total_reward_ma: 137.667  loss: 0.846078  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:13,731 PID:6089 INFO __init__.py log_metrics] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df metrics] final_return_ma: 137.667  strength: 115.807  max_strength: 178.14  final_strength: 55.14  sample_efficiency: 4.20638e-05  training_efficiency: 3.50532e-05  stability: 0.509145
(pid=3919) [2021-05-04 16:54:13,790 PID:6067 INFO __init__.py log_summary] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df] epi: 2481  t: 10  wall_t: 50  opt_step: 72000  frame: 60000  fps: 1200  total_reward: 15  total_reward_ma: 20.1667  loss: 177.188  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:13,801 PID:6067 INFO __init__.py log_metrics] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df metrics] final_return_ma: 20.1667  strength: -1.69333  max_strength: 8.14  final_strength: -6.86  sample_efficiency: -3.73327e-05  training_efficiency: -3.11105e-05  stability: -5.96969
(pid=3927) [2021-05-04 16:54:13,827 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 5260  t: 1  wall_t: 49  opt_step: 108000  frame: 90000  fps: 1836.73  total_reward: 10  total_reward_ma: 16.5556  loss: 35.1399  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:13,839 PID:6133 INFO __init__.py log_metrics] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df metrics] final_return_ma: 16.5556  strength: -5.30445  max_strength: 20.14  final_strength: -11.86  sample_efficiency: 3.63809e-05  training_efficiency: 3.03174e-05  stability: -0.560758
(pid=3919) [2021-05-04 16:54:13,812 PID:6066 INFO __init__.py log_summary] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df] epi: 2683  t: 9  wall_t: 50  opt_step: 72000  frame: 60000  fps: 1200  total_reward: 11  total_reward_ma: 25  loss: 47.0795  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:13,824 PID:6066 INFO __init__.py log_metrics] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df metrics] final_return_ma: 25  strength: 3.14  max_strength: 32.14  final_strength: -10.86  sample_efficiency: -2.65764e-05  training_efficiency: -2.2147e-05  stability: -0.548822
(pid=3927) [2021-05-04 16:54:14,580 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 3970  t: 12  wall_t: 50  opt_step: 72000  frame: 60000  fps: 1200  total_reward: 11  total_reward_ma: 22.3333  loss: 0.836874  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:14,598 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df metrics] final_return_ma: 22.3333  strength: 0.473333  max_strength: 33.14  final_strength: -10.86  sample_efficiency: -0.000242618  training_efficiency: -0.000202181  stability: -2.21168
(pid=3927) [2021-05-04 16:54:15,019 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 3702  t: 9  wall_t: 50  opt_step: 72000  frame: 60000  fps: 1200  total_reward: 9  total_reward_ma: 25.3333  loss: 51.1418  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:15,030 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 25.3333  strength: 3.47333  max_strength: 61.14  final_strength: -12.86  sample_efficiency: 3.17162e-05  training_efficiency: 2.64302e-05  stability: -1.72997
(pid=3927) [2021-05-04 16:54:15,035 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 3702  t: 9  wall_t: 50  opt_step: 72000  frame: 60000  fps: 1200  total_reward: 9  total_reward_ma: 25.3333  loss: 51.1418  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:15,046 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 25.3333  strength: 3.47333  max_strength: 61.14  final_strength: -12.86  sample_efficiency: 3.17162e-05  training_efficiency: 2.64302e-05  stability: -1.72997
(pid=3917) [2021-05-04 16:54:16,546 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 1748  t: 120  wall_t: 53  opt_step: 84000  frame: 70000  fps: 1320.75  total_reward: 200  total_reward_ma: 70.2857  loss: 0.312924  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:16,557 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df metrics] final_return_ma: 70.2857  strength: 48.4257  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 3.36278e-05  training_efficiency: 2.80232e-05  stability: 0.297438
(pid=3919) [2021-05-04 16:54:17,508 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 4451  t: 10  wall_t: 54  opt_step: 108000  frame: 90000  fps: 1666.67  total_reward: 10  total_reward_ma: 35  loss: 1.27574e+07  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:17,516 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df metrics] final_return_ma: 35  strength: 13.14  max_strength: 163.14  final_strength: -11.86  sample_efficiency: 0.000128859  training_efficiency: 0.000107383  stability: -0.644636
(pid=3919) [2021-05-04 16:54:17,519 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 4451  t: 10  wall_t: 54  opt_step: 108000  frame: 90000  fps: 1666.67  total_reward: 10  total_reward_ma: 35  loss: 1.27574e+07  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:17,526 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df metrics] final_return_ma: 35  strength: 13.14  max_strength: 163.14  final_strength: -11.86  sample_efficiency: 0.000128859  training_efficiency: 0.000107383  stability: -0.644636
(pid=3927) [2021-05-04 16:54:18,549 PID:6135 INFO __init__.py log_summary] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df] epi: 6295  t: 7  wall_t: 54  opt_step: 120000  frame: 100000  fps: 1851.85  total_reward: 12  total_reward_ma: 18.3  loss: 2.96285  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:18,557 PID:6135 INFO __init__.py log_metrics] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [train_df metrics] final_return_ma: 18.3  strength: -3.56  max_strength: 24.14  final_strength: -9.86  sample_efficiency: 2.10637e-06  training_efficiency: 1.75531e-06  stability: -0.903651
(pid=3919) [2021-05-04 16:54:18,903 PID:6064 INFO __init__.py log_summary] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df] epi: 1250  t: 164  wall_t: 55  opt_step: 120000  frame: 100000  fps: 1818.18  total_reward: 200  total_reward_ma: 155.7  loss: 0.0576436  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:18,911 PID:6064 INFO __init__.py log_metrics] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [train_df metrics] final_return_ma: 155.7  strength: 133.84  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 2.27543e-05  training_efficiency: 1.8962e-05  stability: 0.843139
(pid=3927) [2021-05-04 16:54:19,570 PID:6133 INFO __init__.py log_summary] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df] epi: 6169  t: 6  wall_t: 55  opt_step: 120000  frame: 100000  fps: 1818.18  total_reward: 10  total_reward_ma: 15.9  loss: 2.19336  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:19,581 PID:6133 INFO __init__.py log_metrics] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [train_df metrics] final_return_ma: 15.9  strength: -5.96  max_strength: 20.14  final_strength: -11.86  sample_efficiency: 3.11313e-05  training_efficiency: 2.59427e-05  stability: -0.17302
(pid=3917) [2021-05-04 16:54:20,922 PID:6084 INFO __init__.py log_summary] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df] epi: 2227  t: 67  wall_t: 57  opt_step: 84000  frame: 70000  fps: 1228.07  total_reward: 18  total_reward_ma: 120.714  loss: 1.49438  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:20,933 PID:6084 INFO __init__.py log_metrics] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df metrics] final_return_ma: 120.714  strength: 98.8543  max_strength: 178.14  final_strength: -3.86  sample_efficiency: 4.41089e-05  training_efficiency: 3.67574e-05  stability: 0.337491
(pid=3917) [2021-05-04 16:54:21,157 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 1647  t: 57  wall_t: 57  opt_step: 84000  frame: 70000  fps: 1228.07  total_reward: 11  total_reward_ma: 61.2857  loss: 2.20304  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:21,169 PID:6083 INFO __init__.py log_metrics] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df metrics] final_return_ma: 61.2857  strength: 39.4257  max_strength: 178.14  final_strength: -10.86  sample_efficiency: 2.36762e-05  training_efficiency: 1.97302e-05  stability: -0.0493655
(pid=3917) [2021-05-04 16:54:21,435 PID:6089 INFO __init__.py log_summary] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df] epi: 1969  t: 7  wall_t: 58  opt_step: 84000  frame: 70000  fps: 1206.9  total_reward: 123  total_reward_ma: 135.571  loss: 0.309378  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:21,446 PID:6089 INFO __init__.py log_metrics] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df metrics] final_return_ma: 135.571  strength: 113.711  max_strength: 178.14  final_strength: 101.14  sample_efficiency: 3.85343e-05  training_efficiency: 3.21119e-05  stability: 0.548097
(pid=3919) [2021-05-04 16:54:22,569 PID:6067 INFO __init__.py log_summary] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df] epi: 2912  t: 3  wall_t: 59  opt_step: 84000  frame: 70000  fps: 1186.44  total_reward: 45  total_reward_ma: 23.7143  loss: 1.50828e+06  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:22,581 PID:6067 INFO __init__.py log_metrics] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df metrics] final_return_ma: 23.7143  strength: 1.85429  max_strength: 23.14  final_strength: 23.14  sample_efficiency: 5.46896e-05  training_efficiency: 4.55747e-05  stability: -1.26378
(pid=3919) [2021-05-04 16:54:22,601 PID:6066 INFO __init__.py log_summary] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df] epi: 3138  t: 24  wall_t: 59  opt_step: 84000  frame: 70000  fps: 1186.44  total_reward: 36  total_reward_ma: 26.5714  loss: 58.8094  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:22,614 PID:6066 INFO __init__.py log_metrics] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df metrics] final_return_ma: 26.5714  strength: 4.71143  max_strength: 32.14  final_strength: 14.14  sample_efficiency: -9.05701e-06  training_efficiency: -7.54751e-06  stability: -1.44161
(pid=3927) [2021-05-04 16:54:22,668 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 4474  t: 28  wall_t: 58  opt_step: 84000  frame: 70000  fps: 1206.9  total_reward: 53  total_reward_ma: 26.7143  loss: 6.999  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:22,678 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df metrics] final_return_ma: 26.7143  strength: 4.85429  max_strength: 33.14  final_strength: 31.14  sample_efficiency: -7.18589e-06  training_efficiency: -5.98823e-06  stability: -14.493
(pid=3927) [2021-05-04 16:54:23,305 PID:6135 INFO __init__.py log_metrics] Trial 6 session 2 sarsa_epsilon_greedy_cartpole_t6_s2 [eval_df metrics] final_return_ma: 18.3  strength: -3.56  max_strength: 24.14  final_strength: -9.86  sample_efficiency: 2.10637e-06  training_efficiency: 1.75531e-06  stability: -0.903651
(pid=3927) [2021-05-04 16:54:23,306 PID:6135 INFO logger.py info] Session 2 done
(pid=3919) [2021-05-04 16:54:23,630 PID:6064 INFO __init__.py log_metrics] Trial 4 session 0 sarsa_epsilon_greedy_cartpole_t4_s0 [eval_df metrics] final_return_ma: 155.7  strength: 133.84  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 2.27543e-05  training_efficiency: 1.8962e-05  stability: 0.843139
(pid=3919) [2021-05-04 16:54:23,631 PID:6064 INFO logger.py info] Session 0 done
(pid=3927) [2021-05-04 16:54:23,807 PID:6133 INFO __init__.py log_metrics] Trial 6 session 0 sarsa_epsilon_greedy_cartpole_t6_s0 [eval_df metrics] final_return_ma: 15.9  strength: -5.96  max_strength: 20.14  final_strength: -11.86  sample_efficiency: 3.11313e-05  training_efficiency: 2.59427e-05  stability: -0.17302
(pid=3927) [2021-05-04 16:54:23,808 PID:6133 INFO logger.py info] Session 0 done
(pid=3927) [2021-05-04 16:54:23,912 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 4569  t: 8  wall_t: 59  opt_step: 84000  frame: 70000  fps: 1186.44  total_reward: 9  total_reward_ma: 23  loss: 4.62452  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:23,923 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 23  strength: 1.14  max_strength: 61.14  final_strength: -12.86  sample_efficiency: 5.98061e-05  training_efficiency: 4.98384e-05  stability: -3.41459
(pid=3919) [2021-05-04 16:54:24,088 PID:6065 INFO __init__.py log_summary] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df] epi: 5011  t: 5  wall_t: 60  opt_step: 120000  frame: 100000  fps: 1666.67  total_reward: 18  total_reward_ma: 33.3  loss: 22.0154  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:24,096 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [train_df metrics] final_return_ma: 33.3  strength: 11.44  max_strength: 163.14  final_strength: -3.86  sample_efficiency: 0.00013287  training_efficiency: 0.000110725  stability: -0.809572
(pid=3917) [2021-05-04 16:54:25,029 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 2578  t: 2  wall_t: 61  opt_step: 96000  frame: 80000  fps: 1311.48  total_reward: 24  total_reward_ma: 64.5  loss: 4.87856  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:25,037 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df metrics] final_return_ma: 64.5  strength: 42.64  max_strength: 178.14  final_strength: 2.14  sample_efficiency: 3.34953e-05  training_efficiency: 2.79127e-05  stability: 0.147442
(pid=3917) [2021-05-04 16:54:27,416 PID:6084 INFO __init__.py log_summary] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df] epi: 2301  t: 176  wall_t: 64  opt_step: 96000  frame: 80000  fps: 1250  total_reward: 200  total_reward_ma: 130.625  loss: 0.751387  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:27,423 PID:6084 INFO __init__.py log_metrics] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df metrics] final_return_ma: 130.625  strength: 108.765  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 3.76376e-05  training_efficiency: 3.13647e-05  stability: 0.333796
(pid=3917) [2021-05-04 16:54:27,477 PID:6089 INFO __init__.py log_summary] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df] epi: 2475  t: 22  wall_t: 64  opt_step: 96000  frame: 80000  fps: 1250  total_reward: 11  total_reward_ma: 120  loss: 2.86133  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:27,489 PID:6089 INFO __init__.py log_metrics] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df metrics] final_return_ma: 120  strength: 98.14  max_strength: 178.14  final_strength: -10.86  sample_efficiency: 3.88944e-05  training_efficiency: 3.2412e-05  stability: 0.464811
(pid=3917) [2021-05-04 16:54:28,199 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 1753  t: 107  wall_t: 64  opt_step: 96000  frame: 80000  fps: 1250  total_reward: 29  total_reward_ma: 57.25  loss: 0.341931  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:28,210 PID:6083 INFO __init__.py log_metrics] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df metrics] final_return_ma: 57.25  strength: 35.39  max_strength: 178.14  final_strength: 7.14  sample_efficiency: 2.33943e-05  training_efficiency: 1.94953e-05  stability: -0.0906588
(pid=3919) [2021-05-04 16:54:28,835 PID:6067 INFO __init__.py log_summary] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df] epi: 3406  t: 3  wall_t: 65  opt_step: 96000  frame: 80000  fps: 1230.77  total_reward: 16  total_reward_ma: 22.75  loss: 4523.71  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:28,845 PID:6066 INFO __init__.py log_summary] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df] epi: 3578  t: 14  wall_t: 65  opt_step: 96000  frame: 80000  fps: 1230.77  total_reward: 14  total_reward_ma: 25  loss: 85.0482  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:28,846 PID:6067 INFO __init__.py log_metrics] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df metrics] final_return_ma: 22.75  strength: 0.889999  max_strength: 23.14  final_strength: -5.86  sample_efficiency: 8.94132e-05  training_efficiency: 7.4511e-05  stability: -3.00616
(pid=3919) [2021-05-04 16:54:28,856 PID:6066 INFO __init__.py log_metrics] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df metrics] final_return_ma: 25  strength: 3.14  max_strength: 32.14  final_strength: -7.86  sample_efficiency: -1.58022e-05  training_efficiency: -1.31685e-05  stability: -1.06186
(pid=3919) [2021-05-04 16:54:29,118 PID:6065 INFO __init__.py log_metrics] Trial 4 session 1 sarsa_epsilon_greedy_cartpole_t4_s1 [eval_df metrics] final_return_ma: 33.3  strength: 11.44  max_strength: 163.14  final_strength: -3.86  sample_efficiency: 0.00013287  training_efficiency: 0.000110725  stability: -0.809572
(pid=3919) [2021-05-04 16:54:29,120 PID:6065 INFO logger.py info] Session 1 done
(pid=3927) [2021-05-04 16:54:29,237 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 5171  t: 14  wall_t: 65  opt_step: 96000  frame: 80000  fps: 1230.77  total_reward: 10  total_reward_ma: 24.625  loss: 9.63054  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:29,244 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df metrics] final_return_ma: 24.625  strength: 2.765  max_strength: 33.14  final_strength: -11.86  sample_efficiency: -1.77408e-05  training_efficiency: -1.4784e-05  stability: -1.56033
(pid=3927) [2021-05-04 16:54:29,685 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 5342  t: 9  wall_t: 65  opt_step: 96000  frame: 80000  fps: 1230.77  total_reward: 9  total_reward_ma: 21.25  loss: 15.0818  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:29,694 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 21.25  strength: -0.610001  max_strength: 61.14  final_strength: -12.86  sample_efficiency: -6.48569e-05  training_efficiency: -5.40475e-05  stability: -10.5288
(pid=3927) [2021-05-04 16:54:29,697 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 5342  t: 9  wall_t: 65  opt_step: 96000  frame: 80000  fps: 1230.77  total_reward: 9  total_reward_ma: 21.25  loss: 15.0818  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:29,704 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 21.25  strength: -0.610001  max_strength: 61.14  final_strength: -12.86  sample_efficiency: -6.48569e-05  training_efficiency: -5.40475e-05  stability: -10.5288
(pid=3917) [2021-05-04 16:54:30,729 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 2682  t: 111  wall_t: 67  opt_step: 108000  frame: 90000  fps: 1343.28  total_reward: 200  total_reward_ma: 79.5556  loss: 0.681929  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:30,736 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df metrics] final_return_ma: 79.5556  strength: 57.6956  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 2.5816e-05  training_efficiency: 2.15134e-05  stability: 0.152791
(pid=3917) [2021-05-04 16:54:32,767 PID:6084 INFO __init__.py log_summary] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df] epi: 2403  t: 1  wall_t: 69  opt_step: 108000  frame: 90000  fps: 1304.35  total_reward: 137  total_reward_ma: 131.333  loss: 172.692  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:32,774 PID:6084 INFO __init__.py log_metrics] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df metrics] final_return_ma: 131.333  strength: 109.473  max_strength: 178.14  final_strength: 115.14  sample_efficiency: 3.45377e-05  training_efficiency: 2.87814e-05  stability: 0.397784
(pid=3917) [2021-05-04 16:54:32,914 PID:6089 INFO __init__.py log_summary] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df] epi: 2645  t: 6  wall_t: 69  opt_step: 108000  frame: 90000  fps: 1304.35  total_reward: 66  total_reward_ma: 114  loss: 229.09  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:32,924 PID:6089 INFO __init__.py log_metrics] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df metrics] final_return_ma: 114  strength: 92.14  max_strength: 178.14  final_strength: 44.14  sample_efficiency: 3.74155e-05  training_efficiency: 3.11796e-05  stability: 0.457408
(pid=3917) [2021-05-04 16:54:33,652 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 1901  t: 179  wall_t: 70  opt_step: 108000  frame: 90000  fps: 1285.71  total_reward: 200  total_reward_ma: 73.1111  loss: 0.833828  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:33,659 PID:6083 INFO __init__.py log_metrics] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df metrics] final_return_ma: 73.1111  strength: 51.2511  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 1.86505e-05  training_efficiency: 1.55421e-05  stability: -0.0631535
(pid=3919) [2021-05-04 16:54:34,298 PID:6066 INFO __init__.py log_summary] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df] epi: 4001  t: 2  wall_t: 70  opt_step: 108000  frame: 90000  fps: 1285.71  total_reward: 35  total_reward_ma: 26.1111  loss: 2.78408e+06  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:34,305 PID:6066 INFO __init__.py log_metrics] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df metrics] final_return_ma: 26.1111  strength: 4.25111  max_strength: 32.14  final_strength: 13.14  sample_efficiency: -6.55907e-06  training_efficiency: -5.46589e-06  stability: -1.70701
(pid=3919) [2021-05-04 16:54:34,342 PID:6067 INFO __init__.py log_summary] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df] epi: 4119  t: 7  wall_t: 70  opt_step: 108000  frame: 90000  fps: 1285.71  total_reward: 14  total_reward_ma: 21.7778  loss: 2.48157  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:34,349 PID:6067 INFO __init__.py log_metrics] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df metrics] final_return_ma: 21.7778  strength: -0.0822228  max_strength: 23.14  final_strength: -7.86  sample_efficiency: -0.000742276  training_efficiency: -0.000618563  stability: -6.58427
(pid=3927) [2021-05-04 16:54:34,524 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 5895  t: 5  wall_t: 70  opt_step: 108000  frame: 90000  fps: 1285.71  total_reward: 9  total_reward_ma: 22.8889  loss: 69.1189  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:34,531 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df metrics] final_return_ma: 22.8889  strength: 1.02889  max_strength: 33.14  final_strength: -12.86  sample_efficiency: -5.78095e-05  training_efficiency: -4.81745e-05  stability: -2.9783
(pid=3927) [2021-05-04 16:54:35,013 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 6177  t: 7  wall_t: 70  opt_step: 108000  frame: 90000  fps: 1285.71  total_reward: 39  total_reward_ma: 23.2222  loss: 13.1122  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:35,023 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 23.2222  strength: 1.36222  max_strength: 61.14  final_strength: 17.14  sample_efficiency: 4.13497e-05  training_efficiency: 3.44581e-05  stability: -17.8524
(pid=3917) [2021-05-04 16:54:36,339 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 3083  t: 10  wall_t: 72  opt_step: 120000  frame: 100000  fps: 1388.89  total_reward: 10  total_reward_ma: 72.6  loss: 8.52186  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:36,350 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df metrics] final_return_ma: 72.6  strength: 50.74  max_strength: 178.14  final_strength: -11.86  sample_efficiency: 2.61857e-05  training_efficiency: 2.18214e-05  stability: 0.0775334
(pid=3917) [2021-05-04 16:54:36,354 PID:6088 INFO __init__.py log_summary] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df] epi: 3083  t: 10  wall_t: 72  opt_step: 120000  frame: 100000  fps: 1388.89  total_reward: 10  total_reward_ma: 72.6  loss: 8.52186  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:36,364 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [train_df metrics] final_return_ma: 72.6  strength: 50.74  max_strength: 178.14  final_strength: -11.86  sample_efficiency: 2.61857e-05  training_efficiency: 2.18214e-05  stability: 0.0775334
(pid=3917) [2021-05-04 16:54:38,126 PID:6084 INFO __init__.py log_summary] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df] epi: 2614  t: 60  wall_t: 74  opt_step: 120000  frame: 100000  fps: 1351.35  total_reward: 56  total_reward_ma: 123.8  loss: 1.64251  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:38,133 PID:6084 INFO __init__.py log_metrics] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [train_df metrics] final_return_ma: 123.8  strength: 101.94  max_strength: 178.14  final_strength: 34.14  sample_efficiency: 3.37159e-05  training_efficiency: 2.80966e-05  stability: 0.385949
(pid=3917) [2021-05-04 16:54:38,334 PID:6089 INFO __init__.py log_summary] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df] epi: 2815  t: 7  wall_t: 74  opt_step: 120000  frame: 100000  fps: 1351.35  total_reward: 200  total_reward_ma: 122.6  loss: 0.631643  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:38,342 PID:6089 INFO __init__.py log_metrics] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [train_df metrics] final_return_ma: 122.6  strength: 100.74  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 3.25676e-05  training_efficiency: 2.71397e-05  stability: 0.486289
(pid=3917) [2021-05-04 16:54:39,308 PID:6083 INFO __init__.py log_summary] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df] epi: 1971  t: 50  wall_t: 75  opt_step: 120000  frame: 100000  fps: 1333.33  total_reward: 15  total_reward_ma: 67.3  loss: 1.24413  lr: 0.05  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3917) [2021-05-04 16:54:39,321 PID:6083 INFO __init__.py log_metrics] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [train_df metrics] final_return_ma: 67.3  strength: 45.44  max_strength: 178.14  final_strength: -6.86  sample_efficiency: 1.87811e-05  training_efficiency: 1.56509e-05  stability: -0.0536357
(pid=3919) [2021-05-04 16:54:40,071 PID:6066 INFO __init__.py log_summary] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df] epi: 4388  t: 32  wall_t: 76  opt_step: 120000  frame: 100000  fps: 1315.79  total_reward: 15  total_reward_ma: 25  loss: 107.492  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:40,085 PID:6066 INFO __init__.py log_metrics] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [train_df metrics] final_return_ma: 25  strength: 3.14  max_strength: 32.14  final_strength: -6.86  sample_efficiency: -1.01768e-05  training_efficiency: -8.48063e-06  stability: -1.30005
(pid=3919) [2021-05-04 16:54:40,115 PID:6067 INFO __init__.py log_summary] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df] epi: 4644  t: 30  wall_t: 76  opt_step: 120000  frame: 100000  fps: 1315.79  total_reward: 39  total_reward_ma: 23.5  loss: 40.428  lr: 0.01  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3919) [2021-05-04 16:54:40,123 PID:6067 INFO __init__.py log_metrics] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [train_df metrics] final_return_ma: 23.5  strength: 1.64  max_strength: 23.14  final_strength: 17.14  sample_efficiency: 4.39444e-05  training_efficiency: 3.66203e-05  stability: -71.9724
(pid=3927) [2021-05-04 16:54:40,163 PID:6136 INFO __init__.py log_summary] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df] epi: 6583  t: 69  wall_t: 76  opt_step: 120000  frame: 100000  fps: 1315.79  total_reward: 44  total_reward_ma: 25  loss: 0.995696  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:40,172 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [train_df metrics] final_return_ma: 25  strength: 3.14  max_strength: 33.14  final_strength: 22.14  sample_efficiency: -9.99731e-06  training_efficiency: -8.33108e-06  stability: -8.50325
(pid=3917) [2021-05-04 16:54:40,353 PID:6088 INFO __init__.py log_metrics] Trial 5 session 2 sarsa_epsilon_greedy_cartpole_t5_s2 [eval_df metrics] final_return_ma: 72.6  strength: 50.74  max_strength: 178.14  final_strength: -11.86  sample_efficiency: 2.61857e-05  training_efficiency: 2.18214e-05  stability: 0.0775334
(pid=3917) [2021-05-04 16:54:40,355 PID:6088 INFO logger.py info] Session 2 done
(pid=3927) [2021-05-04 16:54:40,770 PID:6134 INFO __init__.py log_summary] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df] epi: 6705  t: 7  wall_t: 76  opt_step: 120000  frame: 100000  fps: 1315.79  total_reward: 19  total_reward_ma: 22.8  loss: 0.850267  lr: 0.1  explore_var: 0.05  entropy_coef: nan  entropy: nan  grad_norm: nan
(pid=3927) [2021-05-04 16:54:40,782 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [train_df metrics] final_return_ma: 22.8  strength: 0.939999  max_strength: 61.14  final_strength: -2.86  sample_efficiency: 5.0888e-05  training_efficiency: 4.24066e-05  stability: -8.1354
(pid=3917) [2021-05-04 16:54:42,296 PID:6084 INFO __init__.py log_metrics] Trial 5 session 1 sarsa_epsilon_greedy_cartpole_t5_s1 [eval_df metrics] final_return_ma: 123.8  strength: 101.94  max_strength: 178.14  final_strength: 34.14  sample_efficiency: 3.37159e-05  training_efficiency: 2.80966e-05  stability: 0.385949
(pid=3917) [2021-05-04 16:54:42,297 PID:6084 INFO logger.py info] Session 1 done
(pid=3917) [2021-05-04 16:54:42,509 PID:6089 INFO __init__.py log_metrics] Trial 5 session 3 sarsa_epsilon_greedy_cartpole_t5_s3 [eval_df metrics] final_return_ma: 122.6  strength: 100.74  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 3.25676e-05  training_efficiency: 2.71397e-05  stability: 0.486289
(pid=3917) [2021-05-04 16:54:42,510 PID:6089 INFO logger.py info] Session 3 done
(pid=3917) [2021-05-04 16:54:43,518 PID:6083 INFO __init__.py log_metrics] Trial 5 session 0 sarsa_epsilon_greedy_cartpole_t5_s0 [eval_df metrics] final_return_ma: 67.3  strength: 45.44  max_strength: 178.14  final_strength: -6.86  sample_efficiency: 1.87811e-05  training_efficiency: 1.56509e-05  stability: -0.0536357
(pid=3917) [2021-05-04 16:54:43,520 PID:6083 INFO logger.py info] Session 0 done
(pid=3919) [2021-05-04 16:54:44,048 PID:6066 INFO __init__.py log_metrics] Trial 4 session 2 sarsa_epsilon_greedy_cartpole_t4_s2 [eval_df metrics] final_return_ma: 25  strength: 3.14  max_strength: 32.14  final_strength: -6.86  sample_efficiency: -1.01768e-05  training_efficiency: -8.48063e-06  stability: -1.30005
(pid=3919) [2021-05-04 16:54:44,048 PID:6067 INFO __init__.py log_metrics] Trial 4 session 3 sarsa_epsilon_greedy_cartpole_t4_s3 [eval_df metrics] final_return_ma: 23.5  strength: 1.64  max_strength: 23.14  final_strength: 17.14  sample_efficiency: 4.39444e-05  training_efficiency: 3.66203e-05  stability: -71.9724
(pid=3919) [2021-05-04 16:54:44,049 PID:6066 INFO logger.py info] Session 2 done
(pid=3927) [2021-05-04 16:54:44,077 PID:6136 INFO __init__.py log_metrics] Trial 6 session 3 sarsa_epsilon_greedy_cartpole_t6_s3 [eval_df metrics] final_return_ma: 25  strength: 3.14  max_strength: 33.14  final_strength: 22.14  sample_efficiency: -9.99731e-06  training_efficiency: -8.33108e-06  stability: -8.50325
(pid=3927) [2021-05-04 16:54:44,077 PID:6136 INFO logger.py info] Session 3 done
(pid=3919) [2021-05-04 16:54:44,050 PID:6067 INFO logger.py info] Session 3 done
(pid=3927) [2021-05-04 16:54:44,672 PID:6134 INFO __init__.py log_metrics] Trial 6 session 1 sarsa_epsilon_greedy_cartpole_t6_s1 [eval_df metrics] final_return_ma: 22.8  strength: 0.939999  max_strength: 61.14  final_strength: -2.86  sample_efficiency: 5.0888e-05  training_efficiency: 4.24066e-05  stability: -8.1354
(pid=3927) [2021-05-04 16:54:44,673 PID:6134 INFO logger.py info] Session 1 done
Result for ray_trainable_5_agent.0.net.optim_spec.lr=0.05,trial_index=5:
  date: 2021-05-04_16-54-46
  done: false
  experiment_id: 8ffcfbad795a4beea99cf2e3c64c9215
  hostname: furanzu
  iterations_since_restore: 1
  node_ip: 220.67.127.75
  pid: 3917
  time_since_restore: 83.9136130809784
  time_this_iter_s: 83.9136130809784
  time_total_s: 83.9136130809784
  timestamp: 1620114886
  timesteps_since_restore: 0
  training_iteration: 1
  trial_data:
    '5':
      agent.0.net.optim_spec.lr: 0.05
      consistency: -1.2436954030885268
      final_return_ma: 96.57500076293945
      final_strength: 48.38999938964844
      max_strength: 178.13999938964844
      sample_efficiency: 2.7812577627628343e-05
      stability: 0.2240338921546936
      strength: 74.71500301361084
      training_efficiency: 2.3177149614639347e-05
  
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 12/16 CPUs, 0/2 GPUs
Memory usage on this node: 5.2/33.6 GB
Result logdir: /home/iwan/ray_results/sarsa_epsilon_greedy_cartpole
Number of trials: 7 ({'TERMINATED': 4, 'RUNNING': 3})
RUNNING trials:
 - ray_trainable_4_agent.0.net.optim_spec.lr=0.01,trial_index=4:	RUNNING
 - ray_trainable_5_agent.0.net.optim_spec.lr=0.05,trial_index=5:	RUNNING, [4 CPUs, 0 GPUs], [pid=3917], 83 s, 1 iter
 - ray_trainable_6_agent.0.net.optim_spec.lr=0.1,trial_index=6:	RUNNING
TERMINATED trials:
 - ray_trainable_0_agent.0.net.optim_spec.lr=0.0005,trial_index=0:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3926], 101 s, 1 iter
 - ray_trainable_1_agent.0.net.optim_spec.lr=0.001,trial_index=1:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3913], 101 s, 1 iter
 - ray_trainable_2_agent.0.net.optim_spec.lr=0.001,trial_index=2:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3925], 101 s, 1 iter
 - ray_trainable_3_agent.0.net.optim_spec.lr=0.005,trial_index=3:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3914], 102 s, 1 iter

(pid=3917) [2021-05-04 16:54:46,943 PID:3917 INFO logger.py info] Trial 5 done
Result for ray_trainable_4_agent.0.net.optim_spec.lr=0.01,trial_index=4:
  date: 2021-05-04_16-54-47
  done: false
  experiment_id: 15bdd3fe8a5348959d51f1b0ce8697eb
  hostname: furanzu
  iterations_since_restore: 1
  node_ip: 220.67.127.75
  pid: 3919
  time_since_restore: 84.52337956428528
  time_this_iter_s: 84.52337956428528
  time_total_s: 84.52337956428528
  timestamp: 1620114887
  timesteps_since_restore: 0
  training_iteration: 1
  trial_data:
    '4':
      agent.0.net.optim_spec.lr: 0.01
      consistency: -2.955836353229779
      final_return_ma: 59.374999046325684
      final_strength: 46.13999938964844
      max_strength: 99.13999938964844
      sample_efficiency: 4.734792105409724e-05
      stability: -18.30973031371832
      strength: 37.5149986743927
      training_efficiency: 3.945659932469425e-05
  
(pid=3919) [2021-05-04 16:54:47,500 PID:3919 INFO logger.py info] Trial 4 done
Result for ray_trainable_6_agent.0.net.optim_spec.lr=0.1,trial_index=6:
  date: 2021-05-04_16-54-48
  done: false
  experiment_id: 4da3531cb0aa45c1be14d4b3c595e945
  hostname: furanzu
  iterations_since_restore: 1
  node_ip: 220.67.127.75
  pid: 3927
  time_since_restore: 84.44849443435669
  time_this_iter_s: 84.44849443435669
  time_total_s: 84.44849443435669
  timestamp: 1620114888
  timesteps_since_restore: 0
  training_iteration: 1
  trial_data:
    '6':
      agent.0.net.optim_spec.lr: 0.1
      consistency: 17.4483187290332
      final_return_ma: 20.499999523162842
      final_strength: -0.6100006103515625
      max_strength: 34.63999938964844
      sample_efficiency: 1.8532071067056677e-05
      stability: -4.428830206394196
      strength: -1.3600005954504013
      training_efficiency: 1.544339670545014e-05
  
(pid=3927) [2021-05-04 16:54:48,028 PID:3927 INFO logger.py info] Trial 6 done
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/2 GPUs
Memory usage on this node: 4.0/33.6 GB
Result logdir: /home/iwan/ray_results/sarsa_epsilon_greedy_cartpole
Number of trials: 7 ({'TERMINATED': 7})
TERMINATED trials:
 - ray_trainable_0_agent.0.net.optim_spec.lr=0.0005,trial_index=0:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3926], 101 s, 1 iter
 - ray_trainable_1_agent.0.net.optim_spec.lr=0.001,trial_index=1:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3913], 101 s, 1 iter
 - ray_trainable_2_agent.0.net.optim_spec.lr=0.001,trial_index=2:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3925], 101 s, 1 iter
 - ray_trainable_3_agent.0.net.optim_spec.lr=0.005,trial_index=3:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3914], 102 s, 1 iter
 - ray_trainable_4_agent.0.net.optim_spec.lr=0.01,trial_index=4:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3919], 84 s, 1 iter
 - ray_trainable_5_agent.0.net.optim_spec.lr=0.05,trial_index=5:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3917], 83 s, 1 iter
 - ray_trainable_6_agent.0.net.optim_spec.lr=0.1,trial_index=6:	TERMINATED, [4 CPUs, 0 GPUs], [pid=3927], 84 s, 1 iter

[2021-05-04 16:54:52,408 PID:3860 INFO analysis.py analyze_experiment] All experiment data zipped to data/sarsa_epsilon_greedy_cartpole_2021_05_04_165139.zip
[2021-05-04 16:54:52,408 PID:3860 INFO logger.py info] Experiment done

Potential Memory Leak

Hello,

I am currently using SLM lab as the learning component of my custom Unity environments. I am using a modified UnityEnv wrapper and I run my experiments using a modified version of the starter code here.

When I am running both PPO and SAC I realized that my Unix kernel kills the job after a while due running out of memory (RAM/Swap).

Given the custom nature of this bug, I don't expect you to replicate it, but rather, asking if you had ever faced a similar problem on your end.

Some more detail:

Initially, I assumed it was due to the size of the replay buffer. But even after the replay buffer was capped up a small number (1000) and got maxed out the problem persisted.
The memory increase is roughly on the order of 1mb/s which is relatively high.
I managed to trace it to the "train step" in SAC. Can't trace if memory is created there, but when the training steps aren't taken, there is no problem.
I tested with the default Unity envs to ensure I didn't cause the problem with my custom env--this doesn't seem to be the cause.
We will be testing with the provided Cartpole env to see if the problem persists.

Any guidance or tips would be appreciated! And once again thank you for the great library!

Run without roboschool

Is it possible to run without roboschool? On the server I'm using installing roboschool fails, but it seems the code depends on it although I'm not trying to run an environment from roboschool.

SLM Lab As a Python Pip Module

Describe the bug
Can not implement a Session in an external project using the code in the gitbook.

To Reproduce

OS and environment: MacOS
SLM Lab git SHA (run git rev-parse HEAD to get it):
spec file used: slm_lab/spec/benchmark/ppo/ppo_cartpole.json

Additional context
I replace "from slm_lab.env import OpenAIEnv" to "from slm_lab.env import make_env" line 5
"from slm_lab.experiment.monitor import Body" to "from slm_lab.agent import Body" line 7
"self.env = OpenAIEnv(self.spec)" to "self.env = make_env(self.spec)" line 17
Because previous components can not be found.

Error logs

Traceback (most recent call last):
  File "pipex.py", line 56, in <module>
    sess = Session(spec)
  File "pipex.py", line 18, in __init__
    body = Body(self.env, self.spec['agent'])
  File "/Users/shiwanyin/SLM-Lab/slm_lab/agent/__init__.py", line 113, in __init__
    self.init_tb()
  File "/Users/shiwanyin/SLM-Lab/slm_lab/agent/__init__.py", line 132, in init_tb
    log_prepath = self.spec['meta']['log_prepath']
TypeError: list indices must be integers or slices, not str

book branch: unable to run initial example

After purchasing the book, I've installed the code on both Centos 7 and Ubuntu 18.04. Running the initial example gives a missing library error.

./bin/setup
conda activate lab
python run_lab.py slm_lab/spec/demo.json dqn_cartpole dev

Gives the error below:

    from roboschool  import cpp_household   as cpp_household
ImportError: libpcre16.so.3: cannot open shared object file: No such file or directory

You can easily reproduce by doing import roboschool in python in the venv set up for the sandbox.

Error when running "demo"-file

Describe the bug
After installation according to instructions on

https://slm-lab.gitbook.io/slm-lab/setup/installation

I get the following error when running

$python run_lab.py slm_lab/spec/demo.json dqn_cartpole dev

in the lab environment:

ImportError: libpcre16.so.3: cannot open shared object file: No such file or directory

To Reproduce

OS and environment: Ubuntu 20.04
SLM Lab git SHA (run git rev-parse HEAD to get it): dda02d0
spec file used: ???

Additional context
???

Error logs

Traceback (most recent call last):
  File "run_lab.py", line 5, in <module>
    from slm_lab.experiment.control import Session, Trial, Experiment
  File "/home/SML Lab/SLM-Lab/slm_lab/experiment/control.py", line 7, in <module>
    from slm_lab.experiment import analysis, search
  File "/home/SML Lab/SLM-Lab/slm_lab/experiment/analysis.py", line 2, in <module>
    from slm_lab.spec import random_baseline
  File "/home//SML Lab/SLM-Lab/slm_lab/spec/random_baseline.py", line 7, in <module>
    import roboschool
  File "/home//anaconda3/envs/lab/lib/python3.7/site-packages/roboschool/__init__.py", line 112, in <module>
    from roboschool.gym_pendulums import RoboschoolInvertedPendulum
  File "/home/anaconda3/envs/lab/lib/python3.7/site-packages/roboschool/gym_pendulums.py", line 1, in <module>
    from roboschool.scene_abstract import SingleRobotEmptyScene
  File "/home/anaconda3/envs/lab/lib/python3.7/site-packages/roboschool/scene_abstract.py", line 12, in <module>
    from roboschool  import cpp_household   as cpp_household
ImportError: libpcre16.so.3: cannot open shared object file: No such file or directory

Fail to save graphs

I follow the book "Foundations of Deep Reinforcement Learning" to conduct the experiments of reinformace algorithm.Although the algorithm can be conducted successfully, its graphs fail to be saved successfully, with an error from orca "service unavaialble".

OS and environment: Ubuntu 16.04
spec file used: reinforce_cartpole.json

Additional context
Add any other context about the problem here.

Error logs
Failed to generate graph. Run retro-analysis to generate graphs later.
The image request was rejected by the orca conversion utility
with the following error:
503:

<title>503 Service Unavailable</title>

Service Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

All 'search' examples end with error

Describe the bug
I'm enjoying the book a lot. The best book on the subject and I've read Sutton & Barto, but I'm an empiricist and not an academic. Anyway, I can run all the examples in the book in 'dev' and 'train' modes but not in 'search' mode. They all end with error. I don't see anybody complaining about this so it must be a rooky mistake on my part. I hope you can help so I can continue enjoying the book to its fullest.

To Reproduce

OS and environment: Ubuntu 18.04
SLM Lab git SHA (run git rev-parse HEAD to get it): What?
spec file used: benchmark/reinforce/reinforce_cartpole.json

Additional context
I'm showing the error logs for Code 2.15 in page 50, but I get similar error logs for all the other codes ran in 'search' mode.
There are 32 files in the 'data' folder, no plots.
All the folders in the 'data' folder are empty except for 'log' which has a file with this

[2020-01-30 11:03:56,907 PID:3351 INFO search.py run_ray_search] Running ray search for spec reinforce_cartpole

NVIDIA drive version: 440.33.01
CUDA version: 10.2

Error logs

python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_baseline_cartpole search
[2020-01-30 11:38:57,177 PID:4355 INFO run_lab.py read_spec_and_run] Running lab spec_file:slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json spec_name:reinforce_baseline_cartpole in mode:search
[2020-01-30 11:38:57,183 PID:4355 INFO search.py run_ray_search] Running ray search for spec reinforce_baseline_cartpole
2020-01-30 11:38:57,183	WARNING worker.py:1341 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
2020-01-30 11:38:57,183	INFO node.py:497 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2020-01-30_11-38-57_183527_4355/logs.
2020-01-30 11:38:57,288	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:59003 to respond...
2020-01-30 11:38:57,409	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:55931 to respond...
2020-01-30 11:38:57,414	INFO services.py:806 -- Starting Redis shard with 3.35 GB max memory.
2020-01-30 11:38:57,435	INFO node.py:511 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2020-01-30_11-38-57_183527_4355/logs.
2020-01-30 11:38:57,435	INFO services.py:1441 -- Starting the Plasma object store with 5.02 GB memory using /dev/shm.
2020-01-30 11:38:57,543	INFO tune.py:60 -- Tip: to resume incomplete experiments, pass resume='prompt' or resume=True to run()
2020-01-30 11:38:57,543	INFO tune.py:223 -- Starting a new experiment.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/1 GPUs
Memory usage on this node: 2.1/16.7 GB

2020-01-30 11:38:57,572	WARNING logger.py:130 -- Couldn't import TensorFlow - disabling TensorBoard logging.
2020-01-30 11:38:57,573	WARNING logger.py:224 -- Could not instantiate <class 'ray.tune.logger.TFLogger'> - skipping.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 4/8 CPUs, 0/1 GPUs
Memory usage on this node: 2.2/16.7 GB
Result logdir: /home/joe/ray_results/reinforce_baseline_cartpole
Number of trials: 2 ({'RUNNING': 1, 'PENDING': 1})
PENDING trials:
 - ray_trainable_1_agent.0.algorithm.center_return=False,trial_index=1:	PENDING
RUNNING trials:
 - ray_trainable_0_agent.0.algorithm.center_return=True,trial_index=0:	RUNNING

2020-01-30 11:38:57,596	WARNING logger.py:130 -- Couldn't import TensorFlow - disabling TensorBoard logging.
2020-01-30 11:38:57,607	WARNING logger.py:224 -- Could not instantiate <class 'ray.tune.logger.TFLogger'> - skipping.
(pid=4389) [2020-01-30 11:38:58,297 PID:4389 INFO logger.py info] Running sessions
(pid=4388) [2020-01-30 11:38:58,292 PID:4388 INFO logger.py info] Running sessions
(pid=4388) terminate called after throwing an instance of 'c10::Error'
(pid=4388)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
(pid=4388) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcf770dedc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
(pid=4388) frame #1: <unknown function> + 0xca67 (0x7fcf6f2daa67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
(pid=4388) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcf6f9fbb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
(pid=4388) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcfa636128a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
(pid=4388) frame #4: <unknown function> + 0xc8421 (0x7fcfbb3bd421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
(pid=4388) frame #5: <unknown function> + 0x76db (0x7fcfc0c466db in /lib/x86_64-linux-gnu/libpthread.so.0)
(pid=4388) frame #6: clone + 0x3f (0x7fcfc096f88f in /lib/x86_64-linux-gnu/libc.so.6)
(pid=4388) 
(pid=4388) Fatal Python error: Aborted
(pid=4388) 
(pid=4388) Stack (most recent call first):
(pid=4389) [2020-01-30 11:38:58,326 PID:4456 INFO openai.py __init__] OpenAIEnv:
(pid=4389) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
(pid=4389) - eval_frequency = 2000
(pid=4389) - log_frequency = 10000
(pid=4389) - frame_op = None
(pid=4389) - frame_op_len = None
(pid=4389) - image_downsize = (84, 84)
(pid=4389) - normalize_state = False
(pid=4389) - reward_scale = None
(pid=4389) - num_envs = 1
(pid=4389) - name = CartPole-v0
(pid=4389) - max_t = 200
(pid=4389) - max_frame = 100000
(pid=4389) - to_render = False
(pid=4389) - is_venv = False
(pid=4389) - clock_speed = 1
(pid=4389) - clock = <slm_lab.env.base.Clock object at 0x7fcc1a023d30>
(pid=4389) - done = False
(pid=4389) - total_reward = nan
(pid=4389) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
(pid=4389) - observation_space = Box(4,)
(pid=4389) - action_space = Discrete(2)
(pid=4389) - observable_dim = {'state': 4}
(pid=4389) - action_dim = 2
(pid=4389) - is_discrete = True
(pid=4389) [2020-01-30 11:38:58,327 PID:4453 INFO openai.py __init__] OpenAIEnv:
(pid=4389) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
(pid=4389) - eval_frequency = 2000
(pid=4389) - log_frequency = 10000
(pid=4389) - frame_op = None
(pid=4389) - frame_op_len = None
(pid=4389) - image_downsize = (84, 84)
(pid=4389) - normalize_state = False
(pid=4389) - reward_scale = None
(pid=4389) - num_envs = 1
(pid=4389) - name = CartPole-v0
(pid=4389) - max_t = 200
(pid=4389) - max_frame = 100000
(pid=4389) - to_render = False
(pid=4389) - is_venv = False
(pid=4389) - clock_speed = 1
(pid=4389) - clock = <slm_lab.env.base.Clock object at 0x7fcc1a023d30>
(pid=4389) - done = False
(pid=4389) - total_reward = nan
(pid=4389) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
(pid=4389) - observation_space = Box(4,)
(pid=4389) - action_space = Discrete(2)
(pid=4389) - observable_dim = {'state': 4}
(pid=4389) - action_dim = 2
(pid=4389) - is_discrete = True
(pid=4389) [2020-01-30 11:38:58,328 PID:4450 INFO openai.py __init__] OpenAIEnv:
(pid=4389) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
(pid=4389) - eval_frequency = 2000
(pid=4389) - log_frequency = 10000
(pid=4389) - frame_op = None
(pid=4389) - frame_op_len = None
(pid=4389) - image_downsize = (84, 84)
(pid=4389) - normalize_state = False
(pid=4389) - reward_scale = None
(pid=4389) - num_envs = 1
(pid=4389) - name = CartPole-v0
(pid=4389) - max_t = 200
(pid=4389) - max_frame = 100000
(pid=4389) - to_render = False
(pid=4389) - is_venv = False
(pid=4389) - clock_speed = 1
(pid=4389) - clock = <slm_lab.env.base.Clock object at 0x7fcc1a023d30>
(pid=4389) - done = False
(pid=4389) - total_reward = nan
(pid=4389) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
(pid=4389) - observation_space = Box(4,)
(pid=4389) - action_space = Discrete(2)
(pid=4389) - observable_dim = {'state': 4}
(pid=4389) - action_dim = 2
(pid=4389) - is_discrete = True
(pid=4389) [2020-01-30 11:38:58,335 PID:4458 INFO openai.py __init__] OpenAIEnv:
(pid=4389) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
(pid=4389) - eval_frequency = 2000
(pid=4389) - log_frequency = 10000
(pid=4389) - frame_op = None
(pid=4389) - frame_op_len = None
(pid=4389) - image_downsize = (84, 84)
(pid=4389) - normalize_state = False
(pid=4389) - reward_scale = None
(pid=4389) - num_envs = 1
(pid=4389) - name = CartPole-v0
(pid=4389) - max_t = 200
(pid=4389) - max_frame = 100000
(pid=4389) - to_render = False
(pid=4389) - is_venv = False
(pid=4389) - clock_speed = 1
(pid=4389) - clock = <slm_lab.env.base.Clock object at 0x7fcc1a023d30>
(pid=4389) - done = False
(pid=4389) - total_reward = nan
(pid=4389) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
(pid=4389) - observation_space = Box(4,)
(pid=4389) - action_space = Discrete(2)
(pid=4389) - observable_dim = {'state': 4}
(pid=4389) - action_dim = 2
(pid=4389) - is_discrete = True
(pid=4388) [2020-01-30 11:38:58,313 PID:4440 INFO openai.py __init__] OpenAIEnv:
(pid=4388) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
(pid=4388) - eval_frequency = 2000
(pid=4388) - log_frequency = 10000
(pid=4388) - frame_op = None
(pid=4388) - frame_op_len = None
(pid=4388) - image_downsize = (84, 84)
(pid=4388) - normalize_state = False
(pid=4388) - reward_scale = None
(pid=4388) - num_envs = 1
(pid=4388) - name = CartPole-v0
(pid=4388) - max_t = 200
(pid=4388) - max_frame = 100000
(pid=4388) - to_render = False
(pid=4388) - is_venv = False
(pid=4388) - clock_speed = 1
(pid=4388) - clock = <slm_lab.env.base.Clock object at 0x7fce28f7fcf8>
(pid=4388) - done = False
(pid=4388) - total_reward = nan
(pid=4388) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
(pid=4388) - observation_space = Box(4,)
(pid=4388) - action_space = Discrete(2)
(pid=4388) - observable_dim = {'state': 4}
(pid=4388) - action_dim = 2
(pid=4388) - is_discrete = True
(pid=4388) [2020-01-30 11:38:58,318 PID:4445 INFO openai.py __init__] OpenAIEnv:
(pid=4388) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
(pid=4388) - eval_frequency = 2000
(pid=4388) - log_frequency = 10000
(pid=4388) - frame_op = None
(pid=4388) - frame_op_len = None
(pid=4388) - image_downsize = (84, 84)
(pid=4388) - normalize_state = False
(pid=4388) - reward_scale = None
(pid=4388) - num_envs = 1
(pid=4388) - name = CartPole-v0
(pid=4388) - max_t = 200
(pid=4388) - max_frame = 100000
(pid=4388) - to_render = False
(pid=4388) - is_venv = False
(pid=4388) - clock_speed = 1
(pid=4388) - clock = <slm_lab.env.base.Clock object at 0x7fce28f7fcf8>
(pid=4388) - done = False
(pid=4388) - total_reward = nan
(pid=4388) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
(pid=4388) - observation_space = Box(4,)
(pid=4388) - action_space = Discrete(2)
(pid=4388) - observable_dim = {'state': 4}
(pid=4388) - action_dim = 2
(pid=4388) - is_discrete = True
(pid=4388) [2020-01-30 11:38:58,319 PID:4449 INFO openai.py __init__] OpenAIEnv:
(pid=4388) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
(pid=4388) - eval_frequency = 2000
(pid=4388) - log_frequency = 10000
(pid=4388) - frame_op = None
(pid=4388) - frame_op_len = None
(pid=4388) - image_downsize = (84, 84)
(pid=4388) - normalize_state = False
(pid=4388) - reward_scale = None
(pid=4388) - num_envs = 1
(pid=4388) - name = CartPole-v0
(pid=4388) - max_t = 200
(pid=4388) - max_frame = 100000
(pid=4388) - to_render = False
(pid=4388) - is_venv = False
(pid=4388) - clock_speed = 1
(pid=4388) - clock = <slm_lab.env.base.Clock object at 0x7fce28f7fcf8>
(pid=4388) - done = False
(pid=4388) - total_reward = nan
(pid=4388) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
(pid=4388) - observation_space = Box(4,)
(pid=4388) - action_space = Discrete(2)
(pid=4388) - observable_dim = {'state': 4}
(pid=4388) - action_dim = 2
(pid=4388) - is_discrete = True
(pid=4388) [2020-01-30 11:38:58,323 PID:4452 INFO openai.py __init__] OpenAIEnv:
(pid=4388) - env_spec = {'max_frame': 100000, 'max_t': None, 'name': 'CartPole-v0'}
(pid=4388) - eval_frequency = 2000
(pid=4388) - log_frequency = 10000
(pid=4388) - frame_op = None
(pid=4388) - frame_op_len = None
(pid=4388) - image_downsize = (84, 84)
(pid=4388) - normalize_state = False
(pid=4388) - reward_scale = None
(pid=4388) - num_envs = 1
(pid=4388) - name = CartPole-v0
(pid=4388) - max_t = 200
(pid=4388) - max_frame = 100000
(pid=4388) - to_render = False
(pid=4388) - is_venv = False
(pid=4388) - clock_speed = 1
(pid=4388) - clock = <slm_lab.env.base.Clock object at 0x7fce28f7fcf8>
(pid=4388) - done = False
(pid=4388) - total_reward = nan
(pid=4388) - u_env = <TrackReward<TimeLimit<CartPoleEnv<CartPole-v0>>>>
(pid=4388) - observation_space = Box(4,)
(pid=4388) - action_space = Discrete(2)
(pid=4388) - observable_dim = {'state': 4}
(pid=4388) - action_dim = 2
(pid=4388) - is_discrete = True
(pid=4389) [2020-01-30 11:38:58,339 PID:4453 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
(pid=4389) [2020-01-30 11:38:58,340 PID:4450 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
(pid=4389) [2020-01-30 11:38:58,343 PID:4456 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
(pid=4389) [2020-01-30 11:38:58,345 PID:4450 INFO base.py __init__][2020-01-30 11:38:58,345 PID:4453 INFO base.py __init__] Reinforce:
(pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bddcc0>
(pid=4389) - algorithm_spec = {'action_pdtype': 'default',
(pid=4389)  'action_policy': 'default',
(pid=4389)  'center_return': False,
(pid=4389)  'entropy_coef_spec': {'end_step': 20000,
(pid=4389)                        'end_val': 0.001,
(pid=4389)                        'name': 'linear_decay',
(pid=4389)                        'start_step': 0,
(pid=4389)                        'start_val': 0.01},
(pid=4389)  'explore_var_spec': None,
(pid=4389)  'gamma': 0.99,
(pid=4389)  'name': 'Reinforce',
(pid=4389)  'training_frequency': 1}
(pid=4389) - name = Reinforce
(pid=4389) - memory_spec = {'name': 'OnPolicyReplay'}
(pid=4389) - net_spec = {'clip_grad_val': None,
(pid=4389)  'hid_layers': [64],
(pid=4389)  'hid_layers_activation': 'selu',
(pid=4389)  'loss_spec': {'name': 'MSELoss'},
(pid=4389)  'lr_scheduler_spec': None,
(pid=4389)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4389)  'type': 'MLPNet'}
(pid=4389) - body = body: {
(pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bddcc0>",
(pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56cc0>",
(pid=4389)   "a": 0,
(pid=4389)   "e": 0,
(pid=4389)   "b": 0,
(pid=4389)   "aeb": "(0, 0, 0)",
(pid=4389)   "explore_var": NaN,
(pid=4389)   "entropy_coef": 0.01,
(pid=4389)   "loss": NaN,
(pid=4389)   "mean_entropy": NaN,
(pid=4389)   "mean_grad_norm": NaN,
(pid=4389)   "best_total_reward_ma": -Infinity,
(pid=4389)   "total_reward_ma": NaN,
(pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bcb710>",
(pid=4389)   "tb_actions": [],
(pid=4389)   "tb_tracker": {},
(pid=4389)   "observation_space": "Box(4,)",
(pid=4389)   "action_space": "Discrete(2)",
(pid=4389)   "observable_dim": {
(pid=4389)     "state": 4
(pid=4389)   },
(pid=4389)   "state_dim": 4,
(pid=4389)   "action_dim": 2,
(pid=4389)   "is_discrete": true,
(pid=4389)   "action_type": "discrete",
(pid=4389)   "action_pdtype": "Categorical",
(pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bddd68>"
(pid=4389) }
(pid=4389) - action_pdtype = default
(pid=4389) - action_policy = <function default at 0x7fcc21560620>
(pid=4389) - center_return = False
(pid=4389) - explore_var_spec = None
(pid=4389) - entropy_coef_spec = {'end_step': 20000,
(pid=4389)  'end_val': 0.001,
(pid=4389)  'name': 'linear_decay',
(pid=4389)  'start_step': 0,
(pid=4389)  'start_val': 0.01}
(pid=4389) - policy_loss_coef = 1.0
(pid=4389) - gamma = 0.99
(pid=4389) - training_frequency = 1
(pid=4389) - to_train = 0
(pid=4389) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bddd30>
(pid=4389) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bdda20>
(pid=4389) - net = MLPNet(
(pid=4389)   (model): Sequential(
(pid=4389)     (0): Linear(in_features=4, out_features=64, bias=True)
(pid=4389)     (1): SELU()
(pid=4389)   )
(pid=4389)   (model_tail): Sequential(
(pid=4389)     (0): Linear(in_features=64, out_features=2, bias=True)
(pid=4389)   )
(pid=4389)   (loss_fn): MSELoss()
(pid=4389) )
(pid=4389) - net_names = ['net']
(pid=4389) - optim = Adam (
(pid=4389) Parameter Group 0
(pid=4389)     amsgrad: False
(pid=4389)     betas: (0.9, 0.999)
(pid=4389)     eps: 1e-08
(pid=4389)     lr: 0.002
(pid=4389)     weight_decay: 0
(pid=4389) )
(pid=4389) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fcc10ba20b8>
(pid=4389) - global_net = None
(pid=4389)  Reinforce:
(pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bdddd8>
(pid=4389) - algorithm_spec = {'action_pdtype': 'default',
(pid=4389)  'action_policy': 'default',
(pid=4389)  'center_return': False,
(pid=4389)  'entropy_coef_spec': {'end_step': 20000,
(pid=4388) [2020-01-30 11:38:58,330 PID:4445 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
(pid=4388) [2020-01-30 11:38:58,330 PID:4449 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
(pid=4388) [2020-01-30 11:38:58,335 PID:4452 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
(pid=4388) [2020-01-30 11:38:58,335 PID:4449 INFO base.py __init__] Reinforce:
(pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e097f60>
(pid=4388) - algorithm_spec = {'action_pdtype': 'default',
(pid=4388)  'action_policy': 'default',
(pid=4388)  'center_return': True,
(pid=4388)  'entropy_coef_spec': {'end_step': 20000,
(pid=4388)                        'end_val': 0.001,
(pid=4388)                        'name': 'linear_decay',
(pid=4388)                        'start_step': 0,
(pid=4388)                        'start_val': 0.01},
(pid=4388)  'explore_var_spec': None,
(pid=4388)  'gamma': 0.99,
(pid=4388)  'name': 'Reinforce',
(pid=4388)  'training_frequency': 1}
(pid=4388) - name = Reinforce
(pid=4388) - memory_spec = {'name': 'OnPolicyReplay'}
(pid=4388) - net_spec = {'clip_grad_val': None,
(pid=4388)  'hid_layers': [64],
(pid=4388)  'hid_layers_activation': 'selu',
(pid=4388)  'loss_spec': {'name': 'MSELoss'},
(pid=4388)  'lr_scheduler_spec': None,
(pid=4388)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4388)  'type': 'MLPNet'}
(pid=4388) - body = body: {
(pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e097f60>",
(pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044eb8>",
(pid=4388)   "a": 0,
(pid=4388)   "e": 0,
(pid=4388)   "b": 0,
(pid=4388)   "aeb": "(0, 0, 0)",
(pid=4388)   "explore_var": NaN,
(pid=4388)   "entropy_coef": 0.01,
(pid=4388)   "loss": NaN,
(pid=4388)   "mean_entropy": NaN,
(pid=4388)   "mean_grad_norm": NaN,
(pid=4388)   "best_total_reward_ma": -Infinity,
(pid=4388)   "total_reward_ma": NaN,
(pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
(pid=4388)   "tb_actions": [],
(pid=4388)   "tb_tracker": {},
(pid=4388)   "observation_space": "Box(4,)",
(pid=4388)   "action_space": "Discrete(2)",
(pid=4388)   "observable_dim": {
(pid=4388)     "state": 4
(pid=4388)   },
(pid=4388)   "state_dim": 4,
(pid=4388)   "action_dim": 2,
(pid=4388)   "is_discrete": true,
(pid=4388)   "action_type": "discrete",
(pid=4388)   "action_pdtype": "Categorical",
(pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e097fd0>"
(pid=4388) }
(pid=4388) - action_pdtype = default
(pid=4388) - action_policy = <function default at 0x7fce304ad620>
(pid=4388) - center_return = True
(pid=4388) - explore_var_spec = None
(pid=4388) - entropy_coef_spec = {'end_step': 20000,
(pid=4388)  'end_val': 0.001,
(pid=4388)  'name': 'linear_decay',
(pid=4388)  'start_step': 0,
(pid=4388)  'start_val': 0.01}
(pid=4388) - policy_loss_coef = 1.0
(pid=4388) - gamma = 0.99
(pid=4388) - training_frequency = 1
(pid=4388) - to_train = 0
(pid=4388) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e097c88>
(pid=4388) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e083940>
(pid=4388) - net = MLPNet(
(pid=4388)   (model): Sequential(
(pid=4388)     (0): Linear(in_features=4, out_features=64, bias=True)
(pid=4388)     (1): SELU()
(pid=4388)   )
(pid=4388)   (model_tail): Sequential(
(pid=4388)     (0): Linear(in_features=64, out_features=2, bias=True)
(pid=4388)   )
(pid=4388)   (loss_fn): MSELoss()
(pid=4388) )
(pid=4388) - net_names = ['net']
(pid=4388) - optim = Adam (
(pid=4388) Parameter Group 0
(pid=4388)     amsgrad: False
(pid=4388)     betas: (0.9, 0.999)
(pid=4388)     eps: 1e-08
(pid=4388)     lr: 0.002
(pid=4388)     weight_decay: 0
(pid=4388) )
(pid=4388) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fce0e0562e8>
(pid=4388) - global_net = None
(pid=4388) [2020-01-30 11:38:58,335 PID:4445 INFO base.py __init__] Reinforce:
(pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e098da0>
(pid=4388) - algorithm_spec = {'action_pdtype': 'default',
(pid=4388)  'action_policy': 'default',
(pid=4388)  'center_return': True,
(pid=4388)  'entropy_coef_spec': {'end_step': 20000,
(pid=4389)                        'end_val': 0.001,
(pid=4389)                        'name': 'linear_decay',
(pid=4389)                        'start_step': 0,
(pid=4389)                        'start_val': 0.01},
(pid=4389)  'explore_var_spec': None,
(pid=4389)  'gamma': 0.99,
(pid=4389)  'name': 'Reinforce',
(pid=4389)  'training_frequency': 1}
(pid=4389) - name = Reinforce
(pid=4389) - memory_spec = {'name': 'OnPolicyReplay'}
(pid=4389) - net_spec = {'clip_grad_val': None,
(pid=4389)  'hid_layers': [64],
(pid=4389)  'hid_layers_activation': 'selu',
(pid=4389)  'loss_spec': {'name': 'MSELoss'},
(pid=4389)  'lr_scheduler_spec': None,
(pid=4389)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4389)  'type': 'MLPNet'}
(pid=4389) - body = body: {
(pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bdddd8>",
(pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56da0>",
(pid=4389)   "a": 0,
(pid=4389)   "e": 0,
(pid=4389)   "b": 0,
(pid=4389)   "aeb": "(0, 0, 0)",
(pid=4389)   "explore_var": NaN,
(pid=4389)   "entropy_coef": 0.01,
(pid=4389)   "loss": NaN,
(pid=4389)   "mean_entropy": NaN,
(pid=4389)   "mean_grad_norm": NaN,
(pid=4389)   "best_total_reward_ma": -Infinity,
(pid=4389)   "total_reward_ma": NaN,
(pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc5828>",
(pid=4389)   "tb_actions": [],
(pid=4389)   "tb_tracker": {},
(pid=4389)   "observation_space": "Box(4,)",
(pid=4389)   "action_space": "Discrete(2)",
(pid=4389)   "observable_dim": {
(pid=4389)     "state": 4
(pid=4389)   },
(pid=4389)   "state_dim": 4,
(pid=4389)   "action_dim": 2,
(pid=4389)   "is_discrete": true,
(pid=4389)   "action_type": "discrete",
(pid=4389)   "action_pdtype": "Categorical",
(pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bdde80>"
(pid=4389) }
(pid=4389) - action_pdtype = default
(pid=4389) - action_policy = <function default at 0x7fcc21560620>
(pid=4389) - center_return = False
(pid=4389) - explore_var_spec = None
(pid=4389) - entropy_coef_spec = {'end_step': 20000,
(pid=4389)  'end_val': 0.001,
(pid=4389)  'name': 'linear_decay',
(pid=4389)  'start_step': 0,
(pid=4389)  'start_val': 0.01}
(pid=4389) - policy_loss_coef = 1.0
(pid=4389) - gamma = 0.99
(pid=4389) - training_frequency = 1
(pid=4389) - to_train = 0
(pid=4389) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bdde48>
(pid=4389) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bddb38>
(pid=4389) - net = MLPNet(
(pid=4389)   (model): Sequential(
(pid=4389)     (0): Linear(in_features=4, out_features=64, bias=True)
(pid=4389)     (1): SELU()
(pid=4389)   )
(pid=4389)   (model_tail): Sequential(
(pid=4389)     (0): Linear(in_features=64, out_features=2, bias=True)
(pid=4389)   )
(pid=4389)   (loss_fn): MSELoss()
(pid=4389) )
(pid=4389) - net_names = ['net']
(pid=4389) - optim = Adam (
(pid=4389) Parameter Group 0
(pid=4389)     amsgrad: False
(pid=4389)     betas: (0.9, 0.999)
(pid=4389)     eps: 1e-08
(pid=4389)     lr: 0.002
(pid=4389)     weight_decay: 0
(pid=4389) )
(pid=4389) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fcc10ba11d0>
(pid=4389) - global_net = None
(pid=4389) [2020-01-30 11:38:58,347 PID:4453 INFO __init__.py __init__][2020-01-30 11:38:58,347 PID:4450 INFO __init__.py __init__] Agent:
(pid=4389) - spec = reinforce_baseline_cartpole
(pid=4389) - agent_spec = {'algorithm': {'action_pdtype': 'default',
(pid=4389)                'action_policy': 'default',
(pid=4389)                'center_return': False,
(pid=4389)                'entropy_coef_spec': {'end_step': 20000,
(pid=4389)                                      'end_val': 0.001,
(pid=4389)                                      'name': 'linear_decay',
(pid=4389)                                      'start_step': 0,
(pid=4389)                                      'start_val': 0.01},
(pid=4389)                'explore_var_spec': None,
(pid=4389)                'gamma': 0.99,
(pid=4389)                'name': 'Reinforce',
(pid=4389)                'training_frequency': 1},
(pid=4389)  'memory': {'name': 'OnPolicyReplay'},
(pid=4388)                        'end_val': 0.001,
(pid=4388)                        'name': 'linear_decay',
(pid=4388)                        'start_step': 0,
(pid=4388)                        'start_val': 0.01},
(pid=4388)  'explore_var_spec': None,
(pid=4388)  'gamma': 0.99,
(pid=4388)  'name': 'Reinforce',
(pid=4388)  'training_frequency': 1}
(pid=4388) - name = Reinforce
(pid=4388) - memory_spec = {'name': 'OnPolicyReplay'}
(pid=4388) - net_spec = {'clip_grad_val': None,
(pid=4388)  'hid_layers': [64],
(pid=4388)  'hid_layers_activation': 'selu',
(pid=4388)  'loss_spec': {'name': 'MSELoss'},
(pid=4388)  'lr_scheduler_spec': None,
(pid=4388)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4388)  'type': 'MLPNet'}
(pid=4388) - body = body: {
(pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e098da0>",
(pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044da0>",
(pid=4388)   "a": 0,
(pid=4388)   "e": 0,
(pid=4388)   "b": 0,
(pid=4388)   "aeb": "(0, 0, 0)",
(pid=4388)   "explore_var": NaN,
(pid=4388)   "entropy_coef": 0.01,
(pid=4388)   "loss": NaN,
(pid=4388)   "mean_entropy": NaN,
(pid=4388)   "mean_grad_norm": NaN,
(pid=4388)   "best_total_reward_ma": -Infinity,
(pid=4388)   "total_reward_ma": NaN,
(pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
(pid=4388)   "tb_actions": [],
(pid=4388)   "tb_tracker": {},
(pid=4388)   "observation_space": "Box(4,)",
(pid=4388)   "action_space": "Discrete(2)",
(pid=4388)   "observable_dim": {
(pid=4388)     "state": 4
(pid=4388)   },
(pid=4388)   "state_dim": 4,
(pid=4388)   "action_dim": 2,
(pid=4388)   "is_discrete": true,
(pid=4388)   "action_type": "discrete",
(pid=4388)   "action_pdtype": "Categorical",
(pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e098e48>"
(pid=4388) }
(pid=4388) - action_pdtype = default
(pid=4388) - action_policy = <function default at 0x7fce304ad620>
(pid=4388) - center_return = True
(pid=4388) - explore_var_spec = None
(pid=4388) - entropy_coef_spec = {'end_step': 20000,
(pid=4388)  'end_val': 0.001,
(pid=4388)  'name': 'linear_decay',
(pid=4388)  'start_step': 0,
(pid=4388)  'start_val': 0.01}
(pid=4388) - policy_loss_coef = 1.0
(pid=4388) - gamma = 0.99
(pid=4388) - training_frequency = 1
(pid=4388) - to_train = 0
(pid=4388) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e098e10>
(pid=4388) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e098f28>
(pid=4388) - net = MLPNet(
(pid=4388)   (model): Sequential(
(pid=4388)     (0): Linear(in_features=4, out_features=64, bias=True)
(pid=4388)     (1): SELU()
(pid=4388)   )
(pid=4388)   (model_tail): Sequential(
(pid=4388)     (0): Linear(in_features=64, out_features=2, bias=True)
(pid=4388)   )
(pid=4388)   (loss_fn): MSELoss()
(pid=4388) )
(pid=4388) - net_names = ['net']
(pid=4388) - optim = Adam (
(pid=4388) Parameter Group 0
(pid=4388)     amsgrad: False
(pid=4388)     betas: (0.9, 0.999)
(pid=4388)     eps: 1e-08
(pid=4388)     lr: 0.002
(pid=4388)     weight_decay: 0
(pid=4388) )
(pid=4388) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fce0e05b1d0>
(pid=4388) - global_net = None
(pid=4388) [2020-01-30 11:38:58,336 PID:4449 INFO __init__.py __init__] Agent:
(pid=4388) - spec = reinforce_baseline_cartpole
(pid=4388) - agent_spec = {'algorithm': {'action_pdtype': 'default',
(pid=4388)                'action_policy': 'default',
(pid=4388)                'center_return': True,
(pid=4388)                'entropy_coef_spec': {'end_step': 20000,
(pid=4388)                                      'end_val': 0.001,
(pid=4388)                                      'name': 'linear_decay',
(pid=4388)                                      'start_step': 0,
(pid=4388)                                      'start_val': 0.01},
(pid=4388)                'explore_var_spec': None,
(pid=4388)                'gamma': 0.99,
(pid=4388)                'name': 'Reinforce',
(pid=4388)                'training_frequency': 1},
(pid=4388)  'memory': {'name': 'OnPolicyReplay'},
(pid=4389)  'name': 'Reinforce',
(pid=4389)  'net': {'clip_grad_val': None,
(pid=4389)          'hid_layers': [64],
(pid=4389)          'hid_layers_activation': 'selu',
(pid=4389)          'loss_spec': {'name': 'MSELoss'},
(pid=4389)          'lr_scheduler_spec': None,
(pid=4389)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4389)          'type': 'MLPNet'}}
(pid=4389) - name = Reinforce
(pid=4389) - body = body: {
(pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bdddd8>",
(pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56da0>",
(pid=4389)   "a": 0,
(pid=4389)   "e": 0,
(pid=4389)   "b": 0,
(pid=4389)   "aeb": "(0, 0, 0)",
(pid=4389)   "explore_var": NaN,
(pid=4389)   "entropy_coef": 0.01,
(pid=4389)   "loss": NaN,
(pid=4389)   "mean_entropy": NaN,
(pid=4389)   "mean_grad_norm": NaN,
(pid=4389)   "best_total_reward_ma": -Infinity,
(pid=4389)   "total_reward_ma": NaN,
(pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc5828>",
(pid=4389)   "tb_actions": [],
(pid=4389)   "tb_tracker": {},
(pid=4389)   "observation_space": "Box(4,)",
(pid=4389)   "action_space": "Discrete(2)",
(pid=4389)   "observable_dim": {
(pid=4389)     "state": 4
(pid=4389)   },
(pid=4389)   "state_dim": 4,
(pid=4389)   "action_dim": 2,
(pid=4389)   "is_discrete": true,
(pid=4389)   "action_type": "discrete",
(pid=4389)   "action_pdtype": "Categorical",
(pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bdde80>"
(pid=4389) }
(pid=4389) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fcc10bdde10>
(pid=4389)  Agent:
(pid=4389) - spec = reinforce_baseline_cartpole
(pid=4389) - agent_spec = {'algorithm': {'action_pdtype': 'default',
(pid=4389)                'action_policy': 'default',
(pid=4389)                'center_return': False,
(pid=4389)                'entropy_coef_spec': {'end_step': 20000,
(pid=4389)                                      'end_val': 0.001,
(pid=4389)                                      'name': 'linear_decay',
(pid=4389)                                      'start_step': 0,
(pid=4389)                                      'start_val': 0.01},
(pid=4389)                'explore_var_spec': None,
(pid=4389)                'gamma': 0.99,
(pid=4389)                'name': 'Reinforce',
(pid=4389)                'training_frequency': 1},
(pid=4389)  'memory': {'name': 'OnPolicyReplay'},
(pid=4389)  'name': 'Reinforce',
(pid=4389)  'net': {'clip_grad_val': None,
(pid=4389)          'hid_layers': [64],
(pid=4389)          'hid_layers_activation': 'selu',
(pid=4389)          'loss_spec': {'name': 'MSELoss'},
(pid=4389)          'lr_scheduler_spec': None,
(pid=4389)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4389)          'type': 'MLPNet'}}
(pid=4389) - name = Reinforce
(pid=4389) - body = body: {
(pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bddcc0>",
(pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56cc0>",
(pid=4389)   "a": 0,
(pid=4389)   "e": 0,
(pid=4389)   "b": 0,
(pid=4389)   "aeb": "(0, 0, 0)",
(pid=4389)   "explore_var": NaN,
(pid=4389)   "entropy_coef": 0.01,
(pid=4389)   "loss": NaN,
(pid=4389)   "mean_entropy": NaN,
(pid=4389)   "mean_grad_norm": NaN,
(pid=4389)   "best_total_reward_ma": -Infinity,
(pid=4389)   "total_reward_ma": NaN,
(pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bcb710>",
(pid=4389)   "tb_actions": [],
(pid=4389)   "tb_tracker": {},
(pid=4389)   "observation_space": "Box(4,)",
(pid=4389)   "action_space": "Discrete(2)",
(pid=4389)   "observable_dim": {
(pid=4389)     "state": 4
(pid=4389)   },
(pid=4389)   "state_dim": 4,
(pid=4389)   "action_dim": 2,
(pid=4389)   "is_discrete": true,
(pid=4389)   "action_type": "discrete",
(pid=4389)   "action_pdtype": "Categorical",
(pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bddd68>"
(pid=4389) }
(pid=4389) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fcc10bddcf8>
(pid=4389) [2020-01-30 11:38:58,347 PID:4458 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search[2020-01-30 11:38:58,347 PID:4450 INFO logger.py info][2020-01-30 11:38:58,347 PID:4453 INFO logger.py info]
(pid=4388)  'name': 'Reinforce',
(pid=4388)  'net': {'clip_grad_val': None,
(pid=4388)          'hid_layers': [64],
(pid=4388)          'hid_layers_activation': 'selu',
(pid=4388)          'loss_spec': {'name': 'MSELoss'},
(pid=4388)          'lr_scheduler_spec': None,
(pid=4388)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4388)          'type': 'MLPNet'}}
(pid=4388) - name = Reinforce
(pid=4388) - body = body: {
(pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e097f60>",
(pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044eb8>",
(pid=4388)   "a": 0,
(pid=4388)   "e": 0,
(pid=4388)   "b": 0,
(pid=4388)   "aeb": "(0, 0, 0)",
(pid=4388)   "explore_var": NaN,
(pid=4388)   "entropy_coef": 0.01,
(pid=4388)   "loss": NaN,
(pid=4388)   "mean_entropy": NaN,
(pid=4388)   "mean_grad_norm": NaN,
(pid=4388)   "best_total_reward_ma": -Infinity,
(pid=4388)   "total_reward_ma": NaN,
(pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
(pid=4388)   "tb_actions": [],
(pid=4388)   "tb_tracker": {},
(pid=4388)   "observation_space": "Box(4,)",
(pid=4388)   "action_space": "Discrete(2)",
(pid=4388)   "observable_dim": {
(pid=4388)     "state": 4
(pid=4388)   },
(pid=4388)   "state_dim": 4,
(pid=4388)   "action_dim": 2,
(pid=4388)   "is_discrete": true,
(pid=4388)   "action_type": "discrete",
(pid=4388)   "action_pdtype": "Categorical",
(pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e097fd0>"
(pid=4388) }
(pid=4388) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fce0e097f98>
(pid=4388) [2020-01-30 11:38:58,337 PID:4449 INFO logger.py info] Session:
(pid=4388) - spec = reinforce_baseline_cartpole
(pid=4388) - index = 2
(pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e097f60>
(pid=4388) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044eb8>
(pid=4388) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044eb8>
(pid=4388) [2020-01-30 11:38:58,337 PID:4449 INFO logger.py info] Running RL loop for trial 0 session 2
(pid=4388) [2020-01-30 11:38:58,337 PID:4445 INFO __init__.py __init__] Agent:
(pid=4388) - spec = reinforce_baseline_cartpole
(pid=4388) - agent_spec = {'algorithm': {'action_pdtype': 'default',
(pid=4388)                'action_policy': 'default',
(pid=4388)                'center_return': True,
(pid=4388)                'entropy_coef_spec': {'end_step': 20000,
(pid=4388)                                      'end_val': 0.001,
(pid=4388)                                      'name': 'linear_decay',
(pid=4388)                                      'start_step': 0,
(pid=4388)                                      'start_val': 0.01},
(pid=4388)                'explore_var_spec': None,
(pid=4388)                'gamma': 0.99,
(pid=4388)                'name': 'Reinforce',
(pid=4388)                'training_frequency': 1},
(pid=4388)  'memory': {'name': 'OnPolicyReplay'},
(pid=4388)  'name': 'Reinforce',
(pid=4388)  'net': {'clip_grad_val': None,
(pid=4388)          'hid_layers': [64],
(pid=4388)          'hid_layers_activation': 'selu',
(pid=4388)          'loss_spec': {'name': 'MSELoss'},
(pid=4388)          'lr_scheduler_spec': None,
(pid=4388)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4388)          'type': 'MLPNet'}}
(pid=4388) - name = Reinforce
(pid=4388) - body = body: {
(pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e098da0>",
(pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044da0>",
(pid=4388)   "a": 0,
(pid=4388)   "e": 0,
(pid=4388)   "b": 0,
(pid=4388)   "aeb": "(0, 0, 0)",
(pid=4388)   "explore_var": NaN,
(pid=4388)   "entropy_coef": 0.01,
(pid=4388)   "loss": NaN,
(pid=4388)   "mean_entropy": NaN,
(pid=4388)   "mean_grad_norm": NaN,
(pid=4388)   "best_total_reward_ma": -Infinity,
(pid=4388)   "total_reward_ma": NaN,
(pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
(pid=4388)   "tb_actions": [],
(pid=4388)   "tb_tracker": {},
(pid=4388)   "observation_space": "Box(4,)",
(pid=4388)   "action_space": "Discrete(2)",
(pid=4388)   "observable_dim": {
(pid=4388)     "state": 4
(pid=4388)   },
(pid=4388)   "state_dim": 4,
(pid=4388)   "action_dim": 2,
(pid=4388)   "is_discrete": true,
(pid=4389)  Session:
(pid=4389) - spec = reinforce_baseline_cartpole
(pid=4389) - index = 0
(pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bddcc0>
(pid=4389) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56cc0>
(pid=4389) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56cc0> Session:
(pid=4389) - spec = reinforce_baseline_cartpole
(pid=4389) - index = 1
(pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bdddd8>
(pid=4389) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56da0>
(pid=4389) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56da0>
(pid=4389) 
(pid=4389) [2020-01-30 11:38:58,347 PID:4450 INFO logger.py info] Running RL loop for trial 1 session 0[2020-01-30 11:38:58,347 PID:4453 INFO logger.py info]
(pid=4389)  Running RL loop for trial 1 session 1
(pid=4389) [2020-01-30 11:38:58,348 PID:4456 INFO base.py __init__] Reinforce:
(pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bdcf28>
(pid=4389) - algorithm_spec = {'action_pdtype': 'default',
(pid=4389)  'action_policy': 'default',
(pid=4389)  'center_return': False,
(pid=4389)  'entropy_coef_spec': {'end_step': 20000,
(pid=4389)                        'end_val': 0.001,
(pid=4389)                        'name': 'linear_decay',
(pid=4389)                        'start_step': 0,
(pid=4389)                        'start_val': 0.01},
(pid=4389)  'explore_var_spec': None,
(pid=4389)  'gamma': 0.99,
(pid=4389)  'name': 'Reinforce',
(pid=4389)  'training_frequency': 1}
(pid=4389) - name = Reinforce
(pid=4389) - memory_spec = {'name': 'OnPolicyReplay'}
(pid=4389) - net_spec = {'clip_grad_val': None,
(pid=4389)  'hid_layers': [64],
(pid=4389)  'hid_layers_activation': 'selu',
(pid=4389)  'loss_spec': {'name': 'MSELoss'},
(pid=4389)  'lr_scheduler_spec': None,
(pid=4389)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4389)  'type': 'MLPNet'}
(pid=4389) - body = body: {
(pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bdcf28>",
(pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56eb8>",
(pid=4389)   "a": 0,
(pid=4389)   "e": 0,
(pid=4389)   "b": 0,
(pid=4389)   "aeb": "(0, 0, 0)",
(pid=4389)   "explore_var": NaN,
(pid=4389)   "entropy_coef": 0.01,
(pid=4389)   "loss": NaN,
(pid=4389)   "mean_entropy": NaN,
(pid=4389)   "mean_grad_norm": NaN,
(pid=4389)   "best_total_reward_ma": -Infinity,
(pid=4389)   "total_reward_ma": NaN,
(pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc7940>",
(pid=4389)   "tb_actions": [],
(pid=4389)   "tb_tracker": {},
(pid=4389)   "observation_space": "Box(4,)",
(pid=4389)   "action_space": "Discrete(2)",
(pid=4389)   "observable_dim": {
(pid=4389)     "state": 4
(pid=4389)   },
(pid=4389)   "state_dim": 4,
(pid=4389)   "action_dim": 2,
(pid=4389)   "is_discrete": true,
(pid=4389)   "action_type": "discrete",
(pid=4389)   "action_pdtype": "Categorical",
(pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bdcfd0>"
(pid=4389) }
(pid=4389) - action_pdtype = default
(pid=4389) - action_policy = <function default at 0x7fcc21560620>
(pid=4389) - center_return = False
(pid=4389) - explore_var_spec = None
(pid=4389) - entropy_coef_spec = {'end_step': 20000,
(pid=4389)  'end_val': 0.001,
(pid=4389)  'name': 'linear_decay',
(pid=4389)  'start_step': 0,
(pid=4389)  'start_val': 0.01}
(pid=4389) - policy_loss_coef = 1.0
(pid=4389) - gamma = 0.99
(pid=4389) - training_frequency = 1
(pid=4389) - to_train = 0
(pid=4389) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bdcf98>
(pid=4389) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10bdcc50>
(pid=4389) - net = MLPNet(
(pid=4389)   (model): Sequential(
(pid=4389)     (0): Linear(in_features=4, out_features=64, bias=True)
(pid=4389)     (1): SELU()
(pid=4389)   )
(pid=4389)   (model_tail): Sequential(
(pid=4389)     (0): Linear(in_features=64, out_features=2, bias=True)
(pid=4389)   )
(pid=4389)   (loss_fn): MSELoss()
(pid=4389) )
(pid=4389) - net_names = ['net']
(pid=4389) - optim = Adam (
(pid=4389) Parameter Group 0
(pid=4389)     amsgrad: False
(pid=4389)     betas: (0.9, 0.999)
(pid=4389)     eps: 1e-08
(pid=4388)   "action_type": "discrete",
(pid=4388)   "action_pdtype": "Categorical",
(pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e098e48>"
(pid=4388) }
(pid=4388) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fce0e098dd8>
(pid=4388) [2020-01-30 11:38:58,338 PID:4445 INFO logger.py info] Session:
(pid=4388) - spec = reinforce_baseline_cartpole
(pid=4388) - index = 1
(pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e098da0>
(pid=4388) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044da0>
(pid=4388) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044da0>
(pid=4388) [2020-01-30 11:38:58,338 PID:4445 INFO logger.py info] Running RL loop for trial 0 session 1
(pid=4388) [2020-01-30 11:38:58,340 PID:4449 INFO __init__.py log_summary] Trial 0 session 2 reinforce_baseline_cartpole_t0_s2 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
(pid=4388) [2020-01-30 11:38:58,340 PID:4452 INFO base.py __init__] Reinforce:
(pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e082a58>
(pid=4388) - algorithm_spec = {'action_pdtype': 'default',
(pid=4388)  'action_policy': 'default',
(pid=4388)  'center_return': True,
(pid=4388)  'entropy_coef_spec': {'end_step': 20000,
(pid=4388)                        'end_val': 0.001,
(pid=4388)                        'name': 'linear_decay',
(pid=4388)                        'start_step': 0,
(pid=4388)                        'start_val': 0.01},
(pid=4388)  'explore_var_spec': None,
(pid=4388)  'gamma': 0.99,
(pid=4388)  'name': 'Reinforce',
(pid=4388)  'training_frequency': 1}
(pid=4388) - name = Reinforce
(pid=4388) - memory_spec = {'name': 'OnPolicyReplay'}
(pid=4388) - net_spec = {'clip_grad_val': None,
(pid=4388)  'hid_layers': [64],
(pid=4388)  'hid_layers_activation': 'selu',
(pid=4388)  'loss_spec': {'name': 'MSELoss'},
(pid=4388)  'lr_scheduler_spec': None,
(pid=4388)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4388)  'type': 'MLPNet'}
(pid=4388) - body = body: {
(pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e082a58>",
(pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044fd0>",
(pid=4388)   "a": 0,
(pid=4388)   "e": 0,
(pid=4388)   "b": 0,
(pid=4388)   "aeb": "(0, 0, 0)",
(pid=4388)   "explore_var": NaN,
(pid=4388)   "entropy_coef": 0.01,
(pid=4388)   "loss": NaN,
(pid=4388)   "mean_entropy": NaN,
(pid=4388)   "mean_grad_norm": NaN,
(pid=4388)   "best_total_reward_ma": -Infinity,
(pid=4388)   "total_reward_ma": NaN,
(pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
(pid=4388)   "tb_actions": [],
(pid=4388)   "tb_tracker": {},
(pid=4388)   "observation_space": "Box(4,)",
(pid=4388)   "action_space": "Discrete(2)",
(pid=4388)   "observable_dim": {
(pid=4388)     "state": 4
(pid=4388)   },
(pid=4388)   "state_dim": 4,
(pid=4388)   "action_dim": 2,
(pid=4388)   "is_discrete": true,
(pid=4388)   "action_type": "discrete",
(pid=4388)   "action_pdtype": "Categorical",
(pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e0540b8>"
(pid=4388) }
(pid=4388) - action_pdtype = default
(pid=4388) - action_policy = <function default at 0x7fce304ad620>
(pid=4388) - center_return = True
(pid=4388) - explore_var_spec = None
(pid=4388) - entropy_coef_spec = {'end_step': 20000,
(pid=4388)  'end_val': 0.001,
(pid=4388)  'name': 'linear_decay',
(pid=4388)  'start_step': 0,
(pid=4388)  'start_val': 0.01}
(pid=4388) - policy_loss_coef = 1.0
(pid=4388) - gamma = 0.99
(pid=4388) - training_frequency = 1
(pid=4388) - to_train = 0
(pid=4388) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e054080>
(pid=4388) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e054160>
(pid=4388) - net = MLPNet(
(pid=4388)   (model): Sequential(
(pid=4388)     (0): Linear(in_features=4, out_features=64, bias=True)
(pid=4388)     (1): SELU()
(pid=4388)   )
(pid=4388)   (model_tail): Sequential(
(pid=4388)     (0): Linear(in_features=64, out_features=2, bias=True)
(pid=4388)   )
(pid=4388)   (loss_fn): MSELoss()
(pid=4388) )
(pid=4388) - net_names = ['net']
(pid=4388) - optim = Adam (
(pid=4388) Parameter Group 0
(pid=4388)     amsgrad: False
(pid=4388)     betas: (0.9, 0.999)
(pid=4388)     eps: 1e-08
(pid=4389)     lr: 0.002
(pid=4389)     weight_decay: 0
(pid=4389) )
(pid=4389) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fcc10b9a2e8>
(pid=4389) - global_net = None
(pid=4389) [2020-01-30 11:38:58,350 PID:4456 INFO __init__.py __init__] Agent:
(pid=4389) - spec = reinforce_baseline_cartpole
(pid=4389) - agent_spec = {'algorithm': {'action_pdtype': 'default',
(pid=4389)                'action_policy': 'default',
(pid=4389)                'center_return': False,
(pid=4389)                'entropy_coef_spec': {'end_step': 20000,
(pid=4389)                                      'end_val': 0.001,
(pid=4389)                                      'name': 'linear_decay',
(pid=4389)                                      'start_step': 0,
(pid=4389)                                      'start_val': 0.01},
(pid=4389)                'explore_var_spec': None,
(pid=4389)                'gamma': 0.99,
(pid=4389)                'name': 'Reinforce',
(pid=4389)                'training_frequency': 1},
(pid=4389)  'memory': {'name': 'OnPolicyReplay'},
(pid=4389)  'name': 'Reinforce',
(pid=4389)  'net': {'clip_grad_val': None,
(pid=4389)          'hid_layers': [64],
(pid=4389)          'hid_layers_activation': 'selu',
(pid=4389)          'loss_spec': {'name': 'MSELoss'},
(pid=4389)          'lr_scheduler_spec': None,
(pid=4389)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4389)          'type': 'MLPNet'}}
(pid=4389) - name = Reinforce
(pid=4389) - body = body: {
(pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bdcf28>",
(pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56eb8>",
(pid=4389)   "a": 0,
(pid=4389)   "e": 0,
(pid=4389)   "b": 0,
(pid=4389)   "aeb": "(0, 0, 0)",
(pid=4389)   "explore_var": NaN,
(pid=4389)   "entropy_coef": 0.01,
(pid=4389)   "loss": NaN,
(pid=4389)   "mean_entropy": NaN,
(pid=4389)   "mean_grad_norm": NaN,
(pid=4389)   "best_total_reward_ma": -Infinity,
(pid=4389)   "total_reward_ma": NaN,
(pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc7940>",
(pid=4389)   "tb_actions": [],
(pid=4389)   "tb_tracker": {},
(pid=4389)   "observation_space": "Box(4,)",
(pid=4389)   "action_space": "Discrete(2)",
(pid=4389)   "observable_dim": {
(pid=4389)     "state": 4
(pid=4389)   },
(pid=4389)   "state_dim": 4,
(pid=4389)   "action_dim": 2,
(pid=4389)   "is_discrete": true,
(pid=4389)   "action_type": "discrete",
(pid=4389)   "action_pdtype": "Categorical",
(pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10bdcfd0>"
(pid=4389) }
(pid=4389) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fcc10bdcf60>
(pid=4389) [2020-01-30 11:38:58,351 PID:4456 INFO logger.py info] Session:
(pid=4389) - spec = reinforce_baseline_cartpole
(pid=4389) - index = 2
(pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bdcf28>
(pid=4389) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56eb8>
(pid=4389) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56eb8>
(pid=4389) [2020-01-30 11:38:58,351 PID:4456 INFO logger.py info] Running RL loop for trial 1 session 2
(pid=4389) [2020-01-30 11:38:58,351 PID:4450 INFO __init__.py log_summary] Trial 1 session 0 reinforce_baseline_cartpole_t1_s0 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
(pid=4389) [2020-01-30 11:38:58,351 PID:4453 INFO __init__.py log_summary] Trial 1 session 1 reinforce_baseline_cartpole_t1_s1 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
(pid=4389) [2020-01-30 11:38:58,352 PID:4458 INFO base.py __init__] Reinforce:
(pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bddd68>
(pid=4389) - algorithm_spec = {'action_pdtype': 'default',
(pid=4389)  'action_policy': 'default',
(pid=4389)  'center_return': False,
(pid=4389)  'entropy_coef_spec': {'end_step': 20000,
(pid=4389)                        'end_val': 0.001,
(pid=4389)                        'name': 'linear_decay',
(pid=4389)                        'start_step': 0,
(pid=4389)                        'start_val': 0.01},
(pid=4389)  'explore_var_spec': None,
(pid=4389)  'gamma': 0.99,
(pid=4389)  'name': 'Reinforce',
(pid=4389)  'training_frequency': 1}
(pid=4389) - name = Reinforce
(pid=4389) - memory_spec = {'name': 'OnPolicyReplay'}
(pid=4389) - net_spec = {'clip_grad_val': None,
(pid=4389)  'hid_layers': [64],
(pid=4389)  'hid_layers_activation': 'selu',
(pid=4389)  'loss_spec': {'name': 'MSELoss'},
(pid=4389)  'lr_scheduler_spec': None,
(pid=4389)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4389)  'type': 'MLPNet'}
(pid=4389) - body = body: {
(pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bddd68>",
(pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56fd0>",
(pid=4389)   "a": 0,
(pid=4389)   "e": 0,
(pid=4389)   "b": 0,
(pid=4388)     lr: 0.002
(pid=4388)     weight_decay: 0
(pid=4388) )
(pid=4388) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fce0e054400>
(pid=4388) - global_net = None
(pid=4388) [2020-01-30 11:38:58,342 PID:4445 INFO __init__.py log_summary] Trial 0 session 1 reinforce_baseline_cartpole_t0_s1 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
(pid=4388) [2020-01-30 11:38:58,342 PID:4452 INFO __init__.py __init__] Agent:
(pid=4388) - spec = reinforce_baseline_cartpole
(pid=4388) - agent_spec = {'algorithm': {'action_pdtype': 'default',
(pid=4388)                'action_policy': 'default',
(pid=4388)                'center_return': True,
(pid=4388)                'entropy_coef_spec': {'end_step': 20000,
(pid=4388)                                      'end_val': 0.001,
(pid=4388)                                      'name': 'linear_decay',
(pid=4388)                                      'start_step': 0,
(pid=4388)                                      'start_val': 0.01},
(pid=4388)                'explore_var_spec': None,
(pid=4388)                'gamma': 0.99,
(pid=4388)                'name': 'Reinforce',
(pid=4388)                'training_frequency': 1},
(pid=4388)  'memory': {'name': 'OnPolicyReplay'},
(pid=4388)  'name': 'Reinforce',
(pid=4388)  'net': {'clip_grad_val': None,
(pid=4388)          'hid_layers': [64],
(pid=4388)          'hid_layers_activation': 'selu',
(pid=4388)          'loss_spec': {'name': 'MSELoss'},
(pid=4388)          'lr_scheduler_spec': None,
(pid=4388)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4388)          'type': 'MLPNet'}}
(pid=4388) - name = Reinforce
(pid=4388) - body = body: {
(pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e082a58>",
(pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044fd0>",
(pid=4388)   "a": 0,
(pid=4388)   "e": 0,
(pid=4388)   "b": 0,
(pid=4388)   "aeb": "(0, 0, 0)",
(pid=4388)   "explore_var": NaN,
(pid=4388)   "entropy_coef": 0.01,
(pid=4388)   "loss": NaN,
(pid=4388)   "mean_entropy": NaN,
(pid=4388)   "mean_grad_norm": NaN,
(pid=4388)   "best_total_reward_ma": -Infinity,
(pid=4388)   "total_reward_ma": NaN,
(pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
(pid=4388)   "tb_actions": [],
(pid=4388)   "tb_tracker": {},
(pid=4388)   "observation_space": "Box(4,)",
(pid=4388)   "action_space": "Discrete(2)",
(pid=4388)   "observable_dim": {
(pid=4388)     "state": 4
(pid=4388)   },
(pid=4388)   "state_dim": 4,
(pid=4388)   "action_dim": 2,
(pid=4388)   "is_discrete": true,
(pid=4388)   "action_type": "discrete",
(pid=4388)   "action_pdtype": "Categorical",
(pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e0540b8>"
(pid=4388) }
(pid=4388) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fce0e054048>
(pid=4388) [2020-01-30 11:38:58,342 PID:4452 INFO logger.py info] Session:
(pid=4388) - spec = reinforce_baseline_cartpole
(pid=4388) - index = 3
(pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e082a58>
(pid=4388) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044fd0>
(pid=4388) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044fd0>
(pid=4388) [2020-01-30 11:38:58,342 PID:4452 INFO logger.py info] Running RL loop for trial 0 session 3
(pid=4388) [2020-01-30 11:38:58,343 PID:4440 INFO base.py post_init_nets] Initialized algorithm models for lab_mode: search
(pid=4388) [2020-01-30 11:38:58,346 PID:4452 INFO __init__.py log_summary] Trial 0 session 3 reinforce_baseline_cartpole_t0_s3 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
(pid=4388) [2020-01-30 11:38:58,348 PID:4440 INFO base.py __init__] Reinforce:
(pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e09ac88>
(pid=4388) - algorithm_spec = {'action_pdtype': 'default',
(pid=4388)  'action_policy': 'default',
(pid=4388)  'center_return': True,
(pid=4388)  'entropy_coef_spec': {'end_step': 20000,
(pid=4388)                        'end_val': 0.001,
(pid=4388)                        'name': 'linear_decay',
(pid=4388)                        'start_step': 0,
(pid=4388)                        'start_val': 0.01},
(pid=4388)  'explore_var_spec': None,
(pid=4388)  'gamma': 0.99,
(pid=4388)  'name': 'Reinforce',
(pid=4388)  'training_frequency': 1}
(pid=4388) - name = Reinforce
(pid=4388) - memory_spec = {'name': 'OnPolicyReplay'}
(pid=4388) - net_spec = {'clip_grad_val': None,
(pid=4388)  'hid_layers': [64],
(pid=4388)  'hid_layers_activation': 'selu',
(pid=4388)  'loss_spec': {'name': 'MSELoss'},
(pid=4388)  'lr_scheduler_spec': None,
(pid=4388)  'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4388)  'type': 'MLPNet'}
(pid=4388) - body = body: {
(pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e09ac88>",
(pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044cc0>",
(pid=4388)   "a": 0,
(pid=4388)   "e": 0,
(pid=4388) terminate called after throwing an instance of 'c10::Error'
(pid=4388)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
(pid=4388) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcf770dedc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
(pid=4388) frame #1: <unknown function> + 0xca67 (0x7fcf6f2daa67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
(pid=4388) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcf6f9fbb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
(pid=4388) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcfa636128a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
(pid=4388) frame #4: <unknown function> + 0xc8421 (0x7fcfbb3bd421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
(pid=4388) frame #5: <unknown function> + 0x76db (0x7fcfc0c466db in /lib/x86_64-linux-gnu/libpthread.so.0)
(pid=4388) frame #6: clone + 0x3f (0x7fcfc096f88f in /lib/x86_64-linux-gnu/libc.so.6)
(pid=4388) 
(pid=4388) Fatal Python error: Aborted
(pid=4388) 
(pid=4388) Stack (most recent call first):
(pid=4388) terminate called after throwing an instance of 'c10::Error'
(pid=4388)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
(pid=4388) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcf770dedc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
(pid=4388) frame #1: <unknown function> + 0xca67 (0x7fcf6f2daa67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
(pid=4388) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcf6f9fbb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
(pid=4388) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcfa636128a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
(pid=4388) frame #4: <unknown function> + 0xc8421 (0x7fcfbb3bd421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
(pid=4388) frame #5: <unknown function> + 0x76db (0x7fcfc0c466db in /lib/x86_64-linux-gnu/libpthread.so.0)
(pid=4388) frame #6: clone + 0x3f (0x7fcfc096f88f in /lib/x86_64-linux-gnu/libc.so.6)
(pid=4388) 
(pid=4388) Fatal Python error: Aborted
(pid=4388) 
(pid=4388) Stack (most recent call first):
(pid=4389)   "aeb": "(0, 0, 0)",
(pid=4389)   "explore_var": NaN,
(pid=4389)   "entropy_coef": 0.01,
(pid=4389)   "loss": NaN,
(pid=4389)   "mean_entropy": NaN,
(pid=4389)   "mean_grad_norm": NaN,
(pid=4389)   "best_total_reward_ma": -Infinity,
(pid=4389)   "total_reward_ma": NaN,
(pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc6a58>",
(pid=4389)   "tb_actions": [],
(pid=4389)   "tb_tracker": {},
(pid=4389)   "observation_space": "Box(4,)",
(pid=4389)   "action_space": "Discrete(2)",
(pid=4389)   "observable_dim": {
(pid=4389)     "state": 4
(pid=4389)   },
(pid=4389)   "state_dim": 4,
(pid=4389)   "action_dim": 2,
(pid=4389)   "is_discrete": true,
(pid=4389)   "action_type": "discrete",
(pid=4389)   "action_pdtype": "Categorical",
(pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10b9a0b8>"
(pid=4389) }
(pid=4389) - action_pdtype = default
(pid=4389) - action_policy = <function default at 0x7fcc21560620>
(pid=4389) - center_return = False
(pid=4389) - explore_var_spec = None
(pid=4389) - entropy_coef_spec = {'end_step': 20000,
(pid=4389)  'end_val': 0.001,
(pid=4389)  'name': 'linear_decay',
(pid=4389)  'start_step': 0,
(pid=4389)  'start_val': 0.01}
(pid=4389) - policy_loss_coef = 1.0
(pid=4389) - gamma = 0.99
(pid=4389) - training_frequency = 1
(pid=4389) - to_train = 0
(pid=4389) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10b9a080>
(pid=4389) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fcc10b9a160>
(pid=4389) - net = MLPNet(
(pid=4389)   (model): Sequential(
(pid=4389)     (0): Linear(in_features=4, out_features=64, bias=True)
(pid=4389)     (1): SELU()
(pid=4389)   )
(pid=4389)   (model_tail): Sequential(
(pid=4389)     (0): Linear(in_features=64, out_features=2, bias=True)
(pid=4389)   )
(pid=4389)   (loss_fn): MSELoss()
(pid=4389) )
(pid=4389) - net_names = ['net']
(pid=4389) - optim = Adam (
(pid=4389) Parameter Group 0
(pid=4389)     amsgrad: False
(pid=4389)     betas: (0.9, 0.999)
(pid=4389)     eps: 1e-08
(pid=4389)     lr: 0.002
(pid=4389)     weight_decay: 0
(pid=4389) )
(pid=4389) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fcc10b9a400>
(pid=4389) - global_net = None
(pid=4389) [2020-01-30 11:38:58,354 PID:4458 INFO __init__.py __init__] Agent:
(pid=4389) - spec = reinforce_baseline_cartpole
(pid=4389) - agent_spec = {'algorithm': {'action_pdtype': 'default',
(pid=4389)                'action_policy': 'default',
(pid=4389)                'center_return': False,
(pid=4389)                'entropy_coef_spec': {'end_step': 20000,
(pid=4389)                                      'end_val': 0.001,
(pid=4389)                                      'name': 'linear_decay',
(pid=4389)                                      'start_step': 0,
(pid=4389)                                      'start_val': 0.01},
(pid=4389)                'explore_var_spec': None,
(pid=4389)                'gamma': 0.99,
(pid=4389)                'name': 'Reinforce',
(pid=4389)                'training_frequency': 1},
(pid=4389)  'memory': {'name': 'OnPolicyReplay'},
(pid=4389)  'name': 'Reinforce',
(pid=4389)  'net': {'clip_grad_val': None,
(pid=4389)          'hid_layers': [64],
(pid=4389)          'hid_layers_activation': 'selu',
(pid=4389)          'loss_spec': {'name': 'MSELoss'},
(pid=4389)          'lr_scheduler_spec': None,
(pid=4389)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4389)          'type': 'MLPNet'}}
(pid=4389) - name = Reinforce
(pid=4389) - body = body: {
(pid=4389)   "agent": "<slm_lab.agent.Agent object at 0x7fcc10bddd68>",
(pid=4389)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56fd0>",
(pid=4389)   "a": 0,
(pid=4389)   "e": 0,
(pid=4389)   "b": 0,
(pid=4389)   "aeb": "(0, 0, 0)",
(pid=4389)   "explore_var": NaN,
(pid=4389)   "entropy_coef": 0.01,
(pid=4389)   "loss": NaN,
(pid=4389)   "mean_entropy": NaN,
(pid=4389)   "mean_grad_norm": NaN,
(pid=4389)   "best_total_reward_ma": -Infinity,
(pid=4389)   "total_reward_ma": NaN,
(pid=4388)   "b": 0,
(pid=4388)   "aeb": "(0, 0, 0)",
(pid=4388)   "explore_var": NaN,
(pid=4388)   "entropy_coef": 0.01,
(pid=4388)   "loss": NaN,
(pid=4388)   "mean_entropy": NaN,
(pid=4388)   "mean_grad_norm": NaN,
(pid=4388)   "best_total_reward_ma": -Infinity,
(pid=4388)   "total_reward_ma": NaN,
(pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
(pid=4388)   "tb_actions": [],
(pid=4388)   "tb_tracker": {},
(pid=4388)   "observation_space": "Box(4,)",
(pid=4388)   "action_space": "Discrete(2)",
(pid=4388)   "observable_dim": {
(pid=4388)     "state": 4
(pid=4388)   },
(pid=4388)   "state_dim": 4,
(pid=4388)   "action_dim": 2,
(pid=4388)   "is_discrete": true,
(pid=4388)   "action_type": "discrete",
(pid=4388)   "action_pdtype": "Categorical",
(pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e09ad30>"
(pid=4388) }
(pid=4388) - action_pdtype = default
(pid=4388) - action_policy = <function default at 0x7fce304ad620>
(pid=4388) - center_return = True
(pid=4388) - explore_var_spec = None
(pid=4388) - entropy_coef_spec = {'end_step': 20000,
(pid=4388)  'end_val': 0.001,
(pid=4388)  'name': 'linear_decay',
(pid=4388)  'start_step': 0,
(pid=4388)  'start_val': 0.01}
(pid=4388) - policy_loss_coef = 1.0
(pid=4388) - gamma = 0.99
(pid=4388) - training_frequency = 1
(pid=4388) - to_train = 0
(pid=4388) - explore_var_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e09acf8>
(pid=4388) - entropy_coef_scheduler = <slm_lab.agent.algorithm.policy_util.VarScheduler object at 0x7fce0e09ae10>
(pid=4388) - net = MLPNet(
(pid=4388)   (model): Sequential(
(pid=4388)     (0): Linear(in_features=4, out_features=64, bias=True)
(pid=4388)     (1): SELU()
(pid=4388)   )
(pid=4388)   (model_tail): Sequential(
(pid=4388)     (0): Linear(in_features=64, out_features=2, bias=True)
(pid=4388)   )
(pid=4388)   (loss_fn): MSELoss()
(pid=4388) )
(pid=4388) - net_names = ['net']
(pid=4388) - optim = Adam (
(pid=4388) Parameter Group 0
(pid=4388)     amsgrad: False
(pid=4388)     betas: (0.9, 0.999)
(pid=4388)     eps: 1e-08
(pid=4388)     lr: 0.002
(pid=4388)     weight_decay: 0
(pid=4388) )
(pid=4388) - lr_scheduler = <slm_lab.agent.net.net_util.NoOpLRScheduler object at 0x7fce0e05c0b8>
(pid=4388) - global_net = None
(pid=4388) [2020-01-30 11:38:58,350 PID:4440 INFO __init__.py __init__] Agent:
(pid=4388) - spec = reinforce_baseline_cartpole
(pid=4388) - agent_spec = {'algorithm': {'action_pdtype': 'default',
(pid=4388)                'action_policy': 'default',
(pid=4388)                'center_return': True,
(pid=4388)                'entropy_coef_spec': {'end_step': 20000,
(pid=4388)                                      'end_val': 0.001,
(pid=4388)                                      'name': 'linear_decay',
(pid=4388)                                      'start_step': 0,
(pid=4388)                                      'start_val': 0.01},
(pid=4388)                'explore_var_spec': None,
(pid=4388)                'gamma': 0.99,
(pid=4388)                'name': 'Reinforce',
(pid=4388)                'training_frequency': 1},
(pid=4388)  'memory': {'name': 'OnPolicyReplay'},
(pid=4388)  'name': 'Reinforce',
(pid=4388)  'net': {'clip_grad_val': None,
(pid=4388)          'hid_layers': [64],
(pid=4388)          'hid_layers_activation': 'selu',
(pid=4388)          'loss_spec': {'name': 'MSELoss'},
(pid=4388)          'lr_scheduler_spec': None,
(pid=4388)          'optim_spec': {'lr': 0.002, 'name': 'Adam'},
(pid=4388)          'type': 'MLPNet'}}
(pid=4388) - name = Reinforce
(pid=4388) - body = body: {
(pid=4388)   "agent": "<slm_lab.agent.Agent object at 0x7fce0e09ac88>",
(pid=4388)   "env": "<slm_lab.env.openai.OpenAIEnv object at 0x7fce28044cc0>",
(pid=4388)   "a": 0,
(pid=4388)   "e": 0,
(pid=4388)   "b": 0,
(pid=4388)   "aeb": "(0, 0, 0)",
(pid=4388)   "explore_var": NaN,
(pid=4388)   "entropy_coef": 0.01,
(pid=4388)   "loss": NaN,
(pid=4388)   "mean_entropy": NaN,
(pid=4388)   "mean_grad_norm": NaN,
(pid=4388)   "best_total_reward_ma": -Infinity,
(pid=4389)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4389)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fcc10bc6a58>",
(pid=4389)   "tb_actions": [],
(pid=4389)   "tb_tracker": {},
(pid=4389)   "observation_space": "Box(4,)",
(pid=4389)   "action_space": "Discrete(2)",
(pid=4389)   "observable_dim": {
(pid=4389)     "state": 4
(pid=4389)   },
(pid=4389)   "state_dim": 4,
(pid=4389)   "action_dim": 2,
(pid=4389)   "is_discrete": true,
(pid=4389)   "action_type": "discrete",
(pid=4389)   "action_pdtype": "Categorical",
(pid=4389)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4389)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fcc10b9a0b8>"
(pid=4389) }
(pid=4389) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fcc10b9a048>
(pid=4389) [2020-01-30 11:38:58,354 PID:4458 INFO logger.py info] Session:
(pid=4389) - spec = reinforce_baseline_cartpole
(pid=4389) - index = 3
(pid=4389) - agent = <slm_lab.agent.Agent object at 0x7fcc10bddd68>
(pid=4389) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56fd0>
(pid=4389) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fcc10c56fd0>
(pid=4389) [2020-01-30 11:38:58,354 PID:4458 INFO logger.py info] Running RL loop for trial 1 session 3
(pid=4389) [2020-01-30 11:38:58,355 PID:4456 INFO __init__.py log_summary] Trial 1 session 2 reinforce_baseline_cartpole_t1_s2 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
(pid=4389) [2020-01-30 11:38:58,358 PID:4458 INFO __init__.py log_summary] Trial 1 session 3 reinforce_baseline_cartpole_t1_s3 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
(pid=4388)   "total_reward_ma": NaN,
(pid=4388)   "train_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "eval_df": "Empty DataFrame\nColumns: [epi, t, wall_t, opt_step, frame, fps, total_reward, total_reward_ma, loss, lr, explore_var, entropy_coef, entropy, grad_norm]\nIndex: []",
(pid=4388)   "tb_writer": "<torch.utils.tensorboard.writer.SummaryWriter object at 0x7fce2b00a780>",
(pid=4388)   "tb_actions": [],
(pid=4388)   "tb_tracker": {},
(pid=4388)   "observation_space": "Box(4,)",
(pid=4388)   "action_space": "Discrete(2)",
(pid=4388)   "observable_dim": {
(pid=4388)     "state": 4
(pid=4388)   },
(pid=4388)   "state_dim": 4,
(pid=4388)   "action_dim": 2,
(pid=4388)   "is_discrete": true,
(pid=4388)   "action_type": "discrete",
(pid=4388)   "action_pdtype": "Categorical",
(pid=4388)   "ActionPD": "<class 'torch.distributions.categorical.Categorical'>",
(pid=4388)   "memory": "<slm_lab.agent.memory.onpolicy.OnPolicyReplay object at 0x7fce0e09ad30>"
(pid=4388) }
(pid=4388) - algorithm = <slm_lab.agent.algorithm.reinforce.Reinforce object at 0x7fce0e09acc0>
(pid=4388) [2020-01-30 11:38:58,350 PID:4440 INFO logger.py info] Session:
(pid=4388) - spec = reinforce_baseline_cartpole
(pid=4388) - index = 0
(pid=4388) - agent = <slm_lab.agent.Agent object at 0x7fce0e09ac88>
(pid=4388) - env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044cc0>
(pid=4388) - eval_env = <slm_lab.env.openai.OpenAIEnv object at 0x7fce28044cc0>
(pid=4388) [2020-01-30 11:38:58,350 PID:4440 INFO logger.py info] Running RL loop for trial 0 session 0
(pid=4388) [2020-01-30 11:38:58,354 PID:4440 INFO __init__.py log_summary] Trial 0 session 0 reinforce_baseline_cartpole_t0_s0 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.002  explore_var: nan  entropy_coef: 0.01  entropy: nan  grad_norm: nan
(pid=4388) terminate called after throwing an instance of 'c10::Error'
(pid=4388)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
(pid=4388) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcf770dedc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
(pid=4388) frame #1: <unknown function> + 0xca67 (0x7fcf6f2daa67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
(pid=4388) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcf6f9fbb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
(pid=4388) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcfa636128a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
(pid=4388) frame #4: <unknown function> + 0xc8421 (0x7fcfbb3bd421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
(pid=4388) frame #5: <unknown function> + 0x76db (0x7fcfc0c466db in /lib/x86_64-linux-gnu/libpthread.so.0)
(pid=4388) frame #6: clone + 0x3f (0x7fcfc096f88f in /lib/x86_64-linux-gnu/libc.so.6)
(pid=4388) 
(pid=4388) Fatal Python error: Aborted
(pid=4388) 
(pid=4388) Stack (most recent call first):
(pid=4389) terminate called after throwing an instance of 'c10::Error'
(pid=4389)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
(pid=4389) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcd68190dc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
(pid=4389) frame #1: <unknown function> + 0xca67 (0x7fcd6038ca67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
(pid=4389) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcd60aadb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
(pid=4389) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcd9741328a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
(pid=4389) frame #4: <unknown function> + 0xc8421 (0x7fcdac471421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
(pid=4389) frame #5: <unknown function> + 0x76db (0x7fcdb1cfa6db in /lib/x86_64-linux-gnu/libpthread.so.0)
(pid=4389) frame #6: clone + 0x3f (0x7fcdb1a2388f in /lib/x86_64-linux-gnu/libc.so.6)
(pid=4389) 
(pid=4389) Fatal Python error: Aborted
(pid=4389) 
(pid=4389) Stack (most recent call first):
(pid=4389) terminate called after throwing an instance of 'c10::Error'
(pid=4389)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
(pid=4389) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcd68190dc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
(pid=4389) frame #1: <unknown function> + 0xca67 (0x7fcd6038ca67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
(pid=4389) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcd60aadb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
(pid=4389) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcd9741328a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
(pid=4389) frame #4: <unknown function> + 0xc8421 (0x7fcdac471421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
(pid=4389) frame #5: <unknown function> + 0x76db (0x7fcdb1cfa6db in /lib/x86_64-linux-gnu/libpthread.so.0)
(pid=4389) frame #6: clone + 0x3f (0x7fcdb1a2388f in /lib/x86_64-linux-gnu/libc.so.6)
(pid=4389) 
(pid=4389) Fatal Python error: Aborted
(pid=4389) 
(pid=4389) Stack (most recent call first):
(pid=4389) terminate called after throwing an instance of 'c10::Error'
(pid=4389)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
(pid=4389) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcd68190dc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
(pid=4389) frame #1: <unknown function> + 0xca67 (0x7fcd6038ca67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
(pid=4389) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcd60aadb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
(pid=4389) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcd9741328a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
(pid=4389) frame #4: <unknown function> + 0xc8421 (0x7fcdac471421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
(pid=4389) frame #5: <unknown function> + 0x76db (0x7fcdb1cfa6db in /lib/x86_64-linux-gnu/libpthread.so.0)
(pid=4389) frame #6: clone + 0x3f (0x7fcdb1a2388f in /lib/x86_64-linux-gnu/libc.so.6)
(pid=4389) 
(pid=4389) Fatal Python error: Aborted
(pid=4389) 
(pid=4389) Stack (most recent call first):
(pid=4389) terminate called after throwing an instance of 'c10::Error'
(pid=4389)   what():  CUDA error: initialization error (getDevice at /opt/conda/conda-bld/pytorch_1556653114079/work/c10/cuda/impl/CUDAGuardImpl.h:35)
(pid=4389) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcd68190dc5 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10.so)
(pid=4389) frame #1: <unknown function> + 0xca67 (0x7fcd6038ca67 in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
(pid=4389) frame #2: torch::autograd::Engine::thread_init(int) + 0x3ee (0x7fcd60aadb1e in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
(pid=4389) frame #3: torch::autograd::python::PythonEngine::thread_init(int) + 0x2a (0x7fcd9741328a in /home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
(pid=4389) frame #4: <unknown function> + 0xc8421 (0x7fcdac471421 in /home/joe/anaconda3/envs/lab/bin/../lib/libstdc++.so.6)
(pid=4389) frame #5: <unknown function> + 0x76db (0x7fcdb1cfa6db in /lib/x86_64-linux-gnu/libpthread.so.0)
(pid=4389) frame #6: clone + 0x3f (0x7fcdb1a2388f in /lib/x86_64-linux-gnu/libc.so.6)
(pid=4389) 
(pid=4389) Fatal Python error: Aborted
(pid=4389) 
(pid=4389) Stack (most recent call first):
(pid=4388) 2020-01-30 11:38:58,550	ERROR function_runner.py:96 -- Runner Thread raised error.
(pid=4388) Traceback (most recent call last):
(pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 90, in run
(pid=4388)     self._entrypoint()
(pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 141, in entrypoint
(pid=4388)     return self._trainable_func(config, self._status_reporter)
(pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 249, in _trainable_func
(pid=4388)     output = train_func(config, reporter)
(pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/search.py", line 90, in ray_trainable
(pid=4388)     metrics = Trial(spec).run()
(pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/control.py", line 181, in run
(pid=4388)     metrics = analysis.analyze_trial(self.spec, session_metrics_list)
(pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 265, in analyze_trial
(pid=4388)     trial_metrics = calc_trial_metrics(session_metrics_list, info_prepath)
(pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 187, in calc_trial_metrics
(pid=4388)     frames = session_metrics_list[0]['local']['frames']
(pid=4388) IndexError: list index out of range
(pid=4388) Exception in thread Thread-1:
(pid=4388) Traceback (most recent call last):
(pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 90, in run
(pid=4388)     self._entrypoint()
(pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 141, in entrypoint
(pid=4388)     return self._trainable_func(config, self._status_reporter)
(pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 249, in _trainable_func
(pid=4388)     output = train_func(config, reporter)
(pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/search.py", line 90, in ray_trainable
(pid=4388)     metrics = Trial(spec).run()
(pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/control.py", line 181, in run
(pid=4388)     metrics = analysis.analyze_trial(self.spec, session_metrics_list)
(pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 265, in analyze_trial
(pid=4388)     trial_metrics = calc_trial_metrics(session_metrics_list, info_prepath)
(pid=4388)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 187, in calc_trial_metrics
(pid=4388)     frames = session_metrics_list[0]['local']['frames']
(pid=4388) IndexError: list index out of range
(pid=4388) 
(pid=4388) During handling of the above exception, another exception occurred:
(pid=4388) 
(pid=4388) Traceback (most recent call last):
(pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/threading.py", line 917, in _bootstrap_inner
(pid=4388)     self.run()
(pid=4388)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 102, in run
(pid=4388)     err_tb = err_tb.format_exc()
(pid=4388) AttributeError: 'traceback' object has no attribute 'format_exc'
(pid=4388) 
(pid=4389) 2020-01-30 11:38:58,570	ERROR function_runner.py:96 -- Runner Thread raised error.
(pid=4389) Traceback (most recent call last):
(pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 90, in run
(pid=4389)     self._entrypoint()
(pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 141, in entrypoint
(pid=4389)     return self._trainable_func(config, self._status_reporter)
(pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 249, in _trainable_func
(pid=4389)     output = train_func(config, reporter)
(pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/search.py", line 90, in ray_trainable
(pid=4389)     metrics = Trial(spec).run()
(pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/control.py", line 181, in run
(pid=4389)     metrics = analysis.analyze_trial(self.spec, session_metrics_list)
(pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 265, in analyze_trial
(pid=4389)     trial_metrics = calc_trial_metrics(session_metrics_list, info_prepath)
(pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 187, in calc_trial_metrics
(pid=4389)     frames = session_metrics_list[0]['local']['frames']
(pid=4389) IndexError: list index out of range
(pid=4389) Exception in thread Thread-1:
(pid=4389) Traceback (most recent call last):
(pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 90, in run
(pid=4389)     self._entrypoint()
(pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 141, in entrypoint
(pid=4389)     return self._trainable_func(config, self._status_reporter)
(pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 249, in _trainable_func
(pid=4389)     output = train_func(config, reporter)
(pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/search.py", line 90, in ray_trainable
(pid=4389)     metrics = Trial(spec).run()
(pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/control.py", line 181, in run
(pid=4389)     metrics = analysis.analyze_trial(self.spec, session_metrics_list)
(pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 265, in analyze_trial
(pid=4389)     trial_metrics = calc_trial_metrics(session_metrics_list, info_prepath)
(pid=4389)   File "/home/joe/SLM-Lab/slm_lab/experiment/analysis.py", line 187, in calc_trial_metrics
(pid=4389)     frames = session_metrics_list[0]['local']['frames']
(pid=4389) IndexError: list index out of range
(pid=4389) 
(pid=4389) During handling of the above exception, another exception occurred:
(pid=4389) 
(pid=4389) Traceback (most recent call last):
(pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/threading.py", line 917, in _bootstrap_inner
(pid=4389)     self.run()
(pid=4389)   File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 102, in run
(pid=4389)     err_tb = err_tb.format_exc()
(pid=4389) AttributeError: 'traceback' object has no attribute 'format_exc'
(pid=4389) 
2020-01-30 11:38:59,690	ERROR trial_runner.py:497 -- Error processing event.
Traceback (most recent call last):
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 446, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 316, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/worker.py", line 2197, in get
    raise value
ray.exceptions.RayTaskError: ray_worker (pid=4388, host=Gauss)
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/trainable.py", line 151, in train
    result = self._train()
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 203, in _train
    ("Wrapped function ran until completion without reporting "
ray.tune.error.TuneError: Wrapped function ran until completion without reporting results or raising an exception.

2020-01-30 11:38:59,694	INFO ray_trial_executor.py:180 -- Destroying actor for trial ray_trainable_0_agent.0.algorithm.center_return=True,trial_index=0. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2020-01-30 11:38:59,705	ERROR trial_runner.py:497 -- Error processing event.
Traceback (most recent call last):
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 446, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 316, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/worker.py", line 2197, in get
    raise value
ray.exceptions.RayTaskError: ray_worker (pid=4389, host=Gauss)
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/trainable.py", line 151, in train
    result = self._train()
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/function_runner.py", line 203, in _train
    ("Wrapped function ran until completion without reporting "
ray.tune.error.TuneError: Wrapped function ran until completion without reporting results or raising an exception.

2020-01-30 11:38:59,707	INFO ray_trial_executor.py:180 -- Destroying actor for trial ray_trainable_1_agent.0.algorithm.center_return=False,trial_index=1. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/1 GPUs
Memory usage on this node: 2.5/16.7 GB
Result logdir: /home/joe/ray_results/reinforce_baseline_cartpole
Number of trials: 2 ({'ERROR': 2})
ERROR trials:
 - ray_trainable_0_agent.0.algorithm.center_return=True,trial_index=0:	ERROR, 1 failures: /home/joe/ray_results/reinforce_baseline_cartpole/ray_trainable_0_agent.0.algorithm.center_return=True,trial_index=0_2020-01-30_11-38-57n2qc80ke/error_2020-01-30_11-38-59.txt
 - ray_trainable_1_agent.0.algorithm.center_return=False,trial_index=1:	ERROR, 1 failures: /home/joe/ray_results/reinforce_baseline_cartpole/ray_trainable_1_agent.0.algorithm.center_return=False,trial_index=1_2020-01-30_11-38-57unqmlqvg/error_2020-01-30_11-38-59.txt

Traceback (most recent call last):
  File "run_lab.py", line 80, in <module>
    main()
  File "run_lab.py", line 72, in main
    read_spec_and_run(*args)
  File "run_lab.py", line 56, in read_spec_and_run
    run_spec(spec, lab_mode)
  File "run_lab.py", line 35, in run_spec
    Experiment(spec).run()
  File "/home/joe/SLM-Lab/slm_lab/experiment/control.py", line 203, in run
    trial_data_dict = search.run_ray_search(self.spec)
  File "/home/joe/SLM-Lab/slm_lab/experiment/search.py", line 124, in run_ray_search
    server_port=util.get_port(),
  File "/home/joe/anaconda3/envs/lab/lib/python3.7/site-packages/ray/tune/tune.py", line 265, in run
    raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [ray_trainable_0_agent.0.algorithm.center_return=True,trial_index=0, ray_trainable_1_agent.0.algorithm.center_return=False,trial_index=1])

Explore var spec configuration

I see few jsons with the below configurations for
explore_var_spec
"explore_var_spec": { "name": "linear_decay", "start_val": 3.0, "end_val": 1.0, "start_step": 0, "end_step": 2000, },

I know that the below one is possible

"explore_var_spec": { "name": "linear_decay", "start_val": 1.0, "end_val": 0.1, "start_step": 0, "end_step": 2000, },

Can "start_val": > 1 be possible? or i missed something !!

How to modify spec.json in enjoy/eval mode?

I have trained a model saved in /data directory. I use 'enjoy@dqn_cartpole_t0_s0' to check the model performance. But how can I modify the spec.json configuration to reduce the 'max_frame' or 'eval_frequency' item in enjoy/eval mode? Modifying 'dqn_cartpole_spec.json' and 'dqn_cartpole_t0_s0_spec.json' in the /data does not work.

And can I get similar graph in enjoy or eval mode like train model?

Thank you very much!

Are there any benchmarks for SIL?

Hi @kengz , your SLM-Lab is a great job. I am trying to use SIL (self-imitation learning) for my project. Do you have tested this algorithm in SLM-Lab, especially for PPO? I tried to implement SIL with PPO, but I found it did not work. In other words, SIL does not help to improve PPO.

VectorEnvWrapper base missing spec

Describe the bug
when writing a new VectorEnvWrapper by extending baseclass VectorEnvWrapper, it appears calling super is not enough to make it work. Seems that the OpenAIEnv constructor expects a wrapper to carry a spec attribute, which is not present in VectorEnvWrapper base class.

To Reproduce

OS and environment: Ubuntu 18.04, conda, etc as per docs
SLM Lab git SHA (run git rev-parse HEAD to get it): 41e6918
spec file used: not relevant, self evident from code review

Additional context

The below is required to implement a working wrapper... removing the self.spec = venv.spec causes it to fail, due to the reference in the OpenAIEnv class constructor.

class VectorLogWrapper(VecEnvWrapper):
    def __init__(self, venv, arg_one, arg_two):
        self.spec = venv.spec
        super().__init__(venv, venv.observation_space, venv.action_space)
        self.arg_one = arg_one
        self.arg_two = arg_two

    def step_wait(self):
        obs, rews, news, infos = self.venv.step_wait()
        logger.info(f'step_called arg_one: {self.arg_one}')
        return obs, rews, news, infos

    def reset(self):
        s = self.venv.reset()
        logger.info('reset_called')
        return s

The reference to spec is used in OpenAIEnv on the below line.

self.max_t = self.max_t or self.u_env.spec.max_episode_steps

Context below...

class OpenAIEnv(BaseEnv):
    '''
    Wrapper for OpenAI Gym env to work with the Lab.

    e.g. env_spec
    "env": [{
        "name": "PongNoFrameskip-v4",
        "frame_op": "concat",
        "frame_op_len": 4,
        "normalize_state": false,
        "reward_scale": "sign",
        "num_envs": 8,
        "max_t": null,
        "max_frame": 1e7
    }],
    '''

    def __init__(self, spec):
        super().__init__(spec)
        try_register_env(spec)  # register if it's a custom gym env
        seed = ps.get(spec, 'meta.random_seed')
        episode_life = not util.in_eval_lab_modes()
        if self.is_venv:  # make vector environment
            self.u_env = make_gym_venv(spec=spec['env'][0], name=self.name, num_envs=self.num_envs, seed=seed, frame_op=self.frame_op, frame_op_len=self.frame_op_len, image_downsize=self.image_downsize, reward_scale=self.reward_scale, normalize_state=self.normalize_state, episode_life=episode_life)
        else:
            self.u_env = make_gym_env(spec=spec['env'][0], name=self.name, seed=seed, frame_op=self.frame_op, frame_op_len=self.frame_op_len, image_downsize=self.image_downsize, reward_scale=self.reward_scale, normalize_state=self.normalize_state, episode_life=episode_life)
        if self.name.startswith('Unity'):
            # Unity is always initialized as singleton gym env, but the Unity runtime can be vec_env
            self.num_envs = self.u_env.num_envs
            # update variables dependent on num_envs
            self._infer_venv_attr()
            self._set_clock()
        self._set_attr_from_u_env(self.u_env)
        self.max_t = self.max_t or self.u_env.spec.max_episode_steps
        assert self.max_t is not None
        logger.info(util.self_desc(self))

Perhaps this should be initialized in the base class? Not sure of the correct fix as I"m not yet familiar with the overall design.

can this(cycled by red line) be changed to other env?

Better Documentation of Logging/Analysis

First off thank you for this library!

I wanted to ask for your help in understanding the analysis and logging of the training.

During training a lot of information is dumped:

Trial 0 session 3 reinforce_cartpole_t0_s3 [eval_df metrics] final_return_ma: 167.6  strength: 145.74  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 2.22874e-05  training_efficiency: 0.000511296  stability: 0.935119

Among these several are new to me in the specific RL context. What is strength and stability, and how is sample_efficiency and training_efficiency calculated?

In the documentation, the metrics explanations are rather brief, describing them as "self-explanatory." And in the codebase itself the comments didn't seem very accessible to me.

Would it be possible for you to give you an overview of these terms? Or possibly point me to a resource that explains these, couldn't find them in your book.

Additionally, If you can point me to the part where logging formatting is done, I can spend some time getting it to be formatted a bit better. The single line dump currently is hard to read.

Thank you very much!

Real recurrent policy supported

Are you requesting a feature or an implementation?

To handle the partial MDP task, the recurrent policy is currently quite popular. We need to add a lstm layer after the original conv (or mlp) policy, and store the hidden states for training. But in SLM-lab, the RecurrentNet class has limited ablities. It is more like a concatenation of series of input states, and the hidden states of rnn are not stored, which weanken the recurrent policy seriously.
For example, I used it with the default parameters to solve the cartpole task, and it failed.

python run_lab.py slm_lab/spec/experimental/ppo/ppo_cartpole.json ppo_rnn_separate_cartpole  train

Even I changed the max_frame parameter of the env from 500 to 50000, the RecurrentNet still couldn't work.

[2019-07-14 21:11:38,098 PID:18904 INFO logger.py info] Session 1 done
[2019-07-14 21:11:38,287 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [train_df metrics] final_return_ma: 58.26  strength: 35.4753  max_strength: 178.14  final_strength: 37.14  sample_efficiency: 9.07107e-05  training_efficiency: 6.71198e-06  stability: 0.846315
[2019-07-14 21:11:38,468 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df] epi: 647  t: 126  wall_t: 655  opt_step: 997120  frame: 49859  fps: 76.1206  total_reward: 126  total_reward_ma: 88.02  loss: 0.610099  lr: 1.44304e-37  explore_var: nan  entropy_coef: 0.001  entropy: 0.0258675  grad_norm: nan
[2019-07-14 21:11:38,835 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df] epi: 648  t: 54  wall_t: 656  opt_step: 997760  frame: 49913  fps: 76.0869  total_reward: 54  total_reward_ma: 88.02  loss: 0.554544  lr: 1.44304e-37  explore_var: nan  entropy_coef: 0.001  entropy: 0.217777  grad_norm: nan
[2019-07-14 21:11:38,835 PID:18906 INFO __init__.py log_metrics] Trial 0 session 3 ppo_rnn_separate_cartpole_t0_s3 [eval_df metrics] final_return_ma: 79.4461  strength: 57.5861  max_strength: 159.64  final_strength: 54.39  sample_efficiency: 9.59096e-05  training_efficiency: 4.81586e-06  stability: 0.899133
[2019-07-14 21:11:38,836 PID:18906 INFO logger.py info] Session 3 done
[2019-07-14 21:11:39,296 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [eval_df metrics] final_return_ma: 61.299  strength: 39.439  max_strength: 178.14  final_strength: 32.64  sample_efficiency: 0.000120629  training_efficiency: 6.06361e-06  stability: 0.84144
[2019-07-14 21:11:39,794 PID:18905 INFO logger.py info] Running eval ckpt
[2019-07-14 21:11:39,939 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df] epi: 649  t: 0  wall_t: 657  opt_step: 999680  frame: 50000  fps: 76.1035  total_reward: 84.25  total_reward_ma: 78.0294  loss: 2.42707  lr: 1.44304e-37  explore_var: nan  entropy_coef: 0.001  entropy: 0.135592  grad_norm: nan
[2019-07-14 21:11:40,234 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [eval_df metrics] final_return_ma: 61.299  strength: 39.439  max_strength: 178.14  final_strength: 32.64  sample_efficiency: 0.000120629  training_efficiency: 6.06361e-06  stability: 0.84144
[2019-07-14 21:11:40,236 PID:18903 INFO logger.py info] Session 0 done
[2019-07-14 21:11:41,480 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df metrics] final_return_ma: 88.02  strength: 55.0476  max_strength: 178.14  final_strength: 32.14  sample_efficiency: 8.00063e-05  training_efficiency: 4.46721e-06  stability: 0.708828
[2019-07-14 21:11:42,347 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df metrics] final_return_ma: 78.0294  strength: 56.1694  max_strength: 84.39  final_strength: 62.39  sample_efficiency: 8.97979e-05  training_efficiency: 4.50698e-06  stability: 0.860915
[2019-07-14 21:11:43,242 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df metrics] final_return_ma: 78.0294  strength: 56.1694  max_strength: 84.39  final_strength: 62.39  sample_efficiency: 8.97979e-05  training_efficiency: 4.50698e-06  stability: 0.860915
[2019-07-14 21:11:43,243 PID:18905 INFO logger.py info] Session 2 done
[2019-07-14 21:11:49,818 PID:18839 INFO analysis.py analyze_trial] All trial data zipped to data/ppo_rnn_separate_cartpole_2019_07_14_210040.zip
[2019-07-14 21:11:49,818 PID:18839 INFO logger.py info] Trial 0 done

If you have any suggested solutions

I'm afraid to cause more bugs, so I'm sorry not able to add this new feature. But I provide two examples.
OpenAI baselines
pytorch-a2c-ppo-acktr-gail

With this feature, I believe SLM-Lab will be the top-1 in pytorch.

Thanks in advance!

Resume a training

Are you requesting a feature or an implementation?
I would like to know if it is possible to load a previously trained experiment and continue the training (i.e. load the neural network and start a new training with the previously trained neural network as the initial network)
This would be useful in case our previous experiment didn't reach a plateau with the previously assigned number of step or maybe in order to use a reuse a previously trained neural network for a similar task

If you have any suggested solutions
Add a "resume_training" mode in order to continue the training
Or add the possibility to load a neural net model in a training spec

Running inside Google Colab

I'm attempting to get SLM-Lab working inside a Google Colab instance. I can't get the getting started example working because of a torch error.

To Reproduce
Run this notebook

Error logs

Traceback (most recent call last):
  File "run_lab.py", line 77, in <module>
    mp.set_start_method('spawn')  # for distributed pytorch to work
  File "/usr/lib/python3.6/multiprocessing/context.py", line 242, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

I think the solution is to add some exception handling around the mp.set_start_method('spawn') call, but I'm not certain about that.

Thanks!

Undefined names

Undefined names have the potential to raise NameError at runtime.

flake8 testing of https://github.com/kengz/SLM-Lab on Python 3.6.3

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./slm_lab/agent/algorithm/base.py:73:16: F821 undefined name 'action'
        return action
               ^
./slm_lab/agent/algorithm/base.py:99:16: F821 undefined name 'batch'
        return batch
               ^
./slm_lab/agent/algorithm/policy_util.py:43:13: F821 undefined name 'new_prob'
            new_prob[torch.argmax(probs, dim=0)] = 1.0
            ^
./slm_lab/env/__init__.py:97:49: F821 undefined name 'nvec'
        setattr(gym_space, 'low', np.zeros_like(nvec))
                                                ^
./slm_lab/experiment/search.py:131:9: F821 undefined name 'config'
        config['trial_index'] = self.experiment.info_space.tick('trial')['trial']
        ^
./slm_lab/experiment/search.py:133:16: F821 undefined name 'config'
        return config
               ^
./slm_lab/experiment/search.py:146:16: F821 undefined name 'trial_data_dict'
        return trial_data_dict
               ^
./test/agent/net/test_nn.py:83:25: F821 undefined name 'net_util'
        before_params = net_util.copy_trainable_params(net)
                        ^
./test/agent/net/test_nn.py:88:24: F821 undefined name 'net_util'
        after_params = net_util.copy_trainable_params(net)
                       ^
./test/agent/net/test_nn.py:114:25: F821 undefined name 'net_util'
        before_params = net_util.copy_fixed_params(net)
                        ^
./test/agent/net/test_nn.py:118:24: F821 undefined name 'net_util'
        after_params = net_util.copy_fixed_params(net)
                       ^
11    F821 undefined name 'action'
11

[question] How to log tensor board while training

Hello. I want to make logs while in train mode. But for now the /log folder is always empty. How can I turn it on ?

[UPD] my bad - changed the path of output. But logs were still in repo/data folder

Any demo for soft actor critic on discrete action space?

Cool Work.

It seems that you have implemented sac to support discrete action space.
I wonder whether this project contains tiny demo from running soft actor critic on scenario of discrete action space.

Best,
Ken

Atari memory preprocessing crops out part of game space

Problem with image preprocessing for Atari games where part of the game space at the top is being cropped out.

State output from OpenAI for certain games (e.g. breakout) has a banner of unused pixels on the bottom. When cropping to 84x84, part of the top state space is missing, and extra unused pixels on the bottom are preserved. This is especially problematic for pong, where the paddle itself could be occluded.

See raw game image (breakout):

See preprocessed game state (breakout):

GPU Usage

I'm running the dqn_BeamRider-v4 trial in the attached spec file in train mode. (It's basically identical to the dqn_boltzmann_breakout trial in the dqn.json spec file, but with the BeamRider env.)

I'm running 4 sessions at once in a 4 gpu machine. For the first ~5 episodes in each session all 4 gpus are used correctly, one for each session. (gpu 0 has a couple extra processes on it, which I'm assuming is normal). But gradually, by around ~100 episodes, all the gpu processes dissappear, and nothing is running on GPU anymore. The training process never crashes or finishes during this - its cpu usage goes to 0 and just sits there.

Any ideas what's going on?

openai_baseline_reproduction_dqn copy.txt

Training block env sampling

HI~

I focus on the sampling efficiencies, if i training a DQN agent, the nn will update every training_frequency timesteps. But when nn update , it will block the env sampling, how can i slove this problem?
@kengz

Why does dppo not support GPU

Describe the bug

Thanks for your excellent library. I think it is the best one in pytorch up to now. I think the ppo algorithm should be the default one to try. So I'm wondering why the dppo is not GPU supported. I thought the distributed version, if combined with gpu supported, would be the best ppo implementation. Could you tell me the performance difference between dppo and ppo (with gpu supported)? I want to make sure which is the best one I should use.

Thanks in advance!.

To Reproduce
Run dppo_pong.json in gpu mode

Error logs

[2019-07-13 15:07:44,444 PID:8795 INFO run_lab_script.py read_spec_and_run] Running lab spec_file:slm_lab/spec/benchmark/dppo/dppo_pong.json spec_name:dppo_pong in mode:train
Traceback (most recent call last):
  File "/home/noone/Documents/New_torch/SLM-Lab/run_lab_script.py", line 87, in <module>
    main()
  File "/home/noone/Documents/New_torch/SLM-Lab/run_lab_script.py", line 73, in main
    read_spec_and_run(*args)
  File "/home/noone/Documents/New_torch/SLM-Lab/run_lab_script.py", line 51, in read_spec_and_run
    spec = spec_util.get(spec_file, spec_name)
  File "/home/noone/Documents/New_torch/SLM-Lab/slm_lab/spec/spec_util.py", line 160, in get
    check(spec)
  File "/home/noone/Documents/New_torch/SLM-Lab/slm_lab/spec/spec_util.py", line 98, in check
    raise e
  File "/home/noone/Documents/New_torch/SLM-Lab/slm_lab/spec/spec_util.py", line 95, in check
    check_compatibility(spec)
  File "/home/noone/Documents/New_torch/SLM-Lab/slm_lab/spec/spec_util.py", line 80, in check_compatibility
    assert ps.get(spec, 'agent.0.net.gpu') == False, f'Distributed mode "synced" works with CPU only. Set gpu: false.'
AssertionError: Distributed mode "synced" works with CPU only. Set gpu: false.
[2019-07-13 15:07:44,450 PID:8795 ERROR spec_util.py check] spec dppo_pong fails spec check
Traceback (most recent call last):
  File "/home/noone/Documents/New_torch/SLM-Lab/slm_lab/spec/spec_util.py", line 95, in check
    check_compatibility(spec)
  File "/home/noone/Documents/New_torch/SLM-Lab/slm_lab/spec/spec_util.py", line 80, in check_compatibility
    assert ps.get(spec, 'agent.0.net.gpu') == False, f'Distributed mode "synced" works with CPU only. Set gpu: false.'
AssertionError: Distributed mode "synced" works with CPU only. Set gpu: false.

every trial startup NUM_EVAL envs

if i have an empty _random_baseline.json, every trial will startup NUM_EVAL envs to compute random_baseline at the same time, but this has a problem, my CPU memory can not allow so many envs, it will crash.
i.e NUM_EVAL = 20
max_trial = 4
it will startup 80 envs when eval !
Am i right ?
How can i resolve this?
@kengz
@lgraesser

use 'shared' with GPU and mupltiprocessing

Hi~

After the issue #421 , I changed your code to async sampling and training.
Now i have a subprocess(P1) created by the main process(P2) .
P2 run with the env and sampling data , then P2 give the data to P1 by multiprocessing.Queue(), P1 get the data to replay buffer and training.Due to i am using "shared" mode, the global nets will be optimized by training in P2 and it also can be used by sampling in P1.
Here i can use async sampling and training with CPU, i test that is correct.
But i still want to increase the training speed.So that i want to change the training to GPU.
First, i get the CUDA initialization error , i refactor my code and use 'spawn' start method.
After that i get this error:

`Traceback (most recent call last):
  File "/home/me/miniconda3/envs/lab/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/me/miniconda3/envs/lab/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/me/cyrl/SLM-Lab/slm_lab/experiment/control.py", line 26, in mp_run_session
    session = Session(spec, global_nets)
  File "/home/me/cyrl/SLM-Lab/slm_lab/experiment/control.py", line 46, in __init__
    self.agent, self.env = make_agent_env(self.spec, global_nets)
  File "/home/me/cyrl/SLM-Lab/slm_lab/experiment/control.py", line 20, in make_agent_env
    agent = Agent(spec, body=body, global_nets=global_nets)
  File "/home/me/cyrl/SLM-Lab/slm_lab/agent/__init__.py", line 71, in __init__
    p.start()
  File "/home/me/miniconda3/envs/lab/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/home/me/miniconda3/envs/lab/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/me/miniconda3/envs/lab/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/me/miniconda3/envs/lab/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/me/miniconda3/envs/lab/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/me/miniconda3/envs/lab/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/home/me/miniconda3/envs/lab/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 231, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
RuntimeError: Attempted to send CUDA tensor received from another process; this is not currently supported. Consider cloning before sending.`

@kengz Could u please give me any help about that?

Resolve Package Not Found during setup

Hi,
I ran the following command bin/setup and got the following error:

--- Installing brew ---
Brew is already installed
--- Installing brew system dependencies ---
--- Installing Atom and Hydrogen for interactive computing ---
Atom is already installed
Hydrogen is already installed
--- Installing NodeJS Lab interface ---
NodeJS is already installed
--- Installing npm modules for Lab interface ---
Npm modules are already installed
--- Installing Python for Lab backend ---
Python3 is already installed
--- Installing Conda ---
Conda is already installed
--- Installing Conda environment ---
conda env lab is already installed
--- Updating Conda environment ---
Fetching package metadata .............
Solving package specifications: 

ResolvePackageNotFound: 
  - pytorch 0.3.0*
  - mkl >=2018
  - torchvision 0.2.0*
  - pytorch >=0.3
  - mkl >=2018

At the moment, I overcome the issue by adding anaconda as a channel in the environment.yml
Not sure if this is the most elegant approach.

ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command

Describe the bug
After successfully installing SLM-Lab and proceeding to the "Quick Start" portion which involves running DQN on the CartPole environment, everything works well i.e. (final_return_ma increases).

Command entered: python run_lab.py slm_lab/spec/demo.json dqn_cartpole dev

After several log summary and metric instances an OpenGL error code occurs :

[101017:1015/191313.594764:ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command

and then the process seems to end without showing any graphs.

To Reproduce

OS and environment: Ubuntu 20.04 LTS
SLM Lab git SHA (run git rev-parse HEAD to get it):dda02d00031553aeda4c49c5baa7d0706c53996b
spec file used: slm_lab/spec/demo.json

Error logs

[2020-10-15 19:13:09,800 PID:100781 INFO __init__.py log_summary] Trial 0 session 0 dqn_cartpole_t0_s0 [train_df] epi: 123  t: 120  wall_t: 153  opt_step: 398720  frame: 10000  fps: 65.3595  total_reward: 200  total_reward_ma: 142.7  loss: 5.46846  lr: 0.00774841  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: 0.230459
[2020-10-15 19:13:09,821 PID:100781 INFO __init__.py log_metrics] Trial 0 session 0 dqn_cartpole_t0_s0 [train_df metrics] final_return_ma: 142.7  strength: 120.84  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 0.00019783  training_efficiency: 5.02079e-06  stability: 0.926742
[100946:1015/191310.923076:ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
[2020-10-15 19:13:12,794 PID:100781 INFO __init__.py log_metrics] Trial 0 session 0 dqn_cartpole_t0_s0 [eval_df metrics] final_return_ma: 142.7  strength: 120.84  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 0.00019783  training_efficiency: 5.02079e-06  stability: 0.926742
[2020-10-15 19:13:12,798 PID:100781 INFO logger.py info] Session 0 done
[101017:1015/191313.594764:ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
[2020-10-15 19:13:15,443 PID:100781 INFO logger.py info] Trial 0 done

how can i reasonable used my memory

Hi~
I see your src code : search.py
'''
num_cpus = min(util.NUM_CPUS, meta_spec['max_session'])
'''
if my spec, Hyberparameter like this:
"num_envs": 20
"max_trial": 10
"max_session": 3

this wil running 6 trials at the same time, And 4 trials are pending,right?
It seems that CPU resource is ok ,but there has 60 process , my CPU memory has it limit, it will possible occured out of memory error and crash.

Is there some measures to prevent this or if this should i watching the cpu memory by myself to manual control my process numbers in my spec?

Is there only consider about "max_session" but not "num_envs" is reasonable?

@kengz

Problem Parsing Input Spec for Enjoy Mode Demo

I'm going through the demo with Ubuntu OS, and, things are working so far, and I get to the part about the enjoy mode.
https://slm-lab.gitbook.io/slm-lab/using-slm-lab/resume-and-enjoy-reinforce-cartpole

This command is run, after having done a successful training run
python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json reinforce_cartpole enjoy@data/reinforce_cartpole_2020_11_30_185405/reinforce_cartpole_t0_s0_spec.json

But it crashes with an error about parsing the input files

[2020-11-30 21:04:00,306 PID:4659 INFO run_lab.py read_spec_and_run] Running lab spec_file:slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json spec_name:reinforce_cartpole in mode:enjoy@data/reinforce_cartpole_2020_11_30_185405/reinforce_cartpole_t0_s0_spec.json
Traceback (most recent call last):
  File "run_lab.py", line 80, in <module>
    main()
  File "run_lab.py", line 72, in main
    read_spec_and_run(*args)
  File "run_lab.py", line 53, in read_spec_and_run
    spec = spec_util.get_eval_spec(spec_file, prename)
  File "/home/philip/Documents/MachineLearning/SLM-Lab/slm_lab/spec/spec_util.py", line 166, in get_eval_spec
    predir, _, _, _, _, _ = util.prepath_split(spec_file)
  File "/home/philip/Documents/MachineLearning/SLM-Lab/slm_lab/lib/util.py", line 339, in prepath_split
    experiment_ts = RE_FILE_TS.findall(prefolder)[0]
IndexError: list index out of range

I could get the demo to run by modifying a few files. It may be a hack though because I don't know what all different inputs you may be expecting. Here's the change in the run_lab.py file...the case in the #eval mode has been modified

def read_spec_and_run(spec_file, spec_name, lab_mode):
    '''Read a spec and run it in lab mode'''
    logger.info(f'Running lab spec_file:{spec_file} spec_name:{spec_name} in mode:{lab_mode}')
    if lab_mode in TRAIN_MODES:
        spec = spec_util.get(spec_file, spec_name)
    else:  # eval mode
        lab_mode, base_path = lab_mode.split('@')
        # ex prename = reinforce_cartpole_t0_s0_spec.json
        _, _, prename, _, _, _ = util.prepath_split(base_path)
        # remove the _spec.json from the name
        prename = re.search('([\w\a\d\_]+)_spec\.json', prename).group(1)
        spec = spec_util.get_eval_spec(base_path, prename)

Arch Install

Hi, i'm having trouble in the installation because the linux distro, can you indicate the packages required for a correct installation for run the "yarn install" command.

It's look a great framework and i'll like to test it, thanks and regards.

Crashing with seccomp-bpf failure in syscall 2030

Describe the bug
I'm going through the guides on the website and running dqn_cartpole both in dev and train results in slow runs, high resource usage and it ends with several duplication of this error message:

../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0230

To Reproduce

OS and environment: Ubuntu 20.04.1
SLM Lab git SHA (run git rev-parse HEAD to get it): faca82c
spec file used: slm_lab/spec/demo.json

Additional context
Running on AMD TR 3990X and all 128 CPU cores are running above 90% during the run (only checked for train, not dev). These are the training metrics logged during one of the run:

[2021-07-24 10:34:11,503 PID:27435 INFO logger.py info] Running RL loop for trial 0 session 1
[2021-07-24 10:34:11,506 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 0  t: 0  wall_t: 0  opt_step: 0  frame: 0  fps: 0  total_reward: nan  total_reward_ma: nan  loss: nan  lr: 0.02  explore_var: 1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 10:36:42,739 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 28  t: 4  wall_t: 151  opt_step: 18720  frame: 500  fps: 3.31126  total_reward: 9  total_reward_ma: 9  loss: 0.0168248  lr: 0.02  explore_var: 0.55  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 10:39:24,086 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 54  t: 1  wall_t: 312  opt_step: 38720  frame: 1000  fps: 3.20513  total_reward: 10  total_reward_ma: 9.5  loss: 0.0740117  lr: 0.02  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 10:39:24,095 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 9.5  strength: -12.36  max_strength: -11.86  final_strength: -11.86  sample_efficiency: 0.00152023  training_efficiency: 4.01807e-05  stability: 1
[2021-07-24 10:42:06,057 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 74  t: 72  wall_t: 474  opt_step: 58720  frame: 1500  fps: 3.16456  total_reward: 21  total_reward_ma: 13.3333  loss: 0.299671  lr: 0.018  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 10:42:06,069 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 13.3333  strength: -8.52667  max_strength: -0.860001  final_strength: -0.860001  sample_efficiency: 0.00149153  training_efficiency: 3.94024e-05  stability: 1
[2021-07-24 10:44:47,405 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 84  t: 68  wall_t: 635  opt_step: 78720  frame: 2000  fps: 3.14961  total_reward: 69  total_reward_ma: 27.25  loss: 0.231153  lr: 0.018  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 10:44:47,413 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 27.25  strength: 5.39  max_strength: 47.14  final_strength: 47.14  sample_efficiency: -0.000676407  training_efficiency: -1.89741e-05  stability: 1
[2021-07-24 10:47:30,196 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 90  t: 35  wall_t: 798  opt_step: 98720  frame: 2500  fps: 3.13283  total_reward: 138  total_reward_ma: 49.4  loss: 0.102941  lr: 0.0162  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 10:47:30,216 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 49.4  strength: 27.54  max_strength: 116.14  final_strength: 116.14  sample_efficiency: 0.000231464  training_efficiency: 5.57282e-06  stability: 1
[2021-07-24 10:50:12,254 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 95  t: 69  wall_t: 960  opt_step: 118720  frame: 3000  fps: 3.125  total_reward: 175  total_reward_ma: 70.3333  loss: 0.563105  lr: 0.0162  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 10:50:12,266 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 70.3333  strength: 48.4733  max_strength: 153.14  final_strength: 153.14  sample_efficiency: 0.000285103  training_efficiency: 7.07366e-06  stability: 1
[2021-07-24 10:52:52,672 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 98  t: 21  wall_t: 1121  opt_step: 138720  frame: 3500  fps: 3.12221  total_reward: 178  total_reward_ma: 85.7143  loss: 0.468518  lr: 0.01458  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 10:52:52,680 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 85.7143  strength: 63.8543  max_strength: 156.14  final_strength: 156.14  sample_efficiency: 0.000285316  training_efficiency: 7.12085e-06  stability: 1
[2021-07-24 10:55:34,404 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 100  t: 121  wall_t: 1282  opt_step: 158720  frame: 4000  fps: 3.12012  total_reward: 200  total_reward_ma: 100  loss: 0.259016  lr: 0.01458  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 10:55:34,417 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 100  strength: 78.14  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 0.000275252  training_efficiency: 6.88705e-06  stability: 1
[2021-07-24 10:58:17,041 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 103  t: 188  wall_t: 1445  opt_step: 178720  frame: 4500  fps: 3.11419  total_reward: 154  total_reward_ma: 106  loss: 0.235752  lr: 0.013122  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 10:58:17,050 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 106  strength: 84.14  max_strength: 178.14  final_strength: 132.14  sample_efficiency: 0.000265999  training_efficiency: 6.66165e-06  stability: 0.926414
[2021-07-24 11:00:59,460 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 106  t: 141  wall_t: 1607  opt_step: 198720  frame: 5000  fps: 3.11139  total_reward: 178  total_reward_ma: 113.2  loss: 0.162558  lr: 0.013122  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:00:59,467 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 113.2  strength: 91.34  max_strength: 178.14  final_strength: 156.14  sample_efficiency: 0.000254717  training_efficiency: 6.38311e-06  stability: 0.939255
[2021-07-24 11:03:42,026 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 109  t: 117  wall_t: 1770  opt_step: 218720  frame: 5500  fps: 3.10734  total_reward: 179  total_reward_ma: 119.182  loss: 0.0619462  lr: 0.0118098  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:03:42,037 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 119.182  strength: 97.3218  max_strength: 178.14  final_strength: 157.14  sample_efficiency: 0.000244016  training_efficiency: 6.11727e-06  stability: 0.949639
[2021-07-24 11:06:24,799 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 112  t: 103  wall_t: 1933  opt_step: 238720  frame: 6000  fps: 3.10398  total_reward: 155  total_reward_ma: 122.167  loss: 2.09935  lr: 0.0118098  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:06:24,808 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 122.167  strength: 100.307  max_strength: 178.14  final_strength: 133.14  sample_efficiency: 0.000235461  training_efficiency: 5.90398e-06  stability: 0.934612
[2021-07-24 11:09:07,338 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 115  t: 133  wall_t: 2095  opt_step: 258720  frame: 6500  fps: 3.10263  total_reward: 169  total_reward_ma: 125.769  loss: 1.67609  lr: 0.0106288  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:09:07,346 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 125.769  strength: 103.909  max_strength: 178.14  final_strength: 147.14  sample_efficiency: 0.000226571  training_efficiency: 5.6819e-06  stability: 0.941845
[2021-07-24 11:11:50,149 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 118  t: 146  wall_t: 2258  opt_step: 278720  frame: 7000  fps: 3.10009  total_reward: 153  total_reward_ma: 127.714  loss: 2.25914  lr: 0.0106288  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:11:50,158 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 127.714  strength: 105.854  max_strength: 178.14  final_strength: 131.14  sample_efficiency: 0.000219163  training_efficiency: 5.4966e-06  stability: 0.936335
[2021-07-24 11:14:32,865 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 121  t: 91  wall_t: 2421  opt_step: 298720  frame: 7500  fps: 3.09789  total_reward: 177  total_reward_ma: 131  loss: 1.16432  lr: 0.00956594  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:14:32,873 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 131  strength: 109.14  max_strength: 178.14  final_strength: 155.14  sample_efficiency: 0.000211029  training_efficiency: 5.29295e-06  stability: 0.941969
[2021-07-24 11:17:15,534 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 124  t: 69  wall_t: 2584  opt_step: 318720  frame: 8000  fps: 3.09598  total_reward: 200  total_reward_ma: 135.312  loss: 3.20369  lr: 0.00956594  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:17:15,546 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 135.312  strength: 113.452  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 0.000202587  training_efficiency: 5.08143e-06  stability: 0.947468
[2021-07-24 11:19:58,220 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 127  t: 106  wall_t: 2746  opt_step: 338720  frame: 8500  fps: 3.09541  total_reward: 152  total_reward_ma: 136.294  loss: 1.04083  lr: 0.00860934  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:19:58,234 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 136.294  strength: 114.434  max_strength: 178.14  final_strength: 130.14  sample_efficiency: 0.000196904  training_efficiency: 4.939e-06  stability: 0.926181
[2021-07-24 11:22:41,174 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 130  t: 35  wall_t: 2909  opt_step: 358720  frame: 9000  fps: 3.09385  total_reward: 183  total_reward_ma: 138.889  loss: 2.89101  lr: 0.00860934  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:22:41,183 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 138.889  strength: 117.029  max_strength: 178.14  final_strength: 161.14  sample_efficiency: 0.000190341  training_efficiency: 4.77443e-06  stability: 0.931119
[2021-07-24 11:25:24,265 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 133  t: 1  wall_t: 3072  opt_step: 378720  frame: 9500  fps: 3.09245  total_reward: 180  total_reward_ma: 141.053  loss: 5.7599  lr: 0.00774841  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:25:24,274 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 141.053  strength: 119.193  max_strength: 178.14  final_strength: 158.14  sample_efficiency: 0.000184401  training_efficiency: 4.62542e-06  stability: 0.934964
[2021-07-24 11:27:56,455 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 135  t: 156  wall_t: 3224  opt_step: 398720  frame: 10000  fps: 3.10174  total_reward: 174  total_reward_ma: 142.7  loss: 0.364511  lr: 0.00774841  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: nan
[2021-07-24 11:27:56,466 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 142.7  strength: 120.84  max_strength: 178.14  final_strength: 152.14  sample_efficiency: 0.000179087  training_efficiency: 4.49212e-06  stability: 0.936856
[2021-07-24 11:27:59,648 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [eval_df metrics] final_return_ma: 142.7  strength: 120.84  max_strength: 178.14  final_strength: 152.14  sample_efficiency: 0.000179087  training_efficiency: 4.49212e-06  stability: 0.936856
[2021-07-24 11:27:59,649 PID:27435 INFO logger.py info] Session 1 done

This is nearly one hour of running on 128 cores (apparently all are used) and then ultimately failing to achieve the pass score of 195. Could the slowness be explained by using all the CPUs and spending a lot of time on syncing?

Error logs

../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0230

missing module cv2

/SLM-Lab/slm_lab/lib/util.py", line 5, in
import cv2
ModuleNotFoundError: No module named 'cv2'

To Reproduce

OS used: Ubuntu 18 LTS
SLM-Lab git: git cloned
demo.json not working

Additional context
had to add cmake libgcc manually

Error logs
(base) l*@l*-HP-Pavilion-dv7-PC:~/SLM-Lab$ python3 run_lab.py slm_lab/spec/demo.json dqn_cartpole dev
Traceback (most recent call last):
File "run_lab.py", line 10, in
from slm_lab.experiment import analysis, retro_analysis
File "/home/l*/SLM-Lab/slm_lab/experiment/analysis.py", line 5, in
from slm_lab.agent import AGENT_DATA_NAMES
File "/home/lr/SLM-Lab/slm_lab/agent/init.py", line 21, in
from slm_lab.agent import algorithm, memory
File "/home/l/SLM-Lab/slm_lab/agent/algorithm/init.py", line 8, in
from .actor_critic import *
File "/home/l*/SLM-Lab/slm_lab/agent/algorithm/actor_critic.py", line 1, in
from slm_lab.agent import net
File "/home/l*/SLM-Lab/slm_lab/agent/net/init.py", line 6, in
from slm_lab.agent.net.conv import *
File "/home/l*/SLM-Lab/slm_lab/agent/net/conv.py", line 1, in
from slm_lab.agent.net import net_util
File "/home/l*/SLM-Lab/slm_lab/agent/net/net_util.py", line 3, in
from slm_lab.lib import logger, util
File "/home/lr/SLM-Lab/slm_lab/lib/logger.py", line 1, in
from slm_lab.lib import util
File "/home/l/SLM-Lab/slm_lab/lib/util.py", line 5, in
import cv2
ModuleNotFoundError: No module named 'cv2'

How can i training with multi computers?

Hi~
How can i training with multi computers?I have not seen where i can set the address to connect?Is there the "distributed" in the spec json can work for this?
@kengz

Breakout DDQN+PER benchmark surprisingly low?

Describe the bug
The Breakout DDQN+PER benchmark is surprisingly low, with a maximum score under 150 and a final score under 80. The original paper shows final performance between 320 and 400 for this environment (although it was evaluated on a single seed).

To Reproduce

Open PER paper: https://arxiv.org/abs/1511.05952, figure 7. See also table 6: Baseline DQN is 1149%, DDQN-PER are 1298% (rank-based) and 1407% (proportional)
Observe Atari Breakout benchmark results for DDQN+PER: https://github.com/kengz/SLM-Lab/blob/master/BENCHMARK.md#atari-benchmark
Performance is expected to be similar, but SLM-Lab performance is much lower.

Additional context
Note that Breakout really shouldn't be problematic to solve. I am a bit worried to see the shape of the training graph: https://user-images.githubusercontent.com/8209263/62100441-9ba13900-b246-11e9-9373-95c6063915ab.png - I am not yet familiar with this codebase but I would suspect a bug.

question about train and development mode

what on earth is the difference between train mode and dev mode?
According to your API, dev mode is just train mode with shorter episodes.
But in terms of what I have learnt about, dev mode doesn't involve the update of model parameters.
So I am quite confused about this and hope someone can help me with this.

normalize_state broke a few things

On a fresh install, with the config/experiment.json file reading
{ "a3c.json": { "a3c_conv_shared_breakout": "dev" } }
running python run_lab.py crashes with

...
[2018-09-05 07:23:19,207 INFO logger.py info] Initialized session 0
[2018-09-05 07:23:19,208 INFO logger.py info] Initialized DistSession 0
Process w0:
Traceback (most recent call last):
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/control.py", line 149, in run
    return self.session.run()
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/control.py", line 76, in run
    self.run_episode()
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/control.py", line 59, in run_episode
    action = self.agent.act(state)
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/agent/__init__.py", line 66, in act
    action = self.algorithm.act(state)
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/agent/algorithm/reinforce.py", line 108, in act
    if self.normalize_state:
AttributeError: 'ActorCritic' object has no attribute 'normalize_state'
[2018-09-05 07:23:20,034 INFO analysis.py analyze_trial] Analyzing trial
Traceback (most recent call last):
  File "run_lab.py", line 73, in <module>
    main()
  File "run_lab.py", line 69, in main
    run_by_mode(spec_file, spec_name, lab_mode)
  File "run_lab.py", line 60, in run_by_mode
    Trial(spec, info_space).run()
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/control.py", line 238, in run
    self.data = analysis.analyze_trial(self)
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/analysis.py", line 483, in analyze_trial
    trial_fitness_df = calc_trial_fitness_df(trial)
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/analysis.py", line 273, in calc_trial_fitness_df
    all_session_fitness_df = pd.concat(list(trial.session_data_dict.values()))
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 212, in concat
    copy=copy)
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 245, in __init__
    raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate

Presumably this is because the spec if missing the normalize_state key. However, if you add the line "normalize_state": true to the a3c_conv_shared_breakout algorithm spec, running python run_lab.py crashes with

...
[2018-09-05 07:25:38,543 INFO logger.py info] Initialized session 0
[2018-09-05 07:25:38,543 INFO logger.py info] Initialized DistSession 0
Process w0:
Traceback (most recent call last):
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/control.py", line 149, in run
    return self.session.run()
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/control.py", line 76, in run
    self.run_episode()
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/control.py", line 59, in run_episode
    action = self.agent.act(state)
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/agent/__init__.py", line 66, in act
    action = self.algorithm.act(state)
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/agent/algorithm/reinforce.py", line 109, in act
    state = policy_util.update_online_stats_and_normalize_state(body, state)
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/agent/algorithm/policy_util.py", line 481, in update_online_stats_and_normalize_state
    update_online_stats(body, state)
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/agent/algorithm/policy_util.py", line 416, in update_online_stats
    assert state.size == body.state_dim
AssertionError
[2018-09-05 07:25:39,357 INFO analysis.py analyze_trial] Analyzing trial
Traceback (most recent call last):
  File "run_lab.py", line 73, in <module>
    main()
  File "run_lab.py", line 69, in main
    run_by_mode(spec_file, spec_name, lab_mode)
  File "run_lab.py", line 60, in run_by_mode
    Trial(spec, info_space).run()
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/control.py", line 238, in run
    self.data = analysis.analyze_trial(self)
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/analysis.py", line 483, in analyze_trial
    trial_fitness_df = calc_trial_fitness_df(trial)
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/analysis.py", line 273, in calc_trial_fitness_df
    all_session_fitness_df = pd.concat(list(trial.session_data_dict.values()))
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 212, in concat
    copy=copy)
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 245, in __init__
    raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate

Running with "normalize_state": false doesn't crash, though it hangs without doing anything. This may be a separate issue.

Why is kengz's name appearing side by side with the indian youtube scammer Siraj Raval in the SLM Lab paper?

So I stumbled upon this paper titled SLM Lab https://drive.google.com/file/d/0BwUv84lNDk72Q1gzaXgwR2U3U2NWVlZSOFk4amZIRmV1QXI0/view

Haven't read it.

But I saw kengz's name right next to Siraj Raval, the notorious indian youtube scammer that I really dislike.

Why is kengz's name appearing side by side with the indian youtube scammer? Is this a bug?

The link to the paper is found in https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A/about

how can i run my own environment with "search" mode

when i run my own e, thienvironment with "search" mode report en error that : my own environment has not register ,(i.e my environment named war3),i think it is because it startup the ray and skip to import war3, but i do not know where should i do " import war3" ,i do this in run_lab.py get the error, but i do this in openai.py ,this error can be resolved.
in slm_lab/env/openai.py like this:

from slm_lab.env.base import BaseEnv
from slm_lab.env.wrapper import make_gym_env
from slm_lab.env.vec_env import make_gym_venv
from slm_lab.env.registration import try_register_env
from slm_lab.lib import logger, util
from slm_lab.lib.decorator import lab_api
import gym
import numpy as np
import pydash as ps
import roboschool
import war3

logger = logger.get_logger(name)
can i import it in my python or run_lab.py , not in your source code?
@kengz
@lgraesser

how can i effective use my CPU resource?

Hi~

If i have 20 CPUS, 10 envs, 5 sessions, then it will starup 50 process with 20 CPUS, i consider it is not reasonable, am i right ? how can i fullly and reasonable use my CPU resource?
@kengz

AttributeError: 'list' object has no attribute 'split'

I'm getting an error when running the following:

python run_lab.py slm_lab/spec/benchmark/reinforce/reinforce_cartpole.json

The error reads:

Traceback (most recent call last):
  File "run_lab.py", line 81, in <module>
    main()
  File "run_lab.py", line 70, in main
    read_spec_and_run(spec_file, spec_name, lab_mode)
  File "run_lab.py", line 52, in read_spec_and_run
    lab_mode, prename = lab_mode.split('@')
AttributeError: 'list' object has no attribute 'split'

Here is what I have in my lab environment:

# Name                    Version                   Build  Channel
atari-py                  0.2.6                    pypi_0    pypi
atomicwrites              1.3.0                      py_0    conda-forge
attrs                     19.3.0                     py_0    conda-forge
autopep8                  1.4.4                      py_0    conda-forge
box2d-py                  2.3.8                    pypi_0    pypi
bzip2                     1.0.8                h0b31af3_2    conda-forge
ca-certificates           2019.11.28           hecc5488_0    conda-forge
certifi                   2019.11.28               py37_0    conda-forge
cffi                      1.13.2           py37h33e799b_0    conda-forge
chardet                   3.0.4                    pypi_0    pypi
click                     7.0                      pypi_0    pypi
cloudpickle               0.5.2                    pypi_0    pypi
colorama                  0.4.3                    pypi_0    pypi
colorlog                  4.0.2                 py37_1000    conda-forge
colorlover                0.3.0                    pypi_0    pypi
coverage                  4.5.3            py37h1de35cc_0    conda-forge
decorator                 4.4.1                    pypi_0    pypi
et_xmlfile                1.0.1                   py_1001    conda-forge
filelock                  3.0.12                   pypi_0    pypi
flaky                     3.5.3                      py_0    conda-forge
flatbuffers               1.11                     pypi_0    pypi
freetype                  2.10.0               h24853df_1    conda-forge
funcsigs                  1.0.2                    pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
gym                       0.12.1                   pypi_0    pypi
idna                      2.8                      pypi_0    pypi
importlib_metadata        1.4.0                    py37_0    conda-forge
intel-openmp              2019.4                      233  
ipython-genutils          0.2.0                    pypi_0    pypi
jdcal                     1.4.1                      py_0    conda-forge
jpeg                      9c                h1de35cc_1001    conda-forge
jsonschema                3.2.0                    pypi_0    pypi
jupyter-core              4.6.1                    pypi_0    pypi
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
libcxx                    9.0.1                         1    conda-forge
libffi                    3.2.1             h6de7cb9_1006    conda-forge
libgcc                    4.8.5               hdbeacc1_10    conda-forge
libgfortran               3.0.1                         0    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libopenblas               0.3.7                hd44dcd8_1    conda-forge
libpng                    1.6.37               h2573ce8_0    conda-forge
libprotobuf               3.11.2               hd174df1_0    conda-forge
libtiff                   4.1.0                ha78913b_3    conda-forge
lz4-c                     1.8.3             h6de7cb9_1001    conda-forge
markdown                  3.1.1                      py_0    conda-forge
mkl                       2019.4                      233  
more-itertools            8.1.0                      py_0    conda-forge
nbformat                  5.0.3                    pypi_0    pypi
ncurses                   6.1               h0a44026_1002    conda-forge
ninja                     1.9.0                ha1b3eb9_1    conda-forge
numpy                     1.16.3           py37hdf140aa_0    conda-forge
olefile                   0.46                       py_0    conda-forge
opencv-python             4.1.0.25                 pypi_0    pypi
openpyxl                  2.6.1                      py_0    conda-forge
openssl                   1.1.1d               h0b31af3_0    conda-forge
pandas                    0.24.2           py37h4f17bb1_1    conda-forge
pillow                    6.2.0            py37hb6f49c9_0    conda-forge
pip                       19.1.1                   py37_0    conda-forge
plotly                    3.9.0                    pypi_0    pypi
plotly-orca               1.2.1                         1    plotly
pluggy                    0.13.0                   py37_0    conda-forge
protobuf                  3.11.2           py37h4a8c4bd_0    conda-forge
psutil                    5.6.2            py37h01d97ff_0    conda-forge
py                        1.8.1                      py_0    conda-forge
pycodestyle               2.5.0                      py_0    conda-forge
pycparser                 2.19                     py37_1    conda-forge
pydash                    4.2.1                      py_0    conda-forge
pyglet                    1.4.9                    pypi_0    pypi
pyopengl                  3.1.0                    pypi_0    pypi
pyrsistent                0.15.7                   pypi_0    pypi
pytest                    4.5.0                    py37_0    conda-forge
pytest-cov                2.7.1                      py_0    conda-forge
pytest-timeout            1.3.3                      py_0    conda-forge
python                    3.7.3                h5c2c468_2    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
pytorch                   1.1.0                   py3.7_0    pytorch
pytz                      2019.3                     py_0    conda-forge
pyyaml                    5.1.2            py37h0b31af3_1    conda-forge
ray                       0.7.0                    pypi_0    pypi
readline                  8.0                  hcfe32e1_0    conda-forge
redis                     2.10.6                   pypi_0    pypi
regex                     2019.05.25       py37h01d97ff_0    conda-forge
requests                  2.22.0                   pypi_0    pypi
retrying                  1.3.3                    pypi_0    pypi
roboschool                1.0.46                   pypi_0    pypi
scipy                     1.3.0            py37hab3da7d_1    conda-forge
setuptools                45.0.0                   py37_1    conda-forge
six                       1.13.0                   py37_0    conda-forge
sqlite                    3.30.1               h93121df_0    conda-forge
tensorboard               1.14.0                   py37_0    conda-forge
tk                        8.6.10               hbbe82c9_0    conda-forge
traitlets                 4.3.3                    pypi_0    pypi
typing                    3.7.4.1                  pypi_0    pypi
ujson                     1.35            py37h0b31af3_1001    conda-forge
urllib3                   1.25.7                   pypi_0    pypi
wcwidth                   0.1.8                      py_0    conda-forge
werkzeug                  0.16.0                     py_0    conda-forge
wheel                     0.33.6                   py37_0    conda-forge
xlrd                      1.2.0                      py_0    conda-forge
xvfbwrapper               0.2.9                    pypi_0    pypi
xz                        5.2.4             h1de35cc_1001    conda-forge
yaml                      0.2.2                h0b31af3_1    conda-forge
zipp                      0.6.0                      py_0    conda-forge
zlib                      1.2.11            h0b31af3_1006    conda-forge
zstd                      1.4.4                he7fca8b_1    conda-forge

I can see where the error is, but I'm not sure how to fix it. The code assumes the lab_mode can be split by @, but lab_mode is a list with the full values:

[{'name': 'Reinforce', 'algorithm': {'name': 'Reinforce', 'action_pdtype': 'default', 'action_policy': 'default', 'center_return': True, 'explore_var_spec': None, 'gamma': 0.99, 'entropy_coef_spec': {'name': 'linear_decay', 'start_val': 0.01, 'end_val': 0.001, 'start_step': 0, 'end_step': 20000}, 'training_frequency': 1}, 'memory': {'name': 'OnPolicyReplay'}, 'net': {'type': 'MLPNet', 'hid_layers': [64], 'hid_layers_activation': 'selu', 'clip_grad_val': None, 'loss_spec': {'name': 'MSELoss'}, 'optim_spec': {'name': 'Adam', 'lr': 0.002}, 'lr_scheduler_spec': None}}]

Thanks!

docker gotchas

Hi. I tried running this through Docker, and ran into a few gotchas following the gitbook instructions:

the files in bin somehow gave me permission errors, despite being root. pasting these manually helped as a work-around.
the setup script used sudo a lot, but the docker container did not recognize this. removing these helped. fwiw, installing sudo helped as well.
source activate lab errored stating source was not recognized. I then tried:

# conda config --add channels anaconda
# conda activate lab
# conda env update
(lab) # python3 --version
Python 3.6.4
(lab) # yarn start
$ python3 run_lab.py
Traceback (most recent call last):
  File "run_lab.py", line 6, in <module>
    from slm_lab.experiment.control import Session, Trial, Experiment
  File "/opt/SLM-Lab/slm_lab/__init__.py", line 12
    with open(os.path.join(ROOT_DIR, 'config', f'{config_name}.json')) as f:
                                                                   ^
SyntaxError: invalid syntax
error Command failed with exit code 1

Trying this line in this python3 seemed not to yield syntax errors though, so f-strings do seem supported. Weird.

I haven't fully gotten this to work, but hopefully some of this may be useful for the tutorial.
I tried looking for the gitbook source in case I could add to the installation instructions based on this, but couldn't find it.

About target entropy in SAC

Hi~keng
I have some problems about SAC-discrete.
I found this version code:https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch which has not use Gumbel-softmax, and its target entropy is set as a positive value with -np.log(1.0/acition_space.size()) * 0.98 and the log_alpha will be increase to greater than 1.0 with the update step. But the sac for continuous in this version also use a negative value with -np.prod(acition_space.size()).
But in your code, you use Gumbel-softmax and set both discrete and continuous's target entropy with a negative value with -np.prod(acition_space.size()),so the log_alpha will decrease with the update step.
I really want to know how can i set the target entropy?Why target entropy in @p-christ 's code is different from you?

https://stackoverflow.com/questions/56226133/%20soft-actor-critic-with-discrete-action-space

@kengz

Running Demo Dependency Issue

I'm enjoying the book very much, thank you. I've got a VirtualBox created with Ubuntu 20.04.1 LTS to try out the SLM lab, and was able to get the code, and run ./bin/setup. The next step is to run your demo, which produces the following error. Do you have an idea about how to resolve this? I couldn't figure it out. I see that there is a linux regular expression library that probably is missing, but am not sure what the best way to fix it could be. I've attached the shell text would shows the setup's output as well.

https://www.pcre.org/original/doc/html/pcre16.html

Error logs

(base) philip@philip-VirtualBox:~/Documents/MachineLearning/SLM-Lab$ conda activate lab
(lab) philip@philip-VirtualBox:~/Documents/MachineLearning/SLM-Lab$ python run_lab.py slm_lab/spec/demo.json dqn_cartpole dev
Traceback (most recent call last):
  File "run_lab.py", line 5, in <module>
    from slm_lab.experiment.control import Session, Trial, Experiment
  File "/home/philip/Documents/MachineLearning/SLM-Lab/slm_lab/experiment/control.py", line 7, in <module>
    from slm_lab.experiment import analysis, search
  File "/home/philip/Documents/MachineLearning/SLM-Lab/slm_lab/experiment/analysis.py", line 2, in <module>
    from slm_lab.spec import random_baseline
  File "/home/philip/Documents/MachineLearning/SLM-Lab/slm_lab/spec/random_baseline.py", line 7, in <module>
    import roboschool
  File "/home/philip/miniconda3/envs/lab/lib/python3.7/site-packages/roboschool/__init__.py", line 112, in <module>
    from roboschool.gym_pendulums import RoboschoolInvertedPendulum
  File "/home/philip/miniconda3/envs/lab/lib/python3.7/site-packages/roboschool/gym_pendulums.py", line 1, in <module>
    from roboschool.scene_abstract import SingleRobotEmptyScene
  File "/home/philip/miniconda3/envs/lab/lib/python3.7/site-packages/roboschool/scene_abstract.py", line 12, in <module>
    from roboschool  import cpp_household   as cpp_household
ImportError: libpcre16.so.3: cannot open shared object file: No such file or directory

slm_libprec_issue.txt

Thanks
Philip

Video recording

Hello 👋

I searched the repo a bit but I’m fairly new to it. I’m running on a headless server and I’m trying to understand if SLM-Lab has the capability for video recording of an episode. I saw that it installs the ffmpeg module in ubuntu_setup.sh, but don’t see if/where ffmpeg is being used. Does it have this capability? Can you point me to the code? If not, I can try to add it if you’d be interested.

Error at end the execution

Hi,
I get stuck at the end of the trial, when it finish, can't create the respective graphics, i got the next traceback error, what can it be?

Traceback (most recent call last):
File "run_lab.py", line 63, in
main()
File "run_lab.py", line 59, in main
run_by_mode(spec_file, spec_name, lab_mode)
File "run_lab.py", line 38, in run_by_mode
Trial(spec).run()
File "/home/kelo/librerias/SLM-Lab/slm_lab/experiment/control.py", line 122, in run
session_datas = util.parallelize_fn(self.init_session_and_run, info_spaces, num_cpus)
File "/home/kelo/librerias/SLM-Lab/slm_lab/lib/util.py", line 533, in parallelize_fn
results = pool.map(fn, args)
File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
ValueError: Invalid property specified for object of type plotly.graph_objs.Layout: 'yaxis2'

Something funny in `util.parallelize_fn`

After a fresh install on MacOS, running the demo works fine in "dev" mode, but in "train" mode there's an immediate crash with:

Traceback (most recent call last):
  File "run_lab.py", line 75, in <module>
    main()
  File "run_lab.py", line 70, in main
    run_by_mode(spec_file, spec_name, lab_mode)
  File "run_lab.py", line 48, in run_by_mode
    Trial(spec, info_space).run()
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/control.py", line 240, in run
    session_datas = self.run_sessions()
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/experiment/control.py", line 190, in run_sessions
    session_datas = util.parallelize_fn(self.init_session_and_run, info_spaces, ps.get(self.spec['meta'], 'resources.num_cpus', util.NUM_CPUS))
  File "/Users/mwcvitkovic/Projects/kengz-SLM-Lab/slm_lab/lib/util.py", line 397, in parallelize_fn
    pool = mp.Pool(num_cpus, initializer=pool_init, maxtasksperchild=1)
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/context.py", line 119, in Pool
    context=self.get_context())
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/pool.py", line 174, in __init__
    self._repopulate_pool()
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
    w.start()
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/popen_fork.py", line 26, in __init__
    self._launch(process_obj)
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Users/mwcvitkovic/miniconda3/envs/lab/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'parallelize_fn.<locals>.pool_init'

kengz / slm-lab Goto Github PK

slm-lab's Introduction

SLM Lab

slm-lab's People

Contributors

Stargazers

Watchers

Forkers

slm-lab's Issues

Service Unavailable

Recommend Projects

Recommend Topics

Recommend Org