openai / mlsh Goto Github PK

Code for the paper "Meta-Learning Shared Hierarchies"

Home Page: https://arxiv.org/abs/1710.09767

Python 99.59% Makefile 0.06% Shell 0.10% Dockerfile 0.25%

paper

mlsh's Issues

--continue_iter is buggy

I used the following statement:
python3 main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 50 --continue_iter 00615 --replay True AntAgent

The thing i notice is that you need to write "0" infront of the iteration-number.
Another thing i noticed is that i needed to copy the files from "savedir" to the folder "AntAgent" to make it work.
I guess the checkpoint-algorithm uses the wrong directory for storing checkpoints.

The first output also shows "It is Iteration 0 so i'm changing [...]".
But i wanted to continue the learning process and didn't want to start from beginning.

readme for Windows

Hi! It would be great if you can give a more detailed readme for a Windows system. Thanks very much.

Termination due to `Bus error (signal 7)`

Hi,

I'm running the code using mpirun inside a docker container. It worked at first but recently I started getting the error message

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 135
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7)

As far as I'm aware of I didn't change anything. Does anyone know where this might be coming from?
Thanks a lot!
Best, Max

Wrong environment reset?

See:
https://github.com/openai/mlsh/blob/master/mlsh_code/rollouts.py#L83

Are you shure that "and" condition there is correct?
It would mean that the environment gets not reset although you already reached the goal.
(It will do another x steps till macrolen gets reached although its clear it won't get any reward)
And also means that the environment will never reset at all if you are at a stage where the goal cannot be reached anymore.
An "or" would make more sense in my opinion.

Fourroom environment not working

When running fourroom environment, there is a bug in the policy_network.py that makes the code crash. The observation returned for the four room environment is an integer, however, for the movement bandit, the return is a vector of current location and two goal locations. In line 49 in the policy_network.py, ob[None] will crash if ob is an integer.

Reproducing agent performance in MovementBandits

Hi, great work on the paper and code! I am working on a project that builds on top of MLSH. We implemented our own GPU optimized version of the algorithm based on your MPI based code. We observe both in our setting as well as with your code for MovementBandits that both the sub policies end up learning the same strategy of moving to just one of the bandits (the same one consistently) in a single run.

Here are the parameters from my run
mpirun -np 120 python3 main.py --task MovementBandits-v0 --num_subs 2 --macro_duration 10 --num_rollouts 2000 --warmup_time 9 --train_time 1 --replay False MovementBandits

Additionally, I tried both optimizer step sizes of 3e-4 (as mentioned in the paper) and 3e-5 (embedded as a default argument in code) and changed the seed value of 1401 also embedded in your main file.

I modified master.py to log some additional information such as the current iteration number and the real goal chosen by randomizeCorrect (our fork). Here is a snippet from one of the runs

Mini ep 10, goal 1, iteration 30: global: 18.60333333333333, local: 42.5
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 2.375
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 2.725
Mini ep 5, goal 1, iteration 35: global: 18.60333333333333, local: 43.075
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 1.05
Mini ep 2, goal 0, iteration 38: global: 18.60333333333333, local: 3.125
Mini ep 4, goal 1, iteration 36: global: 18.60333333333333, local: 44.375
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 3.275
Mini ep 4, goal 1, iteration 36: global: 18.60333333333333, local: 43.65
Mini ep 7, goal 0, iteration 33: global: 18.60333333333333, local: 5.575
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 3.975
Mini ep 8, goal 0, iteration 32: global: 18.60333333333333, local: 4.0
Mini ep 2, goal 0, iteration 38: global: 18.60333333333333, local: 0.475
Mini ep 5, goal 1, iteration 35: global: 18.60333333333333, local: 41.4
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 0.975

We observe similar behavior (confirmed upon visualization of the subpolicies with render, both of them cause the agent to move to the same disc throughout the entire training run) with our own implementation.

We are attempting to reproduce the results from the paper (figure 4, page 6) where the agent learns to get rewards around 40 after a few gradient updates. Please let us know if we are running the right hyperparameter configuration/what seeds to use with the original codebase to observe such behavior; this will greatly help with our research! Thanks.

rl_algs

could you please point out the library for rl_algs ?

How to determine the number of sub-policies?

Hi, I read the paper and in the experiment section, apart from the first simple examples where it is trivial to determine the number of sub-policies, from section 6.4 (ant robot, etc.) on, I didn't see any detail about how this number is set.

In my opinion, this number should be quite important for this algorithm to perform well, for example you won't get good results by setting num-policy=3 in FourRooms.

Could you please explain how this number should be chosen? Thank you.

Details on running the code with multiple cores

Just want to check if we run the code in a correct way. If I want to run the code in 120 parallel instance as the paper suggests. Do I just use mpiexec command like the following command? or there is another way doing this?

mpiexec -n 120 python3 main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay False AntAgent

Hi, how did you solve this error?

Hi, have similar error with rl_algos:

mlsh/mlsh_code$ python main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay False AntAgent
Traceback (most recent call last):
  File "main.py", line 20, in <module>
    from rl_algs.common import set_global_seeds, tf_util as U
ModuleNotFoundError: No module named 'rl_algs'

Originally posted by @ViktorM in #1 (comment)

Bug in the FourRooms env implementation

It seems the code for the FourRooms env was carried over from the option-critic implementation, and thus it also has a bug that the option-critic implementation had. See: jeanharb/option_critic#11

Also, it seems that the goal is same every episode? https://github.com/openai/mlsh/blob/master/test_envs/test_envs/envs/fourrooms.py#L60

@kvfrans It would be cool if you can clarify this?

Observation of MovementBandits env

Hello, I have tried to train the MLSH policies under the MovementBandits environment,
but outputs of the master policy seems to be random even after training.

The command I tried is here:
mpirun -np 120 python3 main.py --task MovementBandits-v0 --num_subs 2 --macro_duration 10 --num_rollouts 2000 --warmup_time 9 --train_time 1 --replay False MovementBandits

I guess the master policy has to have observation about the correct goal to select sub policies, but the current implementation provides nothing about the correct goal.
Do you have any updates about MovementBandits?

Terminal states logic for sub-policies

mlsh/mlsh_code/rollouts.py

Line 123 in 58f527a

nonterminal = 1-new[t+1]

Hi, shouldn't the logic for determining terminal states for sub-policies consider the case where the master action changes? If the action changes, shouldn't we designate the current state as terminal? It seems that the current implementation can bootstrap from a different sub-policy network when such a case arises.

how to render Fourrooms-v0

For Fourrooms-v0 game, env.render() function can be applied. how to render the game??

openai / mlsh Goto Github PK

mlsh's Issues

--continue_iter is buggy

readme for Windows

Termination due to `Bus error (signal 7)`

Wrong environment reset?

Fourroom environment not working

Reproducing agent performance in MovementBandits

rl_algs

How to determine the number of sub-policies?

Details on running the code with multiple cores

Hi, how did you solve this error?

Bug in the FourRooms env implementation

Observation of MovementBandits env

Terminal states logic for sub-policies

how to render Fourrooms-v0

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent