openai / mlsh Goto Github PK
View Code? Open in Web Editor NEWCode for the paper "Meta-Learning Shared Hierarchies"
Home Page: https://arxiv.org/abs/1710.09767
Code for the paper "Meta-Learning Shared Hierarchies"
Home Page: https://arxiv.org/abs/1710.09767
I used the following statement:
python3 main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 50 --continue_iter 00615 --replay True AntAgent
The thing i notice is that you need to write "0" infront of the iteration-number.
Another thing i noticed is that i needed to copy the files from "savedir" to the folder "AntAgent" to make it work.
I guess the checkpoint-algorithm uses the wrong directory for storing checkpoints.
The first output also shows "It is Iteration 0 so i'm changing [...]".
But i wanted to continue the learning process and didn't want to start from beginning.
Hi! It would be great if you can give a more detailed readme for a Windows system. Thanks very much.
Hi,
I'm running the code using mpirun inside a docker container. It worked at first but recently I started getting the error message
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 135
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7)
As far as I'm aware of I didn't change anything. Does anyone know where this might be coming from?
Thanks a lot!
Best, Max
See:
https://github.com/openai/mlsh/blob/master/mlsh_code/rollouts.py#L83
Are you shure that "and" condition there is correct?
It would mean that the environment gets not reset although you already reached the goal.
(It will do another x steps till macrolen gets reached although its clear it won't get any reward)
And also means that the environment will never reset at all if you are at a stage where the goal cannot be reached anymore.
An "or" would make more sense in my opinion.
When running fourroom environment, there is a bug in the policy_network.py that makes the code crash. The observation returned for the four room environment is an integer, however, for the movement bandit, the return is a vector of current location and two goal locations. In line 49 in the policy_network.py, ob[None] will crash if ob is an integer.
Hi, great work on the paper and code! I am working on a project that builds on top of MLSH. We implemented our own GPU optimized version of the algorithm based on your MPI based code. We observe both in our setting as well as with your code for MovementBandits that both the sub policies end up learning the same strategy of moving to just one of the bandits (the same one consistently) in a single run.
Here are the parameters from my run
mpirun -np 120 python3 main.py --task MovementBandits-v0 --num_subs 2 --macro_duration 10 --num_rollouts 2000 --warmup_time 9 --train_time 1 --replay False MovementBandits
Additionally, I tried both optimizer step sizes of 3e-4 (as mentioned in the paper) and 3e-5 (embedded as a default argument in code) and changed the seed value of 1401 also embedded in your main file.
I modified master.py
to log some additional information such as the current iteration number and the real goal chosen by randomizeCorrect
(our fork). Here is a snippet from one of the runs
Mini ep 10, goal 1, iteration 30: global: 18.60333333333333, local: 42.5
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 2.375
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 2.725
Mini ep 5, goal 1, iteration 35: global: 18.60333333333333, local: 43.075
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 1.05
Mini ep 2, goal 0, iteration 38: global: 18.60333333333333, local: 3.125
Mini ep 4, goal 1, iteration 36: global: 18.60333333333333, local: 44.375
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 3.275
Mini ep 4, goal 1, iteration 36: global: 18.60333333333333, local: 43.65
Mini ep 7, goal 0, iteration 33: global: 18.60333333333333, local: 5.575
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 3.975
Mini ep 8, goal 0, iteration 32: global: 18.60333333333333, local: 4.0
Mini ep 2, goal 0, iteration 38: global: 18.60333333333333, local: 0.475
Mini ep 5, goal 1, iteration 35: global: 18.60333333333333, local: 41.4
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 0.975
We observe similar behavior (confirmed upon visualization of the subpolicies with render
, both of them cause the agent to move to the same disc throughout the entire training run) with our own implementation.
We are attempting to reproduce the results from the paper (figure 4, page 6) where the agent learns to get rewards around 40 after a few gradient updates. Please let us know if we are running the right hyperparameter configuration/what seeds to use with the original codebase to observe such behavior; this will greatly help with our research! Thanks.
could you please point out the library for rl_algs ?
Hi, I read the paper and in the experiment section, apart from the first simple examples where it is trivial to determine the number of sub-policies, from section 6.4 (ant robot, etc.) on, I didn't see any detail about how this number is set.
In my opinion, this number should be quite important for this algorithm to perform well, for example you won't get good results by setting num-policy=3 in FourRooms.
Could you please explain how this number should be chosen? Thank you.
Just want to check if we run the code in a correct way. If I want to run the code in 120 parallel instance as the paper suggests. Do I just use mpiexec command like the following command? or there is another way doing this?
mpiexec -n 120 python3 main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay False AntAgent
Hi, have similar error with rl_algos:
mlsh/mlsh_code$ python main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay False AntAgent
Traceback (most recent call last):
File "main.py", line 20, in <module>
from rl_algs.common import set_global_seeds, tf_util as U
ModuleNotFoundError: No module named 'rl_algs'
Originally posted by @ViktorM in #1 (comment)
It seems the code for the FourRooms
env was carried over from the option-critic implementation, and thus it also has a bug that the option-critic implementation had. See: jeanharb/option_critic#11
Also, it seems that the goal is same every episode? https://github.com/openai/mlsh/blob/master/test_envs/test_envs/envs/fourrooms.py#L60
@kvfrans It would be cool if you can clarify this?
Hello, I have tried to train the MLSH policies under the MovementBandits environment,
but outputs of the master policy seems to be random even after training.
The command I tried is here:
mpirun -np 120 python3 main.py --task MovementBandits-v0 --num_subs 2 --macro_duration 10 --num_rollouts 2000 --warmup_time 9 --train_time 1 --replay False MovementBandits
I guess the master policy has to have observation about the correct goal to select sub policies, but the current implementation provides nothing about the correct goal.
Do you have any updates about MovementBandits?
Line 123 in 58f527a
Hi, shouldn't the logic for determining terminal states for sub-policies consider the case where the master action changes? If the action changes, shouldn't we designate the current state as terminal? It seems that the current implementation can bootstrap from a different sub-policy network when such a case arises.
For Fourrooms-v0 game, env.render() function can be applied. how to render the game??
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.