tristandeleu / pytorch-maml-rl Goto Github PK

Reinforcement Learning with Model-Agnostic Meta-Learning in Pytorch

License: MIT License

Python 100.00%

pytorch-maml-rl's Introduction

Reinforcement Learning with Model-Agnostic Meta-Learning (MAML)

Implementation of Model-Agnostic Meta-Learning (MAML) applied on Reinforcement Learning problems in Pytorch. This repository includes environments introduced in (Duan et al., 2016, Finn et al., 2017): multi-armed bandits, tabular MDPs, continuous control with MuJoCo, and 2D navigation task.

Getting started

To avoid any conflict with your existing Python setup, and to keep this project self-contained, it is suggested to work in a virtual environment with virtualenv. To install virtualenv:

pip install --upgrade virtualenv

Create a virtual environment, activate it and install the requirements in requirements.txt.

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Requirements

Python 3.5 or above
PyTorch 1.3
Gym 0.15

Usage

Training

You can use the train.py script in order to run reinforcement learning experiments with MAML. Note that by default, logs are available in train.py but are not saved (eg. the returns during meta-training). For example, to run the script on HalfCheetah-Vel:

python train.py --config configs/maml/halfcheetah-vel.yaml --output-folder maml-halfcheetah-vel --seed 1 --num-workers 8

Testing

Once you have meta-trained the policy, you can test it on the same environment using test.py:

python test.py --config maml-halfcheetah-vel/config.json --policy maml-halfcheetah-vel/policy.th --output maml-halfcheetah-vel/results.npz --meta-batch-size 20 --num-batches 10  --num-workers 8

References

This project is, for the most part, a reproduction of the original implementation cbfinn/maml_rl in Pytorch. These experiments are based on the paper

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. International Conference on Machine Learning (ICML), 2017 [ArXiv]

If you want to cite this paper

@article{finn17maml,
  author    = {Chelsea Finn and Pieter Abbeel and Sergey Levine},
  title     = {{Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks}},
  journal   = {International Conference on Machine Learning (ICML)},
  year      = {2017},
  url       = {http://arxiv.org/abs/1703.03400}
}

If you want to cite this implementation:

@misc{deleu2018mamlrl,
  author = {Tristan Deleu},
  title  = {{Model-Agnostic Meta-Learning for Reinforcement Learning in PyTorch}},
  note   = {Available at: https://github.com/tristandeleu/pytorch-maml-rl},
  year   = {2018}
}

pytorch-maml-rl's People

Stargazers

Watchers

Forkers

johndpope dthboyd amoliu merz9b multipath codeaudit tony32769 qitong kelvinson muharremokutan auserj shubhampachori12110095 renly vuoristo ian09 ahmadsawalhah abalakrishna123 dendisuhubdy zhanpenghe kylehkhsu sohojoe yingyingf vashishtmadhavan megayeye zxsted qbetterk disiok hal2001 msbird quantumiracle praneethsv cb614611757 collector-m che1qian2 yuanmengzhixing papoudakis haochihlin dantodor sx1616039 biy001 qyxqyx hazekiahwon skyhehe123 ankeshanand rsip4sh tescaf amitfishy zhmiao vidhijain pinglmlcv khuongnd zstbackcourt benjywp dujing213 wanghuimu huangfugui00 shamcondor tk1363704 csjunxu tjtanaa kjunelee sharathraparthy shenzebang ryuhaerang huiniu advaypal behzadhaghgoo drdh aravindmahadevan guitaricet liuqiangopenmind janmaltel forks-learning mayankm96 benhuh ashutosh-adhikari 543678659 zoeyuchao marccote 5l1v3r1 thanhkaist slee01 songanz kennyderek adamzou naveace hilbertxu ssikjeong1 dkkim93 jasleen1722 shawnmanuel000 syzhang092218-source guardian-in-the-wf mfe7 weichengtseng chenso121 albertwilcox yskim525 chaochaolu gthecht

pytorch-maml-rl's Issues

what's the purpose of len(self) in batchepisode when sample

What's the purpose of using len(self) in sampling, I'm confused since we have already provide batch_size. With this len(self) , sampled data's shape would be [len(self), batch_size, state_dim], which I can not understand.

clean and nice implementation, could you extend to promp

Thanks for this nice and readable implementation of meta learning algorithm, could you extend it to ProMP?

Are benchmarks available?

Thank you for the repo!

I was wondering if the repo obtains the same performance as reported in the papers on the DRL benchmarks?

Interpretation of before and after update

I am confused about the before and after update rewards on tensorboard.

# Tensorboard writer.add_scalar('total_rewards/before_update', total_rewards([ep.rewards for ep, _ in episodes]), batch) writer.add_scalar('total_rewards/after_update', total_rewards([ep.rewards for _, ep in episodes]), batch)

I mean, I wanted to understand how to train the model on a new environment in 2 or 3 gradient steps and then check the reward. Is this what the after_update rewards refer to ?

Fails to converge on bandit tasks

Using k=5, n=100, MAML fails to learn: average training and validation returns consistently hover around 50 throughout all 500 outer loop steps. Any possible discrepancies between this repo's code/config and the paper's experiments?

For reference, the following command

python train.py --config configs/maml/bandit/bandit-k5-n100.yaml --output-folder maml-bandit-k5-n100 --seed 1 --num-workers 10

produces the following average training/validation average returns for first and last 5 iterations respectively:

0 49.1 51.600002
1 45.5 47.75
2 49.449997 50.350002
3 49.65 52.2
4 50.4 52.7
...
495 46.6 50.0
496 50.150005 50.200005
497 53.100002 55.15
498 49.0 50.450005
499 44.5 47.600002

questions about Ant environment?

Hi, this is fantastic resource, Thanks for sharing this! BTW, I ran AntDir-v1, and AntVel-v1, I ran 2~3 times, but it doesn't seem to learn at all. I could successfully run HalfCheetahDir-v1 and HalfChettahVel-v1. Do you have any idea? I used default settings in your running script.

Question about first_order argument

Hi,

I am wondering what are the effects of having first-order=False, and when should we use it?

From what I understand of the current implementation, first_order only affects the sample method, and thus is not reflected in MAML training loop (MetaLearner.surrogate_loss, MetaLearner.step). I am wondering if this would make any difference in training.

Would like to confirm with you on this.

Thanks

Question about regression in baseline

In https://github.com/tristandeleu/pytorch-maml-rl/blob/master/maml_rl/baseline.py#L44-L45, is there any particular reason to multiplying featmat.t()? In other words, is there any drawback to just doing torch.gels(torch.matmul(returns), torch.matmul(featmat) + reg_coeff * eye)?
Thanks in advance for your insight!

Memory is always increasing?

Thanks a lot for your implementation of this project.But I meet some problem that I can't solve by myself. When I run your code, the computer memory is always increasing and never decrease until the memory is exhausted.Can you tell me the reason?

Problem with registration importing the basic modified environment

Hello,

Thank you for this code-base. I'm facing an issue with running the main experiment. Do you have any suggestions for getting it to work? I've tested out the openai gym and mujoco frameworks independently and they both seem to work fine. Not sure why something wrong is getting passed to one of internal checks in registration.py.

Thank you!

$ python main.py --env-name HalfCheetahDir-v1 --num-workers 8 --fast-lr 0.1 --max-kl 0.01 --fast-batch-size 20 --meta-batch-size 40 --num-layers 2 --hidden-size 100 --num-batches 1000 --gamma 0.99 --tau 1.0 --cg-damping 1e-5 --ls-max-steps 15 --output-folder maml-halfcheetah-dir --device cuda

Output:

Traceback (most recent call last):
  File "main.py", line 141, in <module>
    main(args)
  File "main.py", line 34, in main
    num_workers=args.num_workers)
  File "/home/fishy2/anaconda3/envs/comp767_maml_project/code/pytorch-maml-rl/maml_rl/sampler.py", line 21, in __init__
    queue=self.queue)
  File "/home/fishy2/anaconda3/envs/comp767_maml_project/code/pytorch-maml-rl/maml_rl/envs/subproc_vec_env.py", line 67, in __init__
    for (remote, env_fn) in zip(self.work_remotes, env_factory)]
  File "/home/fishy2/anaconda3/envs/comp767_maml_project/code/pytorch-maml-rl/maml_rl/envs/subproc_vec_env.py", line 67, in <listcomp>
    for (remote, env_fn) in zip(self.work_remotes, env_factory)]
  File "/home/fishy2/anaconda3/envs/comp767_maml_project/code/pytorch-maml-rl/maml_rl/envs/subproc_vec_env.py", line 15, in __init__
    self.env = env_fn()
  File "/home/fishy2/anaconda3/envs/comp767_maml_project/code/pytorch-maml-rl/maml_rl/sampler.py", line 10, in _make_env
    return gym.make(env_name)
  File "/home/fishy2/anaconda3/envs/comp767_maml_project/libraries/gym/gym/envs/registration.py", line 183, in make
    return registry.make(id, **kwargs)
  File "/home/fishy2/anaconda3/envs/comp767_maml_project/libraries/gym/gym/envs/registration.py", line 132, in make
    if (env.spec.timestep_limit is not None) and not spec.tags.get('vnc'):
AttributeError: 'NoneType' object has no attribute 'timestep_limit'

pytorch 1.3 and python 3.8

pip couldn't find a pytorch 1.3 version with my python 3.8 virtualenv, but the requirements.txt file installed just fine when i used python 3.6.9. maybe one fix is to update the readme to limit the compatible python versions to 3.5-3.6 (maybe 3.7, but i didn't test that)

Custom environment and baseline.fit(episodes) error

Hi -- I have a custom gym environment that outputs observations that (should) never contain all zeros, but sometimes when I print out episodes.observations the first several rows contain reasonable observation vectors and the last several rows contain zero vectors. I am guessing the mask attribute is related to this?

The issue is that having a bunch of zero rows seems to make the matrix inversion in baseline.fit difficult and it returns an error. I'm wondering if you have any advice on where the zero vectors might be coming from and what to do to make the fit function work in their presence (maybe ignoring those rows?). Thanks!

Can this code run in win10 ？

I find you have update your code recently. Your work is very amazing to me, I really thank you very much. But I also find that your code can't run in win10 but can run in linux.

Questions about the MultiTaskSampler

Hello,

Thanks for sharing the code.

I'm trying to put train.py and test.py into one file, which is important for me to do further work.

However, when I use sampler = MultiTaskSampler(...) the second time, it will stop at

pytorch-maml-rl/maml_rl/samplers/multi_task_sampler.py

Lines 138 to 143 in 243214b

    
           async def _wait(train_futures, valid_futures): 
        
               # Gather the train and valid episodes 
        
               train_episodes = await asyncio.gather(*[asyncio.gather(*futures) 
        
                                                     for futures in train_futures]) 
        
               valid_episodes = await asyncio.gather(*valid_futures) 
        
               return (train_episodes, valid_episodes)

in the multi_task_sampler.py.

It's easy to reproduce this problem, if you copy the code

pytorch-maml-rl/train.py

Lines 47 to 83 in 243214b

    
           sampler = MultiTaskSampler(config['env-name'], 
        
                                      env_kwargs=config.get('env-kwargs', {}), 
        
                                      batch_size=config['fast-batch-size'], 
        
                                      policy=policy, 
        
                                      baseline=baseline, 
        
                                      env=env, 
        
                                      seed=args.seed, 
        
                                      num_workers=args.num_workers) 
        
           metalearner = MAMLTRPO(policy, 
        
                                  fast_lr=config['fast-lr'], 
        
                                  first_order=config['first-order'], 
        
                                  device=args.device) 
        
           num_iterations = 0 
        
           for batch in trange(config['num-batches']): 
        
               tasks = sampler.sample_tasks(num_tasks=config['meta-batch-size']) 
        
               futures = sampler.sample_async(tasks, 
        
                                              num_steps=config['num-steps'], 
        
                                              fast_lr=config['fast-lr'], 
        
                                              gamma=config['gamma'], 
        
                                              gae_lambda=config['gae-lambda'], 
        
                                              device=args.device) 
        
               logs = metalearner.step(*futures, 
        
                                       max_kl=config['max-kl'], 
        
                                       cg_iters=config['cg-iters'], 
        
                                       cg_damping=config['cg-damping'], 
        
                                       ls_max_steps=config['ls-max-steps'], 
        
                                       ls_backtrack_ratio=config['ls-backtrack-ratio']) 
        
               train_episodes, valid_episodes = sampler.sample_wait(futures) 
        
               num_iterations += sum(sum(episode.lengths) for episode in train_episodes[0]) 
        
               num_iterations += sum(sum(episode.lengths) for episode in valid_episodes) 
        
               logs.update(tasks=tasks, 
        
                           num_iterations=num_iterations, 
        
                           train_returns=get_returns(train_episodes[0]), 
        
                           valid_returns=get_returns(valid_episodes))

to the end of 'main' function where use it second time.

I think the problem may be about deadlock with multiprocessing module.
However, I‘ve searched and tried many solution like gc and threading in stackoverflow, and I couldn't fix the problem.

What could be the problem?

Thank you so much if you can provide any help.

Seeing the agent in action

Hello,
How can one save the agent performance in the form of a video?
For example you have a gif file in this repo at https://github.com/tristandeleu/pytorch-maml-rl/blob/master/_assets/halfcheetahdir.gif

Can we use gym's Monitor to record the performance during testing/training?

What's the goal to close self.work_remotes in subproc_vec_env.py

I found you establish 8 mp.Pipe() with

self.remotes, self.work_remotes = zip(*[mp.Pipe() for _ in env_factory])

But then you close all another duplex point by:

		for remote in self.work_remotes:
			remote.close()

I wonder could u explain this?

Cuda Support Issue

Hi,
Thank you so much for the codes.

I see that cuda may not be supported in many cases.
I'm guessing this is due to multiprocessing, which I am not so sure about.
So can you (or whoever can) briefly explain what's causing it, and whether it can be fixed?

Thanks in advance :)

hyperparameters for multi-armed bandit envs

do you happen to have the hyperparameters for the multi-armed bandit experiments. im trying to compare with the results of Duan et al. 2016

Could u kindly implement pytorch-maml part?

Hi, I found it's very difficult to implement the code in MAML.
Currently my implementation has some bugs: https://github.com/dragen1860/MAML-Pytorch
Im not sure how to do 2nd derivate in MAML.
And also I search other implementation in pytorch and there are more severe bugs in them.

So would u kindly publish a correct verion of pytorch-maml or You CAN just review my implementation above and help me see any bugs?
Thanks so much.

train_returns and valid_returns seems to be equal

Hi,

Thank you for your great work. After train and test with test.py, I got results.npz file. However, the mean of train_returns and valid returns seems to be very similar, the difference is always below 1. I've tried different fast-lr rate 0.1 0.01 0.001 even 10, increased number of batches to 1000, and also different num-step in the test.py. Nevertheless, there is no significant difference between train_returns and valid returns. I tried on halfcheetah-vel and dir environments and the problem occurred in both env. Do you have any suggestions?

Question : hessian_vector_product in MetaLearner needed for TRPO, or MAML?

Hello!

Great implementation, thank you for putting this out there!

I am using it in order to get a design framework to make a supervised learning MAML implementation, and I have a quick question on the outer loop gradient:

From my understanding, the hessian_vector_product calculation is only needed as part of the TRPO implementation, in order to do conjugate gradients and line search. Is that right?

What I mean is that if I want to do supervised learning, I can just use the autograd.grad(create_graph = True) trick in order to create gradients that I can back propagate through, and then in the outer loop just use the standard pytorch Adam implementation right?

(and I know that this is getting slightly out of topic for this repo, apologies, but this should also extend seemingly to having multiple inner updates right?I'd just need to set create_graph = True to all of them)

I will appreciate any input on these a lot!
Again, great repo!
Thanks!!

How do you get the baseline curve in Fig5 in your paper?

Dear author, can you tell me how can I get the curve of pretrained, random, oracle of the fig 5 of paper Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks ?

KL divergence with old policy in trpo training

Hi,

I noticed that during trpo training, we need to compute hessian vector product to get the step direction.

The code for hessian vector product requires computing kl divergence between new and old policy

def hessian_vector_product(self, episodes, damping=1e-2):
    def _product(vector):
        kl = self.kl_divergence(episodes)
        ...... # use kl to do some computations
    return _product

Below is the implementation of self.kl_divergence()

def kl_divergence(self, episodes, old_pis=None):
    kls = []
    if old_pis is None:
        old_pis = [None] * len(episodes)

    for (train_episodes, valid_episodes), old_pi in zip(episodes, old_pis):
        params = self.adapt(train_episodes)
        pi = self.policy(valid_episodes.observations, params=params)

        ## If old_pis is not provided, use pi as old_pi
        if old_pi is None:
            old_pi = detach_distribution(pi)
        ......
        kl = weighted_mean(kl_divergence(pi, old_pi), dim=0, weights=mask)
        kls.append(kl)

    return torch.mean(torch.stack(kls, dim=0))

Since old_pis is not provided inside hessian_vector_product. This means that kl divergence is computed between new policy and itself.

I am wondering if this would lead to consistent self.kl_divergence(episodes)==0 throughout training?

Would appreciate your insight. Thanks!

AttributeError: 'Box' object has no attribute 'n'

Dear author:
When I try to run some experiements such as "Hopper-v2", it occures:

Traceback (most recent call last):
  File "main.py", line 151, in <module>
    main(args)
  File "main.py", line 52, in main
    sampler.envs.action_space.n,
AttributeError: 'Box' object has no attribute 'n'

Why?

if i want employe this work to a new env, what should i do

Thanks for this great work! I'd like to konw if iwant employe the meta-rl on the new env (such as carla), what should i do

Loading Pre/Partially-Trained and Visualization

Hello!

Appreciate this repo, I'm learning a lot from it. I trained the halfCheetah, and want to start exploring other environments, but I can't figure out how to restore my trained model / visualize the cheetah. I can see the rewards in TensorBoard perfectly, but want to render the saved policy network to see the trained cheetah in action. I noticed there is no argument for something such as a load path.

Is there a way to do this? How have you been rendering the environment (e.g. the Cheetah GIF in the README?)? Thanks!

TabularMDP-v0 : data type not understood

I tried to use your TabularMDP-v0, but got an error: data type not understood

Thanks for your help!

builtins.AttributeError: 'NoneType' object has no attribute 'timestep_limit'？？

what is the mean of train_episodes and valid_episodes?

the test.py writes a file about "task", "train_episodes", "valid_episodes". "train_episodes", "valid_episodes" are the total rewards of an episode?

Restoring model

Currently there is no way to restore a saved model and only do testing (fast adaptation), correct?

If I want to use the the meta-parameters to adapt to new task, what should I do?

I write a new environment (navigation on deterministic map):
(1) I run " python train.py --config xxxx", and get config.json, policy.th.
(2) I run "python test.py -config xxxx", and get results.npz.
But the rewards in results.npz are still very low.
What should I do to use policy.th to fast adapt to a new task?

AttributeError: 'dict' object has no attribute 'iteritems'

i@d:~/rl/pytorch-maml-rl$ python main.py --env-name HalfCheetahDir-v1 --num-workers 8 --fast-lr 0.1 --max-kl 0.01 --fast-batch-size 20 --meta-batch-size 40 --num-layers 2 --hidden-size 100 --num-batches 1000 --gamma 0.99 --tau 1.0 --cg-damping 1e-5 --ls-max-steps 15 --output-folder maml-halfcheetah-dir --device cuda

Traceback (most recent call last):
  File "main.py", line 138, in <module>
    main(args)
  File "main.py", line 29, in main
    config = {k: v for (k, v) in vars(args).iteritems() if k != 'device'}
AttributeError: 'dict' object has no attribute 'iteritems'

I followed your readme but the error occurs. Any tips, Thanks.

question about /maml_rl/policies/categorical_mlp.py

def forward(self, input, params=None):

    if params is None:
        params = OrderedDict(self.named_parameters())
    output = input
    for i in range(1, self.num_layers):
        output = F.linear(output,
            weight=params['layer{0}.weight'.format(i)],
            bias=params['layer{0}.bias'.format(i)])
        output = self.nonlinearity(output)
    logits = F.linear(output,
        weight=params['layer{0}.weight'.format(self.num_layers)],
        bias=params['layer{0}.bias'.format(self.num_layers)])
    return Categorical(logits=logits)

At the end of categorical_mlp.py, the forward function return the above result.

But should not it be " return Categorical(logits) ", since logits means the probability, right?

Pre-trained networks

I couldn't find a pre-trained policy hosted here, so I trained my own using the suggested command in the readme (saving each iteration). You can find a zip with all of these policies here -- feel free to copy and host, maybe even as a release here? This might be helpful for others who want to try the repo, not an issue per se. Thanks for posting and maintaining this nice repo.

k-shot testing script

Hi! Thanks for your awesome work!

I am wondering if you have implemented MAML test script like the original paper, where we can test pretrained MAML agents, and plot out k-shot rewards?

I am planning on using your repo for a project, and this function would be highly useful. Thanks!

"terminate called after throwing an instance of 'c10::Error'"

Hello,

Thanks for sharing the code. When I tried python train.py --config configs/maml/halfcheetah-vel.yaml --output-folder maml-halfcheetah-vel --seed 1 --num-workers 8,

It gave me this error,
"terminate called after throwing an instance of 'c10::Error'"

I checked all the requirements are satisfied. What could be the problem?

Thanks

question about test

Hi,
Apologize me if the question is a little dumb. But I can't figure out what's going on in test.py. Is there any learning phase in it? If not how can I test gradient update and if so where does model learn?

Questions about multi-gradient steps

Hi, thank you for providing great implementations! I've learned a lot from this repo, which is pretty easy-understanding and fast. My question is that for the 2d-navigation task, I trained num_steps=5 and tested it, but the results are quite different from those in the original paper.
I edited the test.py code like the followings:

# test.py
...
for batch in trange(args.num_batches):
        tasks = sampler.sample_tasks(num_tasks=args.meta_batch_size)
        train_episodes, valid_episodes = sampler.sample(tasks,
                                                        num_steps=config['num-steps'], # num_steps=5
                                                        fast_lr=config['fast-lr'],
                                                        gamma=config['gamma'],
                                                        gae_lambda=config['gae-lambda'],
                                                        device=args.device)

        logs['tasks'].extend(tasks)
        grad0_returns.append(get_returns(train_episodes[0]))
        grad1_returns.append(get_returns(train_episodes[1]))
        grad2_returns.append(get_returns(train_episodes[2]))
        grad3_returns.append(get_returns(train_episodes[3]))
        grad4_returns.append(get_returns(train_episodes[4]))
...
logs['grad0_returns'] = np.concatenate(grad0_returns, axis=0)
logs['grad1_returns'] = np.concatenate(grad1_returns, axis=0)
...

after saw #26 (comment).

To see the results, I did something like this.

...
data = np.load('path-to-results')
grad0_returns = data['grad0_returns']
grad1_returns = data['grad1_returns']
...
val0 = grad0_returns.mean()
val1 = grad1_returns.mean()
...

However, as the figure shows, the result values are far from what we want.

Figure from the paper.

I also tested just 1 gradient step, which shows about -10, similar to the original paper. Only more gradient steps are the problem.

And one more thing, the paper says for evaluation, they used fast learning rate=0.1 for 1 gradient step, then halved it to 0.05 for all future. But in this implementation, I can't find out that. Isn't this a critical thing? Since I didn't check this, so now I'm struggling to modify the codes to follow the original paper.

Thank you very much in advance!

Can inner update apply Advanced optimizer such as Adam/RMSprop?

Hi, I found your code is very readable and elegant, Thanks.
I have a question when implementing MAML,
when do inner update:

        for (name, param), grad in zip(self.named_parameters(), grads):
            updated_params[name] = param - step_size * grad

fast_weights = list(map(lambda p: p[1] - self.train_lr * p[0], zip(grad, fast_weights)))

While in outer update:
we can use simple SGD still,

for (name, param), grad in zip(self.named_parameters(), grads):
            updated_params[name] = param - step_size * grad

 meta_op.backward()
 adam_optim.step()

My question is : Can the inner update use Adam/RMSprop? why. will it corrupt computation graph?

Can not read your env in the Jupyter

Hi, very nice code.

I downloaded your code and wanted to create envrinoment in the Jupyter.
I created config manually. When I try to make the gym env, it failed.
Seems that your mujoco env is not connected with the gym/envs/mujoco, and thus I can not read your mujoco env from gym.make

My codes and the errors are listed as follows:

import argparse
import multiprocessing as mp
import yaml
import json
import os
import gym
import torch

arser = argparse.ArgumentParser(description='Reinforcement learning with '
                                             'Model-Agnostic Meta-Learning (MAML) - Train')
# parser.add_argument('--config', type=str, required=True,help='path to the configuration file.')

# Miscellaneous
misc = parser.add_argument_group('Miscellaneous')
misc.add_argument('--output-folder', type=str,
                  help='name of the output folder')
misc.add_argument('--seed', type=int, default=None,
                  help='random seed')
misc.add_argument('--num-workers', type=int, default=mp.cpu_count() - 1,
                  help='number of workers for trajectories sampling (default: '
                       '{0})'.format(mp.cpu_count() - 1))
misc.add_argument('--use-cuda', action='store_true',
                  help='use cuda (default: false, use cpu). WARNING: Full upport for cuda '
                       'is not guaranteed. Using CPU is encouraged.')

# args = parser.parse_args()
# args.device = ('cuda' if (torch.cuda.is_available()
#                           and args.use_cuda) else 'cpu')

args = parser.parse_args(args=[])

args.config = "configs/maml/halfcheetah-vel.yaml"
with open(args.config, 'r') as f:
    config = yaml.load(f, Loader=yaml.FullLoader)

args.output_folder = "maml-halfcheetah-vel"
if args.output_folder is not None:
    if not os.path.exists(args.output_folder):
        os.makedirs(args.output_folder)
    policy_filename = os.path.join(args.output_folder, 'policy.th')
    config_filename = os.path.join(args.output_folder, 'config.json')

print(config['env-name'], config['env-kwargs'])
env = gym.make(config['env-name'], **config.get('env-kwargs', {}))

BR,
Charles

log_ratio problem

Hello,
Thank you for sharing this wonderful implementation. In pytorch-maml-rl/maml_rl/metalearner.py line 142, when we are calculating the log ratio (which is essentially importance sampling) why do we put the old_pi in the denominator of the ratio? The advantages are sampled from pi, not old_pi so isn't the importance weight implementation wrong?

Question about the Ant env

Hi, thanks for your code which gives me a lot of help.
But I have a question about the AntVel env, when I tried to train MAML in this environment, I can't get a good result. But it works well in the HalfCheetahVel enviroment. Did you encountered this problem？
And could you tell me how to deal with it? Thanks!

how can I adapt maml on my own environment?

Hi, thanks for your excellent work!
I want to know how can I adapt maml on my own environment, is there any methods that I must add in my env class, such as sample_tasks, reset_tasks?

TypeError: list indices must be integers or slices, not str

I run:
python train.py --config configs/maml/bandit/bandit-k5-n10.yaml --output-folder bandit/bandit-k5-n10/ --seed 1 --num-workers 8
but fail:
Process SamplerWorker-1:
Process SamplerWorker-2:
Process SamplerWorker-3:
Traceback (most recent call last):
Traceback (most recent call last):
File "/gpfs/share/home/.conda/envs/spinningup/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/gpfs/share/home/MetaRLSAS/maml_rl/samplers/multi_task_sampler.py", line 333, in run
self.sample(index, **kwargs)
File "/gpfs/share/home/.conda/envs/spinningup/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/gpfs/share/home/MetaRLSAS/maml_rl/samplers/multi_task_sampler.py", line 263, in sample
device=device)
File "/gpfs/share/home/MetaRLSAS/maml_rl/samplers/multi_task_sampler.py", line 333, in run
self.sample(index, **kwargs)
File "/gpfs/share/home/MetaRLSAS/maml_rl/samplers/multi_task_sampler.py", line 298, in create_episodes
for item in self.sample_trajectories(params=params):
File "/gpfs/share/home/MetaRLSAS/maml_rl/samplers/multi_task_sampler.py", line 263, in sample
device=device)
File "/gpfs/share/home/MetaRLSAS/maml_rl/samplers/multi_task_sampler.py", line 318, in sample_trajectories
batch_ids = infos['batch_ids']
File "/gpfs/share/home/MetaRLSAS/maml_rl/samplers/multi_task_sampler.py", line 298, in create_episodes
for item in self.sample_trajectories(params=params):
File "/gpfs/share/home/MetaRLSAS/maml_rl/samplers/multi_task_sampler.py", line 318, in sample_trajectories
batch_ids = infos['batch_ids']
TypeError: list indices must be integers or slices, not str
TypeError: list indices must be integers or slices, not str

what's wrong with this?

How do gym.make go to envs/mujoco/ant.py or half_cheetah.py

Dear author:
I have a small question when trying to understand your code.
I saw you make env by call:

def make_env(env_name):
	"""
	return a function
	:param env_name:
	:return:
	"""
	def _make_env():
		return gym.make(env_name)

	return _make_env

However, you dont call ant.py or half_cheetah.py explicitly. So I wonder How do the gym know you have implemented this class HalfCheetahDirEnv and HalfCheetahVelEnv etc?

AttributeError: Can't pickle local object 'make_env.<locals>._make_env'

Ubuntu 18.04.5 LTS
Meet this problem after adding 'mp.set_start_method('spawn')'
#40 (comment)

Traceback (most recent call last):
File "train.py", line 122, in
main(args)
File "train.py", line 54, in main
num_workers=args.num_workers)
File "/home/dchen/pytorch-maml-rl/maml_rl/samplers/multi_task_sampler.py", line 107, in init
worker.start()
File "/home/dchen/anaconda3/envs/torch/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/home/dchen/anaconda3/envs/torch/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/dchen/anaconda3/envs/torch/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/dchen/anaconda3/envs/torch/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/dchen/anaconda3/envs/torch/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/home/dchen/anaconda3/envs/torch/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/dchen/anaconda3/envs/torch/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'make_env.._make_env'

Saving baseline in main.py

I think to do the k shot adaptation, we need to also save the baseline.

Questions about the output files

This question may be a bit silly but I cannot figure out what the output files mean and how to draw the figures on your paper. The result files consist of three files: tasks, train_returns, valid_returns. To gain the average returns, should I calculate the mean value of the "valid_returns"? What about the returns before update? Is it calculated by average the "train_returns"? Thank you so much if you can provide any help.

HalfCheetahDir-v1

gym.error.UnregisteredEnv: No registered env with id: HalfCheetahDir-v1

GPU memory leak

Hi,
Different from your implementation which use the weighs/ bias from nn.Linear layer, I write the tensor and forward as full functional style:

class Net:

	def __init__(self, n_class, device):

		# according to Pytorch w/b format, w = [out_dim, in_dim]
		# b = [out_dim]
		self.vars = [
			# [28*28, 512]
			torch.ones(512, 28 * 28, requires_grad=True, device=device),
			torch.zeros(512, requires_grad=True, device=device),
			# [512, 256]
			torch.ones(256, 512, requires_grad=True, device=device),
			torch.zeros(256, requires_grad=True, device=device),
			# [256, n]
			torch.ones(n_class, 256, requires_grad=True, device=device),
			torch.zeros(n_class, requires_grad=True, device=device)
		]



 	def forward(self, x, vars):
		"""

		:param x: [b, 1, 28, 28]
		:param vars:
		:return:
		"""

		vars_idx = 0

		# [b, 1/2, 28, 28]
		x = x.view(x.size(0), -1)

		# [b, 28*28] => [b, 512]
		x = F.linear(x, vars[vars_idx], vars[vars_idx + 1])
		# x = self.bn1(x)
		x = F.leaky_relu(x, 0.2)
		vars_idx += 2

		# [b, 512] => [b, 256]
		x = F.linear(x, vars[vars_idx], vars[vars_idx + 1])
		# x = self.bn2(x)
		x = F.leaky_relu(x, 0.2)
		vars_idx += 2


		# [b, 256] => [b, n_class]
		x = F.linear(x, vars[vars_idx], vars[vars_idx + 1])
		# x = self.bn3(x)
		# here follow by CrossEntroyLoss
		# x = F.leaky_relu(x, 0.2) 
		vars_idx += 2

		return x

However, when I implemented the maml based on these form, I found my program will occupy more and more gpu memory, maybe memory leak.

I dnt know why, Could u say some experience of your guess?

	async def _wait(train_futures, valid_futures):
	# Gather the train and valid episodes
	train_episodes = await asyncio.gather([asyncio.gather(futures)
	for futures in train_futures])
	valid_episodes = await asyncio.gather(*valid_futures)
	return (train_episodes, valid_episodes)

	sampler = MultiTaskSampler(config['env-name'],
	env_kwargs=config.get('env-kwargs', {}),
	batch_size=config['fast-batch-size'],
	policy=policy,
	baseline=baseline,
	env=env,
	seed=args.seed,
	num_workers=args.num_workers)

	metalearner = MAMLTRPO(policy,
	fast_lr=config['fast-lr'],
	first_order=config['first-order'],
	device=args.device)

	num_iterations = 0
	for batch in trange(config['num-batches']):
	tasks = sampler.sample_tasks(num_tasks=config['meta-batch-size'])
	futures = sampler.sample_async(tasks,
	num_steps=config['num-steps'],
	fast_lr=config['fast-lr'],
	gamma=config['gamma'],
	gae_lambda=config['gae-lambda'],
	device=args.device)
	logs = metalearner.step(*futures,
	max_kl=config['max-kl'],
	cg_iters=config['cg-iters'],
	cg_damping=config['cg-damping'],
	ls_max_steps=config['ls-max-steps'],
	ls_backtrack_ratio=config['ls-backtrack-ratio'])

	train_episodes, valid_episodes = sampler.sample_wait(futures)
	num_iterations += sum(sum(episode.lengths) for episode in train_episodes[0])
	num_iterations += sum(sum(episode.lengths) for episode in valid_episodes)
	logs.update(tasks=tasks,
	num_iterations=num_iterations,
	train_returns=get_returns(train_episodes[0]),
	valid_returns=get_returns(valid_episodes))

tristandeleu / pytorch-maml-rl Goto Github PK

pytorch-maml-rl's Introduction

Reinforcement Learning with Model-Agnostic Meta-Learning (MAML)

Getting started

Requirements

Usage

Training

Testing

References

pytorch-maml-rl's People

Stargazers

Watchers

Forkers

pytorch-maml-rl's Issues

Recommend Projects

Recommend Topics

Recommend Org