dena / handyrl Goto Github PK

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

License: MIT License

Python 100.00%

reinforcement-learning pytorch games policy-gradient deep-learning machine-learning distributed-training

handyrl's People

Contributors

Stargazers

Watchers

Forkers

yuricat ikki407 yonsweng g-votte jeromepatel imokuri aserun yujiimt icas711 kenoss shuntaro-tanaka zui-jiang keisuke-777 zhanyon minhthangbk ttagu99 kibuna peterbonnesoeur tonghuikang toy101 r-ceph kumar-shubham-ml hitenkoku hiroki9759 akihironishihara chandanpanda kuto5046 inkyusa skyramp rl-code-lib songzhg doyoundo qqpann birdwatcheryt mmilk1231 jogima-cyber yasumasanamba astrfo nejumi

handyrl's Issues

Replay Buffer

Many thanks for the great code. I have a question.

How do I insert into the replay buffer? For example, I want to use other agents or random or rule based agents actions=>experience into the replay buffer, especially in the beginning.

BrokenPipeError: [Errno 32] Broken pipe

When training to 189 epoch, the training was interrupted in a server.
It seems OK on my own computer with the same config.

Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/alex2/hx_workspare/HandyRL/handyrl/connection.py", line 175, in _sender
conn.send(next(self.send_generator))
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 398, in _send_bytes
self._send(buf)
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/alex2/hx_workspare/HandyRL/handyrl/connection.py", line 190, in _receiver
data, cnt = conn.recv()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/alex2/hx_workspare/HandyRL/handyrl/connection.py", line 190, in _receiver
data, cnt = conn.recv()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
fd = df.detach()
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/home/alex2/workspace/miniconda3/envs/torch/lib/python3.7/multiprocessing/connection.py", line 620, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

yaml
train_args:
turn_based_training: False
observation: True
gamma: 0.8
forward_steps: 32
compress_steps: 4
entropy_regularization: 2.0e-3
entropy_regularization_decay: 0.3
update_episodes: 300
batch_size: 400
minimum_episodes: 10000
maximum_episodes: 250000
num_batchers: 7
eval_rate: 0.1
worker:
num_parallel: 6
lambda: 0.7
policy_target: 'UPGO' # 'UPGO' 'VTRACE' 'TD' 'MC'
value_target: 'TD' # 'VTRACE' 'TD' 'MC'
seed: 0
restart_epoch: 0

worker_args:
server_address: ''
num_parallel: 6

OSError when running on Windows

Hey guys, thanks for the amazing work.

I have been trying to run your library on windows. I seem to keep encountering following issue

Exception in thread Thread-2:
Traceback (most recent call last):
  File "D:\ProgramData\Anaconda3\envs\handyRL\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "D:\ProgramData\Anaconda3\envs\handyRL\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "D:\Projects\Kaggle\HandyRL\handyrl\connection.py", line 254, in _recv_thread
    conn_list, _, _ = select.select(self.conns, [], [], 0.3)
OSError: [WinError 10022] An invalid argument was supplied

TypeError: list indices must be integers or slices, not str

When I run handyRL in colab or Kaggle, I get the following error.
Is there a problem with loading LuxAI's kaggle-environment?

  File "/usr/local/lib/python3.7/dist-packages/kaggle_environments/envs/lux_ai_2021/lux_ai_2021.py", line 106, in interpreter
    player1.observation.globalCityIDCount = match_obs_meta["globalCityIDCount"]
TypeError: list indices must be integers or slices, not str

Make it a library

c.f. #182 (comment)

At a glance, this framework is worth and it is convenient if it is available on PyPI.

Required works are:

Make handyrl executable
Support user-provided environments and config.yaml.
CI (publish to PyPI)

The cpu memory keeps increasing, and when it is full, an error got!

Hi, When I try to train this code with python main.py --train in my local computer. It's always happen that the cpu memory keeps increasing and when it is full, an error is reported. I think this problem cause by the memory out.

Traceback (most recent call last):
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/nzh_projects/kaggle-environments-master/kaggle_agent/HandyRL/handyrl/connection.py", line 190, in _receiver
data, cnt = conn.recv()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/user/nzh_projects/kaggle-environments-master/kaggle_agent/HandyRL/handyrl/connection.py", line 175, in _sender
conn.send(next(self.send_generator))
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 397, in _send_bytes
self._send(header)
File "/home/user/.conda/envs/spinningup_nzh/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Because my Memory only have 32G, so i have try to turn the max_episodes to the 200000. when the number of the sample at 20w, the memoey use_rate about 95%, but this code still occupy the more cpu memory with continuous sampling. In my Option, I think when the sample number more than max_episodes, the replay buffer will delete the old sample. but i dont know why the memory still increase.

could you give me some ideas. and how can i solve this problem(not restart). Thanks very much!

[REQUESTING OVERVIEW OF DISTRIBUTED HANDYRL]

Hello HandyRL Team!

First off, thanks for making such a useful repository for RL! I love it!

I am trying to understand how the distributed architecture of HandyRL works, but due to lack of documentation so far its been difficult to understand how it's implemented.

I'll give an example (following the Large Scale Training document in the repo):
I have 3 VMs running on GCP (1 as the server (the learner) and 2 other as workers). In the config.yaml file I entered the external IP (the document says its valid to enter the external IP too) of the learner in the worker args parameter for both workers (as per instructions in the document) and tried to run it. However, I don't see anything happen. In the following output the server appears to continue to sleep and does nothing.

OUTPUT:

xyz@vm1:~/HandyRL$ python3 main.py --train-server
{'env_args': {'env': 'HungryGeese'}, 'train_args': {'turn_based_training': False, 'observation': False, 'gamma': 0.8, 'forward_steps': 32, 'compress_steps': 4, 'entropy_regularization': 0.002, 'entropy_regularization_decay': 0.3, 'update_episodes': 500, 'batch_size': 400, 'minimum_episodes': 1000, 'maximum_episodes': 200000, 'epochs': -1, 'num_batchers': 7, 'eval_rate': 0.1, 'worker': {'num_parallel': 32}, 'lambda': 0.7, 'max_self_play_epoch': 1000, 'policy_target': 'TD', 'value_target': 'TD', 'eval': {'opponent': ['modelbase'], 'weights_path': 'None'}, 'seed': 0, 'restart_epoch': 0}, 'worker_args': {'server_address': '<EXTERNAL_IP_OF_SERVER_GOES_HERE_FOR_WORKERS>', 'num_parallel': 32}}
Loading environment football failed: No module named 'gfootball'
started batcher 0
started batcher 1
started batcher 2
started batcher 3
started batcher 4
started batcher 5
waiting training
started entry server 9999
started batcher 6
started worker server 9998
started server

I was hoping you could provide some guidance as to how I can proceed. In any case, a documentation or brief but complete background on the distributed architecture would also be appreciated to debug the problem on my own.

Thank you!

Going further into deep MARL with halite and HandyRL

Hello there !
I've done a little bit of work and research on the kaggle Halite IV environment.

It first appears that nobody to my knowledge has ever tackled it with pure self-play deep RL. Which makes the task quite interesting to do. It's understandable why nobody has ever tackled it that way, as the number of agents vary during a game, as it includes heterogeneous agents (type SHIP and type SHIPYARD), a need for cooperation (between ships and shipyard of the same team), and competition against agents not in the same team. These characteristics are close to real world issues that could be tackled thanks to deep RL.

These kind of characteristics seem though to have been tackled with success by DeepMind and Open AI (with AlphaStar and OpenAI Five). But, their models only output one action, not one per agent, one for all agents. They manage to have success because of the possibility to take huge number of actions in a small time and thus kind of simulating taking one action per agent. They've used embedding of actions and a unique neural net (and so, according to the nomenclature of the scientific literature, they used a centralized neural network and centralized execution, see https://arxiv.org/pdf/1803.11485.pdf for explanation of this).

For the Halite IV environment such a way to handle the actions is not possible, since you only have 400 steps maximum per game to win and actions must be taken simultaneously. It would mean only taking 400 actions in total, when you could have taken each step as many actions as agents you control.

So, what could be done ? Fortunately this paper https://arxiv.org/pdf/2005.13625.pdf gives mathematical proof of a way to handle all this. Their answer is : use a unique neural net for all your agents. If you have heterogeneous agents, try to make it understandable for the net. Use action masking to handle the different action spaces of the heterogeneous agents in the unique net.

So now it comes to this post. I think, it may be possible to use HandyRL with very little tweak in order to achieve a state-of-the-art agent for kaggle Halite IV and I'd like to do it. Why HandyRL ? Because this is the most working distributed policy-gradient library I know so far. What would be the tweaks ? Just enable varying agent numbers in the same game (agents can create other agents, and agents can die during a game).

I've already done the preprocessing and the net :

import numpy as np
import copy
from kaggle_environments import make
from ray.rllib.env.multi_agent_env import MultiAgentEnv

SHIP_ACTIONS = ["CONVERT", "NORTH", "SOUTH", "EAST", "WEST", None, "SPAWN", None]
SHIPYARD_ACTIONS = ["SPAWN", None]

# For Fully Independent Learning
class HaliteEnv(MultiAgentEnv):
    def __init__(self, env_config):
        self.env = make("halite", debug=True)
        self.previous_alive_agent_ids = set() # it's a set
        self.all_agent_ids = [set(),set()]
        self.previous_ships_halite = {}
    
    def observation(self, obs, agent_rllib_id):
        agent_type, team, agent_id = agent_rllib_id.split("_")
        team = int(team)
        obs = copy.deepcopy(obs)
        
        action_mask = np.zeros(8)
        final_obs_proc = None
        if agent_type == "SHIP":
            action_mask[:6] = 1
            final_obs_proc = np.zeros((11,21,21))
        elif agent_type == "SHIPYARD":
            final_obs_proc = np.zeros((11,21,21))
            action_mask[6:] = 1

        # Halite channel
        halites = obs[0].observation.halite    
        for key, halite in enumerate(halites):
            x = key % 21
            y = key // 21
            final_obs_proc[0,x,y] = halite/500 # normalization !

        players = obs[0].observation.players
        own_team = players[team]
        opponent_team = players[1-team]
        
        x_ref = 10
        y_ref = 10
        
        if(agent_type == "SHIP"):
            own_ships = own_team[2]
            agent = own_ships.pop(agent_id, None)
            
            final_obs_proc[1] = 1
            
            if agent is not None:
                # Own ship channel
                x_ref = agent[0] % 21
                y_ref = agent[0] // 21

                # Own ship halite channel
                final_obs_proc[2] = agent[1] / 500 if agent[1] >= 0 else 0 # We have to normalize it, but how ?

            final_obs_proc[3] = obs[team].reward / 50000 if obs[team].reward >= 0 else 0 # We have to normalize it, but how ?
            final_obs_proc[4] = obs[1-team].reward / 50000 if obs[1-team].reward >= 0 else 0 # We have to normalize it, but how ?
            # Own other ships channel
            # Own ships halite channel
            for own_other_ship_key in own_ships:
                x = own_ships[own_other_ship_key][0] % 21
                y = own_ships[own_other_ship_key][0] // 21

                final_obs_proc[5,x,y] = 1 # We have to normalize it, but how ?
                final_obs_proc[6,x,y] = own_ships[own_other_ship_key][1] / 500 # We have to normalize it, but how ?

            # Own shipyard channel
            own_shipyards = own_team[1]
            for own_shipyard_key in own_shipyards:
                x = own_shipyards[own_shipyard_key] % 21
                y = own_shipyards[own_shipyard_key] // 21

                final_obs_proc[7,x,y] = 1

            # Opponent ships channel
            # Opponent ships halite channel
            opponent_ships = opponent_team[2]
            for opponent_ship_key in opponent_ships:
                x = opponent_ships[opponent_ship_key][0] % 21
                y = opponent_ships[opponent_ship_key][0] // 21

                final_obs_proc[8,x,y] = 1 # We have to normalize it, but how ?
                final_obs_proc[9,x,y] = opponent_ships[opponent_ship_key][1] / 500 # We have to normalize it, but how ?

            # Opponent shipyard channel
            opponent_shipyards = opponent_team[1]
            for opponent_shipyard_key in opponent_shipyards:
                x = opponent_shipyards[opponent_shipyard_key] % 21
                y = opponent_shipyards[opponent_shipyard_key] // 21

                final_obs_proc[10,x,y] = 1
            # Final number of channels : 9

        elif(agent_type == "SHIPYARD"):
            own_shipyards = own_team[1]
            agent = own_shipyards.pop(agent_id, None)
            final_obs_proc[1] = 0.5
            if agent is not None:
                # Own shipyard channel
                x_ref = agent % 21
                y_ref = agent // 21
            
            final_obs_proc[3] = obs[team].reward / 50000 if obs[team].reward >= 0 else 0 # We have to normalize it, but how ?
            final_obs_proc[4] = obs[1-team].reward / 50000 if obs[1-team].reward >= 0 else 0 # We have to normalize it, but how ?
            
            # Own ships channel
            # Own ships halite channel
            own_ships = own_team[2]
            for own_other_ship_key in own_ships:
                x = own_ships[own_other_ship_key][0] % 21
                y = own_ships[own_other_ship_key][0] // 21

                final_obs_proc[5,x,y] = 1 # We have to normalize it, but how ?
                final_obs_proc[6,x,y] = own_ships[own_other_ship_key][1] / 500 # We have to normalize it, but how ?

            # Own shipyard channel
            for own_shipyard_key in own_shipyards:
                x = own_shipyards[own_shipyard_key] % 21
                y = own_shipyards[own_shipyard_key] // 21

                final_obs_proc[7,x,y] = 1

            # Opponent ships channel
            # Opponent ships halite channel
            opponent_ships = opponent_team[2]
            for opponent_ship_key in opponent_ships:
                x = opponent_ships[opponent_ship_key][0] % 21
                y = opponent_ships[opponent_ship_key][0] // 21

                final_obs_proc[8,x,y] = 1 # We have to normalize it, but how ?
                final_obs_proc[9,x,y] = opponent_ships[opponent_ship_key][1] / 500 # We have to normalize it, but how ?

            # Opponent shipyard channel
            opponent_shipyards = opponent_team[1]
            for opponent_shipyard_key in opponent_shipyards:
                x = opponent_shipyards[opponent_shipyard_key] % 21
                y = opponent_shipyards[opponent_shipyard_key] // 21

                final_obs_proc[10,x,y] = 1
        
        final_obs_proc = np.roll(final_obs_proc, 10-x_ref,2)
        final_obs_proc = np.roll(final_obs_proc, 10-y_ref,1)
        
        return {"obs":final_obs_proc, "action_mask":action_mask}
    
    def reset(self):
        obs = self.env.reset(2)
        players = obs[0].observation.players
        return_obs = {}
        
        ship1_name = list(players[0][2].keys())[0]
        ship2_name = list(players[1][2].keys())[0]
        agent1_name = "SHIP_0_"+ship1_name
        agent2_name = "SHIP_1_"+ship2_name
        
        # Must add halite number of both teams (normalized!)
        return_obs[agent1_name] = self.observation(obs, agent1_name)
        return_obs[agent2_name] = self.observation(obs, agent2_name)
        
        self.previous_alive_agent_ids = set((agent1_name, agent2_name)) # it's a set
        self.all_agent_ids = [set((agent1_name,)),set((agent2_name,))]
        self.previous_ships_halite = {
            agent1_name:0,
            agent2_name:0
        }
        return return_obs
    
    def rllib_action_dict_to_halite(self, action_dict):
        final_action = [{},{}]
        
        for rllib_agent_id in action_dict:
            action = action_dict[rllib_agent_id]
            agent_type, team, agent_id = rllib_agent_id.split("_")
            team = int(team)
            
            converted_action = None
            """if agent_type == "SHIP":
                converted_action = SHIP_ACTIONS[action]
            elif agent_type == "SHIPYARD":
                converted_action = SHIPYARD_ACTIONS[action]"""
            
            converted_action = SHIP_ACTIONS[action]
            
            if converted_action is None:
                continue;
                
            final_action[team][agent_id] = converted_action
        
        return final_action
    
    def extract_alive_agents_for_rllib(self,obs):
        # build set of alive agents
        agent_ids_list = [[],[]]
        players = obs[0].observation.players
        for team, player in enumerate(players):
            shipyards = player[1]
            ships = player[2]
            for shipyard in shipyards:
                agent_ids_list[team].append("SHIPYARD_"+str(team)+"_"+shipyard)
            
            for ship in ships:
                agent_ids_list[team].append("SHIP_"+str(team)+"_"+ship)
        
        team1_agent_ids = set(agent_ids_list[0])
        team2_agent_ids = set(agent_ids_list[1])
        
        self.all_agent_ids[0] = set(list(self.all_agent_ids[0])+list(team1_agent_ids)) # for the final reward
        self.all_agent_ids[1] = set(list(self.all_agent_ids[1])+list(team2_agent_ids)) # for the final reward
        
        alive_agent_ids = set(list(team1_agent_ids)+list(team2_agent_ids))
        
        return alive_agent_ids
    
    def outcome(self, obs, dones):
        # return terminal outcomes
        # 1st: 1.0 2nd: 0.33 3rd: -0.33 4th: -1.00
        team1_reward = 1 if obs[0].reward > obs[1].reward else -1
        team2_reward = 1 if obs[0].reward < obs[1].reward else -1
        
        outcomes = {}
        
        for agent_rllib_id in dones:
            agent_type, team, agent_id = agent_rllib_id.split("_")
            team = int(team)
            
            if team == 0:
                outcomes[agent_rllib_id] = team1_reward
            elif team == 1:
                outcomes[agent_rllib_id] = team2_reward
                
        return outcomes
    
    def get_rewards(self,obs,dones):
        if not self.env.done:
            return {agent_id: 0 for agent_id in dones}
        else:
            return self.outcome(obs, dones)
    
    def step(self, action_dict):
        actions = self.rllib_action_dict_to_halite(action_dict)
        obs = self.env.step(actions)
        
        # We have to terminate status of dead ships and dead shipyards
        # Thus we need previous list of ids : self.previous_ids
        alive_agent_ids = self.extract_alive_agents_for_rllib(obs)
        # We can build the done space now !
        # Here we handle dead or still alive agents
        dones = {agent_id: False if agent_id in alive_agent_ids else True for agent_id in self.previous_alive_agent_ids}
        self.previous_alive_agent_ids = alive_agent_ids
        # Here we handle new agents
        for alive_agent_id in alive_agent_ids:
            if alive_agent_id not in dones:
                dones[alive_agent_id] = False
        
        # Now we have to build the returned observation space
        rllib_obs = {agent_id:self.observation(obs, agent_id) for agent_id in dones}
        
        # And finally the reward space !
        rewards = self.get_rewards(obs, dones)
        
        dones["__all__"] = self.env.done
        
        return rllib_obs, rewards, dones, {}

import torch
import torch.nn as nn
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2

class TorusConv2d(nn.Module):
    def __init__(self, input_dim, output_dim, kernel_size, bn):
        super(TorusConv2d, self).__init__()
        self.edge_size = (kernel_size[0] // 2, kernel_size[1] // 2)
        self.conv = nn.Conv2d(input_dim, output_dim, kernel_size=kernel_size)
        self.bn = nn.BatchNorm2d(output_dim) if bn else None

    def forward(self, x):
        h = torch.cat([x[:,:,:,-self.edge_size[1]:], x, x[:,:,:,:self.edge_size[1]]], dim=3)
        h = torch.cat([h[:,:,-self.edge_size[0]:], h, h[:,:,:self.edge_size[0]]], dim=2)
        h = self.conv(h)
        h = self.bn(h) if self.bn is not None else h
        return h

from ray.rllib.utils.torch_ops import FLOAT_MIN, FLOAT_MAX

class HaliteShipNet(TorchModelV2, nn.Module):
    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        nn.Module.__init__(self)
        super(HaliteShipNet, self).__init__(obs_space, action_space, None,
                                                 model_config, name)
        
        layers, filters = 12, 32
        self.relu = nn.ReLU()
        self.conv0 = TorusConv2d(11, filters, (3, 3), True)
        self.blocks = nn.ModuleList([TorusConv2d(filters, filters, (3, 3), True) for _ in range(layers)])
        
        self.head_p = nn.Sequential(
            nn.Conv2d(in_channels=filters,out_channels=2,kernel_size=1),
            nn.BatchNorm2d(2),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(2*21*21, 8, bias=False)
        )
        self.head_v = nn.Sequential(
            nn.Conv2d(in_channels=filters,out_channels=1,kernel_size=1),
            nn.BatchNorm2d(1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(21*21, 256),
            nn.ReLU(),
            nn.Linear(256, 1, bias=False)
        )
        
        self._head_features = None
        self._avg_features = None
        
    def forward(self, input_dict, state, seq_lens):
        x = input_dict['obs']['obs'].to(torch.float32)
        action_mask = input_dict['obs']['action_mask']
        h = self.relu(self.conv0(x))
        for block in self.blocks:
            h = self.relu(h + block(h))
        self._head_features = (h * x[:,:1])
        self._avg_features = h
        
        inf_mask = torch.clamp(torch.log(action_mask), FLOAT_MIN, FLOAT_MAX)
                
        return self.head_p(self._head_features) + inf_mask, []
    
    def value_function(self):
        assert self._head_features is not None and self._avg_features is not None, "must call forward() first"
        
        return torch.tanh(self.head_v(self._head_features)).squeeze(1)

I've just adapted your hungry geese example to all I've previously said. This works fine with Rllib A2C, but does not leverage well distribution (A2C problem plus library issues). I'd like to run it on HandyRL because I think I could leverage way better distribution. I didn't yet achieve convergence with A2C and Rllib, but in 12 hours it only played 20 000 games, and I think it should need way more to achieve convergence. That's why I'd like to try it on HandyRL.
My adaptation of the preprocessing of hungry geese might not be the must, the same goes for the net.
Thus my two questions :

Could you give me insights about how to adapt HandyRL to this ?
Could you give insights about preprocessing and network ?

And one other question : do you think convergence might be achieved with all of this ?

p.s. : my email address if you want to reach out in private : [email protected]
p.p.s. : to handle the two heterogeneous agents (SHIP and SHIPYARD) I've just added a channel full of 1 if SHIP and full of 0.5 if SHIPYARD.
p.p.p.s : Halite IV rules : https://www.kaggle.com/c/halite-iv-playground-edition/overview/halite-rules

How is multi-agent handled ?

Hello there, I was wondering how you handle multi-agent learning. Let's take as an example the supported kaggle hungry-geese environment. There is a function in the environment class :

rule_based_action()

If it exists, does it mean that the trained policy represents 1 player and the three others are rule based with this function ?

HungryGeese environment, too many open files in Ubuntu 18.04

Could you please advise on the Too many open files error I got when running training?
Error messages that relate to the code:

File "/home/HandyRL/handyrl/worker.py", line 84, in open_worker
worker.run()
File "/home/HandyRL/handyrl/worker.py", line 64, in run
model_pool = self._gather_models(model_ids)
File "/home/HandyRL/handyrl/worker.py", line 50, in _gather_models
model_pool[model_id] = send_recv(self.conn, ('model', model_id))
 File "/home/HandyRL/handyrl/connection.py", line 17, in send_recv
 rdata = conn.recv()

config.yaml:

env_args:
    env: 'HungryGeese'
    source: 'handyrl.envs.kaggle.hungry_geese'

train_args:
    turn_based_training: True
    observation: False
    gamma: 0.8
    forward_steps: 16
    compress_steps: 4
    entropy_regularization: 1.0e-1
    entropy_regularization_decay: 0.1
    update_episodes: 200
    batch_size: 128
    minimum_episodes: 400
    maximum_episodes: 100000
    num_batchers: 2
    eval_rate: 0.1
    worker:
        num_parallel: 4
    lambda: 0.7
    policy_target: 'TD' # 'UGPO' 'VTRACE' 'TD' 'MC'
    value_target: 'TD' # 'VTRACE' 'TD' 'MC'
    seed: 42
    restart_epoch: 0


worker_args:
    server_address: ''
    num_parallel: 4

One GPU (GeForce RTX 2080 Ti).
CUDA Version: 11.0
Anaconda environment with Python 3.7.9
Ubuntu 18.04

Multiple Ctrl+C is needed to finish every process and thread in the early phase

It has been a little bothering me for a long time, but most of the time when I try to stop it with Ctrl+C early in the training process, it doesn't stop at once.

It seems that it's stopping at the following Queue.get():

^CTraceback (most recent call last):
  File "/Users/ohto/myrepo/HandyRL/handyrl/train.py", line 644, in run
    self.server()
  File "/Users/ohto/myrepo/HandyRL/handyrl/train.py", line 568, in server
    conn, (req, data) = self.worker.recv()
  File "/Users/ohto/myrepo/HandyRL/handyrl/connection.py", line 221, in recv
    return self.input_queue.get()
  File "/usr/local/Cellar/[email protected]/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/queue.py", line 171, in get
    self.not_empty.wait()
  File "/usr/local/Cellar/[email protected]/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
KeyboardInterrupt

However, setting timeout did not affect this.
develop...YuriCat:experiment/ctrl_c_queue_timeout

How to make model consider immediate reward ?

I want to use immediate reward from environment to teach my RL model. As described in the document, I implemented "reward" function in "Environment" class.

However, when I checked loss calculation flow in train.py file, losses['v'] seems only consider value outputted from model and outcome from environment. Also, I found that loss['r'] takes into account the rewards from the environment.

Does this mean that my model also needs to output a "return" value ?

train stop at 690epoch

Thank you for nice library!

I train this, but training stop at 690 epoch.
I don't know why this happen.

When then, I kill that job.
Next, I changed train_args:restart_epoch in config.yaml like below in order to continue training.

train_args:
restart_epoch: 690

Is this right method ?

Thanks

Suggestion: AMP support

It would be useful to have native amp mixd precision support for those with limited GPU
https://pytorch.org/docs/stable/amp.html

Large scale training

I use HandyRL very intensively and would like to take advantage of the server/client functionality to optimize my costs on vastai. My machines are not on the same local network and I'm asking if it is still possible for them to connect to each other. From the vastai point of view, it is possible to open ports on the vms (9999 and 9998) and you have access to the public IP of the machines. I tried but my server does not receive client requests, and I don't know if it is a problem with HandyRL or with my machines.

Customize learning rate

Hello there! I'd like to customize the learning rate in HandyRL but I don't see anywhere where to to it! In imitation learning my model converges well with a LR of 0.000025, so I'd like to set a similar LR in HandyRL.

Examples of config.yaml for supported games

Would you mind sharing sample settings for all games supported by your excellent framework?

There is config.yaml for TicTacToe already. It would be great to have similar files for other games (config.geister.yaml, config.geese.yaml or commented section in the existing config). I am not sure who are your target audience, but as a newcomer, not all parameters might be obvious.
Any comments in the files would be much appreciated as well.

num_parallel affecting learning results

hi, I've tried training on a 32 core machine, naturally i set num_parallel to 32. However the model does not seem to learn at all. Weirdly, when i set num_parallel to 6, the model learns.
The rest of the config is exactly the same as the PubHRL config for hungry geese.

[Question] [Documentation]

Could you provide a detailed example to simple RL problems like ubiquitous cart-pole?
This could help with improving the documentation as well

Best regards
Alessandro

Unexpected results inference time

Hi, I came from Kaggle, first of all, amazing work!
I was trying this for Kaggle Competition: Hungry Geese, its training fine with these params:

{'env_args': {'env': 'HungryGeese', 'source': 'handyrl.envs.kaggle.hungry_geese'}, 'train_args': {'turn_based_training': False, 'observation': False, 'gamma': 0.8, 'forward_steps': 16, 'compress_steps': 4, 'entropy_regularization': 0.1, 'entropy_regularization_decay': 0.1, 'update_episodes': 200, 'batch_size': 128, 'minimum_episodes': 400, 'maximum_episodes': 100000, 'num_batchers': 2, 'eval_rate': 0.1, 'worker': {'num_parallel': 6}, 'lambda': 0.7, 'policy_target': 'TD', 'value_target': 'TD', 'seed': 0, 'restart_epoch': 0}, 'worker_args': {'server_address': '', 'num_parallel': 8}}

and in between training, I picked an epoch model let say 17.pth
Now when trying it on kaggle enviroment (playing against another agent):

current_score = evaluate(
                "hungry_geese", 
                [
                    agents[ind_1], # HandyRL
                    agents[ind_2], 
                    "simple_toward.py", 
                    "simple_toward.py",
                ],
                num_episodes=100,
            )

the output of current_score is

[[601, None, 2704, 2604], [1503, None, 801, 1501], [501, None, 502, 502], [301, None, 2606, 2703], [901, None, 901, 903], [401, None, 1405, 1403], [601, None, 1403, 1502], [2804, None, 401, 2804], [501, None, 501, 601], [501, None, 402, 401], [702, None, 702, 702], [602, None, 502, 703], [501, None, 502, 601], [401, None, 801, 802], [501, None, 703, 701], [1002, None, 901, 501], [403, None, 401, 402], [402, None, 301, 301], [1602, None, 1602, 1202], [1301, None, 1303, 1301], [602, None, 602, 601], [1701, None, 1702, 1704], [301, None, 301, 402], [501, None, 2805, 2903], [301, None, 302, 401], [301, None, 401, 301], [0, None, 0, 201], [1702, None, 0, 1805], [1602, None, 301, 1704], [3207, None, 3205, 1801], [402, None, 301, 301], [1503, None, 501, 1502], [1202, None, 1202, 1304], [501, None, 601, 501], [902, None, 902, 901], [401, None, 401, 502], [901, None, 2304, 2202], [401, None, 301, 301], [402, None, 701, 802], [901, None, 301, 1003], [301, None, 202, 0], [702, None, 601, 601], [601, None, 1803, 1902], [301, None, 301, 402], [703, None, 301, 702], [2603, None, 602, 2704], [1301, None, 1202, 1201], [1101, None, 401, 1204], [602, None, 602, 301], [601, None, 1103, 1203], [401, None, 502, 301], [801, None, 802, 501], [601, None, 1904, 2002], [301, None, 201, 403], [403, None, 301, 301], [902, None, 901, 902], [501, None, 401, 401], [1902, None, 601, 1905], [502, None, 401, 301], [602, None, 601, 602], [2804, None, 2703, 202], [601, None, 902, 902], [3404, None, 3302, 1301], [301, None, 901, 902], [501, None, 2504, 2603], [2302, None, 2403, 801], [302, None, 502, 401], [401, None, 1803, 1704], [1304, None, 1302, 401], [201, None, 301, 202], [501, None, 501, 602], [201, None, 201, 301], [702, None, 602, 601], [401, None, 1502, 1503], [3004, None, 201, 2902], [803, None, 2102, 2103], [201, None, 302, 201], [201, None, 201, 301], [2003, None, 2203, 2103], [501, None, 501, 603], [503, None, 2002, 1903], [501, None, 502, 501], [1402, None, 401, 1403], [1302, None, 1403, 601], [2204, None, 501, 2203], [301, None, 401, 301], [1403, None, 1403, 1401], [301, None, 402, 301], [701, None, 1602, 1502], [2103, None, 1001, 2105], [401, None, 602, 501], [602, None, 501, 702], [402, None, 301, 301], [402, None, 402, 401], [601, None, 702, 602], [401, None, 401, 402], [701, None, 802, 802], [201, None, 202, 201], [702, None, 501, 601], [501, None, 501, 602]]

the second element is surprisingly "None" that is quite strange.
what do you think? what went wrong ?