Giter Site home page Giter Site logo

starry-sky6688 / marl-algorithms Goto Github PK

View Code? Open in Web Editor NEW
1.3K 12.0 271.0 59.64 MB

Implementations of IQL, QMIX, VDN, COMA, QTRAN, MAVEN, CommNet, DyMA-CL, and G2ANet on SMAC, the decentralised micromanagement scenario of StarCraft II

Python 100.00%
multi-agent-reinforcement-learning deep-reinforcement-learning reinforcement-learning

marl-algorithms's Introduction

StarCraft

Pytorch implementations of the multi-agent reinforcement learning algorithms, including IQL, QMIX, VDN, COMA, QTRAN(both QTRAN-base and QTRAN-alt), MAVEN, CommNet, DyMA-CL, and G2ANet, which are the state of the art MARL algorithms. In addition, because CommNet and G2ANet need an external training algorithm, we provide Central-V and REINFORCE for them to training, you can also combine them with COMA. We trained these algorithms on SMAC, the decentralised micromanagement scenario of StarCraft II.

Corresponding Papers

Requirements

Use pip install -r requirements.txt to install the following requirements:

Acknowledgement

TODO List

  • Add CUDA option
  • DyMA-CL
  • G2ANet
  • MAVEN
  • VBC
  • Other SOTA MARL algorithms
  • Update results on other maps

Quick Start

$ python main.py --map=3m --alg=qmix

Directly run the main.py, then the algorithm will start training on map 3m. Note CommNet and G2ANet need an external training algorithm, so the name of them are like reinforce+commnet or central_v+g2anet, all the algorithms we provide are written in ./common/arguments.py.

If you just want to use this project for demonstration, you should set --evaluate=True --load_model=True.

The running of DyMA-CL is independent from others because it requires different environment settings, so we put it on another project. For more details, please read DyMA-CL documentation.

Result

We independently train these algorithms for 8 times and take the mean of the 8 independent results, and we evaluate them for 20 episodes every 100 training steps. All of the results are saved in ./result. Results on other maps are still in training, we will update them later.

1. Mean Win Rate of 8 Independent Runs on 3m --difficulty=7(VeryHard)

2. Mean Win Rate of 8 Independent Runs on 8m --difficulty=7(VeryHard)

3. Mean Win Rate of 8 Independent Runs on 2s3z --difficulty=7(VeryHard)

Replay

If you want to see the replay, make sure the replay_dir is an absolute path, which can be set in ./common/arguments.py. Then the replays of each evaluation will be saved, you can find them in your path.

marl-algorithms's People

Contributors

harnvo avatar starry-sky6688 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

marl-algorithms's Issues

有个问题

runner里面的这个:
mini_batch = self.buffer.sample(min(self.buffer.current_size, self.args.batch_size))
在一开始经验池里面只有一条经验的话,那不就是只能sample出一条经验来训练了吗?

evaluate模式下replay不能保存

首先感谢大佬无私分享的优秀代码 !!!

作者你好, 我通过下面命令行运行整个项目, 想直接看看演示的效果
python main.py --map=3m --alg=qmix --evaluate=True --load_model=True
我自己在 arguments.py 里面也配置了replay存放的文件夹, 运行完项目后正常输出, 但是没有replay文件的生成.

然后我翻看了大佬你这个项目历史的被关掉的 issue , 发现作者你说该项目不支持 replay , 然后我参照 SMAC 官网的教程, 官网说调用 save_replay()即可, 然后发现作者你的 rollout.py 里面调用了该函数, 然后我实在想不出办法获得 replay 文件就来打扰作者你了,
请问我该如何修改代码从而获得 evaluate 模式下的 replay文件呢?

我是个合作式MA_DRL初学者, 如有打扰作者你, 我深感抱歉

replay buffer 中的 available action

你好,

想问一下关于replay buffer中的 available action 的设置是什么? 因为一般的buffer里好像没有这一项,对于starwar的环境不是很了解, 望赐教。

VDN中关于action 误差值

你好,在DQN里面反向传递回去的是已经选择好的 action 的值, 其他的并不需要。所以将其他的 action 的Q值全变成 0, 将用到的 action 误差值 反向传递回去, 作为更新凭据。但是在VDN的实现中,误差值变成一个数值(周期,步数,用户)而不是关于action的向量(周期,步数,用户,动作)?这样做似乎不能起到更新,智能体当时采取动作的Q值的作用,我不能理解为什么这么实现,希望能得到你的帮助,谢谢!

Batch size

Hi,

What is a good way to change the batch size for better GPU utilization?. Is it the n_episodes parameter?. I am using reinforce+g2anet with 3m map. Thanks

关于多agent的环境问题

在starcraft中,多环境已经给出了,如果有的环境没有给出,比如HFO足球,应该怎样处理呢

训练环境奖励问题

你好,请教一个问题哈!
你的实验中训练的时候环境 reward 是默认(也就是根据双方血量、死亡等信息来给出奖励)的是吧?
你有尝试过利用稀疏奖励的环境(也就是回合结束时候根据输赢来给出奖励)来训练吗?
打扰了~

训练一段时间后环境报错

您好,请问您在训练一段时间后是否遇到过这个问题呢?是如何解决的呢?我在网上没有找到相关的解答,麻烦您帮忙看一下是否遇到过。
image
image

无法解析websocket帧,closehandle: 127.0.0.1:45512断开连接

你好,在训练结束后出现DataHandler: unable to parse websocket frame.
CloseHandler: 127.0.0.1:52886 disconnected
ResponseThread: No connection, dropping the response.
这个如何解决呢?

@clb-Lenovo-Rescuer-15ISK:~/桌面/11/StarCraft-master$ python main.py --evaluate_epoch=100 --map=3m --alg=qmix
Successfully load the model: ./model/qmix/3m/rnn_net_params.pkl and ./model/qmix/3m/qmix_net_params.pkl
Version: B75689 (SC2.4.10)
Build: Aug 12 2019 17:16:57
Command Line: '"/home/clb/StarCraftII/Versions/Base75689/SC2_x64" -listen 127.0.0.1 -port 24779 -dataDir /home/clb/StarCraftII/ -tempDir /tmp/sc-oj7qxk50/ -eglpath libEGL.so'
Starting up...
Startup Phase 1 complete
Startup Phase 2 complete
Attempting to initialize EGL from file libEGL.so ...
Successfully loaded EGL library!
Successfully initialized display on device idx: 0, EGL version: 1.5

Running CGLSimpleDevice::HALInit...
Calling glGetString: 0x7fe81ee5cfe0
Version: 4.6.0 NVIDIA 440.82
Vendor: NVIDIA Corporation
Renderer: GeForce GTX 960M/PCIe/SSE2
OpenGL initialized!
Listening on: 127.0.0.1:24779
Startup Phase 3 complete. Ready for commands.
ConnectHandler: Request from 127.0.0.1:52886 accepted
ReadyHandler: 127.0.0.1:52886 ready
Requesting to join a single player game
Configuring interface options
Configure: raw interface enabled
Configure: feature layer interface disabled
Configure: score interface disabled
Configure: render interface disabled
Launching next game.
Next launch phase started: 2
Next launch phase started: 3
Next launch phase started: 4
Next launch phase started: 5
Next launch phase started: 6
Next launch phase started: 7
Next launch phase started: 8
Game has started.
Using default stable ids, none found at: /home/clb/StarCraftII/stableid.json
Successfully loaded stable ids: GameData\stableid.json
Sending ResponseJoinGame
Epoch 0: defeat
Epoch 1: win
Epoch 2: win
Epoch 3: win
Epoch 4: win
Epoch 5: win
Epoch 6: win
Epoch 7: win
Epoch 8: win
Epoch 9: defeat
Epoch 10: win
Epoch 11: win
Epoch 12: win
Epoch 13: win
Epoch 14: win
Epoch 15: win
Epoch 16: win
Epoch 17: defeat
Epoch 18: win
Epoch 19: win
Epoch 20: win
Epoch 21: win
Epoch 22: defeat
Epoch 23: win
Epoch 24: win
Epoch 25: win
Epoch 26: win
Epoch 27: win
Epoch 28: win
Epoch 29: win
Epoch 30: win
Epoch 31: win
Epoch 32: win
Epoch 33: win
Epoch 34: win
Epoch 35: win
Epoch 36: win
Epoch 37: win
Epoch 38: win
Epoch 39: win
Epoch 40: win
Epoch 41: win
Epoch 42: win
Epoch 43: win
Epoch 44: win
Epoch 45: win
Epoch 46: win
Epoch 47: win
Epoch 48: win
Epoch 49: win
Epoch 50: win
Epoch 51: win
Epoch 52: win
Epoch 53: defeat
Epoch 54: win
Epoch 55: win
Epoch 56: win
Epoch 57: win
Epoch 58: win
Epoch 59: win
Epoch 60: win
Epoch 61: win
Epoch 62: defeat
Epoch 63: win
Epoch 64: win
Epoch 65: win
Epoch 66: win
Epoch 67: win
Epoch 68: win
Epoch 69: defeat
Epoch 70: win
Epoch 71: win
Epoch 72: win
Epoch 73: defeat
Epoch 74: win
Epoch 75: win
Epoch 76: win
Epoch 77: win
Epoch 78: win
Epoch 79: win
Epoch 80: win
Epoch 81: win
Epoch 82: win
Epoch 83: win
Epoch 84: win
Epoch 85: win
Epoch 86: win
Epoch 87: win
Epoch 88: win
Epoch 89: win
Epoch 90: win
Epoch 91: win
Epoch 92: win
Epoch 93: win
Epoch 94: win
Epoch 95: win
Epoch 96: win
Epoch 97: win
Epoch 98: win
Epoch 99: win
RequestQuit command received.
Closing Application...
DataHandler: unable to parse websocket frame.
CloseHandler: 127.0.0.1:52886 disconnected
ResponseThread: No connection, dropping the response.
The win rate of qmix is 0.92

win10 训练 reinforce+commnet 报错 和 issue#44 类似

很抱歉麻烦作者您了, torch是最新版本 1.8 , 执行以下语句训练时 python main.py --map=5m_vs_6m --alg=reinforce+commnet
您在 #44 提到 升级版本和 修改generate_episode( )函数代码, 我都试了, 都不成功, 尝试谷歌这个错误, 没有解决, 就来问作者您有没有解决办法
但是我在colab上面训练可以成功, colab的环境是 ubuntu18和pytorch1.8, 所以我怀疑可能是windows系统本身的问题

Traceback (most recent call last):
  File "main.py", line 34, in <module>
    runner.run(i)
  File "D:\StarCraft_pymarl\runner.py", line 60, in run
    self.agents.train(episode_batch, train_steps, self.rolloutWorker.epsilon)
  File "D:\StarCraft_pymarl\agent\agent.py", line 211, in train
    self.policy.learn(batch, max_episode_len, train_step, epsilon)
  File "D:\StarCraft_pymarl\policy\reinforce.py", line 64, in learn
    batch[key] = torch.tensor(batch[key], dtype=torch.long)
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

状态提问

when I change the difficulty of the map '8m',the reward and win rate of the training process be zero???

train_hyper-parameter:
map 8m
map_difficulty 4
epoch 20000
replay_buffer 4000
learning_rate 0.0005
epoch: 0 avg_reward: 0.020390625 avg_winrate: 0.0 test_avg_reward 4.3765625 test_avg_winrate 0.0
epoch: 100 avg_reward: 1.8874609375 avg_winrate: 0.0 test_avg_reward 17.509375 test_avg_winrate 0.7
epoch: 200 avg_reward: 1.9575390625 avg_winrate: 0.0 test_avg_reward 15.9609375 test_avg_winrate 0.55
epoch: 300 avg_reward: 2.0929296875 avg_winrate: 0.0 test_avg_reward 17.3375 test_avg_winrate 0.7
epoch: 400 avg_reward: 2.187265625 avg_winrate: 0.0 test_avg_reward 16.646875 test_avg_winrate 0.6
epoch: 500 avg_reward: 2.3000390625 avg_winrate: 0.0 test_avg_reward 15.15625 test_avg_winrate 0.45
epoch: 600 avg_reward: 2.377578125 avg_winrate: 0.0 test_avg_reward 14.1015625 test_avg_winrate 0.3
epoch: 700 avg_reward: 2.59359375 avg_winrate: 0.0 test_avg_reward 16.3703125 test_avg_winrate 0.55
epoch: 800 avg_reward: 2.675390625 avg_winrate: 0.0 test_avg_reward 11.3515625 test_avg_winrate 0.05
epoch: 900 avg_reward: 2.4835546875 avg_winrate: 0.0 test_avg_reward 6.2828125 test_avg_winrate 0.0
epoch: 1000 avg_reward: 2.2533203125 avg_winrate: 0.0 test_avg_reward 4.5921875 test_avg_winrate 0.0
epoch: 1100 avg_reward: 2.1725390625 avg_winrate: 0.0 test_avg_reward 5.3140625 test_avg_winrate 0.0
epoch: 1200 avg_reward: 2.209765625 avg_winrate: 0.0 test_avg_reward 6.115625 test_avg_winrate 0.0
epoch: 1300 avg_reward: 2.2476953125 avg_winrate: 0.0 test_avg_reward 3.9359375 test_avg_winrate 0.0
epoch: 1400 avg_reward: 2.22125 avg_winrate: 0.0 test_avg_reward 3.5078125 test_avg_winrate 0.0
epoch: 1500 avg_reward: 2.1265234375 avg_winrate: 0.0 test_avg_reward 3.2515625 test_avg_winrate 0.0
epoch: 1600 avg_reward: 2.09265625 avg_winrate: 0.0 test_avg_reward 3.365625 test_avg_winrate 0.0
epoch: 1700 avg_reward: 2.0511328125 avg_winrate: 0.0 test_avg_reward 3.478125 test_avg_winrate 0.0
epoch: 1800 avg_reward: 2.0827734375 avg_winrate: 0.0 test_avg_reward 3.1734375 test_avg_winrate 0.0
epoch: 1900 avg_reward: 2.1476171875 avg_winrate: 0.0 test_avg_reward 2.9421875 test_avg_winrate 0.0
epoch: 2000 avg_reward: 2.2252734375 avg_winrate: 0.0 test_avg_reward 3.19375 test_avg_winrate 0.0
epoch: 2100 avg_reward: 2.254609375 avg_winrate: 0.0 test_avg_reward 2.66875 test_avg_winrate 0.0
epoch: 2200 avg_reward: 2.1873828125 avg_winrate: 0.0 test_avg_reward 2.2515625 test_avg_winrate 0.0
epoch: 2300 avg_reward: 2.186484375 avg_winrate: 0.0 test_avg_reward 2.3671875 test_avg_winrate 0.0
epoch: 2400 avg_reward: 2.25515625 avg_winrate: 0.0 test_avg_reward 2.2703125 test_avg_winrate 0.0
epoch: 2500 avg_reward: 2.1551953125 avg_winrate: 0.0 test_avg_reward 1.6890625 test_avg_winrate 0.0
epoch: 2600 avg_reward: 2.086796875 avg_winrate: 0.0 test_avg_reward 1.9234375 test_avg_winrate 0.0
epoch: 2700 avg_reward: 2.277734375 avg_winrate: 0.0 test_avg_reward 4.878125 test_avg_winrate 0.0
epoch: 2800 avg_reward: 2.7228515625 avg_winrate: 0.0 test_avg_reward 4.096875 test_avg_winrate 0.0
epoch: 2900 avg_reward: 2.968671875 avg_winrate: 0.0 test_avg_reward 3.5453125 test_avg_winrate 0.0
epoch: 3000 avg_reward: 3.369453125 avg_winrate: 0.0 test_avg_reward 4.946875 test_avg_winrate 0.0
epoch: 3100 avg_reward: 3.9022265625 avg_winrate: 0.0 test_avg_reward 5.396875 test_avg_winrate 0.0
epoch: 3200 avg_reward: 4.7571875 avg_winrate: 0.0 test_avg_reward 6.01875 test_avg_winrate 0.0

您好!有个问题想请教下!

您好!想问下,就是如果我想用我自己的一个多智能体环境的话,那我该怎么做啊?谢谢!(原谅我是一个新手。。。)

关于dyma网络

您好,您的代码写的非常清晰,我作为强化学习的初学者,能够根据您的注释轻松的理解代码。但是有一个地方我不太明白,想请教您一下。我现在在学习dyma-cl,我看您的代码在dyma里使用的是VDN-Net,请问这里可以换成QMIX-Net吗? 我尝试换了一下,但是将8m用于3m时,QMIX-Net的参数不匹配。这里的迁移是只能用VDN吗?不知道我的问题有没有阐述清楚,期待您的回复,谢谢

你好!

老哥,你好。我看你把代码注释都写上了。。。好强。你这个代码也是基于github上那个开源的smac代码改进的吗?还是你自己撸的。(真的太强了!!!)谢谢!

coma td_error 中target q的计算问题

Hi, 老兄,又来请教一番,哈哈。

我注意到在coma中q_next_target的计算,有点不太理解,为啥这么计算。
282 q_evals = torch.gather(q_evals, dim=3, index=u).squeeze(3)
283 q_next_target = torch.gather(q_next_target, dim=3, index=u_next).squeeze(3)

根据我的理解,一般有两种计算target的方式,dqn, 或者ddqn. 前者用eval net的argmax取action, 后者直接target net取argmax.

这里我有2个问题:
(1) 为啥用执行的u_next来选取动作呢, 不用dqn或者ddqn的方式?
(2) 这个u_next的最后一步(因为实际最后teminate的时候没动作)是用0填充的,我看了看smac的源码, 0大概是noop,但是noop对活的agent是不可用的。所以这样会不会有问题?

关于reward

请问对于reward,每一步都只存一个值,是因为agetn协作,共享reward吗?
那如果,环境中有四个智能体,两两协作,这类的reward该如何处理呢?
另外,QMIX是不是只能用于协作场景?

期待您的回复
谢谢。

环境的特征向量设置

您好!对于SMAC环境的get_obs() get_state()返回的特征向量有点疑问
如官方文档所说,上述两个函数返回值包含了队友和敌人的相对坐标等信息,并归一化。

但我不太确定这里的意思,是不是说:

假设当前环境是一个2v1场景,即共有两个协同的agent(我们需要训练的) vs 1个待摧毁敌人
下面考虑SMAC中get_obs()将返回的内容
调用get_obs()将返回一个长度为2的list,记录了两个队友的观察特征向量如下(此处简化只考虑7个维度):
[[自己的坐标x,自己的坐标y,队友的相对坐标x,队友的相对坐标y ,敌人的相对坐标x,敌人的相对坐标y,敌人的血量], [队友2的观察(结构同上)]]

假设当前我的坐标是 (5,5) 队友的相对坐标是(2,1) 敌人的相对坐标是(1,-2) 敌人的血量为5
考虑距离分量最大为10 ,敌人最高血量为10
那么,观察的特征向量是[0.5, 0.5, 0.2, 0.1, 0.1, -0.2, 0.5]

另外,如果当前观测距离不含有有队友或者敌人,可设置一个特殊值,此处可设置为0?

上述为我对环境的特征向量的理解,不知道是否正确
还望指证

我想问一下,游戏的render怎么关?还有就是训练的过程中loss没有下降?

很奇怪的loss变化:
image

只是调了下超参数,和打印信息没有改变源代码
训练输出结果如下:
train_hyper-parameter:
map 8m
map_difficulty 5
train_epoch 20000
replay_buffer 10000
learning_rate 0.0005
episode_limit 120
epoch: 0
epsilon 0.99977
train_reward 0.016875 train_win_rate 0.0 test_reward 0.24375 test_win_rate 0.0
epoch: 100
epsilon 0.9767700000000048
train_reward 1.881875 train_win_rate 0.0 test_reward 3.1859375 test_win_rate 0.0
epoch: 200
epsilon 0.9537700000000096
train_reward 1.929375 train_win_rate 0.0 test_reward 15.26875 test_win_rate 0.4
epoch: 300
epsilon 0.9307700000000143
train_reward 2.0589583333333334 train_win_rate 0.0 test_reward 17.346875 test_win_rate 0.7
epoch: 400
epsilon 0.9077700000000191
train_reward 2.1264583333333333 train_win_rate 0.0 test_reward 16.0015625 test_win_rate 0.55
epoch: 500
epsilon 0.8847700000000238
train_reward 2.1064583333333333 train_win_rate 0.0 test_reward 15.1859375 test_win_rate 0.45
epoch: 600
epsilon 0.8617700000000286
train_reward 2.32125 train_win_rate 0.0 test_reward 17.4375 test_win_rate 0.7
epoch: 700
epsilon 0.8387700000000333
train_reward 2.511875 train_win_rate 0.0 test_reward 17.16875 test_win_rate 0.65
epoch: 800
epsilon 0.8157700000000381
train_reward 2.5832291666666665 train_win_rate 0.0 test_reward 17.8171875 test_win_rate 0.75
epoch: 900
epsilon 0.7927700000000428
train_reward 2.6990625 train_win_rate 0.0 test_reward 16.6140625 test_win_rate 0.6
epoch: 1000
epsilon 0.7697700000000476
train_reward 2.8104166666666663 train_win_rate 0.0 test_reward 15.653125 test_win_rate 0.5
epoch: 1100
epsilon 0.7467700000000523
train_reward 3.001666666666667 train_win_rate 0.0 test_reward 16.203125 test_win_rate 0.55
epoch: 1200
epsilon 0.7237700000000571
train_reward 3.0987500000000012 train_win_rate 0.0 test_reward 14.6296875 test_win_rate 0.4
epoch: 1300
epsilon 0.7007700000000618
train_reward 3.3103125 train_win_rate 0.0 test_reward 14.5515625 test_win_rate 0.35
epoch: 1400
epsilon 0.6777700000000666
train_reward 3.6044791666666662 train_win_rate 0.0 test_reward 15.046875 test_win_rate 0.4
epoch: 1500
epsilon 0.6547700000000714
train_reward 3.811875000000002 train_win_rate 0.0 test_reward 15.121875 test_win_rate 0.45
epoch: 1600
epsilon 0.6317700000000761
train_reward 4.193333333333333 train_win_rate 0.0 test_reward 15.540625 test_win_rate 0.45
epoch: 1700
epsilon 0.6087700000000809
train_reward 4.337812499999999 train_win_rate 0.0 test_reward 11.4578125 test_win_rate 0.1
epoch: 1800
epsilon 0.5857700000000856
train_reward 4.354895833333331 train_win_rate 0.0 test_reward 15.36875 test_win_rate 0.45
epoch: 1900
epsilon 0.5627700000000904
train_reward 4.8644791666666665 train_win_rate 0.0 test_reward 10.675 test_win_rate 0.05
epoch: 2000
epsilon 0.5397700000000951
train_reward 5.213854166666667 train_win_rate 0.0 test_reward 12.1 test_win_rate 0.1
epoch: 2100
epsilon 0.5167700000000999
train_reward 5.163541666666667 train_win_rate 0.0 test_reward 11.121875 test_win_rate 0.1
epoch: 2200
epsilon 0.49377000000010307
train_reward 5.481875 train_win_rate 0.0 test_reward 7.5296875 test_win_rate 0.0
epoch: 2300
epsilon 0.4707700000001023
train_reward 5.200729166666669 train_win_rate 0.0 test_reward 5.159375 test_win_rate 0.0
epoch: 2400
epsilon 0.4477700000001015
train_reward 5.0978125000000025 train_win_rate 0.0 test_reward 1.9328125 test_win_rate 0.0
epoch: 2500
epsilon 0.4247700000001007
train_reward 4.331041666666667 train_win_rate 0.0 test_reward 3.621875 test_win_rate 0.0
epoch: 2600
epsilon 0.4017700000000999
train_reward 4.076562499999999 train_win_rate 0.0 test_reward 4.4828125 test_win_rate 0.0
epoch: 2700
epsilon 0.3787700000000991
train_reward 4.628541666666667 train_win_rate 0.0 test_reward 1.9828125 test_win_rate 0.0
epoch: 2800
epsilon 0.3557700000000983
train_reward 4.869062500000002 train_win_rate 0.0 test_reward 0.0 test_win_rate 0.0
epoch: 2900
epsilon 0.3327700000000975
train_reward 4.633854166666668 train_win_rate 0.0 test_reward 0.4359375 test_win_rate 0.0
epoch: 3000
epsilon 0.3097700000000967
train_reward 5.482499999999999 train_win_rate 0.0 test_reward 0.7578125 test_win_rate 0.0
epoch: 3100
epsilon 0.2867700000000959
train_reward 5.1883333333333335 train_win_rate 0.0 test_reward 2.64375 test_win_rate 0.0
epoch: 3200
epsilon 0.2637700000000951
train_reward 4.500833333333335 train_win_rate 0.0 test_reward 4.7109375 test_win_rate 0.0
epoch: 3300
epsilon 0.2407700000000943
train_reward 4.912499999999998 train_win_rate 0.0 test_reward 3.6890625 test_win_rate 0.0
epoch: 3400
epsilon 0.2177700000000935
train_reward 4.882812500000001 train_win_rate 0.0 test_reward 2.675 test_win_rate 0.0
epoch: 3500
epsilon 0.1947700000000927
train_reward 4.233020833333335 train_win_rate 0.0 test_reward 3.2828125 test_win_rate 0.0
epoch: 3600
epsilon 0.1717700000000919
train_reward 4.476875 train_win_rate 0.0 test_reward 3.865625 test_win_rate 0.0
epoch: 3700
epsilon 0.1487700000000911
train_reward 3.603749999999999 train_win_rate 0.0 test_reward 1.9125 test_win_rate 0.0
epoch: 3800
epsilon 0.1257700000000903
train_reward 3.9501041666666676 train_win_rate 0.0 test_reward 4.121875 test_win_rate 0.0
epoch: 3900
epsilon 0.10277000000009086
train_reward 3.233645833333332 train_win_rate 0.0 test_reward 2.5046875 test_win_rate 0.0
epoch: 4000
epsilon 0.07977000000009145
train_reward 3.533333333333332 train_win_rate 0.0 test_reward 4.21875 test_win_rate 0.0
epoch: 4100
epsilon 0.056770000000091865
train_reward 3.0776041666666676 train_win_rate 0.0 test_reward 3.1796875 test_win_rate 0.0
epoch: 4200
epsilon 0.049870000000091834
train_reward 3.398541666666667 train_win_rate 0.0 test_reward 3.64375 test_win_rate 0.0
epoch: 4300
epsilon 0.049870000000091834
train_reward 3.745208333333332 train_win_rate 0.0 test_reward 4.046875 test_win_rate 0.0
epoch: 4400
epsilon 0.049870000000091834
train_reward 3.8334375000000005 train_win_rate 0.0 test_reward 4.7265625 test_win_rate 0.0
epoch: 4500
epsilon 0.049870000000091834
train_reward 3.708854166666664 train_win_rate 0.0 test_reward 3.5 test_win_rate 0.0
epoch: 4600
epsilon 0.049870000000091834
train_reward 4.413958333333334 train_win_rate 0.0 test_reward 3.3640625 test_win_rate 0.0
epoch: 4700
epsilon 0.049870000000091834
train_reward 4.633541666666665 train_win_rate 0.0 test_reward 5.0734375 test_win_rate 0.0
epoch: 4800
epsilon 0.049870000000091834
train_reward 4.795520833333332 train_win_rate 0.0 test_reward 6.2375 test_win_rate 0.0
epoch: 4900
epsilon 0.049870000000091834
train_reward 4.778229166666666 train_win_rate 0.0 test_reward 4.3984375 test_win_rate 0.0
epoch: 5000
epsilon 0.049870000000091834
train_reward 4.482291666666666 train_win_rate 0.0 test_reward 2.98125 test_win_rate 0.0
epoch: 5100
epsilon 0.049870000000091834
train_reward 4.067083333333334 train_win_rate 0.0 test_reward 4.384375 test_win_rate 0.0
epoch: 5200
epsilon 0.049870000000091834
train_reward 4.1935416666666665 train_win_rate 0.0 test_reward 5.6421875 test_win_rate 0.0
epoch: 5300
epsilon 0.049870000000091834
train_reward 4.421666666666665 train_win_rate 0.0 test_reward 3.953125 test_win_rate 0.0
epoch: 5400
epsilon 0.049870000000091834
train_reward 2.894895833333333 train_win_rate 0.0 test_reward 3.2671875 test_win_rate 0.0
epoch: 5500
epsilon 0.049870000000091834
train_reward 3.086979166666668 train_win_rate 0.0 test_reward 3.1640625 test_win_rate 0.0
epoch: 5600
epsilon 0.049870000000091834
train_reward 2.7130208333333328 train_win_rate 0.0 test_reward 3.546875 test_win_rate 0.0
epoch: 5700
epsilon 0.049870000000091834
train_reward 3.0173958333333335 train_win_rate 0.0 test_reward 3.171875 test_win_rate 0.0
epoch: 5800
epsilon 0.049870000000091834
train_reward 2.3047916666666675 train_win_rate 0.0 test_reward 1.828125 test_win_rate 0.0
epoch: 5900
epsilon 0.049870000000091834
train_reward 2.1951041666666673 train_win_rate 0.0 test_reward 0.621875 test_win_rate 0.0
epoch: 6000
epsilon 0.049870000000091834
train_reward 1.8297916666666658 train_win_rate 0.0 test_reward 1.003125 test_win_rate 0.0
epoch: 6100
epsilon 0.049870000000091834
train_reward 2.3591666666666655 train_win_rate 0.0 test_reward 0.896875 test_win_rate 0.0
epoch: 6200
epsilon 0.049870000000091834
train_reward 1.8259375000000002 train_win_rate 0.0 test_reward 0.434375 test_win_rate 0.0
epoch: 6300
epsilon 0.049870000000091834
train_reward 2.144791666666667 train_win_rate 0.0 test_reward 0.6296875 test_win_rate 0.0
epoch: 6400
epsilon 0.049870000000091834
train_reward 2.068541666666667 train_win_rate 0.0 test_reward 2.128125 test_win_rate 0.0
epoch: 6500
epsilon 0.049870000000091834
train_reward 2.009895833333333 train_win_rate 0.0 test_reward 0.4515625 test_win_rate 0.0
epoch: 6600
epsilon 0.049870000000091834
train_reward 2.530208333333334 train_win_rate 0.0 test_reward 0.7734375 test_win_rate 0.0
epoch: 6700
epsilon 0.049870000000091834
train_reward 2.4498958333333336 train_win_rate 0.0 test_reward 1.409375 test_win_rate 0.0
epoch: 6800
epsilon 0.049870000000091834
train_reward 1.9322916666666663 train_win_rate 0.0 test_reward 0.13125 test_win_rate 0.0
epoch: 6900
epsilon 0.049870000000091834
train_reward 2.537708333333333 train_win_rate 0.0 test_reward 0.121875 test_win_rate 0.0
epoch: 7000
epsilon 0.049870000000091834
train_reward 1.9564583333333332 train_win_rate 0.0 test_reward 0.8203125 test_win_rate 0.0
epoch: 7100
epsilon 0.049870000000091834
train_reward 1.6682291666666667 train_win_rate 0.0 test_reward 0.615625 test_win_rate 0.0
epoch: 7200
epsilon 0.049870000000091834
train_reward 1.7818750000000003 train_win_rate 0.0 test_reward 0.925 test_win_rate 0.0
epoch: 7300
epsilon 0.049870000000091834
train_reward 2.0258333333333325 train_win_rate 0.0 test_reward 0.2453125 test_win_rate 0.0
epoch: 7400
epsilon 0.049870000000091834
train_reward 1.8468749999999998 train_win_rate 0.0 test_reward 0.065625 test_win_rate 0.0
epoch: 7500
epsilon 0.049870000000091834
train_reward 1.9370833333333344 train_win_rate 0.0 test_reward 1.21875 test_win_rate 0.0
epoch: 7600
epsilon 0.049870000000091834
train_reward 2.1445833333333333 train_win_rate 0.0 test_reward 0.284375 test_win_rate 0.0
epoch: 7700
epsilon 0.049870000000091834
train_reward 1.8172916666666667 train_win_rate 0.0 test_reward 0.196875 test_win_rate 0.0
epoch: 7800
epsilon 0.049870000000091834
train_reward 2.1602083333333333 train_win_rate 0.0 test_reward 1.259375 test_win_rate 0.0
epoch: 7900
epsilon 0.049870000000091834
train_reward 2.1704166666666653 train_win_rate 0.0 test_reward 1.0078125 test_win_rate 0.0
epoch: 8000
epsilon 0.049870000000091834
train_reward 2.354895833333333 train_win_rate 0.0 test_reward 1.0421875 test_win_rate 0.0
epoch: 8100
epsilon 0.049870000000091834
train_reward 2.7079166666666667 train_win_rate 0.0 test_reward 0.7921875 test_win_rate 0.0
epoch: 8200
epsilon 0.049870000000091834
train_reward 2.513958333333332 train_win_rate 0.0 test_reward 1.534375 test_win_rate 0.0
epoch: 8300
epsilon 0.049870000000091834
train_reward 2.433958333333333 train_win_rate 0.0 test_reward 2.1265625 test_win_rate 0.0
epoch: 8400
epsilon 0.049870000000091834
train_reward 2.9 train_win_rate 0.0 test_reward 2.740625 test_win_rate 0.0
epoch: 8500
epsilon 0.049870000000091834
train_reward 2.9971874999999994 train_win_rate 0.0 test_reward 2.403125 test_win_rate 0.0
epoch: 8600
epsilon 0.049870000000091834
train_reward 2.9128125000000007 train_win_rate 0.0 test_reward 1.9890625 test_win_rate 0.0
epoch: 8700
epsilon 0.049870000000091834
train_reward 3.0215625000000013 train_win_rate 0.0 test_reward 1.559375 test_win_rate 0.0
epoch: 8800
epsilon 0.049870000000091834
train_reward 2.7771874999999993 train_win_rate 0.0 test_reward 1.0578125 test_win_rate 0.0
epoch: 8900
epsilon 0.049870000000091834
train_reward 2.1780208333333335 train_win_rate 0.0 test_reward 0.840625 test_win_rate 0.0
epoch: 9000
epsilon 0.049870000000091834
train_reward 2.3998958333333316 train_win_rate 0.0 test_reward 0.99375 test_win_rate 0.0
epoch: 9100
epsilon 0.049870000000091834
train_reward 1.9713541666666665 train_win_rate 0.0 test_reward 0.6484375 test_win_rate 0.0
epoch: 9200
epsilon 0.049870000000091834
train_reward 1.9113541666666665 train_win_rate 0.0 test_reward 0.9140625 test_win_rate 0.0
epoch: 9300
epsilon 0.049870000000091834
train_reward 1.7506249999999997 train_win_rate 0.0 test_reward 0.8390625 test_win_rate 0.0
epoch: 9400
epsilon 0.049870000000091834
train_reward 1.5670833333333325 train_win_rate 0.0 test_reward 1.4109375 test_win_rate 0.0
epoch: 9500
epsilon 0.049870000000091834
train_reward 1.620104166666666 train_win_rate 0.0 test_reward 0.8265625 test_win_rate 0.0
epoch: 9600
epsilon 0.049870000000091834
train_reward 1.4828125000000005 train_win_rate 0.0 test_reward 0.440625 test_win_rate 0.0
epoch: 9700
epsilon 0.049870000000091834
train_reward 1.2328124999999999 train_win_rate 0.0 test_reward 0.825 test_win_rate 0.0
epoch: 9800
epsilon 0.049870000000091834
train_reward 1.2220833333333336 train_win_rate 0.0 test_reward 1.2109375 test_win_rate 0.0
epoch: 9900
epsilon 0.049870000000091834
train_reward 1.1340624999999998 train_win_rate 0.0 test_reward 0.69375 test_win_rate 0.0
epoch: 10000
epsilon 0.049870000000091834
train_reward 1.0516666666666665 train_win_rate 0.0 test_reward 0.2921875 test_win_rate 0.0
epoch: 10100
epsilon 0.049870000000091834
train_reward 0.8470833333333333 train_win_rate 0.0 test_reward 0.809375 test_win_rate 0.0
epoch: 10200
epsilon 0.049870000000091834
train_reward 0.8629166666666667 train_win_rate 0.0 test_reward 0.3375 test_win_rate 0.0
epoch: 10300
epsilon 0.049870000000091834
train_reward 0.6390625000000001 train_win_rate 0.0 test_reward 0.8484375 test_win_rate 0.0
epoch: 10400
epsilon 0.049870000000091834
train_reward 0.8579166666666665 train_win_rate 0.0 test_reward 0.3296875 test_win_rate 0.0
epoch: 10500
epsilon 0.049870000000091834
train_reward 0.7028124999999998 train_win_rate 0.0 test_reward 0.103125 test_win_rate 0.0
epoch: 10600
epsilon 0.049870000000091834
train_reward 0.8051041666666666 train_win_rate 0.0 test_reward 0.2359375 test_win_rate 0.0
epoch: 10700
epsilon 0.049870000000091834
train_reward 0.804375 train_win_rate 0.0 test_reward 0.196875 test_win_rate 0.0
epoch: 10800
epsilon 0.049870000000091834
train_reward 0.6659374999999998 train_win_rate 0.0 test_reward 0.215625 test_win_rate 0.0
epoch: 10900
epsilon 0.049870000000091834
train_reward 0.8440624999999998 train_win_rate 0.0 test_reward 0.9828125 test_win_rate 0.0
epoch: 11000
epsilon 0.049870000000091834
train_reward 0.7466666666666665 train_win_rate 0.0 test_reward 0.0 test_win_rate 0.0
epoch: 11100
epsilon 0.049870000000091834
train_reward 0.7547916666666665 train_win_rate 0.0 test_reward 0.1328125 test_win_rate 0.0
epoch: 11200
epsilon 0.049870000000091834
train_reward 0.9181250000000002 train_win_rate 0.0 test_reward 0.3234375 test_win_rate 0.0
epoch: 11300
epsilon 0.049870000000091834
train_reward 0.8406250000000002 train_win_rate 0.0 test_reward 0.1890625 test_win_rate 0.0
epoch: 11400
epsilon 0.049870000000091834
train_reward 0.7920833333333333 train_win_rate 0.0 test_reward 0.253125 test_win_rate 0.0
epoch: 11500
epsilon 0.049870000000091834
train_reward 0.6032291666666666 train_win_rate 0.0 test_reward 0.01875 test_win_rate 0.0
epoch: 11600
epsilon 0.049870000000091834
train_reward 0.6145833333333333 train_win_rate 0.0 test_reward 0.225 test_win_rate 0.0
epoch: 11700
epsilon 0.049870000000091834
train_reward 0.5562499999999999 train_win_rate 0.0 test_reward 0.103125 test_win_rate 0.0
epoch: 11800
epsilon 0.049870000000091834
train_reward 0.5293749999999998 train_win_rate 0.0 test_reward 0.065625 test_win_rate 0.0
epoch: 11900
epsilon 0.049870000000091834
train_reward 0.4654166666666667 train_win_rate 0.0 test_reward 0.0375 test_win_rate 0.0
epoch: 12000
epsilon 0.049870000000091834
train_reward 0.5352083333333334 train_win_rate 0.0 test_reward 0.05625 test_win_rate 0.0
epoch: 12100
epsilon 0.049870000000091834
train_reward 0.5377083333333335 train_win_rate 0.0 test_reward 0.0 test_win_rate 0.0
epoch: 12200
epsilon 0.049870000000091834
train_reward 0.636875 train_win_rate 0.0 test_reward 0.0 test_win_rate 0.0
epoch: 12300
epsilon 0.049870000000091834
train_reward 0.45531249999999995 train_win_rate 0.0 test_reward 0.075 test_win_rate 0.0
epoch: 12400
epsilon 0.049870000000091834
train_reward 0.5946874999999998 train_win_rate 0.0 test_reward 0.046875 test_win_rate 0.0
epoch: 12500
epsilon 0.049870000000091834
train_reward 0.5055208333333333 train_win_rate 0.0 test_reward 0.0 test_win_rate 0.0
epoch: 12600
epsilon 0.049870000000091834
train_reward 0.41854166666666665 train_win_rate 0.0 test_reward 0.028125 test_win_rate 0.0
epoch: 12700
epsilon 0.049870000000091834
train_reward 0.4434375 train_win_rate 0.0 test_reward 0.0 test_win_rate 0.0
epoch: 12800
epsilon 0.049870000000091834
train_reward 0.5711458333333332 train_win_rate 0.0 test_reward 0.028125 test_win_rate 0.0
epoch: 12900
epsilon 0.049870000000091834
train_reward 0.5663541666666665 train_win_rate 0.0 test_reward 0.046875 test_win_rate 0.0
epoch: 13000
epsilon 0.049870000000091834
train_reward 0.5740624999999998 train_win_rate 0.0 test_reward 0.028125 test_win_rate 0.0
epoch: 13100
epsilon 0.049870000000091834
train_reward 0.45760416666666665 train_win_rate 0.0 test_reward 0.225 test_win_rate 0.0
epoch: 13200
epsilon 0.049870000000091834
train_reward 0.518125 train_win_rate 0.0 test_reward 0.1984375 test_win_rate 0.0
epoch: 13300
epsilon 0.049870000000091834
train_reward 0.49447916666666664 train_win_rate 0.0 test_reward 0.2828125 test_win_rate 0.0
epoch: 13400
epsilon 0.049870000000091834
train_reward 0.6486458333333335 train_win_rate 0.0 test_reward 0.0375 test_win_rate 0.0
epoch: 13500
epsilon 0.049870000000091834
train_reward 0.6007291666666668 train_win_rate 0.0 test_reward 0.0 test_win_rate 0.0
epoch: 13600
epsilon 0.049870000000091834
train_reward 0.45947916666666666 train_win_rate 0.0 test_reward 0.13125 test_win_rate 0.0
epoch: 13700
epsilon 0.049870000000091834
train_reward 0.42625 train_win_rate 0.0 test_reward 0.140625 test_win_rate 0.0

多agent Qmix 多进程怎么跑

麻烦您请教一下,在多agent 的情况下,最终目的是为了得到多个相同结构的网络?但是这要怎么训练呢,因为这并不是分布式数据平行只为了最终得到一个网络,
我的数据必须由两个进程各自提供,然后还有一个上层网络,上面网络的loss应该要会给各自两个网络,但是他们跨进程,不知道怎么能把梯度跨进程传递,其实就是Qmix下面带网络,网络结构一样但是位于不同的进程谢谢

The algorithms do not perform well in hard scenarios?

I've already tested some hard scenarios, such as 5m_vs_6m or 3s_vs_5z. But the results don't seem to do as well as Pymarl.
And I have find some inadequacies, such as 'args.target_update_cycle = 200' and 'train_step % self.args.target_update_cycle == 0' in VDN. This needs to be changed to 'args.target_update_cycle = 20000' or 'epoch % self.args.target_update_cycle == 0'.

参数冻结

您好,我想请问下本实验有无保留参数继续训练的方法,我发现在训练过程中经常出现训练崩溃,望您不吝赐教,谢谢!

n_episodes 的设置问题

先赞一下作者的工作,很棒! 比Pymarl的结构清晰,易读性好多了。

在这个版本的代码提交中,所有算法的n_episodes 都被强制为1了。先前版本这个值有5,8,然后所有的被拼接起来。所以现在应该是没有拼接这个步骤了,虽然代码还在(既然为1了不会循环那个拼接代码)

想问一下,这么做主要是什么原因呢,性能问题吗, 还是因为容易产生bug?因为pytorch 在处理RNN batch变长的时候比较tricky. 我去年一个项目也遇到过,如果batch为1, 没有这个问题,后来用了https://towardsdatascience.com/taming-lstms-variable-sized-mini-batches-and-why-pytorch-is-good-for-your-health-61d35642972e
类似的办法。涉及到padding, pack_padded_sequence, 就是padding,打包,解包的,处理起来反正是很费劲,稍微不注意就整错。

作者这里手动全部把长度设置相等用0填充倒是另外一种思路,不知道用pytorch 自带的packing 函数来处理这里的 变长sequence batch ,实现起来效果会怎么样?

另外这版本的main.py有个小问题,main.py中10,11行应该需要被注释掉,应该是测试的代码。

有关reuse_network

老兄,我看参赛设置里面有这个选项,
但是如果这个设置成False的话, 代码里面似乎没看到对应的实现?
目前是否支持reuse_network = False的情况呢?

reinforce算法中的n_return问题

老兄你好,又来请教个问题,

这里有一个很有意思的问题,在reinforce.py中计算n return的时候
def _get_returns(self, r, mask, terminated, max_episode_len):
r = r.squeeze(-1)
mask = mask.squeeze(-1)
terminated = terminated.squeeze(-1)
terminated = 1 - terminated
n_return = torch.zeros_like(r)
n_return[:, -1] = r[:, -1] * terminated[:, -1] * mask[:, -1]
for transition_idx in range(max_episode_len - 2, -1, -1):
n_return[:, transition_idx] = (r[:, transition_idx] + self.args.gamma * n_return[:, transition_idx + 1] * terminated[:, transition_idx + 1]) * mask[:, transition_idx]
return n_return.unsqueeze(-1).expand(-1, -1, self.n_agents)

n_return[:, -1] = r[:, -1] * terminated[:, -1] * mask[:, -1]
这一行是不是有问题啊。因为terminated 的反转要比padding早一步, 最后一个1 (terminate)是有reward的。
比如00001, 然后 1 - terminated = 11110
上面的代码会把terminate那一步的reward 给搞成0了。

所以后面 terminated[:, transition_idx + 1]) * mask[:, transition_idx]这里就有1步之差。

另外我有个问题是,其实在terminate步之后的都是padding的。从设计上看,为什么要加入一个padding的字段,只利用terminate来实现可以吗?terminate的后一个step 就是padding, 会有啥问题?

About the system requirements to train qmix map corridor

Hi,sir,
I wonder the requirements about train qmix model. I train it with my 32GB memory machine and it run out the memory, so I want to know about how many memory
it needed or what else I can do. Thanks a lot for your code anyway.

一个问题

您好,请问如何看待网络使用RNN?是value-based MARL的特性吗?我看过MADDPG、MAAC,他们只使用了MLP。
希望能得到您的解答~谢谢。

g2anet、commnet 算法报错

[您好,reinforce+g2anet、reinforce+commnet、central_v+g2anet、central_v+commnet算法会在运行过程中报错,截图如下,请问是否可以帮助解决?谢谢。
g2anet_error

commnet_error_1
commnet_error_2
g2anet_error_1
g2anet_error_2

另外,Linux系统下runner.py文件是否应该加上 plt.switch_backend('agg'),不加的话会报错:
plt_error

Training stoped due to unexpected RequestQuit after several epochs

Hello, thanks to open source this project. I found the training stage will always exit after several epochs. I try to fix it via the issues posted on pysc2 but fail to solve it. I would be appreciated if you can help. BTW, can you share the version of starCraft and python packages for this repo?

Run 0, train epoch 2285, total epoch 20000
RequestQuit command received.
Closing Application...
unable to parse websocket frame.
Version: B60604 (SC2.4.01)
Build: May 1 2018 19:24:12
Command Line: '"/home/ldp/StarCraftII/Versions/Base60321/SC2_x64" -listen 127.0.0.1 -port 21450 -dataDir /home/ldp/StarCraftII/ -tempDir /tmp/sc-80505een/ -eglpath libEGL.so'
Starting up...
Startup Phase 1 complete
Startup Phase 2 complete
Attempting to initialize EGL from file libEGL.so ...
Successfully loaded EGL library!
Successfully initialized display on device idx: 0, EGL version: 1.5

Running CGLSimpleDevice::HALInit...
Calling glGetString: 0x7f80e8765880
Version: 4.6.0 NVIDIA 418.56
Vendor: NVIDIA Corporation
Renderer: GeForce GTX 1080 Ti/PCIe/SSE2
OpenGL initialized!
Disabling compressed textures
Listening on: 127.0.0.1:21450
Startup Phase 3 complete. Ready for commands.
Requesting to join a single player game
Configuring interface options
Configure: raw interface enabled
Configure: feature layer interface disabled
Configure: score interface disabled
Configure: render interface disabled
Entering load game phase.
Launching next game.
Next launch phase started: 2
Next launch phase started: 3
Next launch phase started: 4
Next launch phase started: 5
Next launch phase started: 6
Next launch phase started: 7
Next launch phase started: 8
Game has started.
Sending ResponseJoinGame
Traceback (most recent call last):
File "main.py", line 34, in
runner.run(i)
File "/home/ldp/Research/MARL/StarCraft/runner.py", line 46, in run
episode, _, _ = self.rolloutWorker.generate_episode(episode_idx)
File "/home/ldp/Research/MARL/StarCraft/common/rollout.py", line 77, in generate_episode
win_tag = True if terminated and info['battle_won'] else False
KeyError: 'battle_won'
RequestQuit command received.
Closing Application...
unable to parse websocket frame.

关于agent.train的问题

您好,代码中您使用RNN作为函数近似网络
off-policy的算法,训练时进行sample
这时sample的是以episode为单位进行sample的
训练时,考虑到hidden state 输入的问题,所以就从头遍历<o,a,r,o'>
所以采样得到的episode中,每一步的transition都作为训练数据

请问上述的理解对吗?

How to execute IQL?

Thanks for sharing the code~ I'm interesting in the IQL, how to execute it, can u give me some advice?

关于使用CUDA的问题

作者您好,我使用2080ti且把argument中的cuda设置为True了,使用torch.cuda.is_available()测试返回也是True,但是每个算法每一张图每一次run(八次中的一次)还是要跑七八个小时,想问你们当初也是这个速度吗(不知道是我的问题还是大家都这样)

rollout.py

请问下,RolloutWorker是用来生成训练用的样本数据吗

问题

QMIX论文里写他们更新规则使用的是DoubleDQN的方式,我看作者的QMIX使用的是Nature的更新方式,所以我改了一下,可是效果却变差了,不知是什么原因?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.