Based on MADDPG algorithm, there are mainly two improvements:
- maddpg_IU-master:the code implementation of updating the agents strategies by iterative update.
- maddpg_IUUR-master:the code implementation of updating the agent strategies by iterative update and unified representation.
The experimental environment is multiagent-particle-envs-master, which is a multi-agent environment that is installed in the same way as MADDPG-env.
There are mainly two environments:fully-cooperative and mixed cooperative-competitive:
- Fully-cooperative environments(Spread): Agents perceive the environment on its own perspective and cooperate with each other to reach different destinations.
- Mixed cooperative-competitive environments(Predator-Prey): Agents are divided into predators and preys. The predators need to cooperate with each other to catch the preys.
In order to compare the influence of the number of agents on the algorithm, it is designed the control groups by increasing the number of agents.
Installation method and dependency package versions are the same as MAPPDG:
- To environment install:
cd
into the root directory(multiagent-particle-envs-master) and typepip install -e .
- Known dependencies: Python (3.5.4), OpenAI gym (0.10.5), tensorflow (1.8.0), numpy (1.14.5)
- To run the code,
cd
into the related root directory and run the corresponding python file.
For example,run train_Cooperative_game.py under a scenario named spread_3:
python train_Cooperative_game.py --scenario=spread_3
We set up a simple environment with three agents (Spread_3) and a complex environment with ten agents(Spread_10).
Spread_3 Spread_10
We run five random seeds for each environment and compare the performance among MADDPG, IU and IUUR.
Spread_3_comparison Spread_10_comparison
As can be seen from the figure, IUUR converges quickly and after 20,000 episodes, it has exceeded MADDPG and maintained a steady rise.
We set up three chase one as simple scenes(Predator_3-prey_1) and six chase two as complex scenes(Predator_6-prey_2).
Predator_3-Prey_1 Predator_6-Prey_2
We run five random seeds for each environment and compare the performance among MADDPG, IU and IUUR.
-
Performance comparison in Predator_3-Prey_1
- The prey is MADDPG while the predators are replaced by IU and IUUR:
- The predator is MADDPG while the preys are replaced by IU and IUUR:
IUUR outperforms MADDPG a lot while IU’s performance is slightly worse than that of MADDPG which is out of our expectation.
- The prey is MADDPG while the predators are replaced by IU and IUUR:
-
Performance comparison in Predator_6-prey_2
- The prey is MADDPG while the predators are replaced by IU and IUUR:
- The predator is MADDPG while the preys are replaced by IU and IUUR:
IU outperforms MADDPG a lot while IUUR’s performance is worse than that of MADDPG. The reason is that as the number of agents increases, nonstationarity arises in multi-agent reinforcement learning gets more serious.
- The prey is MADDPG while the predators are replaced by IU and IUUR:
- This paper presents iteration updating and unified representation. lterative update is used to stabilize the environment and unified representation take the advantages of tensor compute to save memory and speed up the interaction with environment.
- Though our experiments are based on MADDPG, this method is also suitable for most of multi-agent algorithms like IQL, VDN, QMIX etc.
- Due to the limited computing resources, we only expand the number of agents to a certain extent, which can be further verified in more complex environments.
- We only simply control the learning frequency of iterative update hyperparameter K through experience, which is a research direction in the future.
- How to realize the iterative update method in this unified representative network,This will be further improved in the future work. (Considering the value fixing method based on Bellman Equation can only guarantee a smaller
L_2
norm of its gradients)