For this repo, I have used multiple algorithm and I try to solve each gym's environment with them. For every project, I have also added a video on how the agent behaves after the training is complete.
This is a pretty straight forward method - where I simple instantiate a weight matrix and multiple with the states to get rewards over a period and then try to maximise the reward. For discrete state and action space.
Projects solved:
0
- Cartpole - For info on the environment - https://github.com/openai/gym/wiki/CartPole-v0
Here I use, CEM method to solve the environment. In this method, a small noise is added to weights to the neural network instead on the actions taken by the agent. It is an off policy reinforcement learning method. It uses a tanh activation function in the final layer - it can be used for continouse action space.
- Instantiate a weight matrix. For every episode, a small amount of noise is added to the weight matrix and rewards are evaluated. Its like genetic evolution method -
- For every episode, you take a set of weights(by adding noise everytime to the weight matrix) and calculate the rewards obtained using those weights.
- Then you sort those rewards and only take the top 10/ whatever the elite number and get the best weights corresponding to those rewards.
- In the end, you take the mean of those top weights and then calculate the reward with that mean weight.
- Repeat step 2-4 for number of episodes, with the mean weight and add noice to it to get correct set of weights.
Projects solved:
0
- MountainCar_Continous - For info on the environment - https://github.com/openai/gym/wiki/MountainCarContinuous-v0
This is quite a complex RL algorithm - can be used for continous state and action space. It is a kind of Actor - Critic method(Atleast that's what the founders call it), but it is somewhat similar to supervised learning. The actor takes the actions and is evaluated with the Q values generated using the critic. So, here actor acts as the output variable and the q values by the critic - we can call them, the labels.
- For every action taken by the actor model - you get state, action, reward, new_state, done(if task is done). So, you can create a tuple of { S, A, R, S', D } and store it in the replay buffer.
- Take a sample from the reply buffer and train the network from the experiences stored.
- I have tried to explain how DDPG works with both actor - critic model in thebelow diagram.
Projects solved:
0
- Pendulum - For info on the environment - https://github.com/openai/gym/wiki/Pendulum-v0