fcarsten / tic-tac-toe Goto Github PK

View Code? Open in Web Editor NEW

57.0 5.0 37.0 2.55 MB

Teaching a machine to play tic-tac-toe

License: Apache License 2.0

Python 2.12% Jupyter Notebook 40.45% HTML 57.43%

tic-tac-toe's Introduction

Tic Tac Toe

A tale about trying to train a machine to play Tic Tac Toe through Reinforcement Learning

To run the Jupyter notebooks in Binder press:

The goal of this series is to implement and test a couple of different approaches to training a computer how to play Tic Tac Toe. We will create:

A player that plays completely randomly,
Two players that implement simple forms of the Min-Max algorithm,
Several players that we will train through Reinforcement Learning:
- a Tabular Q-Learning player.
- a Simple Neural Network Q-Learning player.
- a Deep Neural Network Q-learning player.
- a Policy Gradient Descent based player.

I assume you are familiar with:

The rules and basic strategy of playing Tic Tac Toe.
Basic Python 3 programming and use of a Python IDE or Jupyter Notebooks.
At least rudimentary knowledge of Tensorflow and Neural Networks would be helpful, but you might be able to do without (give it a try and if it's too overwhelming do some of the beginner tutorials, and then try again).

tic-tac-toe's People

Contributors

Stargazers

Watchers

Forkers

aramir62 dprotopopov sebastiansajie talshell ybtech frank-w-b ryancur aminaettayebi peyman1986 decastro-alex brussee chyavana-maharshi jcassiojr meghna30 stuartanisbet solitone jonathangithub zhengma zeeshan-nadir cxw-droid leggitta gayathrig269 aminahmed1999 sandesh2000 mitziu prakashd1 cloned-repositories laurencelamarche swtldnjs arikgorun lucas-dei singh-prishita seonghyun-cloud windstudio286 quantau tarobins ac31415

tic-tac-toe's Issues

Board.py -> WIN_CHECK_DIRS

Hi,

WIN_CHECK_DIRS is confusing to me could you provide some explanation to what is going on with this constant please?

Thanks

calculate_targets SimpleNNQPlayer.py

Hi,

Me again! I have a query about the calculate_targets method in the SimpleNNQPlayer class.

It seems to me that we are only really teaching the model about final moves.

In TQPlayer when we loop through the moves at the end of the game we reverse the game moves and apply the formula to set the new Q value for the final move first. Then we set the next_max for us to use on the next iteration of the loop, the previous move. Here it makes sense that we use the reward discount to reduce the penalty/reward in steps and so on until the first move.

However

When we call calculate_targets for the NNQ player

target[self.action_log[i]] = self.reward_discount * self.next_max_log[i]

self.next_max_log[i] is just some arbitrary value given to us by a naive neural network and only when we get to the final iteration of the loop where self.next_max_log[i] is the actual reward/punishment does this calculation seem to make sense. The next_max values aren't related to the result of the game.

It feels to me like we should be doing something here more akin to what is going on in the TabQ player reward mechanism?

Thanks for reading
-Paul

DeepExpDoubleDuelQPlayer -> Unstable Learning

Hey,

First of all I would like to thank you on a very detailed tutorial on deep reinforcement learning.

Okay so what happened was I followed your guidelines and went ahead and coded a network very much similar to yours in PyTorch, so when I was training my network, during training it did good in between, but all of a sudden it seems to forget what it has learnt and leading to 100% losses in between and again learning and again leading to 100% losses and the cycle continues ?

(Trained Against A MinMax Agent)

My configs:
->Network architecture same as yours
->I didn't use a one hot encode
->Updated my target network every 5 games (I tried changing it 4, 5 ,6, didn't make a difference)
->lr = 0.001, PreTrainingGames = 3000(changed to 500, 1000), Epsilon = 1, Discount =0.99, BatchSize =66

have you encountered such a pattern during your training process ?, if so what precautions have you taken to avoid such 'catastrophic forgetting' ?

Also I coudnt understand what 'tau' was in updating your target network, aren't you supposed to update based on number of episdoes(/Games), but in yours it was a float factor), could you provide some insight into this, Thanks a lot

regards,
Nikhil

Can I use this on Colab

I installed the package like this "!pip install tic_tac_toe" and tic_tac_toe.Board is not found.

fcarsten / tic-tac-toe Goto Github PK

tic-tac-toe's Introduction

Tic Tac Toe

A tale about trying to train a machine to play Tic Tac Toe through Reinforcement Learning

tic-tac-toe's People

Contributors

Stargazers

Watchers

Forkers

tic-tac-toe's Issues

Board.py -> WIN_CHECK_DIRS

calculate_targets SimpleNNQPlayer.py

DeepExpDoubleDuelQPlayer -> Unstable Learning

Can I use this on Colab

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent