Giter Site home page Giter Site logo

tic-tac-toe's Introduction

Tic Tac Toe

A tale about trying to train a machine to play Tic Tac Toe through Reinforcement Learning

To run the Jupyter notebooks in Binder press: Binder

The goal of this series is to implement and test a couple of different approaches to training a computer how to play Tic Tac Toe. We will create:

  • A player that plays completely randomly,
  • Two players that implement simple forms of the Min-Max algorithm,
  • Several players that we will train through Reinforcement Learning:
    • a Tabular Q-Learning player.
    • a Simple Neural Network Q-Learning player.
    • a Deep Neural Network Q-learning player.
    • a Policy Gradient Descent based player.

I assume you are familiar with:

  • The rules and basic strategy of playing Tic Tac Toe.
  • Basic Python 3 programming and use of a Python IDE or Jupyter Notebooks.
  • At least rudimentary knowledge of Tensorflow and Neural Networks would be helpful, but you might be able to do without (give it a try and if it's too overwhelming do some of the beginner tutorials, and then try again).

tic-tac-toe's People

Contributors

fcarsten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tic-tac-toe's Issues

Board.py -> WIN_CHECK_DIRS

Hi,

WIN_CHECK_DIRS is confusing to me could you provide some explanation to what is going on with this constant please?

Thanks

calculate_targets SimpleNNQPlayer.py

Hi,

Me again! I have a query about the calculate_targets method in the SimpleNNQPlayer class.

It seems to me that we are only really teaching the model about final moves.

In TQPlayer when we loop through the moves at the end of the game we reverse the game moves and apply the formula to set the new Q value for the final move first. Then we set the next_max for us to use on the next iteration of the loop, the previous move. Here it makes sense that we use the reward discount to reduce the penalty/reward in steps and so on until the first move.

However

When we call calculate_targets for the NNQ player

target[self.action_log[i]] = self.reward_discount * self.next_max_log[i]

self.next_max_log[i] is just some arbitrary value given to us by a naive neural network and only when we get to the final iteration of the loop where self.next_max_log[i] is the actual reward/punishment does this calculation seem to make sense. The next_max values aren't related to the result of the game.

It feels to me like we should be doing something here more akin to what is going on in the TabQ player reward mechanism?

Thanks for reading
-Paul

DeepExpDoubleDuelQPlayer -> Unstable Learning

Hey,

First of all I would like to thank you on a very detailed tutorial on deep reinforcement learning.

Okay so what happened was I followed your guidelines and went ahead and coded a network very much similar to yours in PyTorch, so when I was training my network, during training it did good in between, but all of a sudden it seems to forget what it has learnt and leading to 100% losses in between and again learning and again leading to 100% losses and the cycle continues ?

Figure_1

(Trained Against A MinMax Agent)

My configs:
->Network architecture same as yours
->I didn't use a one hot encode
->Updated my target network every 5 games (I tried changing it 4, 5 ,6, didn't make a difference)
->lr = 0.001, PreTrainingGames = 3000(changed to 500, 1000), Epsilon = 1, Discount =0.99, BatchSize =66

have you encountered such a pattern during your training process ?, if so what precautions have you taken to avoid such 'catastrophic forgetting' ?

Also I coudnt understand what 'tau' was in updating your target network, aren't you supposed to update based on number of episdoes(/Games), but in yours it was a float factor), could you provide some insight into this, Thanks a lot

regards,
Nikhil

Can I use this on Colab

I installed the package like this "!pip install tic_tac_toe" and tic_tac_toe.Board is not found.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.