Giter Site home page Giter Site logo

codenames's Introduction

CodeNames AI

This is a project to study the performance of some standard natural-language and machine-learning algorithms for replacing human players in the CodeNames game.

Requirements

Code is currently developed and tested using python 2.7. The following external packages are required, and all available via pip:

  • pywikibot: interact with the wikimedia API.
  • nltk: Natural language tool kit.
  • gensim: Learn word vector embeddings from a large corpus text.

After nltk is installed, you will need to download the following data files:

  • Models / punkt (13.0Mb)
  • Corpora / wordnet (10.3Mb)

Use the following python commands to open the NLTK download dialog:

import nltk
nltk.download()

Knowledge Base

The AI learns from a large set of sentences called the "corpus". Build the corpus by first selecting articles from the english wikipedia related to each word with:

./create_corpus_index.py

This will read the list of words used in the game from words.txt and writes an index file corpus/<Word>.index for each one containing wikipedia page titles (with utf-8 encoding). This script aims for the same number (nominally 10K) of articles for each word, but some words might have less if not enough articles can be found. I had to restart this process a few times due to uncaught pywikibot.data.api.APIError exceptions (TODO: implement automatic recovery from these). The whole index-creation step takes about 6 hours.

The next step is to download the content of all articles using:

./fetch_corpus_text.py

This step is performed in parallel (using 20 processes by default) since it is IO bound. In case it fails for some reason, it can be restarted and will automatically skip over any words that have already been fetched. The goal of the fetching step is to download ~5M characters of plain (unicode) text for each code word. This does not require downloading all of the indexed articles, so articles are processed in a random (but reproducible) order until at least 5M characters have been downloaded.

The final step is to preprocess the downloaded text into a format suitable for feeding directly into word2vec training:

  • Split into sentences, one per line.
  • Remove markup headings (e.g., "== References ==").
  • Split into word tokens.
  • Combine compound words from the word list ("ice cream" -> "ice_cream").
  • Remove punctuation.
  • Convert to lower case.

This step is performed using:

./preprocess_corpus.py

which converts each corpus/Word.txt.gz into a corresponding corpus/Word.pre.gz. Processing runs in a single process since it is relatively fast (~90 mins) and this simplifies collecting the summary statistics in the freqs.dat output file.

Machine Learning

Machine learning consists of iteratively finding an embedding of all words into a 300-dimensional vector space that captures important aspects of their semantic relationships. Run training with the command:

./learn.py

In practice, this takes a while (~30 hours total) so we run five consecutive passes using 20 cores in parallel:

nohup ./learn.py --workers 20 --npass 1 > learn1.log &
nohup ./learn.py --workers 20 --npass 2 > learn2.log &
nohup ./learn.py --workers 20 --npass 3 > learn3.log &
nohup ./learn.py --workers 20 --npass 4 > learn4.log &
nohup ./learn.py --workers 20 --npass 5 > learn5.log &

Each pass starts with a random shuffle of the corpus followed by 5 epochs of training, with a learning rate that decreases linearly from 0.0251 to 0.0001 over the 5 passes. The output from each job consists of 4 files (with N = 1-5):

  • word2vec.dat.N
  • word2vec.dat.N.syn0.npy
  • word2vec.dat.N.syn1.npy
  • word2vec.dat.N.syn1neg.npy

The first file contains the vocabulary words and corresponding embedded vectors. The last three files contain the neural network weights.

After training, run an evaluation suite of the embedding quality using:

./evaluate.py -i word2vec.dat.4 --top-singles 10 --top-pairs 10 --save-plots

Play

Finally, try playing the game with an AI spymaster for both teams using:

./play.py --config CHCH --seed 123

codenames's People

Contributors

andrewisplinghoff avatar dkirkby avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.