dkirkby / codenames Goto Github PK
View Code? Open in Web Editor NEWAI for CodeNames
License: MIT License
AI for CodeNames
License: MIT License
Contents would include:
corpus_name: corpus
word_list: words.dat
embedding: word2vec.dat
Any others?
This would replace the default random assignment of cards and has two purposes:
The pre-processed corpus is in unicode after #8 (with utf8 encoding, by default), which means that clues are now given in unicode. The new engine.say()
function now handles encoding of all game output, including clues, but we now have a mixture of byte and unicode literals in engine.py
that would be much cleaner in python3. This issue is to perform and test the migration.
The AI is getting a bit slow, especially with the larger corpus. This issue is to run a profile and see if there are any obvious bottlenecks that would be easy to improve.
The corpus is currently scraped from a list of wikipedia articles that is automatically derived from the word list, but the resulting coverage (number of times each word appears) is quite un-even. This issue is to split the work now done by build_corpus.py
into two tasks:
corpus/
directory.The motivation for this split is to allow the first step to be improved without needing to run the second step (which takes most of the time). An added benefit is that the second step could be easily parallelized since it is limited by network IO, not CPU.
The goal of the improvements to the first step is to:
The random board is currently created treating all 400 words as independent. However, the physical game has 200 double-sided cards, which means there are 200 pairs of words that can never appear together. Is it important to reproduce this?
This would primarily be to allow beta testers to interact with the AI, so we can work out the bugs and tune parameters. The simplest starting point would be to host a single game at a time on a service like twitter. Turns could be timed so that multiple people can guess each word, then the majority choice goes to the AI.
When I performed the statement: ./fetch_corpus_text.py, there will always be these two error messages as follow and The operation will be interrupted,which can't be solved by running again.
Unexpected Error:: No JSON object could be decoded
Unexpected Error:: HTTPConnectionPool(host='en.wikipedia.org', port=80): Max retries exceeded with url...
If I continue with the next steps, there will be an error of
RuntimeError:you must first build vocabulary before training the model
during training, I think it may be caused by the above problem.
So, how can I solve this problem? It would be so great with your help.
When some words were not guessed in a previous round, the clue number provided with subsequent clues should generally be increased to allow old clues to be used to guess extra words. The simplest implementation would be to keep a count of un-guessed words N and increase the count for new clues by (N-1). Do we need anything more sophisticated?
The current implementation of model.get_clue(...)
is too simple and ad-hoc. Visible words are now divided into four groups:
A clue is only accepted if it is a better match for all the clue words than any of the words for the other team or neutral. It must also be better than the assassin word by some margin.
Possible improvements:
It could happen especially if the first guess was wrong in the previous round, because then the words to guess are the same.
I propose we keep a list of previous clues and remove them from future choices.
NIGHT BOW SATURN HAM <<<<<<<<<<<<
SCUBA_DIVER <<<<<<<<<<<< RABBIT TURKEY CRICKET
ICE_CREAM MICROSCOPE MEXICO BILL KID
STREAM LAP HOLE BOTTLE ANTARCTICA
LINE QUEEN COMIC ICE BACK
>>> your clue is: england 3
>>> enter your guess #1: queen
>>> Sorry!
(...snip the other team's round...)
<NIGHT <BOW -saturn >HAM <<<<<<<<<<<<
<SCUBA_DIVER <<<<<<<<<<<< >RABBIT -turkey >CRICKET
>ICE_CREAM >MICROSCOPE >MEXICO <BILL #KID
<STREAM -lap -hole -bottle -antarctica
>LINE <<<<<<<<<<<< >>>>>>>>>>>> <ICE >BACK
Thinking...
0.819 HAM + CRICKET + MEXICO = england
0.697 CRICKET = first-class
0.672 MEXICO = mexican
0.661 HAM + CRICKET + LINE = paddington
...
NIGHT BOW SATURN HAM <<<<<<<<<<<<
SCUBA_DIVER <<<<<<<<<<<< RABBIT TURKEY CRICKET
ICE_CREAM MICROSCOPE MEXICO BILL KID
STREAM LAP HOLE BOTTLE ANTARCTICA
LINE <<<<<<<<<<<< >>>>>>>>>>>> ICE BACK
>>> your clue is: england 3
>>> enter your guess #1:
$ ./evaluate.py -i word2vec.dat.1 --top-singles 30 --top-pairs 30 --save-plots 1
Traceback (most recent call last):
File "./evaluate.py", line 88, in <module>
main()
File "./evaluate.py", line 39, in main
clues = embedding.get_clues((word), (word))
AttributeError: 'WordEmbedding' object has no attribute 'get_clues'
Modifying the call from get_clues()
to get_clue()
doesn't work either because the number of arguments doesn't match.
Checklist:
I just realized that I am using the same learning rates for each pass, i.e., the default linear decrease from 0.025 to 0.0001 over 10 iterations. This means that there is no benefit from additional passes with the current implementation!
The advice here is that there is no benefit beyond 20-30 total iterations, so aim for 25 in 5 passes, n=1-5, with:
alpha = 0.025 - (n-1) * 0.005 + 0.0001
When running fetch_corpus_text.py
, each process starts fetching the first few articles is has been assigned to, but after a few dozens, it stalls and never ends. The article where it happens changes each time.
Things I've tried without success :
fetch()
for each clue word serially, it always stalls somewhere in the fetching of the articles for Africa (the first clue alphabetically).time.sleep(0.5)
after each article is fetched, to be nicer to the system. same problem happens.Most of the time, it happens in the request sent by content = page.content
(line 70), but I've seen it blocked once in the request sent by page = wikipedia.page(...)
(line 68).
Debugging is not praticable since the article for which it will happen doesn't seem deterministic.
I've added logging like this :
from httplib import HTTPConnection
HTTPConnection.debuglevel = 1
logging.basicConfig() # you need to initialize logging, otherwise you will not see anything from requests
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
This produces a lot of logs which I skip, but here is when it stalls :
DEBUG:requests.packages.urllib3.connectionpool:"GET /w/api.php?inprop=url&redirects=&format=json&ppprop=disambiguation&prop=info%7Cpageprops&titles=1930+FIFA+World+Cup&action=query HTTP/1.1" 200 284
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): en.wikipedia.org
send: 'GET /w/api.php?format=json&rvprop=ids&prop=extracts%7Crevisions&titles=1930+FIFA+World+Cup&action=query&explaintext= HTTP/1.1\r\nHost: en.wikipedia.org\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: wikipedia (https://github.com/goldsmith/Wikipedia/)\r\n\r\n'
That last line (starting with "send:") is the end of the logs.
Strangely I didn't encounter this problem with the old version of build_corpus.py
, which used the same library in a very similar way.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.