A deep learning based naive implementation of the quora similarity question problem.
The approach used here relies on using word2vec embeddings from spacy which we use to train a siamese network to make final predictions on whether two questions are similar or not. This is achieved by training a joint embedding neural network model which tries to minimize the distance between two similar questions while at the same time trying to maximize the distance between two unrelated questions.
The project provides a basic trainer file(python/example_trainer.py) which is provided as a proof of concept on how the model training workflow looks like.
If you want to test whether two questions(strings) are similar or not, i also provide a ready to use (python/question_similarity.py) which uses a more robust model which has been trained on the 80/20 split of the entire 400k dataset with 25 iterations of the training data.
/quora-similar-questions:]$ python python/question_similarity.py "What are some examples of products that can be make from crude oil?" "What are some of the products made from crude oil?"
Yes
PS: ensure that the two questions are provided within quotes to ensure that they are correctly parsed. I also provide the tfidf model that was learned as part of the full training to ensure that the results can be reproduced as well as to be used for the prediction stage.
Also provied is a ./webserver file that can be used to run a local webserver to process curl requests from the commandline
In one terminal, setups the localhost pointing to port: 8001
[parvoberoi:~/Desktop/quora-similar-questions:]$ ./python/webserver.py 8001
On another terminal
[parvoberoi:~:]$ curl -X POST 'http://localhost:8001/?sentence1=This%20is%20sentence%201.&sentence2=This%20is%20sentence%202.'
YES
[parvoberoi:~:]$ curl -X POST 'http://localhost:8001/?sentence1=Should%20I%20buy%20tiago?&sentence2=What%20keeps%20childern%20active%20and%20far%20from%20phone%20and%20video%20games?'
NO
The simple siamese network with very basic tfidf pre-processing was able to achieve a accuracy of 79.8% on a random 80/20 split of the quora data.
- preprocess data to get rid of punctuation marks
- try converting common words (What, which, where, how, etc.) to lowercase as currently the tfidf model associates differents weights with uppercase and lowercase instances
- investigate getting rid of stop words as well as converting apostrophe words to their normal forms
- investigate augmenting the training data by randomly mixing two unrelated questions as not-similar.
- run a parameter sweep on the various NN architecural params to find a optimal solution
- experiment with a generic word2vec scores from wikipedia instead of one built out from the question dataset
- build a standalone executable of the python predictor with all packages included to facilitate easier predictions
- You will need keras 2.1.6, numpy 1.13.3, tensorflow 1.9.0 and pandas 0.18.0