Deep learning and doc2vec exploration of enron email dataset: https://www.cs.cmu.edu/~enron/
Expects directory ./maildir
to hold unzipped dataset from above link.
Tensorboard logs, neural network model/weights/loss/accuracy files from training stored under ./logs
.
To run:
-
download dataset: https://www.cs.cmu.edu/~enron/
-
python train_doc2vec.py (with
should_create_data = True
on first run) -
python inference_doc2vec.py (for sanity check that doc2vec operates correctly)
-
python train_nn.py (wiht
should_aggregate_data = True
on first run)
Future Work:
- We attempt to only use
sent
mail from users. Add more emails besides these. - Due to (1), we ignore two users. Add these two users in.
- Since number of emails varies by user, try weighting each class (user) by their email usage.
- Try bayes
- Try SVM
- Try decision trees
- Cross-validate. Similar to link (4) below.
- Try forming weights matrix by doc2vec as weights of embedding layer in neural network classifier. Similar in theory to link (2) below.
- Visualize weights.
- Play with hyperparameters of doc2vec model. Goes along with (7) above.
- Play with hyperparameters of neural network classifier. Goes along with (7) above.
- Try kaggle's version of the dataset (may be cleansed/more uniform)
- Try approaches covered under (1), (3)-(6)
Auxillary
resources below - Try any missing approaches from (1) and (2) from
Primary
resources below
Primary resources:
- http://linanqiu.github.io/2015/10/07/word2vec-sentiment/
- https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
- https://stackoverflow.com/questions/48842866/gensim-models-doc2vec-has-no-attribute-labeledsentence
- https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/
- https://stackoverflow.com/questions/46197493/using-gensim-doc2vec-with-keras-conv1d-valueerror
Auxillary Resources:
- https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html
- https://www.kaggle.com/zichen/explore-enron/data
- https://en.wikipedia.org/wiki/Word2vec#cite_note-doc2vec_java-11
- https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
- https://medium.com/@williamkoehrsen/machine-learning-with-python-on-the-enron-dataset-8d71015be26d
- https://medium.com/@klintcho/doc2vec-tutorial-using-gensim-ab3ac03d3a1