wget http://nlp.stanford.edu/data/glove.6B.zip
unzip -q glove.6B.zip -d glove.6B
https://github.com/LGDoor/Dump-of-Simple-English-Wiki/blob/master/corpus.tgz
$ cd datasets
$ tree
.
├── articles_dict
├── categories
├── cats_dict
├── corpus.txt
└── glove.6B
├── glove.6B.100d.txt
├── glove.6B.200d.txt
├── glove.6B.300d.txt
└── glove.6B.50d.txt
article belongs to many categories articleId -> Set(cat1Id, cat2Id, ...)
The following approaches have been tried but something better has to be set up.
articleId -> cat1Id,
articleId -> cat2Id
articleId -> ...
The issue is that the accuracy is low because the categories are somehow mixed
articleId -> cat1Id
The issue is that we loose many connections
- each word is mapped onto one of its closest neighbours with equal probability 0.5
- creating vocab dict of arbitrary size (10.000) most frequent words
- initialize embeddings: creating matrix 10.000 x 100 (vocabulary size x embeddings size) with random values from U(-1,1)
- in each batch (size: 128) we take the embeddings for words used in this batch
- softmax weights are initialized with mean: 0 and sd=0.1, bias weights: 0
- mean sample softmax loss for batch is calculated (sampled softmax: https://arxiv.org/pdf/1412.2007.pdf)
- loss is optmized using Adagrad(1) optimizer
- after optimization embeddings are normalized by dividing by L2 norm
-
kilka grup kategorii np matematycy, filozxofowie, zwierzeta, historia
-
matematycy powinni byc blisko filozofow, dalej zwierzat
-
konwersja artykulx(vector + kategoria) -> odleglosc miedzy kategoriami
-
albo wrzucic na NN te vectory
-
albo zsumowac i policzyc odleglosc vectorow
-
czy dodawanie nowych danych obniza jakosc?
The following categories are a test set:
History 6602
Ancient_history 29636
War 38468
Animals 5861
Pets 14654
Domesticated_animals 33670
Mathematics 5195
Mathematicians 19894
Logic 41358
Philosophy 6536
Philosophers 5375
Ethics 25540
Expected results:
History -> Philosophers, Mathematicians,
Phiolosophy -> Matemathicians