State-of-the-Art Language Modeling and Text Classification in Hindi Language
- We achieved State of the Art Perplexity = 46.81 for Hindi compared to 40.68 for English (lower is better)
- To the best of our knowledge on March 2, 2018
- Models and Word Embeddings
- Hindi Wikipedia with about 21k unique tokens for minfreq = 50
- Processed Data
- Language modeling based on wikipedia dump
- Extract embeddings - convert to word2vec gensim format and release here
- Figure out a word embedding evaluation task for Hindi
- Benchmark text classification with FastText
- Fine-tuning model for text classification
- Add a leaderboard and allow submission, similar to SQuAD
Special thanks to Jeremy, Rachel and other contributors to fastai. This work is a reproduction of their work in English to Hindi. Thanks to @cstorm125 for thai2vec which inspired this work.