This is a multi-class random forest based language detector that works by using words, bigrams and trigrams as features. Support is provided for 7 languages using the Latin Script (English, German, Portuguese, Spanish, French, Dutch, Italian). To optimize training and testing performance by reducing number of features,a corpus of 300,000 sentences for each language is leveraged from Leipzig Corpus and the 50 most frequent words, bigrams, and trigrams are shortlisted as features. The dataframe creation is slightly complicated, but it is highly vectorized to speed up performance. All train and test datapoints are then represented in the reduced feature-space. A model trained on 5,000 sentences from each language takes less than 2 minutes to train, and performs at 98% accuracy. To replicate the environment, please place the following data files sourced from this Google Drive directory, and assign that to 'dirname'
- deu_mixed-typical_2011_300K-sentences.txt
- eng_news_2005_300K-sentences.txt
- fra_mixed_2009_300K-sentences.txt
- ita_mixed-typical_2017_300K-sentences.txt
- nld_mixed_2012_300K-sentences.txt
- por_newscrawl_2011_300K-sentences.txt
- spa_news_2006_300K-sentences.txt
The longest task is that of finding most common features for every language (~ 1 minute per language). The training dataframe creation then takes ~ 2 minutes, and creating the random forest model takes ~ 1 minute. The following performance metrics are calculated
- ๐๐๐๐๐๐ ๐๐๐= ๐๐๐ข๐ ๐๐๐ ๐๐ก๐๐ฃ๐๐ /(๐๐๐ข๐ ๐๐๐ ๐ก๐๐ฃ๐๐ +๐น๐๐๐ ๐ ๐๐๐ ๐๐ก๐๐ฃ๐๐ )
- ๐ ๐๐๐๐๐= ๐๐๐ข๐ ๐๐๐ ๐๐ก๐๐ฃ๐๐ /(๐๐๐ข๐ ๐๐๐ ๐ก๐๐ฃ๐๐ +๐น๐๐๐ ๐ ๐๐๐๐๐ก๐๐ฃ๐๐ )
- ๐น1 ๐๐๐๐๐= 2โ ๐๐๐๐๐๐ ๐๐๐โ ๐ ๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐+๐ ๐๐๐๐๐
In general, the model performs with 98% Precision, 98% Recall and 98% F1-Score.
The performance is also examined using the confusion matrix, that tells us the distribution of predicted labels v/s actual labels.
Using all bigrams, trigrams, and words will blow up the feature space and impact performance adversely. Hence the features are first shortlisted on the basis of most frequent features. This results in optimal performance both in terms of model accuracy and time taken.
There is a need to prune feature space to further remove redundancies. One approach could be through the use of maximal substrings. For eg - the trigram ' a ' will be a substring of ' a' always and can be removed. The size of the training data (for feature shortlisting) can be reduced and an optimal size of data can be explored. Furthermore, the accuracy of the model for French and Portuguese can be bettered through use of slightly more features for these two languages in particular.