Language Detection using the European Parliament Proceedings Parallel Corpus. European Parliament Proceedings Parallel Corpus is a text dataset used for evaluating language detection engines. The 1.5GB corpus includes 21 languages spoken in EU. This project aims to build a machine learning model trained on this dataset to predict new unseen data.
The Training data can be downloaded here. Be sure to download the source resource (file size: 1.5 GB)
I use multinomial logistic regression instead of naive Bayes classifiers because it does not assume statistical independence of the input random variables. This alone makes up for the longer training time to train the multinomial logistic regression model. Logistic regression alone will not do well if the features used to train it are highly correlated. There is also the problem of overfitting the training data that needs to be considered. We therefore, use L2 or Ridge regularization along with multinomial logistic regression to prevent overfitting and also to help deal with correlated features. L2 regularization also introduces sparcity and helps shrink the regression coefficients.
Around 26.5 MB of text files were randomly selected for each language and analysis was carried out for this subsample (~554 MB) of the 5 GB corpus
Preprocessing includes removing punctuations and digits from the extracted text.
We ensure that we treat the test data exactly like we treated the training data to remove characters such as punctuations and digits which add noise to test data
List of all the languages whose detection is supported:
- 'bg': Bulgarian
- 'cs': Czech
- 'da': Danish
- 'de': German
- 'el': Greek, Modern
- 'en': English
- 'es': Spanish
- 'et': Estonian
- 'fi': Finnish
- 'fr': French
- 'hu': Hungarian
- 'it': Italian
- 'lt': Lithuanian
- 'lv': Latvian
- 'nl': Dutch
- 'pl': Polish
- 'pt': Portuguese
- 'ro': Romanian
- 'sk': Slovak
- 'sl': Slovenian
- 'sv': Swedish
There are therefore, 21 categorical variables that our classifier needs to be able to identify correctly.
Step 4: Function used to join all words held in a dataframe for a given language and place these words in a list
Step 5: Converting the list 't', of all lists of strings, from every language, into a single dataframe.
This dataframe "trainingData" is the Training data that we train our model on.
Character frequency analysis is undertaken using a logistic regressing model. Bi-gram model of word pairs is considered. We use a pipeline to implement this model and we use all CPU cores to build the model. L2 regularization is used to prevent overfitting of the model to training data. The inverse of regularization strength "C" is set to 1.0. Thereby, the model generalizes to new unseen data.
Character frequency analysis and Word frequency analysis is undertaken using a logistic regressing model. n-gram models tested for include:
- 1-gram Character frequency analysis
- 2-gram Character frequency analysis
- 4-gram Character frequency analysis
- 1-gram Word frequency analysis
- 2-gram Word frequency analysis
We use a pipeline to implement these models and we use all CPU cores to build the models. L2 regularization is used to prevent overfitting of the models to training data. The inverse of regularization strength "C" is set to 1.0 for all models. Thereby, the models generalize to new unseen data.
labels precision recall f1-score support
bg 1.00 1.00 1.00 997
cs 0.49 0.96 0.65 993
da 0.85 0.79 0.82 994
de 0.78 0.89 0.83 993
el 1.00 1.00 1.00 988
en 0.81 0.72 0.76 998
es 0.75 0.51 0.61 996
et 0.90 0.71 0.80 993
fi 0.84 0.96 0.90 995
fr 0.91 0.72 0.81 999
hu 0.91 0.97 0.94 998
it 0.90 0.48 0.63 996
lt 0.78 0.94 0.85 995
lv 0.95 0.94 0.95 978
nl 0.64 0.90 0.75 999
pl 0.95 0.97 0.96 997
pt 0.69 0.85 0.76 996
ro 0.73 0.88 0.80 927
sk 0.73 0.20 0.31 929
sl 0.90 0.80 0.85 998
sv 0.93 0.75 0.83 996
avg/total 0.83 0.81 0.80 20755
0.808961695977
Prediction accuracy of 80.896% was achieved on test data using a model trained with 1-gram Character logistic regression Model
labels precision recall f1-score support
bg 1.00 1.00 1.00 997
cs 0.62 0.98 0.76 993
da 0.90 0.90 0.90 994
de 0.89 0.96 0.92 993
el 1.00 1.00 1.00 988
en 0.96 0.89 0.92 998
es 0.93 0.72 0.81 996
et 0.95 0.83 0.88 993
fi 0.88 0.99 0.93 995
fr 0.94 0.93 0.93 999
hu 0.95 0.99 0.97 998
it 0.96 0.79 0.87 996
lt 0.91 0.96 0.94 995
lv 0.98 0.98 0.98 978
nl 0.83 0.94 0.88 999
pl 0.98 0.99 0.98 997
pt 0.79 0.94 0.86 996
ro 0.88 0.96 0.92 927
sk 0.97 0.41 0.58 929
sl 0.95 0.93 0.94 998
sv 0.96 0.86 0.91 996
avg/total 0.92 0.90 0.90 20755
0.903444953023
Prediction accuracy of 90.344% was achieved on test data using a model trained with 2-gram Character logistic regression Model
labels precision recall f1-score support
bg 1.00 1.00 1.00 997
cs 0.68 0.98 0.80 993
da 0.93 0.93 0.93 994
de 0.91 0.98 0.94 993
el 1.00 1.00 1.00 988
en 0.99 0.92 0.95 998
es 0.98 0.83 0.90 996
et 0.97 0.87 0.92 993
fi 0.90 0.99 0.95 995
fr 0.97 0.95 0.96 999
hu 0.95 0.99 0.97 998
it 0.98 0.88 0.93 996
lt 0.93 0.97 0.95 995
lv 0.99 0.99 0.99 978
nl 0.90 0.95 0.92 999
pl 0.98 0.99 0.98 997
pt 0.85 0.97 0.91 996
ro 0.92 0.98 0.95 927
sk 0.99 0.53 0.69 929
sl 0.96 0.95 0.95 998
sv 0.97 0.89 0.93 996
avg / total 0.94 0.93 0.93 20755
0.931775475789
Prediction accuracy of 93.1775% was achieved on test data using a model trained with 4-gram Character logistic regression Model
labels precision recall f1-score support
bg 1.00 1.00 1.00 997
cs 0.85 0.50 0.63 993
da 0.96 0.89 0.92 994
de 0.94 0.97 0.95 993
el 0.63 1.00 0.78 988
en 0.92 0.96 0.94 998
es 0.83 0.72 0.77 996
et 0.93 0.88 0.90 993
fi 0.91 0.82 0.86 995
fr 0.72 0.79 0.75 999
hu 0.98 0.94 0.96 998
it 0.86 0.92 0.89 996
lt 0.97 0.84 0.90 995
lv 0.79 0.91 0.85 978
nl 0.52 0.94 0.67 999
pl 0.95 0.81 0.87 997
pt 0.47 0.56 0.51 996
ro 0.99 0.84 0.91 927
sk 0.91 0.71 0.79 929
sl 0.78 0.44 0.56 998
sv 0.98 0.89 0.93 996
avg / total 0.85 0.83 0.83 20755
0.825439653096
Prediction accuracy of 82.5439% was achieved on test data using a model trained with 1-gram Word logistic regression Model
labels precision recall f1-score support
bg 1.00 1.00 1.00 997
cs 0.85 0.49 0.62 993
da 0.96 0.89 0.92 994
de 0.94 0.97 0.95 993
el 0.62 1.00 0.77 988
en 0.91 0.96 0.94 998
es 0.83 0.72 0.77 996
et 0.93 0.88 0.90 993
fi 0.91 0.82 0.86 995
fr 0.71 0.79 0.75 999
hu 0.99 0.94 0.96 998
it 0.86 0.92 0.89 996
lt 0.97 0.83 0.89 995
lv 0.79 0.90 0.84 978
nl 0.52 0.94 0.67 999
pl 0.95 0.81 0.87 997
pt 0.47 0.57 0.51 996
ro 0.99 0.84 0.91 927
sk 0.91 0.70 0.79 929
sl 0.78 0.44 0.56 998
sv 0.98 0.89 0.93 996
avg / total 0.85 0.82 0.83 20755
0.823849674777
Prediction accuracy of 82.3849% was achieved on test data using a model trained with 2-gram Word logistic regression Model
The crosstab below shows us the false positives and false negatives that gives us some insight into correlation between languages. "P" stands for Predicted values and "A" stands for Actual values in the crosstab.
- 399 strings in Slovak where missclassified as Czech. This points at the two languages being highly correlated. This makes sense since Czech Republic and Slovakia have a shared history contributing to similartes between the two languages spoken in this region.
Similarly the following prominent trends emerged:
- 114 Spanish strings were missclassified as Portuguese
- 68 Estonian strings were missclassified as Finnish
- 44 Swedish strings were missclassified as Danish
- 42 Italian strings were missclassified as Romanian
- 30 Italian strings were missclassified as Portuguese
- 28 Dutch strings were missclassified as German
- 22 Danish strings were missclassified as Dutch
- 22 Estonian strings were missclassified as Lithuanian
By means of experimenting with hyperparameters we see that the character frequency analysis serves better than word frequency analysis in the case of language detection. This reinforces the idea that learning to differentiate languagues is mostly about learning the disassociation between the script of a language rather than the vocabulary of a language.
It is therefore important to remove digits and common punctuations in the imported raw text data which may be common accross languages and therefore, add noise to the training data. This process helps is bring to the fore the difference across languages.
Multinomial logistic regression does poorly when it comes to differentiating between similar languages as seen with Slovak and Czech, which were often missclassified.
The European Parliment corpus is sizable at about 5 GB. This would seem to be a data at scale problem requiring Big Data analysis. However, by means of sampling we can execute the language detection classifier on a regular PC using the SciPy stack. Around 26.5 MB of text files were randomly selected for each language and analysis was carried out for this subsample (~554 MB) of the 5 GB corpus. We are able to still get a model accuracy of 93.1775%, with the model generalizing to new unseen data.
Training the models using GPU instead of CPU to reduce computation time
The 4-gram character analysis model can be further improved by implementing a grid-search method that helps fine tune the model hyperparameters.
We can also try higher n-gram models however the computational expense and time required to train these models may be very high for a regular PC
A Neural network model such as Recurrent Neural Networks, can be implemented which may perform better, however, the computational expense of training such a large dataset on a neural network architecture does not justify the gain in prediction accuracy.