For the NER experiment, I used CoNLL 2003 English dataset. This dataset includes 1,393 English and 909 German news articles. Entities are annotated with LOC (location), ORG (organisation), PER (person) and MISC (miscellaneous). This is an example sentence, where each line consists of [word] [POS tag] [chunk tag] [NER tag]
U.N. NNP I-NP I-ORG official NN I-NP O Ekeus NNP I-NP I-PER heads VBZ I-VP O for IN I-PP O Baghdad NNP I-NP I-LOC
Preprocessed Data Shapes:
X_Train - (900, 204566) X_val - (900, 46665) X_test - (900, 51577) Y_train - (10, 204566) Y_val - (10, 46665) Y_test - (10, 51577)
Each word is mapped to a pre-trained feature of size 300, hence the feature size of the whole window of size 3 is 300 X 3 = 900.
We have experimented with different architectures. The common portion of all the networks is the following:
The input to the network is the pretrained features for each window. The input shape is (900, 204566) where 900 is the feature size of each window and 204566 refers to the total number of windows. The hidden layer varies between different architectures. We will describe it a bit later. The final layer is of size 10, corresponding to each NER Tag. The output of this final layer is passed to a cross entropy function for converting the output to probabilities. Different losses like log likelihood and max margin are used.
Changing the architecture is Easy. The structure of the architecture is the following:
nn_architecture = [ {"layer_size": 900, "activation": "none"}, {"layer_size": 300, "activation": "relu"}, {"layer_size": 100, "activation": "relu"}, {"layer_size": 10, "activation": "sigmoid"} ]
Different activations like sigmoid, tanh, relu, leaky relu are tried for the hidden layers.
The Dataset has class imbalance issue:
There are various methods for tackling this issue:
- Duplicating the infrequent classes: Does not provide any new information to model.
- Downscale the most frequent classes: Results in a lot of loss of data.
- Focal Loss: This is a really good way for dealing with class imbalance. It puts more weight on harder or infrequent samples thus making the model to focus on infrequent samples too.
I used the Synthetic Minority Oversampling Technique (SMOTE) approach [1]. SMOTE first selects a minority class instance at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.
This process is highly memory intensive. Further it requires a certain number of samples of each class present for successful interpolation. Hence I divided my data into batches of 10000 windows and applied the SMOTE on each of them.
After applying SMOTE to a 10000 batch:
Counter({3: 7585, 1: 7585, 8: 7585, 0: 7585, 7: 7585, 4: 7585, 5: 7585, 9: 7585, 6: 7585, 2: 7585})
Before applying SMOTE to a 10000 batch:
Counter({1: 843, 4: 41, 0: 23, 8: 22, 3: 19, 5: 18, 7: 16, 9: 12, 6: 5, 2: 1})
Notice the very small number of samples of type 6, 2, 9 etc. are normalised after applying SMOTE.
- CORNLL.ipynb: Preprocess the data and extract features
- NER_NN.ipynb: Neural Network Implementation
- NER_NN_balanced.ipynb: Neural Network Implementation with SMOTE
- Other .py files: Supporting code
Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357.