The spell checker currently removes characters that appear consecutively more than twice (e.g., pproovaaaa becomes pproovaa).
In addition to that, it should remove characters that repeat more than once from the beginning and the end of a token (e.g., pproovaaaa becomes proova).
This should help the stemmer to correctly stem tokens.
Right now, the classifier can handle unicode emojis such as ๐ฅฐ or ๐ก.
We shall improve the emoji handler with support for text emoticons, such as :) or :(.
The test set is unavailable. Therefore, we should use the development set for training, testing and validation.
To do this, we should split the development set into training and test sets (80-20?). To validate the model, we should perform k-fold cross-validation on the training set. Testing should be done on the test set.
We could graph the length of the reviews in the data exploration phase. We could have two distributions, one for each class (negative and positive review length distribution). Is this useful?