This dataset has been split into train and test set. The IMDB Movie ratings are discretised into 5 bins, with 0 as lowest and 4 as highest rating
The number of years of every movie’s released year from the median is calculated so we can have the ‘years’ feature in a Gaussian distribution
Class distribution is visualised through a pie and bar chart to check for class imbalance
Any redundant features that were observed from the visualisation of data are removed. The remaining continuous features are standardised so that their ranges are consistent
One Hot Encoding is used to represent categorical features in a continuous space. This step is done in a pipeline to ensure that the encoding scheme used for the testing data is consistent with that of the training data. This also allows better handling of new categorical attributes in the testing instances.
For all of the following models, the accuracy for testing instances and an average accuracy of 5-fold cross validation are used as the evaluation metric. The training and prediction runtime are also recorded to compare the models’ complexity. The following classifiers are built:
- 0R as baseline model
- Random Forest
- Bagging ensemble of Decision Trees
- Logistic Regression