- pandas
- nltk
- sklearn
- Application must be run on a machine with internet connectivity. Core functionality is dependent on fetching data from the git repository.
- Assuming that the unredactor.tsv file has bad lines or corrupted data, such lines are skipped while reading the file
- Given the limited dataset and the quality of the data, the accuracy of the model is very low.
- Only few of the data errors are handled and bad data from the unredactor.tsv file can cause errors, stopping the application.
Note: Validation data/records are not being used as RandomForestClassifier is used for training and prediction. Also, the model is not saved or improved upon.
Input: None
Output: Dataframe containing the data from the tsv file
This function uses the raw url of the unredactor.tsv file from the git repository to read the current data in the file. Data is read using pandas library and the dataframe is returned
Input: Dataframe with data from tsv file
Output: Adjusted data frame with headers. Sentences are converted to lower case and lemmatized
This method sets the header for the data frame and loops through all the sentences in the dataframe. All the sentences are converted to lower case. A lemmatizer from nltk library is used to lemmatize the sentences and the updated dataframe is returned
Input: Dataframe with clean data
Output: Dataframes containing rows for training and testing (VALIDATION ROWS EXCLUDED)
This method selects the training and testing data from the dataframe. The data is stored in two different dataframes and returned
Input: Dataframes with training and testing data
Output: Prints 10 predicted names and returns precision, recall and F1 scores
Sentences from the training data is vectorized using a TF-IDF vectorizer and set as the X. The redacted names from the data are set as Y.
A RandomForestClassifier is then initialized with a maximum depth of 70 and trained using X,Y.
Performing Prediction: Inorder to match the number of features, we use vocabulary from the initial vectorizer to create a new vectorizer. This vectorizer is then used to vectorize the sentences from the training dataframe and fed as input to the model to predict the names.
The first 10 names from the prediction are then copied to an array and printed on the console. The predictions and actual names from the testing data are compared to acquire the precision, recall, f1 scores and returned as output.
All the above functions are called in a sequence and the resultant scores are printed on the console as output.
test_fetch_data
This test case runs the fetch_data method and checks if the returned dataframe is not empty
test_setup_training_data
Data is fetched, cleaned and setup using the fetch_data, clean_data and setup_training_data functions. The dataframes returned are checked if they contain data or not.
test_train_and_predict
This test case runs the complete project (sequentially calls all the functions), then checks if the scores returned as outputs are less than 1 or not.
Project was run on an e2-micro instance
Clone the repository using the command
git clone https://github.com/SSharath-Kumar/cs5293sp22-project3
Install the required dependencies using the command
pipenv install
The project is run using the command
pipenv run python unredactor.py
Test cases can be run using the below command
pipenv run python -m pytest
##Addendum
- MultinomialNB was also implemented but as the scores were inconsistent or wrong, it was not used for the project
- The RandomForestClassifier initially provided lower accuracy scores but upon trying various max depth options (10, 20 .. 90) helped to improve the scores.
- The application currently has the max depth of the RandomForestClassifier set at 70 to run on the standard E2-MICRO instance
- Max Depth at 90 provided better accuracy but was killing the GCP instance. You can increase your instance size, update line 63 on unredactor.py and run the application.
- For feature extraction, CountVectorizer was also implemented but using TF-IDF vectorizer with n-grams has provided better results.