Giter Site home page Giter Site logo

cs5293sp22-project3's Introduction

CS5293SP22 - Project 3

The Unredactor

Libraries used

  • pandas
  • nltk
  • sklearn

Model used - RandomForestClassifier

Assumptions

  • Application must be run on a machine with internet connectivity. Core functionality is dependent on fetching data from the git repository.
  • Assuming that the unredactor.tsv file has bad lines or corrupted data, such lines are skipped while reading the file
  • Given the limited dataset and the quality of the data, the accuracy of the model is very low.
  • Only few of the data errors are handled and bad data from the unredactor.tsv file can cause errors, stopping the application.

Note: Validation data/records are not being used as RandomForestClassifier is used for training and prediction. Also, the model is not saved or improved upon.

Functionality

fetch_data

Input: None

Output: Dataframe containing the data from the tsv file

This function uses the raw url of the unredactor.tsv file from the git repository to read the current data in the file. Data is read using pandas library and the dataframe is returned

clean_data

Input: Dataframe with data from tsv file

Output: Adjusted data frame with headers. Sentences are converted to lower case and lemmatized

This method sets the header for the data frame and loops through all the sentences in the dataframe. All the sentences are converted to lower case. A lemmatizer from nltk library is used to lemmatize the sentences and the updated dataframe is returned

setup_training_data

Input: Dataframe with clean data

Output: Dataframes containing rows for training and testing (VALIDATION ROWS EXCLUDED)

This method selects the training and testing data from the dataframe. The data is stored in two different dataframes and returned

train_and_predict

Input: Dataframes with training and testing data

Output: Prints 10 predicted names and returns precision, recall and F1 scores

Sentences from the training data is vectorized using a TF-IDF vectorizer and set as the X. The redacted names from the data are set as Y.

A RandomForestClassifier is then initialized with a maximum depth of 70 and trained using X,Y.

Performing Prediction: Inorder to match the number of features, we use vocabulary from the initial vectorizer to create a new vectorizer. This vectorizer is then used to vectorize the sentences from the training dataframe and fed as input to the model to predict the names.

The first 10 names from the prediction are then copied to an array and printed on the console. The predictions and actual names from the testing data are compared to acquire the precision, recall, f1 scores and returned as output.


All the above functions are called in a sequence and the resultant scores are printed on the console as output.

Test Cases

test_fetch_data

This test case runs the fetch_data method and checks if the returned dataframe is not empty

test_setup_training_data

Data is fetched, cleaned and setup using the fetch_data, clean_data and setup_training_data functions. The dataframes returned are checked if they contain data or not.

test_train_and_predict

This test case runs the complete project (sequentially calls all the functions), then checks if the scores returned as outputs are less than 1 or not.


Steps for local deployment

Project was run on an e2-micro instance

Clone the repository using the command

git clone https://github.com/SSharath-Kumar/cs5293sp22-project3

Install the required dependencies using the command

pipenv install

Running the project

The project is run using the command

pipenv run python unredactor.py

Running the test cases

Test cases can be run using the below command

pipenv run python -m pytest


##Addendum

  • MultinomialNB was also implemented but as the scores were inconsistent or wrong, it was not used for the project
  • The RandomForestClassifier initially provided lower accuracy scores but upon trying various max depth options (10, 20 .. 90) helped to improve the scores.
  • The application currently has the max depth of the RandomForestClassifier set at 70 to run on the standard E2-MICRO instance
  • Max Depth at 90 provided better accuracy but was killing the GCP instance. You can increase your instance size, update line 63 on unredactor.py and run the application.
  • For feature extraction, CountVectorizer was also implemented but using TF-IDF vectorizer with n-grams has provided better results.

cs5293sp22-project3's People

Contributors

ssharath-kumar avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.