People tend to discuss or share opinions on social platforms but such activities sometimes encounter threats or harassments which compel people to not express themselves properly.
Many social platforms try to find out such harassments or threats in conversations so that such conversations can easily be prevented before it causes any further damage.
Toxicity detection in comments is one of such methodologies to find out the different types of conversations that can be classified as toxic in nature.
To increase the efficacy in classifying such comments, we can make use of machine learning algorithms to determine the toxicity in comments.
In this model, many toxic comments have been fed to build a Bidirectional Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN)
model for fulfilling the purpose.
- Python 3.7.0+
- Tensorflow 2.4.1+
- Keras 2.4.3+
- matplotlib 3.3.3+
- numpy 1.19.5+
- pandas 1.2.1+
- scikit-learn 0.24.1+
- nltk 3.5+
- spacy 3.0.3+
- textblob 0.15.3+
- gradio 1.5.3+
You can downloaded the dataset from kaggle. Use the underlying download link to download the dataset.
- Navigate to
data
section - In the
Data Explorer
, you will find four separate zip archives to download - Download
test.csv.zip
,test_labels.csv.zip
andtrain.csv.zip
- Extract the files
- Copy the CSV files to the
data
directory
The following list enumerates different classes (types) of comments -
Toxic | Very Toxic | Obscene | Threat | Insult | Hate | Neutral |
---|
- Clone the repository
git clone https://github.com/baishalidutta/Comments-Toxicity-Detection.git
- Install the required libraries
pip3 install -r requirements.txt
Clean text
:- Lower all text
- Remove uncommon signs
- Expand abbreviations
- Correct misspelled words
- Remove punctuations
- Remove emojis
- Remove stop words
- Apply lemmatisation
Tokenize text
data- Create
Embedding Vector
using Glove.6B - Train a
Recurrent Neural Network (RNN)
with aBidirectional LSTM
layer
Navigate to the source
directory to execute the following source codes.
- To generate the model on your own, run
python3 model_training.py
- You can also provide your own CSV data:
python3 model_training.py --data=csv_file_location
- To evaluate any dataset using the pre-trained model (in the
model
directory), run
python3 model_evaluation.py
Note that, for evaluation, model_evaluation.py
will use the test.csv
and test_labels.csv
(inside data
directory).
Alternatively, you can find the whole analysis in the notebook inside the notebook
directory. To open the notebook, use either jupyter notebook
or google colab
or any other IDE that supports notebook feature such as PyCharm Professional
.
To run the web application locally, go to the webapp
directory and execute:
python3 web_app.py
This will start a local server that you can access in your browser. You can type in any comment and find out what toxicity the model determines.
You can, alternatively, try out the hosted web application here.
Baishali Dutta ([email protected])
If you would like to contribute and improve the model further, check out the Contribution Guide
This project is licensed under Apache License Version 2.0