Giter Site home page Giter Site logo

conbegpt's Introduction

Text De-toxification with ConBEGPT

This project presents a proposed solution to the problem of text de-toxification by leveraging the capabilities of Conditional BERT (Con. BERT) and GPT-2 models. The ConBEGPT network architecture is utilized to improve the quality and indirectly transform text generated by Con. BERT. The source code for Con. BERT is taken from a github repository.

Important note: The project use pseudo-absolute paths, hence the notebooks should be launched only once. So If you want to launch it second time, restart the kernel.

Purposed Idea

While Con. BERT is efficient in inference and performs well, it often struggles with indirectly understood phrases. For instance, a toxic sentence such as They're all laughing at us, so we'll kick your ass is detoxified by Con. BERT as They're all laughing at us, so we'll kick your way. However, a more concise and semantically equivalent rewrite could be They're laughing at us, we'll show you.

To solve such problem we assume one more LLM could rewrite Con. BERT produced text to get better quality and some indirectly transformed text. Also, it should take the initial sentence to extract some useful context information.

Technical Implementation

The ConBEGPT network architecture is illustrated in Figure 1 below.

Alt text

For the new LLM, we have selected GPT-2 due to its ease of fine-tuning through transfer learning and its efficient inference time. To fine-tune the model, we have constructed a new structure for the text input data and introduced new tokens:

  • <T> - Initial toxic text, given as an input from user.
  • <NT> - Detoxificated (supposing-ly Non-Toxic) text generated by Con. BERT.
  • <E> - Toxicity estimation of a detoxificated text.
  • <F> - Final answer that we expected to have as an output of a model.
  • Example: <T>Spend fucking money on the neuroscience.<NT>Spend money on the neuroscience.<E>0.07<F>Sponsor neuroscience.

Data Description

The initial dataset, which can be found here, was preprocessed and simplified to a structure that is easier to interpret in our case. The preprocessed dataset is saved as /data/interm/filtered_preprocessed.csv and can be manually generated using the /notebooks/data_preprocessing.ipynb notebook.

The dataset contains the following columns:

Column Type Discription
toxic_text str initial text with high toxisit
de-toxic_text str transformed text with lower toxicity
init_toxicity float toxicity level of toxic_text
detox_toxicity float toxicity level of transformed text

The GPT-2 training corpus, gpt2_corpus.txt, was constructed using the defined text structure and used during the fine-tuning process. You can manually generate it using the /notebooks/Data_Preprocessing.ipynb notebook. The notebook takes the toxic text from the created dataset for <T>, produces <NT> using the pretrained Con. BERT, estimates the toxicity of the Con. BERT generated text to obtain <E>, and finally takes <F> from the dataset.

How to launch:

The model uses custom wrappers over the existing gpt-2 and bert models, so the launching process is fairly simple and consists of 2 steps only:

  • Veryfi, you are using python 3.7.x version.
  • Install all the requirements from /REQIREMENTS.TXT
  • Inferencing of ConBEGPT model is clearly described in /notebooks/conbegpt_inference.ipynb.

Structure

The structure of the repository is following:

text-detoxification
├── README.md, INSTRUCTIONS.md, REQUIREMENTS.txt  # Basic information about repository
│
├── data 
│   ├── interim  # Intermediate data that has been preprocessed
│   └── raw      # The original, immutable data
│
├── models  # Pretrained models and wrappers
│   ├── Conbert     # Conditional BERT model
│   ├── Gpt2        # GPT-2 model
│   └── Estimator   # toxicity estimator model
│
├── notebooks  
│   ├── conbegpt_inference.ipynb    # MAIN notebook, shows inference of purposed model
│   ├── conbert_compile_vocab.ipynb # manually construct vacabulary for Con. BERT model
│   ├── conbert_inference.ipynb     # shows the inference of con. BERT only
│   ├── data_preprocessing.ipynb    # manually generate gpt-2 train corpus and estimator dataset
│   ├── gpt2_fine_tune.ipynb        # manual fine-tune of the gpt-2 model
│   ├── gpt2_inference.ipynb        # shows the inference of gpt-2 only
│   ├── estimator_inference.ipynb   # shows the inference of estimator only
│   └── estimator_train.ipynb       # manual training of the estimator
│   
├── references   # Data dictionaries, manuals, and all other explanatory materials.
│
├── reports  # 2 reports required for the problem
│   └── figures  # Generated graphics and figures
│           
└── src
    │                 
    ├── data
    │   └── data_preproc.py  # Contains functions that are used in data_preprocessing.ipynb
    │
    ├── models
    │   ├── predict_model.py
    │   ├── download_models.py  # Used to automatically download and setup all the models
    │   └── train_model.py
    │   
    └── visualization   # Scripts to create exploratory and results oriented visualizations
        └── visualize.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.