Giter Site home page Giter Site logo

teyang-lau / disaster_tweet_classification Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 5.48 MB

Classify real disaster tweets using different sequence models like LSTM, Bi-directional LSTM with attention, and transformers (BERT) with best accuracy of 84%

License: MIT License

Python 100.00%
lstm nlp-models tweets-data attention transformer nlp text-classification

disaster_tweet_classification's Introduction

Twitter Disaster Classification Using LSTM, Attention and Transformers

made-with-kaggle made-with-python made-with-Markdown Generic badge GitHub license

Author: TeYang, Lau
Last Updated: 8 January 2021



Please refer to this notebook on Kaggle for a more detailed description, analysis and insights of the project.

Project Motivation

For this project, I applied sequence models to sequential data in a NLP problem of classifying twitter data into whether they are about disaster events or not. This is easy for a human, but a computer will find it difficult as languages contain multiple complexities. A model will thus will have to take into account the sequential nature of the tweet, the meaning and representation of each word in numbers, as well as the importance and contribution of other words in the same sequence, since two words can have completely different meanings in two different contexts. Here, I will be using Long-Short Term Memory (LSTM), Attention and **Transformers **models, which have provided state-of-the-art results for many NLP tasks.

This project also used to be a competition on Kaggle, but I didn't have the time nor the knowledge to complete the challenge the first time I joined. After finishing a deep learning specialization course and doing more readings on sequential models, I am back to tackle this problem again!

Project Goals

  1. Explore using different sequence models (LSTM, Attention, Transformers) for NLP sentence classification problem
  2. Preprocess/Clean tweets data into appropriate format for inputting into neural network models
  3. Understand word embeddings and how they are used to represent words as inputs into NLP models
  4. Engineer new features from tweets data that can help to improve model classification

Project Overview

  • Preprocess and clean tweets data
  • Exploratory data analysis of the texts to understand the text structure
  • Engineer meta-features from the text for training additional model for ensemble
  • Wrangling text into appropriate format as inputs into models (tokenization, padding)
  • Using GloVe embeddings for word representation
  • Training LSTM, Bidirectional LSTM with Attention, and BERT models
  • Error Analysis to look at mistakes made by models

About this Dataset

The dataset contains 10,000 tweets that were classified as disaster or non-disaster. It was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website.

Exploratory Data Analysis

Most Common Bigrams

Most Common Trigrams



Text Preprocessing/Cleaning

  • Expand Contractions

  • Remove Emojis

  • Remove URLs

  • Remove Punctuations except '!?' as they convey intensity and tonality of tweet

  • Replace 'amp' with 'and'

  • Word Segmentaion - segment words such as 'iwould' into 'i' and 'would'

  • Lemmatization - reduces inflected words into their root form; verb part-of-speech tag is used here)

Most Common Words After Cleaning



LSTM


Bidirectional LSTM with Attention


BERT

BERT was trained for only 3-5 epochs and achieved a validation accuracy of 0.84. It achieved the same accuracy performance of 0.84 on Kaggle's testing set as well.


Error Analysis

The confusion matrix for BERT's validation shows that there are more false negatives than false positives.

Conclusion

NLP techniques has came a long way, from using basic bag of words to RNN to transformers method. The current state-of-the-art models use some form of transformers in the network, which improves the performance of NLP tasks as they are no longer sequential in nature. From this project, the BERT model that uses the encoder part of the transformer appears to perform the best, although BERT performance differs a lot in different runs.

Nevertheless, NLP models and techniques are getting more attention and hopefully, better models can be created, which will greatly improve and contribute immensely to different fields and industry.

disaster_tweet_classification's People

Contributors

teyang-lau avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.