Giter Site home page Giter Site logo

covid19_fakenews_detection's Introduction

TathyaCov : Detecting Fake Tweets in the times of COVID 19

title

DEMO VIDEO: https://youtu.be/pdWoBxBu9-k

This repository contains the implementation of the paper : "No Rumours Please! A Multi-Indic-Lingual Approach for Covid Fake-Tweet Detection". The system aims to classify whether a tweet contains a verifiable claim or not in real-time and has been specifically trained to detect COVID19 related fake news. We use AI based techniques to process the tweet text and use it, along with user features, to classify the tweets as either REAL or FAKE. We are handling tweets in three different languages: English, Hindi and Bengali.

flowchart

Structure :

Each of the folders are equipped with detailed READMEs on how to run the scripts.

  • For dataset, refer to the data folder
  • To scrape and annotate more dataset, refer to scraping_tools folder (We encourage extending the dataset to accomadate more annotations in languages explored and unexplored in this work)
  • For the transformer based classifiers, refer to the transformer_classifiers folder
  • For ML based models and GUI implementation, refer to the GUI_MLModels folder

We next provide a very brief overview of the dataset and the methods used in our work in the following sections.

Dataset:

We create the Indic-covidemic tweet dataset and use it for training and testing purpose. We consider the English tweets from the Infodemic dataset and scrape Bengali and Hindi tweets from Twitter which are related to COVID-19. Fresh annotations were done and incorporated to create the larger Indic dataset for this task. For this purpose, scraping and parsing tools were created which might be helpful to further mine Indic data. We have published our annotated dataset for research purposes which can be found here.

Method:

We experimented with two different models to handle the tweet classification. In one setting, we consider a mono-lingual model, for handling English tweets. We extend the concept, by replacing the classifier with the multi-lingual one, where we consider tweets from English, Hindi and Bengali languages, as of now. The main essence of our proposed approach lies in the features we have used for the classification task, the different classifiers and their corresponding adaptation done for identifying the fake tweets.

The architecture of the classifier is as shown below.

mono_ar

We have used various textual and user related features for the classification task as follows:
  • bert based sentence encoding of the tweets (TxtEmbd)
  • tweet features (twttxt)
  • user features (twtusr)

    link_score

  • link score (FactVer) - Ratio of similarity calculated between a given tweet and titles of verified URL list obtained on querying the tweet on Google Search Engine (algorithm given below). We have a list of 50 URLs listed as verified sources.

    link_score

  • bias score (Bias) - The probability of a tweet containing offensive language.

mono_features

It is evident from the correlation plot that a subset of user features and tweet features can be helpful. We have experimented with different classifiers, the results of which are as given below.

mono_result multi_result

Graphical User Interface (GUI):

We design a simple static HTML page to obtain the tweet id/URL, as user input, and detect if the tweet is real or fake. Though our monolingual English classifier gave the best performance, even by beating the SOTA, we choose the multi-lingual classifier for its wider application. Some of the snapshots of our demo is shown below:

gui_hindi
gui_bengali
gui_english

FLASK API:

The GUI has been hosted in a IBM server (http://pca.sl.cloud9.ibm.com:1999/) which is accessible within IBM domain.
process.py is a working code to host the GUI in the localhost. It can be easily modified to host the demo in any other server as well.

Citation :

If you find our work useful, please cite our work as:

@misc{kar2020rumours,
      title={No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection}, 
      author={Debanjana Kar and Mohit Bhardwaj and Suranjana Samanta and Amar Prakash Azad},
      year={2020},
      eprint={2010.06906},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

covid19_fakenews_detection's People

Contributors

amarazad avatar debanjanakar avatar mohitbhardwaj520 avatar suranjanas avatar suransam123 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.