A Novel Methodology for Developing Automatic Harassment Classifiers for Twitter

Research conducted by Columbia University Speech Lab

Part of the Fourth Workshop on Online Abuse and Harms (co-hosted with EMNLP)

Development done by:

Ishaan Arora (MS '20, Columbia CS)
Julia Guo (BA '22, Columbia CS)

Advised by:

Sarah Ita Levitan (Professor, Hunter College; Previous postdoctorate, Columbia CS)
Susan E. McGregor (Professor, Columbia Journalism & CS)
Julia Hirschberg (Professor, Columbia CS)

Overview

In this repository, we have included core code components used to fetch tweet threads potentially containing hate speech from Twitter archive data. The primary heuristics currently used to fetch this data are:

Get threads containing blocked and muted users from the target's Twitter archive
Get threads containing subtweets (mention of real name, but not of username) from the Twitter Search API

Structure

driver.py: Main driver script for scraping data from Twitter archive and Twitter Search API

utils.py: Helper Tweet retrieval functions used in driver.py and ways_to_fetch_tweet_threads.py

ways_to_fetch_tweet_threads.py: Classes for fetching tweet threads (from Archive, Search, ...)

Requirements

tweepy = 3.9.0

Setting up environment

conda create -n <env_name>
conda activate <env_name>
pip install -r requirements.txt

How to run code with Twitter archive data

First, modify the function get_auth_cred() in utils.py with your appropriate Twitter Developer API keys. If you don't already have these, or are unfamiliar with the Twitter API, please see the below links.

Next, learn how to download your Twitter Archive here.

After everything is set up and ready to go, you can run driver.py using Twitter archive files (tweet, mute, block) in the following manner:

python driver.py --tweet_file=<tweet file path> --mute_file=<mute file path> --block_file=<block file path> --real_name=<real name> --user_name=<twitter username>

You can also provide optional arguments that specify the number of tweets to fetch using each method (muted, blocked, non-muted/non-blocked, subtweets). The default values are 100, 100, 200, 100, respectively.

Here is an example command to run:

python driver.py \
  --tweet_file="jd_tweet.js" \
  --mute_file="jd_mute.js" \
  --block_file="jd_block.js" \
  --real_name="Jane Doe" \
  --user_name="JaneDoe" \
  --mute_tweets_ct=10 \
  --block_tweets_ct=20 \
  --other_tweets_ct=30 \
  --subtweet_tweets_ct=40

After the script finishes running, it will create a new data file, dump/filtered.json, containing tweet threads pulled from both the provided Twitter archive files, and the Search API.

To add a new tweet filtering heuristic:

Define a new subclass of TweetThreadsFromSource in ways_to_fetch_tweet_threads.py

Example:

class TweetThreadsFromSearch(TweetThreadsFromSource):

Override the method get_tweet_threads_list in this newly defined class.

Contact information:

Ishaan Arora: [email protected]
Julia Guo: [email protected]

ishaan007 / woah_emnlp_2020 Goto Github PK

woah_emnlp_2020's Introduction

A Novel Methodology for Developing Automatic Harassment Classifiers for Twitter

Overview

Structure

Requirements

Setting up environment

How to run code with Twitter archive data

To add a new tweet filtering heuristic:

Contact information:

woah_emnlp_2020's People

Contributors

Stargazers

Watchers

woah_emnlp_2020's Issues

Enhancement: driver.py file should run with default arguments for number of tweets in different files

Handle edge case in driver.py - take uniques

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent