Giter Site home page Giter Site logo

woah_emnlp_2020's Introduction

A Novel Methodology for Developing Automatic Harassment Classifiers for Twitter

Research conducted by Columbia University Speech Lab

Part of the Fourth Workshop on Online Abuse and Harms (co-hosted with EMNLP)

Development done by:

  • Ishaan Arora (MS '20, Columbia CS)
  • Julia Guo (BA '22, Columbia CS)

Advised by:

  • Sarah Ita Levitan (Professor, Hunter College; Previous postdoctorate, Columbia CS)
  • Susan E. McGregor (Professor, Columbia Journalism & CS)
  • Julia Hirschberg (Professor, Columbia CS)

Overview

In this repository, we have included core code components used to fetch tweet threads potentially containing hate speech from Twitter archive data. The primary heuristics currently used to fetch this data are:

  • Get threads containing blocked and muted users from the target's Twitter archive
  • Get threads containing subtweets (mention of real name, but not of username) from the Twitter Search API

Structure

driver.py: Main driver script for scraping data from Twitter archive and Twitter Search API

utils.py: Helper Tweet retrieval functions used in driver.py and ways_to_fetch_tweet_threads.py

ways_to_fetch_tweet_threads.py: Classes for fetching tweet threads (from Archive, Search, ...)

Requirements

  • tweepy = 3.9.0

Setting up environment

conda create -n <env_name>
conda activate <env_name>
pip install -r requirements.txt

How to run code with Twitter archive data

First, modify the function get_auth_cred() in utils.py with your appropriate Twitter Developer API keys. If you don't already have these, or are unfamiliar with the Twitter API, please see the below links.

Next, learn how to download your Twitter Archive here.

After everything is set up and ready to go, you can run driver.py using Twitter archive files (tweet, mute, block) in the following manner:

python driver.py --tweet_file=<tweet file path> --mute_file=<mute file path> --block_file=<block file path> --real_name=<real name> --user_name=<twitter username>

You can also provide optional arguments that specify the number of tweets to fetch using each method (muted, blocked, non-muted/non-blocked, subtweets). The default values are 100, 100, 200, 100, respectively.

Here is an example command to run:

python driver.py \
  --tweet_file="jd_tweet.js" \
  --mute_file="jd_mute.js" \
  --block_file="jd_block.js" \
  --real_name="Jane Doe" \
  --user_name="JaneDoe" \
  --mute_tweets_ct=10 \
  --block_tweets_ct=20 \
  --other_tweets_ct=30 \
  --subtweet_tweets_ct=40

After the script finishes running, it will create a new data file, dump/filtered.json, containing tweet threads pulled from both the provided Twitter archive files, and the Search API.

To add a new tweet filtering heuristic:

Define a new subclass of TweetThreadsFromSource in ways_to_fetch_tweet_threads.py

Example:

class TweetThreadsFromSearch(TweetThreadsFromSource):

Override the method get_tweet_threads_list in this newly defined class.

Contact information:

Ishaan Arora: [email protected]
Julia Guo: [email protected]

woah_emnlp_2020's People

Contributors

ishaan007 avatar julia-guo avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

woah_emnlp_2020's Issues

Handle edge case in driver.py - take uniques

Before final call to serialize_tweets() is made in driver.py. There is a bleak chance that tweet threads from multiple sources get replicated. In that case a small enhancement to find uniques of list of list of dictionaries needs to be made.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.