Giter Site home page Giter Site logo

tyjk / echoburst Goto Github PK

View Code? Open in Web Editor NEW
40.0 13.0 4.0 9.49 MB

A browser extension that utilizes sentiment analysis to find and highlight constructive comments on various social media platforms that oppose the users worldview in order to encourage them to break out of the echo chambers the internet has allowed us to construct.

License: MIT License

Python 100.00%
echo-chamber conversation social-media nlp python

echoburst's Introduction

EchoBurst

Finding civilized conversation in a more polarized age

Table of Contents

Welcome

Thank you for visiting the EchoBurst repository. This README details the broad purpose and objectives of the project, as well as where to find more detailed information on different aspects of the project.

The Revival

After two years of hiatus, I finally feel I have the time, experience/skills and the data to make a real run at this project. The repo will be incrementally updated over the next month or so until summer break hits and I hopefully have time to dedicate to the project's technical development.

The Problem

In days’ past, we had limited choices in what media we consumed and who we interacted with. This limited reach forced us to be less selective: we would talk to those who were closest, watch the channels we could afford and read the papers printed in our area. With the dawn of the internet, many hoped that we would see an expansion in how many views a person had access to. But we have found the opposite to be the case: since we can now choose from a functionally unlimited number of perspectives, we can customize who and what we interact with to fit whatever views we already have. This creates echo chambers of unparalleled fortitude, which greatly narrows our prespective, and makes it easier to accept fake or misleading stories that align with our established worldview. It has become increasingly important that we find a way to encourage more a more diverse media diet, and find ways to check the stories we hear against a diverse set of established sources.

Our Solution

This problem is too large to be solved by any one effort, but we hope to contribute in some small part. The way we hope to do this is by creating a tool that works against the paradigm created by social media of increasingly isolated echo chambers and growing distrust and animosity towards those who dissent against our established beliefs. As explained in the project description, we hope to do this by making it easier to find comments that promote civil discussion (i.e., are not simply toxic, but contribute to the conversation) but oppose the view of the user. In this way, we hope that users will be able to expand their horizons without having to sift through hundreds of destructive and potentially hateful comments to find them. Additionally, to prevent false equivilence between all positions and stories, this will be coupled with NLP enabled fact checking and fake news detection. This process will rely on a wide range of established news stories, and we'll be working to ensure that this is done transparently, and is loyal only to the truth, as muc has it can be established.

Why It Matters

In an age where political and scientific discourse can literally reshape the face of the planet, our unwillingness to communicate with those we disagree with has caused views to polarize to an astounding degree, and discussion has broken down. If we isolate ourselves from everyone who disagrees with us, we greatly reduce our collective ability to affect change. It is the thesis of this project that most people generally want the same thing: a better, healthier, fairer and safer world. Often we simply differ on how we believe this can be accomplished. Even in cases where prejudice and distrust infect our discourse, exposure and interaction between hostile groups often leads to the discovery of shared ground.

How To Contribute

For details of how to contribute, please see our CONTRIBUTING page. Anyone interested is encouraged to do so, and we especially need expertise in NLP and how these models can be effectively integrated into a web extention.

Our concrete short and medium term goals have been posted in the Roadmap issue, where they can be discussed, checked off and modified as progress is made. This is of course subject to change, but we're hoping to follow the general timeline set out. A more general set of development stages can be found in the Wiki.

The wiki has been updated and now contains an outline of the new structure of the project, the different planned machine learning components, and a very rough outline of the desired stages of development.

echoburst's People

Contributors

annakrystalli avatar jelliotartz avatar tyjk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

echoburst's Issues

Web Scraping

Web Scraping

This issue is primarily to ensure organization of any web scraping efforts. If you are going to try to scrape a URL, mention which one it is so others don't do the same.

Instructions

Sign up for Portia, a free, visual web scraping tool. Portia lets you set up simple rules for how the spider (aka web crawler) will navigate the site, and then lets you visually mark what content you want to scrape. This pattern will then be utilized on other pages. Multiple patterns can be given to ensure proper scraping across multiple page formats. There ARE likely more efficient and clever methods of scraping, but this is the most feasible I've found that people who don't have any specialized knowledge will be able to use. If you have any of that specialized knowledge, please feel free to speak up and make suggestions.

Tutorial

Tutorial Video
Portia Documentation

Important Note
Make SURE that when you have the text highlighted, it's scraping text and only text. This will mean you won't have to worry about it scraping images or other undesirable content.

Also, if you are able to get all your data with only one sample (you can add to the sample by clicking the little four square icon near the minus sign), do that and name it field1. This provides a standard and makes cleaning easier. If this isn't possible though, no worries.

Running the Scraper

It's hard to tell how long the process will run for. It can take several hours to scrape one site, depending on its size, so keep that in mind when deciding how many sites you'll scrape. Once the scraper is running, it's a good idea to check the log as soon as you can to make sure that, in general, the scraper is doing what you want it to.

Uploading data

One thing that wasn't mentioned in the tutorial (woops) was how to upload. Click on the items number once it's completed, and then go to the Export button in the top right. Select "JSONL" and download the file. Then upload it to the Data folder when finished.

Thank you so much for your contribution!

code of conduct

Mind if I use your code of conduct as a template for Pi Reel?

Click here for more info on pi reel. Its still a work in progress.

Identification of Polarized Blog Posts

Labelling Blog Sites

We need labelled data for various topics and sentiment and we need a lot of it. We have decided on a form of labelling called distant supervision, where we use heuristics and tags in order to classify far more text than we could possibly label manually, with the idea being the cost of potentially mislabelling some data is outweighed by the far greater volume. In order to do this we have targeted opinion blogs for 3 main reasons:

  • They contain far more text than a single social media comment
  • Posts on the same site should largely hold the same sentiment or point of view for a given topic
  • Unlike news articles, they should be very semantically similar to comments

We will need to scrape this data meaning we first need to label potential target sites. To do this we need people to pick a topic, such as global warming, vaccination, religion/atheism or some other polarizing topic. Once that topic is decided one, try to find blogs that have to do more or less exclusively with this topic, and determine the dominant sentiment of official posts on the site (not comments). Check that the sentiment is fairly consistent between posts and authors (if there's more than one).

Once a site or domain is determined to be a good target, enter the url into a text file. The text file should be named in the format: Topic of Blog Posts - Sentiment (eg. Climate Change - Denial, Abortion - Pro Choice, etc). Each file should contain only one leaning for the sake of easily running them through any automated scraper we create. Avoid ambiguously leaning sites (those that post from both sides) or those whose topic varies significantly .

What should be in the file

The first is the domain of the website, which will be used to limit where a crawler can go and which links it can follow. It should not include 'http://' or 'www', but simply the domain name, such as realclimate.org.

The next is the URL pattern for the blog posts. By this I mean the longest consistent URL for all blog pages on that site. For example for realclimate.org, all of the blog posts can be found by year, eg. http://www.realclimate.org/index.php/archives/2017/05/ or http://www.realclimate.org/index.php/archives/2016/03/. Thus, the common URL would be http://www.realclimate.org/index.php/archives/20. This is not itself a valid URL, but all valid URLs MUST contain this sequence. This makes it easy for anyone scraping using Portia or other scrapers to simply enter this sequence into the ReGex section when designing a spider and then setting it loose. Finally, if you want you can add a subjective evaluation of how extreme you believe the site to be in their position, with 1 being centrist and 5 being extremist. A template is available in the URL Dump folder and remember to name your file with the topic and sentiment

The list of possible topics includes but is not limited to:

  • Climate Change - IsReal/Skeptic
  • Abortion - Pro-life/Pro-choice
  • Religion - Believers/Non-believers
  • Vaccines - Pro-vaccination/anti-vaccination
  • Guns - Pro-gun/Anti-gun
  • Drug Policy - Criminalization/Decriminalization and Legalization

We have deliberately stayed away from topics like Politics - Left/Right or Libertarian/Authoritarian for two reasons:

  • These sorts of categories are quite general and tend to encompass many of the above topics
  • Defining what is Left vs what is Right is more subjective and inconsistent person to person.

If you choose to create your own topic, please keep in mind that it should be clear/unambiguous as well as broad. Ie. Yankees vs. Red Sox would not be a good topic as it's very specific. If you have any doubts please comment on this issue with your suggested topic and we'll give you feedback. Also, while any self-directed initiative is encouraged, keep in mind that we'd rather have a bunch of data for just a few topics than sparser data for many topics.

Thank you for your efforts and patiences.

Compiling YouTube video playlists

We're looking to extend data collection from the captions of YouTube videos.

As a start, it would be useful to get playlists of the different topics gathered together. Currently, the most effective approach would be to curate playlists that are consistent on both topic and position ie a separate playlist for climate change vs climate change denying videos.

We are mainly interested in videos in which the caption are NOT autogenerated. However, because further down the line we might look into extracting useful data from autogenerated captions, it would also be useful to compile videos with autogenerated captions separately. So if you do come across them just add them to a separate list (no need to thematically separate that at this point)

We're open to suggestions of what the most effective approach to centralise resulting playlists. Let us know what you think. Otherwise just drop a link to any playlists you create here for the time being.

Contributing to EchoBurst

How to Contribute Discussion and Questions

The README and CONTRIBUTING pages discuss how to get started contributing, but if you have any questions, comments or concerns regarding how to get started or even just about the project itself, post them here. If we get enough questions or recurring concerns, we'll add them to a FAQ page to the Wiki as well.

Topic Classification

Creating an Initial Topic Identification Model

We have created vector models in both Word2Vec and Doc2Vec and so now we are aiming to use these vectors to create features for a classification or topic model that will correctly identify when a topic from a predefined list is being discussed in a comment. We are looking at different possibilities, including custom though imperfect datasets that use subreddit names as labels (generalized into broader topics), or possibly using a classic dataset such as 20newsgroup as a proof of concept.

We will be using the gensim library to create the model and hope to have it completed by the end of the week.

Any expertise or advice on topic modeling would be appreciated.

Incentivization Brainstorming

A Discussion on how to Subvert Our Aversion to Dissenting Opinions

A primary problem with the proposed platform as it's conceptualized now is that most people will be extremely unwilling to engage with views they disagree with. Even if we manage to employ toxicity filtering to some extent to make the experience more palatable, it's a deeply ingrained defense mechanism that will be difficult to work around. It would be beneficial then to begin a conversation revolving around how this might be approached.

As a starting point, we were thinking of using positive feedback and reward systems, similar to those employed by many mobile games and social media sites, in order to create a cycle positive feedback. A metric or score is usually a good place to start with this, and our current idea is to have that score be Viewpoint Variance. The idea behind this is that the greater the diversity of news sites you view and comments you read, the higher your score.

There are several technical challenges that would need to be addressed and capabilities that the app would need to have to make this work, but for this discussion we should keep it theoretical to start. This is probably the greatest challenge involved in the project as it's an attempt to subvert human nature, but if we can meet this challenge, it greatly opens up the potential for more widespread impact.

We would particularly love anyone with a background in behavioural psychology, reward systems or belief change to contribute, but this discussion has no prerequisites for posting. If you think you have an interesting idea or novel approach, or believe you can build on what we've already discussed, please comment.

NLP Models and Data Collection Discussion

A Discussion on the Best NLP and Data Collection Approaches

This is a place we hope we can generate discussion, with both experts and non-experts, on how we're planning on moving forward in the immediate future towards a classification model for topic modeling and sentiment analysis. We've included data collection in this, as none of this can proceed until we have some labelled data.

The scope of this discussion can include:

  • How we are labelling our data
  • How we are collecting/scraping this data
  • Our plans for topic modeling
  • Our plans for sentiment analysis
  • How we will be classifying the resulting models

A Brief Overview of Our Current Plan

  • Labelling: We're going to label our data by selecting blogs and websites (or sections of websites) that have a consistent sentiment and a coherent topic in line with our chosen topics (Full List). These will be collected
  • Scraping: We're thinking of using Portia, Beautiful Soup or possibly Selenium. This aspect is still being discussed and we should have a final plan within the next few days.
  • Topic Modeling: Our current plan is to use the Doc2Vec algorithm (specifically the gensim Python library). Each topic would be used as a tag, in addition a unique tag for each document (blog post/article). However, we're also looking into the use of labelled LDA for this stage.
  • Sentiment Analysis: This stage is pretty firmly decided as using doc2vec, as it's the state of the art for this sort of task. However we have not decided on general sentiment detection (across all topics), topic specific sentiment analysis (a separate sentiment model for each topic) or possibly a hybrid model. We will likely test all of the above and find what works best for our purposes.
  • Classification: Selecting that classification algorithm should be a pretty trivial matter. We suspect an SVM algorithm will perform best, or else Naive Bayes based on our research, but we'll try a broad range.

We welcome questions and suggestions with regards to these topics, so please feel free to drop a comment.

Roadmap

Roadmap

This is an ideal set of steps we would take. What we focus on and when things are completed is subject to change.

March and April

  • Fix up the repo
  • Collect data, particularly social media data.
  • Read up on the latest NLP breakthroughs such as BERT, Transformers, etc.
  • Read up on some of the specific sub-problems such as text summarization and topic classification

May

  • Develop an effective political leaning classifier
  • Research methods of incorporating ML into web extensions, and how they should be structured to ensure they aren't resource intensive for the user
  • Develop an effective topic classifier
  • Create a dead simple testing platform and test the effectiveness of the combined leaning/topic models.

June

  • Develop an event classifier and determine the general feasibility of this segment of the project, as it's subject to external factors
  • Layout a framework for the extension or application, determine server requirements
  • Continue testing real world performance of existing classifiers using local testing platform
  • Build up a prototype extension with the existing models for very basic functionality

July

  • Ideally, soft launch on the MVP, though this is likely not feasible
  • Develop a set of text summarizers using a variety of parameters, data subsets and techniques if necessary
  • Continue developing extension. I'm going to learn to hate web programming all over again this summer
  • Establish the framework for developing the fake news classifier. Owing to the potential politicized subjectivity of what counts as fake news, this is an important step before development for the credibility of the project

August

  • Develop toxicity classifier
  • Continue working on extension
  • Develop fake news classifier

Working Open - How to get more contributors

Here are just some suggestions:

  • Add a CONTRIBUTING.md file linked in the README.md with clear instructions how people can contribute and contact you
  • Move all gathered resources to the repo Wiki
  • Create a IRC or Gitter.im channel to have an open discussion
  • Move meetings notes from Google docs to an public etherpad
  • Put a link to the etherpad in the README.md
  • Don't forget these: mozillascience/WOW-2017#26
  • Maybe have the Roadmap as an issue instead of a file, so that people can discuss it
  • Put a link to the Roadmap issue in the README.md
  • Use a simple style for issues labels
  • Create a project board to manage issues and track progress with columns such as "To-Do", "Doing", "Done" (more about Kanban boards)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.