Labelling Blog Sites
We need labelled data for various topics and sentiment and we need a lot of it. We have decided on a form of labelling called distant supervision, where we use heuristics and tags in order to classify far more text than we could possibly label manually, with the idea being the cost of potentially mislabelling some data is outweighed by the far greater volume. In order to do this we have targeted opinion blogs for 3 main reasons:
- They contain far more text than a single social media comment
- Posts on the same site should largely hold the same sentiment or point of view for a given topic
- Unlike news articles, they should be very semantically similar to comments
We will need to scrape this data meaning we first need to label potential target sites. To do this we need people to pick a topic, such as global warming, vaccination, religion/atheism or some other polarizing topic. Once that topic is decided one, try to find blogs that have to do more or less exclusively with this topic, and determine the dominant sentiment of official posts on the site (not comments). Check that the sentiment is fairly consistent between posts and authors (if there's more than one).
Once a site or domain is determined to be a good target, enter the url into a text file. The text file should be named in the format: Topic of Blog Posts - Sentiment (eg. Climate Change - Denial, Abortion - Pro Choice, etc). Each file should contain only one leaning for the sake of easily running them through any automated scraper we create. Avoid ambiguously leaning sites (those that post from both sides) or those whose topic varies significantly .
What should be in the file
The first is the domain of the website, which will be used to limit where a crawler can go and which links it can follow. It should not include 'http://' or 'www', but simply the domain name, such as realclimate.org.
The next is the URL pattern for the blog posts. By this I mean the longest consistent URL for all blog pages on that site. For example for realclimate.org, all of the blog posts can be found by year, eg. http://www.realclimate.org/index.php/archives/2017/05/ or http://www.realclimate.org/index.php/archives/2016/03/. Thus, the common URL would be http://www.realclimate.org/index.php/archives/20. This is not itself a valid URL, but all valid URLs MUST contain this sequence. This makes it easy for anyone scraping using Portia or other scrapers to simply enter this sequence into the ReGex section when designing a spider and then setting it loose. Finally, if you want you can add a subjective evaluation of how extreme you believe the site to be in their position, with 1 being centrist and 5 being extremist. A template is available in the URL Dump folder and remember to name your file with the topic and sentiment
The list of possible topics includes but is not limited to:
- Climate Change - IsReal/Skeptic
- Abortion - Pro-life/Pro-choice
- Religion - Believers/Non-believers
- Vaccines - Pro-vaccination/anti-vaccination
- Guns - Pro-gun/Anti-gun
- Drug Policy - Criminalization/Decriminalization and Legalization
We have deliberately stayed away from topics like Politics - Left/Right or Libertarian/Authoritarian for two reasons:
- These sorts of categories are quite general and tend to encompass many of the above topics
- Defining what is Left vs what is Right is more subjective and inconsistent person to person.
If you choose to create your own topic, please keep in mind that it should be clear/unambiguous as well as broad. Ie. Yankees vs. Red Sox would not be a good topic as it's very specific. If you have any doubts please comment on this issue with your suggested topic and we'll give you feedback. Also, while any self-directed initiative is encouraged, keep in mind that we'd rather have a bunch of data for just a few topics than sparser data for many topics.
Thank you for your efforts and patiences.