Giter Site home page Giter Site logo

paperboy_emmy's Introduction

paperboy

Lifecycle: experimental R-CMD-check Codecov test coverage

Twitter

The philosophy of paperboy is that the package is a comprehensive collection of webscraping scripts for news media sites. Many data scientists and researchers write their own code when they have to retrieve news media content from websites. At the end of research projects, this code is often collecting digital dust on researchers hard drives instead of being made public for others to employ. paperboy offers writers of webscraping scripts a clear path to publish their code and earn co-authorship on the package (see For developers Section). For users, the promise is simple: paperboy delivers news media data from many websites in a consistent format. Check which domains are already supported in the table below or with the command pb_available().

Installation

paperboy is not on CRAN yet. Install via remotes (first install remotes via install.packages("remotes"):

remotes::install_github("JBGruber/paperboy")

For Users

Say you have a link to a news media article, for example, from mediacloud.org. Simply supply one or multiple links to a media article to the main function, pb_deliver:

library(paperboy)
df <- pb_deliver("https://tinyurl.com/386e98k5")
df
url expanded_url domain status datetime author headline text misc
https://tinyurl.com/386e98k5 https://www.theguardian.com/tv-and-radio/2021/jul/12/should-marge-divorce-homer www.theguardian.com 200 2021-07-12 12:00:13 https://www.theguardian.com/profile/stuart-heritage ’A woman trapped in an… The Simpson couple have… NULL

The returned data.frame contains important meta information about the news items and their full text. Notice, that the function had no problem reading the link, even though it was shortened. paperboy is an unfinished and highly experimental package at the moment. You will therefore often encounter this warning:

pb_deliver("google.com")
#> Warning in pb_deliver_paper.default(u, verbose
#> = verbose, ...): ...No method for domain
#> www.google.com yet, attempting generic approach
url expanded_url domain status datetime author headline text misc
google.com http://www.google.com/ www.google.com 200 NA NA Google © 2022 - Datenschutzerklrung - Nutzungsbedingungen NULL

The function still returns a data.frame, but important information is missing — in this case because it isn’t there. The other URLs will be processed normally though. If you have a dead link in your url vector, the status column will be different from 200 and contain NAs.

If you are unhappy with results from the generic approach, you can still use the second function from the package to download raw html code and later parse it yourself:

pb_collect("google.com")
url expanded_url domain status content_raw
google.com http://www.google.com/ www.google.com 200 <!doctype html><html itemscope…

pb_collect uses concurrent requests to download many pages at the same time, making the function very quick to collect large amounts of data. You can then experiment with rvest or another package to extract the information you want from df$content_raw.

For developers

If there is no scraper for a news site and you want to contribute one to this project, you can become a co-author of this package by adding it via a pull request. First check available scrapers and open issues and pull requests. Open a new issue or comment on an existing one to communicate that you are working on a scraper (so that work isn’t done twice). Then start by pulling a few articles with pb_collect and start to parse the html code in the content_raw column (preferably with rvest).

Every webscraper should retrieve a tibble with the following format:

url expanded_url domain status datetime headline author text misc
character character character integer as.POSIXct character character character list
the original url fed to the scraper the full url the domain http status code publication datetime the headline the author the full text all other information that can be consistently found on a specific outlet

Since some outlets will give you additional information, the misc column was included so these can be retained.

Available Scrapers

domain status author issues
buzzfeed.com #1
cbslnk.cbsileads.com #1
dailymail.co.uk @JBGruber
decider.com #1
edition.cnn.com @JBGruber
eu.usatoday.com @JBGruber
faz.net @JBGruber
forbes.com @JBGruber #2
fortune.com #1
ftw.usatoday.com @JBGruber
huffingtonpost.co.uk @JBGruber
lnk.techrepublic.com #1
marketwatch.com @JBGruber
newsweek.com @JBGruber
nypost.com @JBGruber
nytimes.com @JBGruber
pagesix.com #1
theguardian.com @JBGruber
time.com #1
us.cnn.com @JBGruber
usatoday.com @JBGruber
washingtonpost.com @JBGruber
wsj.com @JBGruber
www.boston.com #1
www.bostonglobe.com #1
www.cbsnews.com @JBGruber
www.cnet.com @JBGruber
www.foxbusiness.com @JBGruber
www.foxnews.com @JBGruber
www.latimes.com @JBGruber
www.msnbc.com #1
www.sfgate.com @JBGruber
www.telegraph.co.uk @JBGruber
www.thelily.com #1
www.thismorningwithgordondeal.com #1
www.tribpub.com #1
  • : Runs without known issues
  • : Runs with some issues
  • : Currently not working, fix has been requested

paperboy_emmy's People

Contributors

jbgruber avatar seankellyhp avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.