Giter Site home page Giter Site logo

bryanwhiting / generalconference Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 41.48 MB

Scrape General Conference Talks using R

Home Page: https://bryanwhiting.github.io/generalconference

License: Other

R 100.00%
data-science text-mining lds-scriptures general-conference tidytext r r-lang r-stats

generalconference's Issues

malformed authors field

Attempting to install via devtools::install_github("bryanwhiting/generalconference") yielded the following error:

checking DESCRIPTION meta-information Malformed Authors@R field: <text>:4:66: unexpected '<' 3: comment = c(URL = "<https://www.bryanwhiting.com>")), 4: person("Gospel Analysis", role = c("fnd"), comment = c(URL = <

Scrape Spanish

Pull a few years of Spanish data. Save to a different location.

  • Add functionality to looper for spa instead of eng
  • Download 4 years of data
  • Process and Save out

Deprecate old extract_metadata

DEPRECATED 2021-08-30

#' extract_metadata <- function(rv_doc) {
#' #' Extract title, author, and kicker from a url and return as a row in a
#' #' dataframe.
#'
#' # the .body-block contains the speech text. But the #p anchors
#' # can be wrong.
#' # returns list of p1, p2, p3... for new talks and p2, p3, p4 for old talks
#' p_bodies <- rv_doc %>%
#' html_elements(".body-block p") %>%
#' html_attr("id")
#'
#' # Explaining !("p1" %in% p_bodies):
#' # Sometimes, the first paragraph isn't p1
#' # e.g., "https://www.churchofjesuschrist.org/study/general-conference/2019/04/27homer"
#' # First paragraph is "p20", then 2nd is "p1".
#' if ("p1" %in% p_bodies) {
#' # In new talks, #p1 is the paragraph text
#' elements <- c("#title1", "#author1", "#author2", "#kicker1")
#' map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), "")) %>%
#' return()
#' } else if (p_bodies[1] == "p5"){
#' # Some talks start at p5, and p1-4 are the title, author, author and kicker
#' # url <- "https://www.churchofjesuschrist.org/study/liahona/2020/11/15cook?lang=eng"
#' # rv_doc <- read_html(url)
#' elements <- c("#p1", "#p2", "#p3", "#p4")
#' df <- map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), "")) %>%
#' rename(title1 = p1,
#' author1 = p2,
#' author2 = p3,
#' kicker1 = p4)
#' } else {
#' # In older talks, #p1 is the author block
#' elements <- c("#title1", "#author1", "#author2", "#kicker1", "#p1")
#' df <- map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), ""))
#'
#' if (is.na(df$author1)) {
#' df$author1 <- df$p1
#' } else {
#' url <- extract_url_from_rv_doc(rv_doc)
#' message(
#' "#p1 not in .body-block p: ", url,
#' "\nPulled #p1 for metadata but author1 is not null."
#' )
#' }
#' df %>%
#' select(-p1) %>%
#' return()
#' }
#' }

four sessions don't process

Debug and create unit tests:

  • data/sessions/202010.rds
  • data/sessions/201604.rds
  • data/sessions/198110.rds
  • data/sessions/197110.rds

Next steps

Todo:

  • build looper for every year
  • save out every year or set of years?
  • Add paragraph_id
  • Clean names
  • Add "calling" field. (is this author2?)
  • Create README.rmd and save to github_markdown
  • figx the markdown_github error on pkgdown
  • move introduction to README.rmd. Built separate "intro" that shows how to analyze data with tidytext
  • move developer notes to

CONTRIBUTING.md

  • set up github actions

  • add coveralls

  • build hex

  • get to point of install_github()

  • figure out imports

  • Move data saver into the DATASET.R file, and all the functions should only return a dataset.

  • Figure out how to save individual datasets.

post processing: Clean up author names

  • Clean author1: Convert "By president Nelson" to "Russel M. Nelson" with separate field for calling "President"
  • Standardize all callings. e.g., "Quorum of Twelve" and "Of the Quorum of the Twelve", etc.
  • Compare compression sizes of unnested data vs not. It's possible nesting is just annoying end users.
  • Add post processing I'm using from general conference blog, such as deep link.

Consolidate documentation

Too many vignettes.

  • Move intro to a new file called "using general conference scrapers"
  • Rename "how to scrape" to "ways to scrape content"
  • Move Developers.md to Contributing.md
  • Organize vignettes a little more.

Are all the data being scraped?

https://www.churchofjesuschrist.org/study/general-conference/2020/10/45andersen.p5?lang=eng#p5
This paragraph is shortened in the data to just the stuff before the quote.

Same problem with this:
https://www.churchofjesuschrist.org/study/general-conference/2020/04/26renlund.p21?lang=eng#p21

I'm guessing this has to be due to the <sup> references being parsed.

Better thing to do would be to scrape the raw content and process it afterward. That way I don't have to rescrape all the content.

Update home page

  • Add GitHub links
  • Move developers to contributing.md
  • Make code of conduct
  • Add my website to author nav.
home:
  strip_header: true

See pkgdown site on build_home().

Save footnotes as extra rows

If footnote text and links can be scraped, it would be great to add those as rows to the talk.

Then we can add a flag for is_footnote.

refactoring

  • rename talk_urls to talk_url_stub
  • rename talk_urls to talk_path and session_url to session_path
  • refactor html_document to rv_document
  • Rename conference url scraper R script.

skipped on 2021 04

scrape_conference_talks(year = year, month = month, path = path)
#p1 not in .body-block p: https://www.churchofjesuschrist.org/study/general-conference/2021/04/14newman?lang=eng
Pulled #p1 for metadata but author1 is not null.
The following urls were skipped:
/study/general-conference/2021/04/11nelson
/study/general-conference/2021/04/12uchtdorf
/study/general-conference/2021/04/13jones
/study/general-conference/2021/04/14newman
/study/general-conference/2021/04/15stevenson
/study/general-conference/2021/04/16gong
/study/general-conference/2021/04/17eyring
/study/general-conference/2021/04/21oaks
/study/general-conference/2021/04/22larson
/study/general-conference/2021/04/23holland
/study/general-conference/2021/04/24becerra
/study/general-conference/2021/04/25renlund
/study/general-conference/2021/04/26andersen
/study/general-conference/2021/04/27mutombo
/study/general-conference/2021/04/28ballard
/study/general-conference/2021/04/31cook
/study/general-conference/2021/04/32corbitt
/study/general-conference/2021/04/33nielsen
/study/general-conference/2021/04/34eyring
/study/general-conference/2021/04/35oaks
/study/general-conference/2021/04/36nelson
/study/general-conference/2021/04/41soares
/study/general-conference/2021/04/42aburto
/study/general-conference/2021/04/43palmer
/study/general-conference/2021/04/44dube
/study/general-conference/2021/04/45teixeira
/study/general-conference/2021/04/46wakolo
/study/general-conference/2021/04/47wong
/study/general-conference/2021/04/48teh
/study/general-conference/2021/04/49nelson
/study/general-conference/2021/04/51oaks
/study/general-conference/2021/04/52rasband
/study/general-conference/2021/04/53dyches
/study/general-conference/2021/04/54christofferson
/study/general-conference/2021/04/55walker
/study/general-conference/2021/04/56bednar
/study/general-conference/2021/04/57nelson
Saved out to:data/sessions/202104.rds

Unit test not working ans$author1

This unit test fails for all authors. Switched to nchar() for time being.

>   expect_equal(ans$author1, "Presented by President Henry B. Eyring")
Error: ans$author1 (`actual`) not equal to "Presented by President Henry B. Eyring" (`expected`).

`actual`:   "Presented by President Henry B. Eyring"
`expected`: "Presented by President Henry B. Eyring"

Slow down the crawler

Print out URLS one by one. Add a system pause. I think IP address is getting blocked.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.