Giter Site home page Giter Site logo

bryanwhiting / generalconference Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 41.48 MB

Scrape General Conference Talks using R

Home Page: https://bryanwhiting.github.io/generalconference

License: Other

R 100.00%
data-science text-mining lds-scriptures general-conference tidytext r r-lang r-stats

generalconference's People

Contributors

bryanwhiting avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

generalconference's Issues

skipped on 2021 04

scrape_conference_talks(year = year, month = month, path = path)
#p1 not in .body-block p: https://www.churchofjesuschrist.org/study/general-conference/2021/04/14newman?lang=eng
Pulled #p1 for metadata but author1 is not null.
The following urls were skipped:
/study/general-conference/2021/04/11nelson
/study/general-conference/2021/04/12uchtdorf
/study/general-conference/2021/04/13jones
/study/general-conference/2021/04/14newman
/study/general-conference/2021/04/15stevenson
/study/general-conference/2021/04/16gong
/study/general-conference/2021/04/17eyring
/study/general-conference/2021/04/21oaks
/study/general-conference/2021/04/22larson
/study/general-conference/2021/04/23holland
/study/general-conference/2021/04/24becerra
/study/general-conference/2021/04/25renlund
/study/general-conference/2021/04/26andersen
/study/general-conference/2021/04/27mutombo
/study/general-conference/2021/04/28ballard
/study/general-conference/2021/04/31cook
/study/general-conference/2021/04/32corbitt
/study/general-conference/2021/04/33nielsen
/study/general-conference/2021/04/34eyring
/study/general-conference/2021/04/35oaks
/study/general-conference/2021/04/36nelson
/study/general-conference/2021/04/41soares
/study/general-conference/2021/04/42aburto
/study/general-conference/2021/04/43palmer
/study/general-conference/2021/04/44dube
/study/general-conference/2021/04/45teixeira
/study/general-conference/2021/04/46wakolo
/study/general-conference/2021/04/47wong
/study/general-conference/2021/04/48teh
/study/general-conference/2021/04/49nelson
/study/general-conference/2021/04/51oaks
/study/general-conference/2021/04/52rasband
/study/general-conference/2021/04/53dyches
/study/general-conference/2021/04/54christofferson
/study/general-conference/2021/04/55walker
/study/general-conference/2021/04/56bednar
/study/general-conference/2021/04/57nelson
Saved out to:data/sessions/202104.rds

Deprecate old extract_metadata

DEPRECATED 2021-08-30

#' extract_metadata <- function(rv_doc) {
#' #' Extract title, author, and kicker from a url and return as a row in a
#' #' dataframe.
#'
#' # the .body-block contains the speech text. But the #p anchors
#' # can be wrong.
#' # returns list of p1, p2, p3... for new talks and p2, p3, p4 for old talks
#' p_bodies <- rv_doc %>%
#' html_elements(".body-block p") %>%
#' html_attr("id")
#'
#' # Explaining !("p1" %in% p_bodies):
#' # Sometimes, the first paragraph isn't p1
#' # e.g., "https://www.churchofjesuschrist.org/study/general-conference/2019/04/27homer"
#' # First paragraph is "p20", then 2nd is "p1".
#' if ("p1" %in% p_bodies) {
#' # In new talks, #p1 is the paragraph text
#' elements <- c("#title1", "#author1", "#author2", "#kicker1")
#' map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), "")) %>%
#' return()
#' } else if (p_bodies[1] == "p5"){
#' # Some talks start at p5, and p1-4 are the title, author, author and kicker
#' # url <- "https://www.churchofjesuschrist.org/study/liahona/2020/11/15cook?lang=eng"
#' # rv_doc <- read_html(url)
#' elements <- c("#p1", "#p2", "#p3", "#p4")
#' df <- map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), "")) %>%
#' rename(title1 = p1,
#' author1 = p2,
#' author2 = p3,
#' kicker1 = p4)
#' } else {
#' # In older talks, #p1 is the author block
#' elements <- c("#title1", "#author1", "#author2", "#kicker1", "#p1")
#' df <- map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), ""))
#'
#' if (is.na(df$author1)) {
#' df$author1 <- df$p1
#' } else {
#' url <- extract_url_from_rv_doc(rv_doc)
#' message(
#' "#p1 not in .body-block p: ", url,
#' "\nPulled #p1 for metadata but author1 is not null."
#' )
#' }
#' df %>%
#' select(-p1) %>%
#' return()
#' }
#' }

post processing: Clean up author names

  • Clean author1: Convert "By president Nelson" to "Russel M. Nelson" with separate field for calling "President"
  • Standardize all callings. e.g., "Quorum of Twelve" and "Of the Quorum of the Twelve", etc.
  • Compare compression sizes of unnested data vs not. It's possible nesting is just annoying end users.
  • Add post processing I'm using from general conference blog, such as deep link.

Slow down the crawler

Print out URLS one by one. Add a system pause. I think IP address is getting blocked.

Are all the data being scraped?

https://www.churchofjesuschrist.org/study/general-conference/2020/10/45andersen.p5?lang=eng#p5
This paragraph is shortened in the data to just the stuff before the quote.

Same problem with this:
https://www.churchofjesuschrist.org/study/general-conference/2020/04/26renlund.p21?lang=eng#p21

I'm guessing this has to be due to the <sup> references being parsed.

Better thing to do would be to scrape the raw content and process it afterward. That way I don't have to rescrape all the content.

Save footnotes as extra rows

If footnote text and links can be scraped, it would be great to add those as rows to the talk.

Then we can add a flag for is_footnote.

Consolidate documentation

Too many vignettes.

  • Move intro to a new file called "using general conference scrapers"
  • Rename "how to scrape" to "ways to scrape content"
  • Move Developers.md to Contributing.md
  • Organize vignettes a little more.

refactoring

  • rename talk_urls to talk_url_stub
  • rename talk_urls to talk_path and session_url to session_path
  • refactor html_document to rv_document
  • Rename conference url scraper R script.

Next steps

Todo:

  • build looper for every year
  • save out every year or set of years?
  • Add paragraph_id
  • Clean names
  • Add "calling" field. (is this author2?)
  • Create README.rmd and save to github_markdown
  • figx the markdown_github error on pkgdown
  • move introduction to README.rmd. Built separate "intro" that shows how to analyze data with tidytext
  • move developer notes to

CONTRIBUTING.md

  • set up github actions

  • add coveralls

  • build hex

  • get to point of install_github()

  • figure out imports

  • Move data saver into the DATASET.R file, and all the functions should only return a dataset.

  • Figure out how to save individual datasets.

four sessions don't process

Debug and create unit tests:

  • data/sessions/202010.rds
  • data/sessions/201604.rds
  • data/sessions/198110.rds
  • data/sessions/197110.rds

Unit test not working ans$author1

This unit test fails for all authors. Switched to nchar() for time being.

>   expect_equal(ans$author1, "Presented by President Henry B. Eyring")
Error: ans$author1 (`actual`) not equal to "Presented by President Henry B. Eyring" (`expected`).

`actual`:   "Presented by President Henry B. Eyring"
`expected`: "Presented by President Henry B. Eyring"

malformed authors field

Attempting to install via devtools::install_github("bryanwhiting/generalconference") yielded the following error:

checking DESCRIPTION meta-information Malformed Authors@R field: <text>:4:66: unexpected '<' 3: comment = c(URL = "<https://www.bryanwhiting.com>")), 4: person("Gospel Analysis", role = c("fnd"), comment = c(URL = <

Update home page

  • Add GitHub links
  • Move developers to contributing.md
  • Make code of conduct
  • Add my website to author nav.
home:
  strip_header: true

See pkgdown site on build_home().

Scrape Spanish

Pull a few years of Spanish data. Save to a different location.

  • Add functionality to looper for spa instead of eng
  • Download 4 years of data
  • Process and Save out

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.