Refactor test-scrape_talk.R to pull urls just once

These tests rely on rv_docs. We could increase
testing speed if we read in urls and have "cases".

malformed authors field

Attempting to install via devtools::install_github("bryanwhiting/generalconference") yielded the following error:

checking DESCRIPTION meta-information Malformed Authors@R field: <text>:4:66: unexpected '<' 3: comment = c(URL = "<https://www.bryanwhiting.com>")), 4: person("Gospel Analysis", role = c("fnd"), comment = c(URL = <

Add documentation on how to build README.md

Add this step to contributing.md.
Find a way to automate this without the knit button.

Scrape Spanish

Pull a few years of Spanish data. Save to a different location.

Add functionality to looper for spa instead of eng
Download 4 years of data
Process and Save out

Deprecate old extract_metadata

DEPRECATED 2021-08-30

#' extract_metadata <- function(rv_doc) {
#' #' Extract title, author, and kicker from a url and return as a row in a
#' #' dataframe.
#'
#' # the .body-block contains the speech text. But the #p anchors
#' # can be wrong.
#' # returns list of p1, p2, p3... for new talks and p2, p3, p4 for old talks
#' p_bodies <- rv_doc %>%
#' html_elements(".body-block p") %>%
#' html_attr("id")
#'
#' # Explaining !("p1" %in% p_bodies):
#' # Sometimes, the first paragraph isn't p1
#' # e.g., "https://www.churchofjesuschrist.org/study/general-conference/2019/04/27homer"
#' # First paragraph is "p20", then 2nd is "p1".
#' if ("p1" %in% p_bodies) {
#' # In new talks, #p1 is the paragraph text
#' elements <- c("#title1", "#author1", "#author2", "#kicker1")
#' map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), "")) %>%
#' return()
#' } else if (p_bodies[1] == "p5"){
#' # Some talks start at p5, and p1-4 are the title, author, author and kicker
#' # url <- "https://www.churchofjesuschrist.org/study/liahona/2020/11/15cook?lang=eng"
#' # rv_doc <- read_html(url)
#' elements <- c("#p1", "#p2", "#p3", "#p4")
#' df <- map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), "")) %>%
#' rename(title1 = p1,
#' author1 = p2,
#' author2 = p3,
#' kicker1 = p4)
#' } else {
#' # In older talks, #p1 is the author block
#' elements <- c("#title1", "#author1", "#author2", "#kicker1", "#p1")
#' df <- map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), ""))
#'
#' if (is.na(df$author1)) {
#' df$author1 <- df$p1
#' } else {
#' url <- extract_url_from_rv_doc(rv_doc)
#' message(
#' "#p1 not in .body-block p: ", url,
#' "\nPulled #p1 for metadata but author1 is not null."
#' )
#' }
#' df %>%
#' select(-p1) %>%
#' return()
#' }
#' }

Add github actions for automate unit tests and pull requests

Run unit tests
Run build:check

four sessions don't process

Debug and create unit tests:

data/sessions/202010.rds
data/sessions/201604.rds
data/sessions/198110.rds
data/sessions/197110.rds

build looper for every year
save out every year or set of years?
Add paragraph_id
Clean names
Add "calling" field. (is this author2?)
Create README.rmd and save to github_markdown
figx the markdown_github error on pkgdown
move introduction to README.rmd. Built separate "intro" that shows how to analyze data with tidytext
move developer notes to

CONTRIBUTING.md

set up github actions
add coveralls
build hex
get to point of install_github()
figure out imports
Move data saver into the DATASET.R file, and all the functions should only return a dataset.
Figure out how to save individual datasets.

Add hexagon badge

post processing: Clean up author names

Clean author1: Convert "By president Nelson" to "Russel M. Nelson" with separate field for calling "President"
Standardize all callings. e.g., "Quorum of Twelve" and "Of the Quorum of the Twelve", etc.
Compare compression sizes of unnested data vs not. It's possible nesting is just annoying end users.
Add post processing I'm using from general conference blog, such as deep link.

Consolidate documentation

Too many vignettes.

Move intro to a new file called "using general conference scrapers"
Rename "how to scrape" to "ways to scrape content"
Move Developers.md to Contributing.md
Organize vignettes a little more.

Are all the data being scraped?

https://www.churchofjesuschrist.org/study/general-conference/2020/10/45andersen.p5?lang=eng#p5
This paragraph is shortened in the data to just the stuff before the quote.

Same problem with this:
https://www.churchofjesuschrist.org/study/general-conference/2020/04/26renlund.p21?lang=eng#p21

I'm guessing this has to be due to the <sup> references being parsed.

Better thing to do would be to scrape the raw content and process it afterward. That way I don't have to rescrape all the content.

Add Google analytics

A Google analytics user ID if you want to track the people who are using your site

template:
  params:
    ganalytics: UA-000000-01

https://pkgdown.r-lib.org/articles/pkgdown.html

Add metadata on sessions (num talks, num_sessions, etc.)

conference: number of talks
session: number of talks, paragraphs, words
paragraph: number of words
paragraph: p_num in session
session number (e.g., 178th general conference)

Add coveralls badges

Update home page

Add GitHub links
Move developers to contributing.md
Make code of conduct
Add my website to author nav.

home:
  strip_header: true

See pkgdown site on build_home().

pkgdown::build_news not working

file's not updating appropriately. whatever...

fix it?

rename talk_urls to talk_url_stub
rename talk_urls to talk_path and session_url to session_path
refactor html_document to rv_document
Rename conference url scraper R script.

skipped on 2021 04

scrape_conference_talks(year = year, month = month, path = path)
#p1 not in .body-block p: https://www.churchofjesuschrist.org/study/general-conference/2021/04/14newman?lang=eng
Pulled #p1 for metadata but author1 is not null.
The following urls were skipped:
/study/general-conference/2021/04/11nelson
/study/general-conference/2021/04/12uchtdorf
/study/general-conference/2021/04/13jones
/study/general-conference/2021/04/14newman
/study/general-conference/2021/04/15stevenson
/study/general-conference/2021/04/16gong
/study/general-conference/2021/04/17eyring
/study/general-conference/2021/04/21oaks
/study/general-conference/2021/04/22larson
/study/general-conference/2021/04/23holland
/study/general-conference/2021/04/24becerra
/study/general-conference/2021/04/25renlund
/study/general-conference/2021/04/26andersen
/study/general-conference/2021/04/27mutombo
/study/general-conference/2021/04/28ballard
/study/general-conference/2021/04/31cook
/study/general-conference/2021/04/32corbitt
/study/general-conference/2021/04/33nielsen
/study/general-conference/2021/04/34eyring
/study/general-conference/2021/04/35oaks
/study/general-conference/2021/04/36nelson
/study/general-conference/2021/04/41soares
/study/general-conference/2021/04/42aburto
/study/general-conference/2021/04/43palmer
/study/general-conference/2021/04/44dube
/study/general-conference/2021/04/45teixeira
/study/general-conference/2021/04/46wakolo
/study/general-conference/2021/04/47wong
/study/general-conference/2021/04/48teh
/study/general-conference/2021/04/49nelson
/study/general-conference/2021/04/51oaks
/study/general-conference/2021/04/52rasband
/study/general-conference/2021/04/53dyches
/study/general-conference/2021/04/54christofferson
/study/general-conference/2021/04/55walker
/study/general-conference/2021/04/56bednar
/study/general-conference/2021/04/57nelson
Saved out to:data/sessions/202104.rds

Unit test not working ans$author1

This unit test fails for all authors. Switched to nchar() for time being.

>   expect_equal(ans$author1, "Presented by President Henry B. Eyring")
Error: ans$author1 (`actual`) not equal to "Presented by President Henry B. Eyring" (`expected`).

`actual`:   "Presented by President Henry B. Eyring"
`expected`: "Presented by President Henry B. Eyring"

Slow down the crawler

Print out URLS one by one. Add a system pause. I think IP address is getting blocked.

bryanwhiting / generalconference Goto Github PK

generalconference's Issues

DEPRECATED 2021-08-30

Recommend Projects

Recommend Topics

Recommend Org