bryanwhiting / generalconference Goto Github PK
View Code? Open in Web Editor NEWScrape General Conference Talks using R
Home Page: https://bryanwhiting.github.io/generalconference
License: Other
Scrape General Conference Talks using R
Home Page: https://bryanwhiting.github.io/generalconference
License: Other
These tests rely on rv_docs. We could increase
testing speed if we read in urls and have "cases".
Attempting to install via devtools::install_github("bryanwhiting/generalconference")
yielded the following error:
checking DESCRIPTION meta-information Malformed Authors@R field: <text>:4:66: unexpected '<' 3: comment = c(URL = "<https://www.bryanwhiting.com>")), 4: person("Gospel Analysis", role = c("fnd"), comment = c(URL = <
Pull a few years of Spanish data. Save to a different location.
spa
instead of eng
#' extract_metadata <- function(rv_doc) {
#' #' Extract title, author, and kicker from a url and return as a row in a
#' #' dataframe.
#'
#' # the .body-block contains the speech text. But the #p anchors
#' # can be wrong.
#' # returns list of p1, p2, p3... for new talks and p2, p3, p4 for old talks
#' p_bodies <- rv_doc %>%
#' html_elements(".body-block p") %>%
#' html_attr("id")
#'
#' # Explaining !("p1" %in% p_bodies):
#' # Sometimes, the first paragraph isn't p1
#' # e.g., "https://www.churchofjesuschrist.org/study/general-conference/2019/04/27homer"
#' # First paragraph is "p20", then 2nd is "p1".
#' if ("p1" %in% p_bodies) {
#' # In new talks, #p1 is the paragraph text
#' elements <- c("#title1", "#author1", "#author2", "#kicker1")
#' map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), "")) %>%
#' return()
#' } else if (p_bodies[1] == "p5"){
#' # Some talks start at p5, and p1-4 are the title, author, author and kicker
#' # url <- "https://www.churchofjesuschrist.org/study/liahona/2020/11/15cook?lang=eng"
#' # rv_doc <- read_html(url)
#' elements <- c("#p1", "#p2", "#p3", "#p4")
#' df <- map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), "")) %>%
#' rename(title1 = p1,
#' author1 = p2,
#' author2 = p3,
#' kicker1 = p4)
#' } else {
#' # In older talks, #p1 is the author block
#' elements <- c("#title1", "#author1", "#author2", "#kicker1", "#p1")
#' df <- map_dfc(elements, ~ extract_element(rv_doc = rv_doc, element = .)) %>%
#' rename_all(~ str_replace(., fixed("#"), ""))
#'
#' if (is.na(df$author1)) {
#' df$author1 <- df$p1
#' } else {
#' url <- extract_url_from_rv_doc(rv_doc)
#' message(
#' "#p1 not in .body-block p: ", url,
#' "\nPulled #p1 for metadata but author1 is not null."
#' )
#' }
#' df %>%
#' select(-p1) %>%
#' return()
#' }
#' }
Debug and create unit tests:
When building pkgdown
, I get this error above.
Todo:
CONTRIBUTING.md
set up github actions
add coveralls
build hex
get to point of install_github()
figure out imports
Move data saver into the DATASET.R file, and all the functions should only return a dataset.
Figure out how to save individual datasets.
Too many vignettes.
https://www.churchofjesuschrist.org/study/general-conference/2020/10/45andersen.p5?lang=eng#p5
This paragraph is shortened in the data to just the stuff before the quote.
Same problem with this:
https://www.churchofjesuschrist.org/study/general-conference/2020/04/26renlund.p21?lang=eng#p21
I'm guessing this has to be due to the <sup>
references being parsed.
Better thing to do would be to scrape the raw content and process it afterward. That way I don't have to rescrape all the content.
A Google analytics user ID if you want to track the people who are using your site
template:
params:
ganalytics: UA-000000-01
home:
strip_header: true
See pkgdown
site on build_home()
.
file's not updating appropriately. whatever...
Causes problems with install_github('bryanwhiting/generalconference')
If footnote text and links can be scraped, it would be great to add those as rows to the talk.
Then we can add a flag for is_footnote.
scrape_conference_talks(year = year, month = month, path = path)
#p1 not in .body-block p: https://www.churchofjesuschrist.org/study/general-conference/2021/04/14newman?lang=eng
Pulled #p1 for metadata but author1 is not null.
The following urls were skipped:
/study/general-conference/2021/04/11nelson
/study/general-conference/2021/04/12uchtdorf
/study/general-conference/2021/04/13jones
/study/general-conference/2021/04/14newman
/study/general-conference/2021/04/15stevenson
/study/general-conference/2021/04/16gong
/study/general-conference/2021/04/17eyring
/study/general-conference/2021/04/21oaks
/study/general-conference/2021/04/22larson
/study/general-conference/2021/04/23holland
/study/general-conference/2021/04/24becerra
/study/general-conference/2021/04/25renlund
/study/general-conference/2021/04/26andersen
/study/general-conference/2021/04/27mutombo
/study/general-conference/2021/04/28ballard
/study/general-conference/2021/04/31cook
/study/general-conference/2021/04/32corbitt
/study/general-conference/2021/04/33nielsen
/study/general-conference/2021/04/34eyring
/study/general-conference/2021/04/35oaks
/study/general-conference/2021/04/36nelson
/study/general-conference/2021/04/41soares
/study/general-conference/2021/04/42aburto
/study/general-conference/2021/04/43palmer
/study/general-conference/2021/04/44dube
/study/general-conference/2021/04/45teixeira
/study/general-conference/2021/04/46wakolo
/study/general-conference/2021/04/47wong
/study/general-conference/2021/04/48teh
/study/general-conference/2021/04/49nelson
/study/general-conference/2021/04/51oaks
/study/general-conference/2021/04/52rasband
/study/general-conference/2021/04/53dyches
/study/general-conference/2021/04/54christofferson
/study/general-conference/2021/04/55walker
/study/general-conference/2021/04/56bednar
/study/general-conference/2021/04/57nelson
Saved out to:data/sessions/202104.rds
This unit test fails for all authors. Switched to nchar() for time being.
> expect_equal(ans$author1, "Presented by President Henry B. Eyring")
Error: ans$author1 (`actual`) not equal to "Presented by President Henry B. Eyring" (`expected`).
`actual`: "Presented by President Henry B. Eyring"
`expected`: "Presented by President Henry B. Eyring"
Print out URLS one by one. Add a system pause. I think IP address is getting blocked.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.