An R package to help detect linkrot, which is when links to a web page break because they’ve been taken down or moved.
Very much a concept. I wrote it to detect linkrot on my personal blog and it works for my needs. Feel free to contribute.
This package is only available on GitHub. Install from an R session with:
install.packages("remotes")
remotes::install_github("matt-dray/linkrot")
Pass a webpage URL to detect_rot()
and get a tibble with each link on
that page and what its response status
code is
(ideally we want 200
).
Here’s a check on one of my older blog posts. The printout tells you the URL you’re looking at, with a period printed for each successful check.
library(linkrot)
page <- "https://www.rostrum.blog/2018/04/14/r-trek-exploring-stardates/"
rot_page <- detect_rot(page)
#> Checking <https://www.rostrum.blog/2018/04/14/r-trek-exploring-stardates/> ..............................
rot_page
#> # A tibble: 30 x 6
#> page link_url link_text response_code response_catego… response_success
#> <chr> <chr> <chr> <dbl> <chr> <lgl>
#> 1 https:… https://ww… R statis… 200 Success TRUE
#> 2 https:… https://en… Star Tre… 200 Success TRUE
#> 3 https:… http://www… Star Tre… 200 Success TRUE
#> 4 https:… https://gi… regex 200 Success TRUE
#> 5 https:… http://vit… tidy 200 Success TRUE
#> 6 https:… https://en… Wikipedia 200 Success TRUE
#> 7 https:… http://sel… Selector… 200 Success TRUE
#> 8 https:… https://cr… how-to v… 404 Client error FALSE
#> 9 https:… https://ww… htmlwidg… 200 Success TRUE
#> 10 https:… https://gi… ggsci 200 Success TRUE
#> # … with 20 more rows
Uh oh, at least one is broken: it has a response_code
of 404
.
You could iterate over multiple pages with {purrr}:
pages <- c(
"https://www.rostrum.blog/2018/04/14/r-trek-exploring-stardates/",
"https://www.rostrum.blog/2018/04/27/two-dogs-in-toilet-elderly-lady-involved/",
"https://www.rostrum.blog/2018/05/19/pokeballs-in-super-smash-bros/"
)
library(purrr)
rot_pages <- set_names(map(pages, detect_rot), basename(pages))
#> Checking <https://www.rostrum.blog/2018/04/14/r-trek-exploring-stardates/> ..............................
#> Checking <https://www.rostrum.blog/2018/04/27/two-dogs-in-toilet-elderly-lady-involved/> ........................................
#> Checking <https://www.rostrum.blog/2018/05/19/pokeballs-in-super-smash-bros/> .....................
rot_pages
#> $`r-trek-exploring-stardates`
#> # A tibble: 30 x 6
#> page link_url link_text response_code response_catego… response_success
#> <chr> <chr> <chr> <dbl> <chr> <lgl>
#> 1 https:… https://ww… R statis… 200 Success TRUE
#> 2 https:… https://en… Star Tre… 200 Success TRUE
#> 3 https:… http://www… Star Tre… 200 Success TRUE
#> 4 https:… https://gi… regex 200 Success TRUE
#> 5 https:… http://vit… tidy 200 Success TRUE
#> 6 https:… https://en… Wikipedia 200 Success TRUE
#> 7 https:… http://sel… Selector… 200 Success TRUE
#> 8 https:… https://cr… how-to v… 404 Client error FALSE
#> 9 https:… https://ww… htmlwidg… 200 Success TRUE
#> 10 https:… https://gi… ggsci 200 Success TRUE
#> # … with 20 more rows
#>
#> $`two-dogs-in-toilet-elderly-lady-involved`
#> # A tibble: 40 x 6
#> page link_url link_text response_code response_catego… response_success
#> <chr> <chr> <chr> <dbl> <chr> <lgl>
#> 1 https:/… https://w… @mattdray 200 Success TRUE
#> 2 https:/… https://d… the Lond… 200 Success TRUE
#> 3 https:/… https://g… the sf p… 200 Success TRUE
#> 4 https:/… https://r… interact… 200 Success TRUE
#> 5 https:/… https://e… eastings… 200 Success TRUE
#> 6 https:/… https://e… latitude 200 Success TRUE
#> 7 https:/… https://e… longitude 200 Success TRUE
#> 8 https:/… https://r… leaflet 200 Success TRUE
#> 9 https:/… https://w… R 200 Success TRUE
#> 10 https:/… https://g… sf (‘sim… 200 Success TRUE
#> # … with 30 more rows
#>
#> $`pokeballs-in-super-smash-bros`
#> # A tibble: 21 x 6
#> page link_url link_text response_code response_catego… response_success
#> <chr> <chr> <chr> <dbl> <chr> <lgl>
#> 1 https:… https://en… Super Sm… 200 Success TRUE
#> 2 https:… https://en… Super Sm… 400 Client error FALSE
#> 3 https:… https://en… SSB Mele… 200 Success TRUE
#> 4 https:… https://en… SSB Braw… 200 Success TRUE
#> 5 https:… https://en… SSB ‘4’,… 200 Success TRUE
#> 6 https:… https://ww… a series… 200 Success TRUE
#> 7 https:… https://en… the Supe… 200 Success TRUE
#> 8 https:… https://en… Zelda 200 Success TRUE
#> 9 https:… https://en… EarthBou… 200 Success TRUE
#> 10 https:… https://en… the Poké… 400 Client error FALSE
#> # … with 11 more rows
Uh-oh, more broken links.
Please note that the {linkrot} project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.