Giter Site home page Giter Site logo

drumguests's Introduction

Analysis of The Drum hosts, panellists and guests

This script scrapes data on the hosts, panellists and guests of The Drum from the ABC website. If you just want to grab some tidy data, it’s currently in drum_tidy.csv. It goes back to 27 April 2018 (as at 2019-05-13).

Note: the formatted datetimes in the dt column are in UTC! You’ll need to convert them to "Australia/Sydney" before using them.

To grab the data from the ABC site yourself, run this notebook!

Let’s scrape data from the ABC website and find out how often people appear!

drum_url = 'http://www.abc.net.au/news/programs/the-drum/'
pages = 1:10
episodes_id = 'collectionId-4'

# download data
episodes =
  map_dfr(pages, function(x) {
    episode_page =
      read_html(glue('{drum_url}?page={x}')) %>%
      html_nodes(glue('#{episodes_id} article'))
    tibble(
      title       = episode_page %>% html_nodes('h3') %>% html_text(),
      description = episode_page %>% html_nodes('p') %>% html_text())
  }) %>%
  print()
#> # A tibble: 250 x 2
#>    title                    description                                    
#>    <chr>                    <chr>                                          
#>  1 "\n\n The Drum Friday M… Host: Ellen Fanning Panel: Kate Mills, David M…
#>  2 "\n\n Health Care Speci… Host: Ellen Fanning Panel: Pat Turner, Profess…
#>  3 "\n\n The Drum Wednesda… Host: Kathryn Robinson Panel: Geraldine Doogue…
#>  4 "\n\n Corangamite Speci… Host: Ellen Fanning Panel: Dr Fiona Gray, Gabr…
#>  5 "\n\n The Drum Monday M… "Host: Kathryn Robinson Panel: Kathryn Greiner…
#>  6 "\n\n The Drum Friday M… Host: Ellen Fanning Panel: Nicki Hutley, Peter…
#>  7 "\n\n The Drum Thursday… Host: Kathryn Robinson Panel: Robyn Parker, Ja…
#>  8 "\n\n The Drum Wednesda… Host: Kathryn Robinson Panel: Amanda Rose, Kat…
#>  9 "\n\n The Drum Tuesday … Host: Ellen Fanning Panel: Jenna Price, Scott …
#> 10 "\n\n The Drum Monday A… In a special episode, our panel of Indigenous …
#> # … with 240 more rows

Okay, let’s tidy it up and get the good bits out (regex makes me cry).

episodes %<>%
  # format the date
  mutate(
    ep_date = str_replace_all(title, c("\n\n The Drum " = "", " \n" = "", "- " = "", "\\s$" = "")),
    dt = parse_date_time(ep_date, orders = "A, B d", tz = "Australia/Sydney")) %>%
  # isolate the host and people
  mutate(
    host = str_extract(description, regex("(?<=Host: )(.*)(?= Panel:)",
      dotall = TRUE)),
    panel = str_extract(description,
      regex(paste0("(?<=Panel: )(.*)(?=( Guest:| Guests:| Interview with:|",
        "The panel|We have))"),
      ignore_case = TRUE, dotall = TRUE))) %>%
  # separate guest and/or interviewees...
  mutate(
    guest = str_extract(panel, regex("(?<=Guest: )(.*)$", dotall = TRUE, ignore_case = TRUE)),
    interviewee = str_extract(panel, regex("(?<=Interview with: )(.*)$", dotall = TRUE, ignore_case = TRUE))) %>%
  # ... and remove them from the panel
  mutate(
    panel = str_replace(panel, regex("Guest: (.*)$"), ""),
    panel = str_replace(panel, regex("Interview with: (.*)$"), ""),
    panel = str_replace(panel, "\\.$", "")) %>%
  select(ep_date, dt, host, panel, guest, interviewee) %>%
  print()
#> Warning: 17 failed to parse.
#> # A tibble: 250 x 6
#>    ep_date    dt                  host    panel         guest   interviewee
#>    <chr>      <dttm>              <chr>   <chr>         <chr>   <chr>      
#>  1 Friday May NA                  Ellen … "Kate Mills,… <NA>    <NA>       
#>  2 "\n\n Hea… NA                  Ellen … "Pat Turner,… <NA>    <NA>       
#>  3 Wednesday  NA                  Kathry… "Geraldine D… <NA>    <NA>       
#>  4 "\n\n Cor… NA                  Ellen … "Dr Fiona Gr… <NA>    <NA>       
#>  5 Monday Ma… 2019-05-06 00:00:00 Kathry… "Kathryn Gre… "John … <NA>       
#>  6 Friday Ma… 2019-05-03 00:00:00 Ellen … "Nicki Hutle… "Ece T… <NA>       
#>  7 Thursday … 2019-05-02 00:00:00 Kathry… "Robyn Parke… "Shane… <NA>       
#>  8 Wednesday… 2019-05-01 00:00:00 Kathry… "Amanda Rose… <NA>    <NA>       
#>  9 Tuesday A… 2019-04-30 00:00:00 Ellen … "Jenna Price… "Sarah… <NA>       
#> 10 Monday Ap… 2019-04-29 00:00:00 <NA>    <NA>          <NA>    <NA>       
#> # … with 240 more rows

Okay, now let’s break these names up:

episodes %<>%
  gather(key = "role", value = "name", host, panel, guest, interviewee) %>%
  separate_rows(name, sep = ", and |, | and ") %>%
  # remove any trailing spaces that snuck in
  mutate(name = str_replace_all(name, "\\s$", "")) %T>%
  write_csv('drum_tidy.csv') %T>%
  print()
#> # A tibble: 1,733 x 4
#>    ep_date                    dt                  role  name            
#>    <chr>                      <dttm>              <chr> <chr>           
#>  1 Friday May                 NA                  host  Ellen Fanning   
#>  2 "\n\n Health Care Special" NA                  host  Ellen Fanning   
#>  3 Wednesday                  NA                  host  Kathryn Robinson
#>  4 "\n\n Corangamite Special" NA                  host  Ellen Fanning   
#>  5 Monday May 6               2019-05-06 00:00:00 host  Kathryn Robinson
#>  6 Friday May 3               2019-05-03 00:00:00 host  Ellen Fanning   
#>  7 Thursday May 2             2019-05-02 00:00:00 host  Kathryn Robinson
#>  8 Wednesday May 1            2019-05-01 00:00:00 host  Kathryn Robinson
#>  9 Tuesday April 30           2019-04-30 00:00:00 host  Ellen Fanning   
#> 10 Monday April 29            2019-04-29 00:00:00 host  <NA>            
#> # … with 1,723 more rows

Nowe we can visualise. For example, here are hosts by frequency:

episodes %>%
  filter(role == "host") %>%
  group_by(name) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  drop_na(name) %T>%
  print() %>%
  {
    ggplot(., aes(x = reorder(name, n), y = n)) +
      geom_col() +
      coord_flip() +
      theme_minimal() +
      labs(
        x = 'Host',
        y = 'Number of appearances',
        title = 'The Drum hosts by appearance over the last year')
  }
#> # A tibble: 10 x 2
#>    name                  n
#>    <chr>             <int>
#>  1 Adam Spencer         21
#>  2 Craig Reucassel      12
#>  3 Dr Norman Swan        1
#>  4 Ellen Fanning        93
#>  5 Fran Kelly            1
#>  6 John Barron           6
#>  7 Julia Baird          71
#>  8 Kathryn Robinson     10
#>  9 Peter van Onselen    23
#> 10 Sarah Dingle          7

And here’s guests, panellists and interviewees:

episodes %>%
  filter(role != "host") %>%
  group_by(name, role) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  drop_na(name) %>%
  top_n(30, n) %T>%
  print() %>%
  {
    ggplot(., aes(x = reorder(name, n), y = n)) +
      geom_col() +
      coord_flip() +
      theme_minimal(base_size = 8) +
      labs(
        x = 'Host',
        y = 'Number of appearances',
        title = 'Top 30 non-host appearance on THe Drum over the last year')
  }
#> # A tibble: 31 x 3
#>    name                role      n
#>    <chr>               <chr> <int>
#>  1 Adrian Piccoli      panel     9
#>  2 Avril Henry         panel     8
#>  3 Bridie Jabour       panel     8
#>  4 Caroline Overington panel     7
#>  5 Craig Chung         panel    10
#>  6 David Marr          panel     6
#>  7 Greg Sheridan       panel     9
#>  8 Jane Caro           panel     7
#>  9 Jenna Price         panel     7
#> 10 Jennifer Hewett     panel     6
#> # … with 21 more rows

drumguests's People

Contributors

jimjam-slam avatar mdsumner avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

mdsumner

drumguests's Issues

Dates for episodes that have "special" names

Some episodes aren't listed on the site by their air date but by another name, such as "Candidates Special" or "Brisbane Special":

screenshot of episode listings on The Drum website, showing "Candidates Special" and "The Drum Wednesday April 17"

The current code scrapes the title in order to infer the date, so epsidoes with these "special" names are listed as having a missing date.

Since the initial release, post timestamps have been added to the site listings. Assuming these happen on the same day as the airing (and they look like they line up with the air time, so probably), these are likely a more reliable way to get episode date. The notebook should be updated to get the current title purely as its own column and to separately scrape the timestamp 🙂

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.