Giter Site home page Giter Site logo

mguideng / rvest-scrape-glassdoor Goto Github PK

View Code? Open in Web Editor NEW
5.0 4.0 2.0 338 KB

Web scrapes Glassdoor company reviews in R (using rvest) and creates a CSV with all reviews. Prep for text mining.

Home Page: https://mguideng.github.io/2019-02-27-scrape-glassdoor-gdscrapeR/

R 100.00%
r-project rvest glassdoor webscraping web-scraper text-mining

rvest-scrape-glassdoor's Introduction

README

Maria Guideng

rvest-scrape-glassdoor

Build status

๐Ÿ˜ด Oh hi. This repo is no longer maintained. Go here for the gdscrapeR package instead. ๐Ÿ‘ป

About

Scrape Glassdoor.com for company reviews. Prep text data for text analytics.

Demo

Take Tesla for example. The url to scrape will be: https://www.glassdoor.com/Reviews/Tesla-Reviews-E43129.htm

Here's a screen shot of the text to extract:

gd-tesla

Web scraper function

Extract company reviews for the following:

  • Total reviews - by full & part-time workers only
  • Date - of when review was posted
  • Summary - e.g., "Amazing Tesla"
  • Rating - star rating between 1.0 and 5.0
  • Title - e.g., "Current Employee - Anonymous Employee"
  • Pros - upsides of the workplace
  • Cons - downsides of the workplace
  • Helpful - number marked as being helpful, if any
#### SCRAPE ####
# Packages
library(rvest)    #scrape
library(purrr)    #iterate scraping by map_df()

# Set URL
baseurl <- "https://www.glassdoor.com/Reviews/"
company <- "Tesla-Reviews-E43129"
sort <- ".htm?sort.sortType=RD&sort.ascending=true"

# How many total number of reviews? It will determine the maximum page results to iterate over.
totalreviews <- read_html(paste(baseurl, company, sort, sep="")) %>% 
  html_nodes(".margBot.minor") %>% 
  html_text() %>% 
  sub(" reviews", "", .) %>% 
  sub(",", "", .) %>% 
  as.integer()

maxresults <- as.integer(ceiling(totalreviews/10))    #10 reviews per page, round up to whole number

# Scraping function to create dataframe of: Date, Summary, Rating, Title, Pros, Cons, Helpful
df <- map_df(1:maxresults, function(i) {
  
  Sys.sleep(sample(seq(1, 5, by=0.01), 1))    #be a polite bot. ~12 mins to run with this system sleeper
  
  cat("boom! ")   #progress indicator
  
  pg <- read_html(paste(baseurl, company, "_P", i, sort, sep=""))   #pagination (_P1 to _P163)
  
  data.frame(rev.date = html_text(html_nodes(pg, ".date.subtle.small, .featuredFlag")),
             rev.sum = html_text(html_nodes(pg, ".reviewLink .summary:not([class*='hidden'])")),
             rev.rating = html_attr(html_nodes(pg, ".gdStars.gdRatings.sm .rating .value-title"), "title"),
             rev.title = html_text(html_nodes(pg, "#ReviewsFeed .hideHH")),
             rev.pros = html_text(html_nodes(pg, "#ReviewsFeed .pros:not([class*='hidden'])")),
             rev.cons = html_text(html_nodes(pg, "#ReviewsFeed .cons:not([class*='hidden'])")),
             rev.helpf = html_text(html_nodes(pg, ".tight")),
             stringsAsFactors=F)
})

RegEx

Use regular expressions to clean and extract additonal variables:

  • Reviewer ID (1 to N reviewers by date, sorted from first to last)
  • Year (from Date)
  • Location (e.g., Palo Alto, CA)
  • Position (e.g., Project Manager)
  • Status (current or former employee)
#### REGEX ####
# Packages
library(stringr)    #pattern matching functions

# Clean: Helpful
df$rev.helpf <- as.numeric(gsub("\\D", "", df$rev.helpf))

# Add: ID
df$rev.id <- as.numeric(rownames(df))

# Extract: Year, Position, Location, Status
df$rev.year <- as.numeric(sub(".*, ","", df$rev.date))

df$rev.pos <- sub(".* Employee - ", "", df$rev.title)
df$rev.pos <- sub(" in .*", "", df$rev.pos)

df$rev.loc <- sub(".*\\ in ", "", df$rev.title)
df$rev.loc <- ifelse(df$rev.loc %in% 
                       (grep("Former Employee|Current Employee", df$rev.loc, value = T)), 
                     "Not Given", df$rev.loc)

df$rev.stat <- str_extract(df$rev.title, ".* Employee -")
df$rev.stat <- sub(" Employee -", "", df$rev.stat)

Output

df-tesla

#### EXPORT ####
write.csv(df, "rvest-scrape-glassdoor-output.csv")  #to csv

Notes

rvest-scrape-glassdoor's People

Contributors

mguideng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

manvik09 camenda

rvest-scrape-glassdoor's Issues

Trouble Getting Started

I hope you are doing well! I'm working on a project where I'm hoping to rely on Glassdoor data, so I came across gdscrapeR on GitHub. As a relative novice in R, I was wondering whether you thought I would have sufficient R knowledge to utilize gdscrapeR and build out a dataset. I basically tried copy-and-pasting the code into R and tried running it on my own computer but encountered some error messages, so I was wondering if I was missing a step/if just copy-pasting and running is insufficient to get things going. Thanks!

Script no longer works

Hi just happened to be using this and realized the scripts do not work any longer. Wondering if it is due to R package updates or Glassdoor that closes their API. Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.