README

Maria Guideng

rvest-scrape-glassdoor

😴 Oh hi. This repo is no longer maintained. Go here for the gdscrapeR package instead. 👻

About

Scrape Glassdoor.com for company reviews. Prep text data for text analytics.

Demo

Take Tesla for example. The url to scrape will be: https://www.glassdoor.com/Reviews/Tesla-Reviews-E43129.htm

Here's a screen shot of the text to extract:

Web scraper function

Extract company reviews for the following:

Total reviews - by full & part-time workers only
Date - of when review was posted
Summary - e.g., "Amazing Tesla"
Rating - star rating between 1.0 and 5.0
Title - e.g., "Current Employee - Anonymous Employee"
Pros - upsides of the workplace
Cons - downsides of the workplace
Helpful - number marked as being helpful, if any

#### SCRAPE ####
# Packages
library(rvest)    #scrape
library(purrr)    #iterate scraping by map_df()

# Set URL
baseurl <- "https://www.glassdoor.com/Reviews/"
company <- "Tesla-Reviews-E43129"
sort <- ".htm?sort.sortType=RD&sort.ascending=true"

# How many total number of reviews? It will determine the maximum page results to iterate over.
totalreviews <- read_html(paste(baseurl, company, sort, sep="")) %>% 
  html_nodes(".margBot.minor") %>% 
  html_text() %>% 
  sub(" reviews", "", .) %>% 
  sub(",", "", .) %>% 
  as.integer()

maxresults <- as.integer(ceiling(totalreviews/10))    #10 reviews per page, round up to whole number

# Scraping function to create dataframe of: Date, Summary, Rating, Title, Pros, Cons, Helpful
df <- map_df(1:maxresults, function(i) {
  
  Sys.sleep(sample(seq(1, 5, by=0.01), 1))    #be a polite bot. ~12 mins to run with this system sleeper
  
  cat("boom! ")   #progress indicator
  
  pg <- read_html(paste(baseurl, company, "_P", i, sort, sep=""))   #pagination (_P1 to _P163)
  
  data.frame(rev.date = html_text(html_nodes(pg, ".date.subtle.small, .featuredFlag")),
             rev.sum = html_text(html_nodes(pg, ".reviewLink .summary:not([class*='hidden'])")),
             rev.rating = html_attr(html_nodes(pg, ".gdStars.gdRatings.sm .rating .value-title"), "title"),
             rev.title = html_text(html_nodes(pg, "#ReviewsFeed .hideHH")),
             rev.pros = html_text(html_nodes(pg, "#ReviewsFeed .pros:not([class*='hidden'])")),
             rev.cons = html_text(html_nodes(pg, "#ReviewsFeed .cons:not([class*='hidden'])")),
             rev.helpf = html_text(html_nodes(pg, ".tight")),
             stringsAsFactors=F)
})

RegEx

Use regular expressions to clean and extract additonal variables:

Reviewer ID (1 to N reviewers by date, sorted from first to last)
Year (from Date)
Location (e.g., Palo Alto, CA)
Position (e.g., Project Manager)
Status (current or former employee)

#### REGEX ####
# Packages
library(stringr)    #pattern matching functions

# Clean: Helpful
df$rev.helpf <- as.numeric(gsub("\\D", "", df$rev.helpf))

# Add: ID
df$rev.id <- as.numeric(rownames(df))

# Extract: Year, Position, Location, Status
df$rev.year <- as.numeric(sub(".*, ","", df$rev.date))

df$rev.pos <- sub(".* Employee - ", "", df$rev.title)
df$rev.pos <- sub(" in .*", "", df$rev.pos)

df$rev.loc <- sub(".*\\ in ", "", df$rev.title)
df$rev.loc <- ifelse(df$rev.loc %in% 
                       (grep("Former Employee|Current Employee", df$rev.loc, value = T)), 
                     "Not Given", df$rev.loc)

df$rev.stat <- str_extract(df$rev.title, ".* Employee -")
df$rev.stat <- sub(" Employee -", "", df$rev.stat)

Output

#### EXPORT ####
write.csv(df, "rvest-scrape-glassdoor-output.csv")  #to csv

Notes

Write up of demo: "Scrape Glassdoor Company Reviews in R Using the gdscraper Package".
Updated 9/23/2018 to add star "rating" field. Initial commit 8/1/2018.

mguideng / rvest-scrape-glassdoor Goto Github PK

rvest-scrape-glassdoor's Introduction

README

rvest-scrape-glassdoor

About

Demo

rvest-scrape-glassdoor's People

Contributors

Stargazers

Watchers

Forkers

rvest-scrape-glassdoor's Issues

Trouble Getting Started

Script no longer works

Scraping additional info from glassdoor

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent