Giter Site home page Giter Site logo

ensembles4612 / analysis_and_modeling_on_data_science_jobs_canada Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 13.91 MB

I did exploratory data analysis on scraped job listings from Glassdoor, and built a salary prediction model deployed as a client-facing forecast tool.

Jupyter Notebook 99.39% Python 0.61%
salary-prediction data-science web-scraping lasso-regression random-forest svm model-deployment mahcine-learning data-cleaning

analysis_and_modeling_on_data_science_jobs_canada's Introduction

Analysis on Data Science Jobs in Canada with Salary Prediction Flask API Deployed in Herokou

Table of Contents

  1. Project Highlights
  2. References
  3. Web Scraping
  4. Data Cleaning
  5. Exploratory Data Analysis
  6. Model Building
  7. Model Performance
  8. Productionization

Project Highlights

  • Scraped job listings for 3 target positions from Glassdoor for a one-month period using python and selenium
  • Cleaned, Visualized and analyzed job listing data in a variaty of ways using matplotlib, seaborn, wordcloud, etc.
  • Built salary prediction models for various positions in data analysis area in Canada using Multivariate Linear, Lasso, Random Forest and SVM
  • Fine-tuned Lasso, and Random Forest and SVM using GridsearchCV to achieve the best model (MAE ~ $16k)
  • Deployed SVM model (deployment repo) as a client-facing salary prediction tool on this website using Flask and Heroku

References

Language: Python 3.7
Packages: pandas, numpy, sklearn, matplotlib, seaborn, wordcloud, nltk.corpus, nltk.tokenize, missingno, dython.nominal, sklearn, joblib, selenium, flask
Project inspired by: https://github.com/PlayingNumbers/ds_salary_proj Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
Flask Productionization: https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2

Web Scraping

I adjusted the web scraper using Selenium to scrape the fulltime job postings from glassdoor.ca for 3 target positions -- "data scientist", "data analyst" and "statistician" for a one-month period(2020-05-09 to 2020-06-09) respectively. See code here.

With each job, I obtained the following: Job title, Salary Estimate, Job Description, Rating, Company Name, Location, Company Headquarters, Company Size, Company Founded Date, Type of Ownership, Industry, Sector, Revenue, Competitors. Because other positions such as data engineer and machine learning engineer also showed up in the search results of the 3 target positions, the following positions appreared in the search results were also chosen to be included in the dataset for analysis and modeling: data engineer and machine learning engineer, research scientist, business intelligent analyst, manager analytics, actuarial analyst.

Data Cleaning

After data scraping, I did some cleaning (See code here) before building the models for salary prediction such as:

  • Combined 3 datasets and deleted the duplicate rows
  • Parsed numeric data out of Salary
  • Removed rows with NAs in Salary and created column Average Salary
  • Removed unwanted Rating from Company Name text
  • Transformed Founded date into Company Age
  • Transformed scraped similar job titles into one simplified job title for all positions
  • Created column Seniority from Job Title
  • Created column Job Description Length from Job Description
  • Created columns for if different skills were listed in Job Description: Python, RStudio, Excel, AWS, SAS, Pytorch, sql

For missing values in predictor variables, I did the following:

  • Detected if NAs were randomly distributed in predictor variables. If not, deleted those variables
  • Deleted variables that have too many NAs and levels

EDA

After cleaning the scraped data, I did some brief data visualization and analysis. Based on the scraped data from Glassdoor of the one-month period, below are some of the plots and analyses. See code here.

  • Total number of new fulltime job postings for statistician, data analyst and data scientist were 1, 16, 12 respectively per day on average:

alt text

  • Most jobs were unspecified regarding seniority. For jobs sepecified with seniority, senior positions were demanded around 10 times more than junior and intern positions combined:

alt text

  • Boxplot of salary distribution for data analyst of seniority made most sense since there was enough data for this position. As we can see for data analyst, only approx 1k rise on median salary from intern to junior, but more than 10k from junior to senior:

alt text

  • Graph below shows the percentage of different skills were listed in the job descriptions for each position.
    • Excel was the most desirable for actuarial analyst and data analyst with Excel listed in more than 70% job discriptions for both positions. Excel was the least desirable for machine learning engineer (less than 30%)
    • Python was the most desirable for data engineer (approx. 85%) followed by data scientist and machine learning engineer (approx. 80%)
    • R/R studio was desired for data analyst, data engineer and data scientist (all under 10%)
    • SAS was the most desirable for statistician (almost 80%) then actuarial analyst (less than 60%)
    • SQL was the most desirable for data engineer (approx. 85%) and then data analyst and business intelligent analyst (approx. 65%)

alt text

  • Top10 company that released most jobs with their company ratings:

  • Total number of jobs released by sector and by job title and salary distribution:

  • Total number of jobs released by location and by job title and salary distribution:

  • Wordcloud for job description regarding Data Analyst(left), Statistician(center) and Data Scientist(right):

Model Building

I did the following before building models:

  • Train test set split: I splited the data into training set (80%) and tests set (20%).
  • Deciding which predictor variables to include in the model based on correlation matrix: In order to avoid bias (keeping test data unseen by the response variable - average salary), I made the corelation heatmap using only training data. Then, decided the following variables to be included in the model based on the heatmap: Job title, Seniority, Location, Company Size, Type of Ownership, python, rstudio, sql, aws, pytorch, sas, excel. The heatmap is shown below:

alt text

  • Transforming the categorical variables into dummy variables.

I tried 4 different models and evaluated them using Mean Absolute Error. They are:

  • Multiple Linear Regression โ€“ Baseline model
  • Lasso Regression โ€“ Data was very sparsed due to the many categorical variables, so I tried Lasso
  • Random Forest
  • Support Vector Regressor

Model Performance

I fine-tuned the following 3 models using GridSearchCV to find the best parameters with 10-fold cross validation. The test errors are:

  • Lasso Regression : MAE = 16784
  • Random Forest: MAE = 17745
  • Support Vector Regressor: MAE = 16288

Below was the graph I plotted regarding test set pred vs. actual average salary for the 3 models. Test error of SVM outperformed that of the other 2 approaches.

alt text

  • Reflection on the model performance: we can tell from the graph above that the 3 models tended to only predict salary between approx. from 50k to 90k well. Reasons for this might be that the majority of the jobs scraped had average salary that fell in this range. We can see this from the below salary distribution graph. What we can do to fix this problem is to scrape more data with salary that fall beyond this range and add them to dataset, then retrain the models.

alt text

Productionization

I built a flask API endpoint using the SVM model and deployed it in Heroku as a client-facing salary prediction website here. The deployment code is in this repo. On the website, you can choose values from the drop-down lists and submit. Then, the flask app takes in the request with the values and returns an estimated salary that will be shown on the website for you like below:

alt text

analysis_and_modeling_on_data_science_jobs_canada's People

Contributors

ensembles4612 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.