Giter Site home page Giter Site logo

anouk2311 / indeed-job-listings Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 3.0 37.94 MB

This repository contains the entire workflow for our Online Data Collection & Management and Data Preparation & Workflow Management group projects (group 3).

Jupyter Notebook 17.67% Makefile 6.90% R 75.43%
job-market netherlands data-analist marketing-analist marketeer data-scientists

indeed-job-listings's Introduction

Are students learning the right skills for their future job?

This repository contains the entire workflow for our Online Data Collection & Management and Data Preparation & Workflow Management group projects (group 3). For our project, we scraped Indeed.com for several marketing analytics related job postings in the Netherlands.

Motivation

The job market is becoming ever more competitive, and, as a student nowadays, it is harder and harder to land the job of your dreams. COVID-19 has even made it more challenging for recent graduates to find a fitting role after graduation. A lot of students complain about the gap between the skills learned during their studies and the actual skills required in the workfield by companies. This motivates our investigation into Indeed.com job vacancy postings, to find the skills that are actually required for different job types.

As Marketing Analytics students, we have both the marketing and analytical knowledge to succeed in a range of different job types in the field of marketing. We are interested in what the actual required skills are for jobs in fields related to our studies. Becoming a data scientist, a data analist, a marketing analist or maybe a marketeer after our studies will require different skills and competencies, and we would like to know to which degree we possess these skills and to which extent our study program effectively prepares us for the job market.

We aspire to find insights to help ourselves, as well as our fellow students, in making a choice relating to which skills to improve and perhaps which skills to forego, in order to effectively land their first job. We aim to conduct our investigation in such a way that the methodology can be used by anyone for any location in the world and for any job title on Indeed.com. First of all, for our fellow students, the keyword and location analysis provide easily interpretable figures to see which skills are in demand for each role, together with the best locations for jobs in the Netherlands. Our entire project is accessible and usable. The entire workflow can be tweaked to your needs, by simply changing the search term in the scraper. This makes our project valuable for all job seekers in the world, as they can reproduce our data scraping and analysis for their specific job wishes.

Method and results

For this project we decided to narrow down our investigation to the Netherlands. Three of our four members originate from the Netherlands and since we and our classmates are all studying at Tilburg University in the Netherlands right now, job options in the Netherlands are most relevant to us. Furthermore, we decided to investigate the 4 jobs most closely related to our master in Marketing Analytics program, being data scientist, marketeer, data analist and marketing analist. By narrowing down the project by not including too many locations and job searches, the project workflow will be much smoother and easier to reproduce for anyone interested in doing so.

We chose a keyword frequency analysis on what we believe to be the most common technical skills as our main tool of analyzing the different job searches. These keywords include programming languages such as R and Python and knowledge of visualization programs such as Power BI and Tableau. In our experience, many of our fellow students find it difficult to bridge the gap between academic knowledge and actual skills required in the job market. By comparing this core set of technical skills per job title, we can provide insights into the actual skills valued by employers. This will help students in making the right decision on how to spend their precious time and prepare themselves optimally for the beginning of their careers.

First, we built a web scraper that scrapes the vital information of each job posting associated with a specific job search. We used the BeautifulSoup package to collect job ids, job titles, salary, company name, dates, job summary and salaries if available. Afterwards, we scraped the job descriptions of the same search results to obtain the job descriptions of each job posting in a seperate dataset by using a chrome webdriver with Python Selenium. Due to restrictions with captchas, we collected multiple small batches of datasets per search term. In R we merged the data into one big file per search term by joining the datasets on the unique job id that serves as an identifier for each seperate job advert on Indeed. After merging the files into one dataset per search term, we cleaned the data by removing duplicate entries and by cleaning up messy string location names. Location names such as Amsterdam-Zuid or Velsen-Noord were unwanted because they would appear as two distinct locations in our analysis. Therefore, we wrote a function that removed all such extra unnecessary information in the location strings. We added a function that replaced certain unspecified locations such as Nederland or Randstad into online, signifying that a job did not specify a specific location.

The final step in the cleaning process was cleaning the salary data. The salary data was quite sparse and also very messy. Roughly three-quarters of the data was missing, and for the data available, different measures were given. Some jobs gave hourly salary rates, whereas other jobs utilized monthly or yearly income as a salary measure. We made a function that removed unnecessary character strings in the salary data and converted all different salary measures into yearly income as a standard. We calculated yearly income for hourly rates based on 40 hour work week. We multiplied the hourly rate * 40 * 4 * 12 to get a yearly income rate based on the hourly wage. For the monthly income, we simply multiplied with 12 to arrive at yearly income. Quite a few jobs gave a salary range instead of a fixed number. For those jobs, we decided to take the middle of the range. We kept the observations with missing values for salary in the dataset for the first part of the analysis because it would significantly reduce the number of observations otherwise.

After cleaning the data we could start with the analysis. The most important part was the keyword analysis. In this part, we investigated the count of occurences for each of our skills in the job advert descriptions. Since some adverts had double or multiple counts of the same words, we also provided the percentage of job ads in which a keyword occured. These 2 measures produced very similar results in the end but we decided to keep both included. The final part of the kewyord analysis consisted of combining the 4 seperate plots with keyword percentages for each skill into one single plot which, showed the occurences for each skill in each job next to each other for easy comparison.

Afterwards, we proceeded with location analysis. At this stage, we scrutinized the frequency of the locations for each seperate job. Resulting in a distinct plot for each job search with the associated location frequencies. The last step was again to combine all of the different plots into one single plot to facilitate comparisons. The final plot gave a good overview of which cities were the top cities for each job. We finalized our analysis with a salary analysis for the different jobs. First, we removed the observations in the dataset with no data on salary. Then we checked the top locations salary-wise for each job title. Afterwards, we finalized the analysis with a combined plot of the top locations with the top salaries per job type.

In the following section we give a short overview of the main findings from our analysis. Due to limited salary data, many locations have only a single job ad with salary data included, this number does not really represent an average. Therefore, we decided to only incorporate locations with a minimum of 3 job ads when analyzing the average salaries per location.

  • We find that for data analist jobs, handling databases with SQL is the most sought skill by employers, followed by Microsoft Excel, Python and R. Amsterdam currently has the most vacancies for a data analist, followed by Utrecht, Rotterdam, Den Haag and Eindhoven. For data analists, again Zaandam and Amsterdam are the cities with highest average salaries. Another attractive option is Apeldoorn. These top 3 cities all have average salaries close to or even over 60.000 euros per year. Quite more than the national average.

  • For becoming a data scientist, students will need to learn Python the most, followed by handling SQL, R, Machine learning and Microsoft Excel in that order. When searching for a job as data scientist, we found that Amsterdam currently has the most vacancies in the Netherlands, followed by Utrecht, Leiden, Rotterdam and Delft. Top locations for data scientists in terms of salary are Amsterdam and Zaandam. In each of these places the average salary for Data Scientists is over 60.000 euros annually, which is well above the average earnings for the entire country.

  • If you have the ambition to become a marketeer in the future, your employer will ask you to be able to handle Excel the most, followed by HTML. A little bit of knowledge on SQL and R is also sought by employers. For future marketeers, Amsterdam is the place with the most vacancies at this moment, followed by Utrecht, Rotterdam, Eindhoven, Nijmegen and Den Bosch. The highest paid jobs for marketeers generally are found in Zaandam and Schiedam, with Zaandam having salaries over 60.000 euros, which is nearly twice the average salary for marketeers. Besides these 2 cities, there are numerous other cities including Utrecht, Amsterdam, Rotterdam, Nijmegen and Eindhoven which are all pretty similar to each other in terms of salaries, coming close to 40.000 euros average annual income.

  • Lastly, for future marketing analists, it is useful to learn Excel, SQL, Python, R and Tableau. When searching for a job as a marketing analist, we found that Amsterdam has the most vacancies, followed by Rotterdam, Utrecht, Den Haag and Eindhoven. For marketing analists, when looking for the highest salaries on average, the best places to find a job are Amsterdam, Almere, Den Haag and Utrecht.

Repository overview

Our repository has the following structure:


├───docs
└───src
    ├───analysis
    ├───collection
    └───data-preparation
    

Running instructions

For this project. we made use of GNU make. This means that the whole process, from loading the data to running the analyses, can be done at once by running the makefile. If you are not (yet) familiar with makefiles, we advice you to take a look at the following tutorial before running the makefile. We advice you to have a look at the makefile before you start running the file.

Before running the makefile, make sure you have the following packages installed:

For Python:

    pip install requests
    pip install bs4
    pip install DateTime
    pip install selenium
    pip install pandas

For R:

    install.packages("tidyverse")
    install.packages("tidyr")
    install.packages("stringr")
    install.packages("reshape")
    install.packages("knitr")
    install.packages("dplyr")

Next to these packages, you will need Chromedriver to run the scraper. Because Chromedriver is operating-system specific, we would like to refer you to the following tutorial. Next to Chromedriver you will also need to install Tex distribution, which is a typesetting system. See the tutorial on how to install the Tex distribution.

More resources

For this project, we did not use related papers or previous studies. Tilburg Science Hub did help us greatly in understanding how to deploy this project from beginning to end. Have a look at the website to learn more about how to efficiently carry out data- and computation-intensive projects.

About

This project has been conducted by students from Tilburg University for the two courses Data Preparation & Workflow Management and Online Data Collection & Management, both instructed by Hannes Datta. All team members were involved in building, developing, optimizing, cleaning, analyzing and reporting of the data. The following 4 students contributed to the project: Georgiana Hutanu, Anouk Heemskerk, Alan Rijnders and Renee Nieuwkoop.

indeed-job-listings's People

Contributors

alantjee avatar anouk2311 avatar georgianahutanu avatar reneen1998 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

indeed-job-listings's Issues

Data cleaning dirty location goes wrong

Hi! I was checking the data_clean.R code and our remove_dirty_location also removes 'Noord' from Noord-Holland and Noordwijk so they end up as -holland and wijk. So maybe we should not delete the words first, but directly replace the whole string?

E.g. so replace Amsterdam-noord by Amsterdam, instead of removing noord everywhere.

Salary cleaning in cleaning file

Hi Guys, most of the code with functions seem to work so far, only in the cleaning file where I added the salary cleaning step does it go wrong at the salary cleaning function(4th one). If you could have a look as well would be great. Trying to fix it right now but not seeming to get any closer.

clean data functions

Hi! I tried to change the clean_data.R file into a file with functions for cleaning the data. However, apparently there are some problems with using dplyr within a function in R. Can one of you take a look at this as well? I don't see how to fix it. Right now, it doesn't create a new column named location_trimmed.

I also included a prototype function to run the cleaning functions on all datasets. But we should first fix that dplyr problem, before testing if this works:
Schermafbeelding 2021-03-23 om 12 21 30

Format changes analysis

Hey guys,

I made some changes in the formatting of the Rmarkdown so that the pdf becomes more readable (Like blank lines between header and text, new chapters on new pages and plots to stop floating to the right). Please take a look at it and let me know if you still see some things that need changing. Thanks!

Combined plot salary analysis

Our combined plot for the salary analysis for top locations salary wise only shows 3 cities because they are the only 3 cities that show up in all 4 plots. If we relax the filter of 3 job postings per location we will get a plot with more cities in it but some of these cities will have only one job ad per search term and thus an average is not very useful in this case.

What do you guys want, keep the plots like this with only a few cities to be compared, or remove the minimum number of job ads needed per location to get plots with more cities in the plot but single job ads having higher influence on average salaries.

download_data issue

Hi guys, I just tested download_data.R and it seems that there is still an issue for the marketing-analist data, I only get the listings and the first 3 descriptions. Can you have a look at it?

Last things Readme

  • Update the repository overview <-- I'll pick that up
  • Integrate analysis salary part in the results overview <-- I'll pick it up when salary analyses is fixed
  • Should we give a description on how to run the makefile? Or is that 'common sense'
  • Overall read through and last checks

Almost done! Great job everyone :)

Download data improved file

Hi! Could you please check if the new download_data.R file works for you? I included functions and it now downloads the data directly from Google Drive instead of Github.

If it works, we can delete all datasets from Github so they are not public anymore.

Error in makefile

Not sure how to add the analysis and analysis/output folders in gen. Tried it with directory.R now (see workflow), but still gives an error.

110803541-95c36480-827f-11eb-8efd-9c89e3f7cdb0

Create driver object in selenium scraper gives error

The code we now have is "driver = webdriver.Chrome()". I don't know how it runs on your computers, but I have to put in my path between the brackets, e.g. "driver = webdriver.Chrome('/usr/local/bin/chromedriver')".

Code has to run on all computers without adjustments right?

Salary cleaning needs to be performed in a subsequent file.

Hi I am currently updating the analysis scripts to incorporate for all 4 datasets and the keyword analysis for each job search term. However the datasets I used should not be cleaned from salary data because this reduces the number of observations massively. I think it is a better idea to seperate the cleaning steps of location strings and removing duplicates into one file. ANd the salary cleaning in the salary analysis file.

Switching of the review/location

Hey guys, at the #getjoblocation part and #getcompanyreview part in the scraper the locations and the reviews get mixed up with eachother. I think i found the problem, but i don't know how to fix it. Could one of you have a look at it? See the pictures. The text.splitlines()[1] gives the reviews in this case and text.splitlines()[2] gives the location.

This is why these 2 get mixed up, because not every vacancy has a review component.

image
image

Documentation for ODCM is completely done, Readme needs additions

Hey guys,

The documentation (datasheet) for ODCM is imo completely done, please do have a look before the deadline to make any enhancements or changes.

Readme file has already been filled in for the most, still needs the part where we explain how to run exactly. I will start working on that. Maybe a good idea to shorten the Method & Results part a little bit? Makes it easier to read.

Error marketeer data

Hi! I was checking for that error in line 5 of the marketeer data, and I think there is something wrong with the separator. Not sure why, because we did the same with this data set as we did with the others.

Schermafbeelding 2021-03-20 om 09 20 48

Schermafbeelding 2021-03-20 om 09 20 54

How to name ambiguous location strings

We know give the name Unknown to all locations that are not in a specific city. Should we maybe change this to Remote or Online or something similar?

last parts frequency and location

The keyword analysis and location frequency largely are functionised right now. If you could try and see if it runs on your own computer as well would be nice. Also some of the last parts I did not really make a function so could still be even more efficient. But already reduced half of the code so we are on the right way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.