Original project goal was to gather insight on what is needed become a Data Scientist in Canada.
The project has then become larger and is now trying to encapsulate the skills required for a variety of roles across multiple locations into an analytics web app. Project description will soon be refactored to represent current goals
Below is the developmental process that has been done, any of which is subject to change as development continues.
Data was collected using a Selenium web crawler Every available job posting correlated to the term "Data Science" was gathered. Exact procedure can be found in the get_data notebook
Updates have been made to integrate a data pipeline. The pipeline scrapes data and walks it through preprocess all the way to model prediction, making database calls as needed.
Multiple cleansing procedures such as;
- Removing major punctuation marks
- Formatting data into a Pandas readable format
- Lemmatisation
- Appropriately splitting the data
- Elimination of unrelated roles
Procedures can be found in within the formatting notebook
Manually finding qualifications of roles within data(404 job postings) would be extremely time inefficient and cumbersome. Therefore, natural language processing was used to differentiate qualification sentences from general job description sentences.
A wide variety of models were tested and interpreted
- Naive Bayes classifier
- Linear Kernel Support Vector Machine
- Gaussian Radial Basis Kernel Support Vector Machine
- Long Term Short Term Memory Neural Network
- BERT transfer learning
Results were stagnant for both SVM models and Naive Bayes Classifier(~88% accuracy on test set). Unfortunantly LSTM network did not work well with the data, achieving the lowest accuracy yet of 83%. BERT transfer learning was found to be extremely accurate when used for text classification(94% accuracy), and will be used in production.
- Word Bigrams/Trigrams to determine common terminology of desired role
- Pie chart to encapsulate most desired technical skills
To do
- Location filtering
- Packages/frameworks utilized
- Cloud technologies
- Statistical concepts (where applicable)