I am a data scientist currently working in fraud analytics where I lead teams in finding and deterring identity theft. I have a background in Neuroscience and a passion for problem solving and solution oriented thinking.
I analysed three, manually labelled clickbait datasets using natural language processing. Utilizing tf-idf and naive bayes I was able to train a fast, light-weight, and powerful statistical model that identifies clickbait with ~90% accuracy. Using flask and gunicorn, I hosted the model on a heroku server and set up a POST API endpoint. Then, using bootstrap's precompiled destributions of NODE.js and CSS I created a simple front end web application.
My partner and I built an ensemble, voting-classifier using 3 transfer learning models (VGG16, DenseNet121, MobileNetV2) to predict the presence of pneumonia from x-rays of children's (age 1-5) chests. Given the use case, we optimised for both recall and accuracy. Initially, we optimised for recall and acheived a test prediciton containing no false negatives. However, we also wanted to insure that medical professionals weren't being inundated with false positives so we included accuracy as an evaluation metric.
- Accuracy: 0.9038
- Recall: 0.9897
In late 2019, the world was hit by the Sars-Covid-2. To prevent the spread of the disease, US state governments locked down the economy in March 2020. As a result of the pandemic, millions of people lost their jobs. Using the current population survey (CPS) in additon to information scraped from news sources and covid data from government agencies, this research identified key early indicators that US household may lose their employment and in turn aid in the allocation of resources before the need is dire.
- The part time workers even holding multiple jobs were hardest hit by unemployment.
- The older the house hold primary earner was the more likely they were to become unemployed.
- Interestingly, the industry had people worked had little bearing on the probability of them becoming unemployed.
The present research utilized the Kaggle King County data set to train a linear regression model of housing prices in that region. The model generated 420 features via interactions and polynomial columns as well as some non-linear transformations of the square footage columns. In particular, I utilized geohashing to increase the spatial resolution beyond zipcode and create a more accurate prediction.
I like to continually develop my coding skill set and am currently expanding my knowledge and experience in a variety of languages
- Python (and it's associated data science libraries)
- SQL
- R
- C
- Octave/Matlab
- Javascript
- HTML
- CSS