Leondra R. Gonzalez's Projects
Capstone project #2 for the Harvard University Professional Certificate in Data Science
Python XGBoost model, using Amazon SageMaker, EC2 instances and S3 buckets. Used to prepare, partition, train, tune, predict and evaluate model. Project involves predicting customers who sign up for a financial product at a bank.
An exploratory analysis of the Kaggle bikeshare data set with the application of linear regression models, which are not optimal for this particular problem of predicting bikes rented.
Leveraging regression random forest and XGBoost algorithms with cross validation and grid search to tune the best performing model on the Boston Housing dataset. Analyzed and visualized the most statistically significant features for both models. Achieved an RMSE of $2K
My first attempt at implementing a neural network using the Boston housing data set from the MASS library.
Candy Crush Level Difficulty Analysis
This is a descriptive and exploratory data analysis project from DataCamp which aims to explore real data on every Chipotle location to identify franchising opportunities. The goal is to scout out the next Chipotle location using interactive maps (ie: leaflet) and external data to compare proposed locations on several important factors, such as proximity to current Chipotle locations, the distribution of the state's population, and the distance from interstates and tourist attractions.
Utilizing tools such as Spark, Python (PySpark), SQL, and Databricks, performed logistic regression on customers to predict those at a higher risk of churning, then applied the model to an unseen "new customers" data set.
Data Visualizations
A cluster analysis leveraging the kmeans algorithm to determine which degrees are likely to yield which levels of income based on historical data.
Analysis of Disney's top grossing films (adjusted for inflation) in Python, using regression to attribute film genre to success. The project includes using regression on the data, as well as bootstrap regression to determine confidence intervals of the intercept and coefficients.
Data Science & Machine Learning Data Capstone based on Moneyball dataset
Used NLP techniques (tokenization, stemming, vectorization for TF-IDF) and clustering algorithms (Kmeans and Hierarchical clustering) to mine the "similarities" between films based on their plots provided by IMBD and Wikipedia. The dataset contains the titles of the top 100 movies on IMDb.
This is my first attempt at a KNN model, where I attempt to classify the purchase of caravan insurance in the Caravan data set (ISLR package).
Video games are big business: the global gaming market is projected to be worth more than $300 billion by 2027 according to Mordor Intelligence. With so much money at stake, the major game publishers are increasingly more incentivized to create the next big hit. But are games getting better, or has the golden age of video games already passed? In this project, I explore the top 400 best-selling video games created between 1977 and 2020. This is achieved by comparing gaming sales data with critic and user reviews data. In doing so, we can discover whether video games have improved as the gaming market has grown. Each table is limited to 400 rows for this experiment, but the complete dataset with over 13,000 games can be found on Kaggle.
Analysis of the co-occurrence network of Game of Thrones characters in the Game of Thrones books. Here, two characters are considered to co-occur if their names appear in the vicinity of 15 words from one another in the books. This project utilized graph analysis and modeling frameworks such as Google's PageRank Algorithm.
Capstone Submission #1 for the Harvard University Professional Certificate in Data Science.
"What Your Heart Is Telling You" Logit Model
Predicting the number of required crew needed for manning a Hyundai Cruise ship based on information like number of cabins and passengers using linear regression. Leveraged SQL and PySpark,
Used SQL in Jupyter Notebooks to analyze and explore data on international debts and codes.
My first attempt with building a SVM model, and optimizing the cost and gamma parameters using the Gaussian Kernel grid search method.
Use of associative rule mining using the APRIORI algorithm
Multi touch attribution models, including Markov chains
2 A/B tests, testing the difference in 1) average player 1 day and 2) 7 day retention against control (old player level) and new version (new player level)
Given the large number of movies and series available on Netflix, it is a perfect opportunity to dive into the entertainment industry with an analysis of Netflix content durations. This analysis aims to understand trends in content duration on the Netflix platform since 2011 through 2020.
An analysis and prediction of taxi fares based on 2013 NYC data using decision trees and random forests.