Gary Waiyaki's Projects
Work through this exercise to hone your visualization abilities and understanding of Bayesian parameter optimization in Python for a Light GBM model.
In this projects you will embark on: Cleaning and Transforming data such as handling missing data, removing duplicates etc. You will visualize the data relationships i.e. correlation heatmaps, pairplots etc. You will pre-process data and split it into testing and training datasets. Finally you will present and share your findings.
Time Series Project on UK Electricity Consumption 2009-2023
Kenny's Source Control with Git Public Repo
Practice what you've learned about cosine similarity by completing this exercise. While working through this exercise, you'll get to see how cosine similarity is calculated with a numeric dataset and explore the utility of cosine similarity for record matching and NLP projects.
This case study explores K-Means clustering to find the value for K using the Elbow method, the Silhouette method, the Gap statistic, and visualize the clusters with Principal Components Analysis (PCA) while using real data containing information on marketing newsletters and email campaigns, as well as transaction-level data from customers.
The case study will involve your use of the full data science pipeline, from importing, loading and cleaning the data right through to modeling and concluding. In the case study, your decision trees will properly implement the supervised learning method of classification.
Keen to put what you've learned about Euclidean and Manhattan distance to the test? This exercise asks you to apply these two distance metrics and visualize their distances on the same dataset.
In this case study, you’ll learn more about frequentist inference. There are two parts to the case study. In part A, you’ll learn the Pythonic implementation of the concepts underlying frequentist inference. In Part B, you’ll apply those implementations to a real-world scenario
In this exercise, you will gain a full understanding of how gradient boosting works to improve predictions based on information from the residuals. First, you'll apply this method to a regression problem then to a classification problem using the Titanic dataset.
In this exercise, using grid search method, you'll identify the optimal number of neighbors to use in the K-nearest neighbor model.
Sharpen your data wrangling skills by completing this mini-project.
In this case study, you'll use Random Forest and logistic regression to understand the scope of the Coronavirus using data from December and January of 2020.
In this case study, you'll analyze whether there is a significant difference between the ratings on these two platforms that would justify choosing one over the other. If there's not, you can always just flip a coin to pick which platform to use at random.
The case study explores which red wines have properties that make them more alcoholic.
We are going to scrape some financial data (stock prices) from yahoo finance. We will use requests and beautiful soup to get and parse the info.
In this case study, you'll use MySQL, PHPMyAdmin, Juptyer Notebook, and SQLite to tackle a series of challenges on a database containing information about a country club.
In this exercise, you will make like your great data storyteller forebears and tell a compelling story about a dataset of interest to you.
The King County, Washington House dataset is a collection of records about single-family homes sold in King County, Washington, between 2014 and 2015.
As a US government data scientist, you'll analyze historical sales data from Cowboy Cigarettes (est. 1890) spanning 1949-1960. Your goal is to predict sales trends in the early 60s for a report on public health and cigarette companies.
Time Series Analysis and Forecasting for sales data