TidyX

Hosts

Ellis has been working with R since 2015 and has a background working as a statistical programmer in support of both Statistical Genetics and HIV Vaccines, and currently works as a Data Science Lead. He also runs the Seattle UseR Group.

Patrick's current work centers on research and development in professional sport with an emphasis on data analysis in American football. Previously, He was a sport scientist within the Nike Sports Research Lab. Research interests include training and competition analysis as they apply to athlete health, injury, and performance.

Description

The goal of TidyX is to explain how R code works. We are focusing on explaining topics either we find interesting or submissions from our viewers. Historically we explained how submissions to the #TidyTuesday Project worked to help promote the great work being done there.

In this repository, you will find copies of the code we've explained, and the code we wrote to show the concept on a new dataset.

To submit code for review, email us at [email protected]

To watch more episodes, go to our youtube channel.

Patreon

If you appreciate what we are doing and would like to support TidyX, please consider signing up to be a patron through Patreon.

https://www.patreon.com/Tidy_Explained

TidyX Episodes

Episode 1: Introduction and Treemaps!
- UseR Highlighted: Courtney Gerver
- Original Tweet
- Source Code
Episode 2: The Office, Sentiment, and Wine
- UseR Highlighted: Robin Sifre
- Original Tweet
- Source Code
Episode 3: TBI, Polar Plots and the NBA
- UseR Highlighted: Raniere Silva
- Original Tweet
- Source Code
Episode 4: A New Hope, {Patchwork} and Interactive Plots
- UseR Highlighted: Maggie Sogin
- Original Tweet
- Source Code
Episode 5: Tour de France and {gganimate}
- UseR Highlighted: Owen Churches
- Original Tweet
- Source Code
Episode 6: Lollipop Charts
- UseR Highlighted: Priya Shukla
- Original Tweet
- Source Code
Episode 7: GDPR Faceting
- UseR Highlighted: Danielle Barnas
- Original Tweet
- Source Code
Episode 8: Broadway Line Tracing
- UseR Highlighted: Jake Kaupp
- Original Tweet
- Source Code
Episode 9: Tables and Animal Crossing
- UseR Highlighted: Ted Lederas
- Original Tweet
- Source Code
Episode 10: Volcanoes and Plotly
- Ellis and Patrick explore this weeks TidyTuesday Dataset!
Episode 11: Times Series and Bayes
- UseR Highlighted: Eric Ekholm
- Original Tweet
- Source Code
Episode 12: Cocktails with Thomas Mock
- UseR Highlighted: Joshua de la Bruere
- Original Tweet
- Source Code
Episode 13: Marble Races and Bump Plots
- UseR Highlighted: Cédric Scherer
- Original Tweet
- Source Code
Episode 14: African American Achievements
- UseR Highlighted: Catriona Cunningham
- Original Tweet
- Source Code
Episode 15: Juneteenth and Census Tables
- Ellis and Patrick show US Census tables in a report, broken down into divisions and highlight values using {colortable}
- Source Code
Episode 16: Caribou Migrations and NBA Shots on Basket
- UseR Highlighted: Jihong Zhang
- Original Tweet
- [Source Code](https://github.com/thebioengineer/TidyX/blob/master/TidyTuesday_Explained/016-Caribou_Migrations_and_Spatial_Analysis/Jihong Zhang - Caribou Migration Map.Rmd)
Episode 17: Uncanny X-men and Feature Engineering
- UseR Highlighted: Rebecca Stevick
- Original Tweet
- Source Code
Episode 18: Coffee and Random Forest
- UseR Highlighted: Nyssa Silbiger
- Original Tweet
- Source Code
Episode 19: Astronauts and Dashboards
- UseR Highlighted: Lauren Pandori
- Original Tweet
- Source Code
Episode 20: Cocktails with David Robinson
- UseR Highlighted: David Robinson
- Original Tweet
- Source Code
Episode 21: The Birds
- UseR Highlighted: Roman Link
- Original Tweet
- Source Code
Episode 22: European Energy and Ball Hogs
- UseR Highlighted: Kelly Cotton
- Original Tweet
- Source Code
Episode 23: Mailbag and Expected Wins
- Ellis and Patrick go into our mailbag and focus on a request we recently had on loops and functions.
- Source Code
Episode 24: Waffle plots and Shiny
- UseR Highlighted: Jared Braggins
- Original Tweet
- Source Code
Episode 25: Intro To Shiny
- This is a start of a series of episodes covering more in-depth uses for {Shiny}, an R package for creating web applications by Joe Cheng. In this episode we cover basics of Shiny, and explain the concept of reactive programming.
- Source Code
Episode 26: Labels and ShinyCARMELO - Part 1
- UseR Highlighted: Mr. Ochiwar
- Original Tweet
- Source Code
Episode 27: LIX and ShinyCARMELO - Part 2
- UseR Highlighted: Leon Jessen
- Original Tweet
- Source Code
Episode 28: Nearest Neighbors and ReactiveValues
- This week Ellis and Patrick explore how to perform career analysis and projections using the KNN algorithm.Using those concepts, we jump into part three of our shiny demo series where we have shiny execute a KNN for our input players. We show how to create an action button to execute our code, and reactiveValues to store the results to then plot!
- Source Code
Episode 29: Palettes and Random Effects
- UseR Highlighted: Kaylea Haynes
- Original Tweet
- Source Code
Episode 30: Tweet Sentiment
- Patrick and Ellis were inspired this week by all the sentiment analysis performed for #TidyTuesday this week so we decided to look at tweets to show and comment on additional things to be aware of when doing sentiment analysis. Using {rtweet}, we pull over 50,000 tweets that used the #Debate2020, and discuss how context is incredibly important to analysis.
- Source Code
Episode 31: Reactable
- This weeks #TidyTuesday dataset was on NCAA Womens Basketball Tournament appearances. Patrick and Ellis in the past have shown how tables can be used for data visualization, and wanted to learn more about another one. {reactable} is a really cool looking package, so we spend some time showing how to use the package, apply column definitions, and even apply html widgets within the table!
- Source Code
Episode 32: Shiny with Eric Nantz
- This weeks #TidyTuesday dataset was a super fun one. Ellis and Patrick are joined by Eric Nantz, who created a shiny app to explore and animate the data. We talk through several new shiny concepts, like using {golem}, cross-talk, and other shiny packages like {bs4dash}!
- UseR Highlighted: Eric Nantz
- Source Code
Episode 33: Beer and State Maps
- UseR Highlighted: Richard Bamattre
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 34: Wind and Maps
- UseR Highlighted: Florence V. Dubois
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 35: Rectangles
- UseR Highlighted: Henry Wakefield
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 36: Animated Plotly
- This weeks #TidyTuesday dataset was on Mobile and Landline subscriptions across the world. This week we saw lots of animation type plots, and wanted to add our own. Using {plotly}, we make an interactive plot that animates across time to show how GDP is related to the raw subscription numbers. We also do some exploration with line plots.
- Source Code
Episode 37: Code Review
- Looking back at ones code can show you just how far you have come. Sparked by a conversation between Ben Baldwin (@benbaldwin), Patrick and Ellis, this weeks episode is on code review and refactoring. Ben went into his past and has furnished a set of code for us to try to refactor. In the spirit of things, neither of us looked closely at the code ahead of time, and recorded our initial reactions and process of refactoring Bens code into a function that could be applied to multiple datasets!
- UseR Highlighted: Ben Baldwin
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 38: Polar Plots
- UseR Highlighted: Tobias Stalder
- Original Tweet
- Tweet Source Code
Episode 39: Imputing Missingness
- This weeks we reach into our mailbag to answer a request from Eric Fletcher(@iamericfletcher) on imputing NA's. In this video we scrape 2013 draft data, and impute using various techniques missing times for the three cone event. We also attempt to discuss Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) - but we decide at the end to leave it to the professionals.
- Source Code
Episode 40: Inspiring Women and Plotly
- UseR Highlighted: Jackie Torres
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 41: Worm Charts with Alice Sweeting
- Alice Sweeting(@alicesweeting) joins us as a guest explainer this week! We are very excited to have her on as she explains with us how she worked through creating a worm chart of a super netball game! She talks with us on common techniques she uses to process data, mixing base R with tidyverse. Then we spend some time discussing Alice's background, current role, and advice for folks looking to get started in sports analytics or R programming in general.
- UseR Highlighted: Alice Sweeting
- Source Code
Episode 42: Highlighting Lines
- UseR Highlighted: Peter
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 43: Funnel Plots, Plotly, and Hockey
- With no #TidyTuesday dataset this week, we decide to continue to work through our learning of plotly. This time, using a tool known as a funnel plot.
- Source Code
Episode 44: Transit Costs, steps, and Plotly Maps
- UseR Highlighted: Martin Devaux
- Original Tweet
- Blog Post
- TidyX Source Code
Episode 45: NHL Pythagorean Wins and Regression
- This week we reflect back on the past year and combine techniques from multiple episodes. We scrape multiuple tables from the the hockey reference website, use regular expressions to clean and organize the data, and use for loops to determine the optimal pythagorean win exponent. We visualize the data using several different techniques, like scatter and lollipop charts. We show some fun tools with regularizing values for linear regressions and how how to predict and visualize the results.
- Source Code
Episode 46: Circle Plots, NHL Salaries, and Logistic Regression
- UseR Highlighted: Natalie O'Shea
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 47: NHL Win Probabilities and GT Tables
- This week we play with a new technique for optimizing, the optim function! We scrape the 2019-2020 NHL season to generate power rankings for every NHL team and home-ice-edge. We can use this to then predict team winning probability! We then combine that with season summary data to generate a pretty GT table!
- Source Code
Episode 48: NBA Point Simulations
- In this episode we show how to scrape the current NBA seasons scores to then generate a simple game simulator. Using {purrr} with some base R functions we generate outputs and show how to simulate thousands of games to generate outcome predictions.
- Source Code
Episode 49: MLB Batting Simulations
- We continue looking at simulations this week, but this time for individual players. Using {Lahman}, we pull the 2019 MLB player batting stats, and visualize the stats using histograms and density plots. Next, to generate confidence intervals around their batting averages we use rbinom() combined with techniques from the {tidyverse} to make simulation easy. Finally we visualize the data using {gt} combined with {sparklines}.
- Source Code
Episode 50: MLB Batting Simulations
- Another MLB Batting episode. This time we use the James Stein Estimator (paper below) to apply a shrinkage estimate to player batting averages to get a "true" estimate, removing luck. Using {Lahman}, we pull the 2018 MLB player batting stats, and explain how to implement the estimator. Next, we compare estimates against the 2019 season. Finally we visualize the data using {gt}, using header spans, and cell styling. For the grand finale we combine this gt table with batting averages with plots using patchwork!.
- Source Code
Episode 51: Deploying Models with Shiny
- Sharing the results of a modeling effort is an important skill of any data scientist. However, just sharing the weights of each predictory is often not good enough to get buy in from stakeholders who are understandably skeptical of your results. Using the power of shiny, you can show your stakeholders exactly how your model interprets and then predicts the results. In this episode, we use the {palmerpenguins} package with {randomforest} to generate a model to predict the species of a new penguin. With shiny, we then deploy our model to allow the users to record new penguins attributes to see whether the model things they are an Adelie, Chinstrap, or Gentoo! The output is a boxplot indicating the models probablity for each species given the inputs.
- Source Code
Episode 52: Too Many Gentoo with Xaringan
- There are too many Gentoo, your PI proclaims. This weeks episode Patrick and Ellis talk how to use the {xaringan} package to produce reproducible html presentations using Rmarkdown syntax. We discuss how we looked at "raw" tech data and used summary statistics to compare against the gold standard {palmerpenguins} package from Dr. Allison Horst, Dr. Alison Hill, Data from Dr. Kristen Gorman. We use last weeks highly powerful machine learning model to generate presictions of species, and generate a confusion matrix of our data vs the predictions. Finally, we talk about the value of making your presentation based on Rmd and being able to update the presentation at the click of a button.
- Source Code
Episode 53: MLB Pitch Classification Introduction
- This week we start a series on using machine learning to automate pitch classification. In this first episode, we discuss ways to start looking at your data and questions to formulate. We use hierarchical clustering a few different ways to start to see relationships between the different pitch types and the statistics that were captured around each pitch!
- Source Code
Episode 54: MLB Pitch Classification 2 - KNN, Caret and UMAP
- In the second episode on using machine learning to automate pitch classification from pitchfx data, we apply the K-nearest-neighbors algorithm as our first attempt at classification. We start with using the results from our naive hierarchical clustering to select 4 groups and apply the KNN algorithm. We then look at how we could evaluate performance of the model both with total mis-classification and within class mis-classification. Then we use {carat} to optimize for the best clustering and compare the results. Finally, we use UMAP to perform dimensional reduction to visualize mulitple dimensions as two and view relationships within the clusters.
- Source Code
Episode 55: MLB Pitch Classification 3 - Decision Trees, Random Forests, optimization
- For the third episode in the series on using machine learning to automate pitch classification from pitchfx data, we talk about decision trees and its famous variant: random forests. We start by discussing what a decision tree is and its value. We visualize the results and discuss the quality of the fit. Then we expand on decision trees, using the Random Forest algorithm, and discuss its performance. Finally, we use {caret} and {doParallel} to do a grid search for optimal mtry, using parallel processes to speed up the search!.
- Source Code
Episode 56: MLB Pitch Classification 4 - XGBoost
- We now turn to the famous XGBoost algorithm to help us in our fourth episode in the series on using machine learning to automate pitch classification from pitchfx data. We start by training using default parameters and observe some tricks to make training faster. Then we use {caret} and {doParallel} to do a grid search for optimal settings to be using to train and discuss the merits and disadvantages of using ever more complicated ML models.
- Source Code
Episode 57: MLB Pitch Classification 5 - Naive Bayes Classification
- We naively turn to bayes...okay, I'm done. In this episode we use the Niave Bayes Classifier from the {e1071} package to classify pitches from our pitchfx data. We discuss briefly how this algorithm works, and review its performance against the other tree-based algorithms we've used so far.
- Source Code
Episode 58: MLB Pitch Classification 6 - TensorFlow
- The next model type is one that has had a lot of excitement over the last decade with the promise of "AI" - deep learning. Using the {keras} package from RStudio, we attempt to train a model to automate pitch classification from pitchfx data. We talk about the differences to consider when building a deep learning algorithm, and data prep that must be done. We finally review the restuls and talk a bit about black-box ML models.
- Source Code
Episode 59: MLB Pitch Classification 7 - Class Imbalance and Model Evaluation Intro
- Throughout this series, we been attempting to predict pitch type using PitchF/X data. However, we have not directly addressed a major flaw in our data, class imbalance. The Four-seam FastBall consists of nearly 37% of our data! In this episode we apply a couple techniuqes to help address the class imbalance, and look at ways to evaluate our models performance. We talk about the pros and cons to consider, and set up for our last episode for the series.
- Source Code
Episode 60: MLB Pitch Classification 8 - Model Evaluation and Visualization
- This week we apply everything we have learned over the last several weeks to attempt to pick the best model for our project. As a reminder, we are attempting to predict pitch type using a subset of PitchF/X data. We attempt to productionalize our evaluations by writing a series of functions that allow quick iteration across multiple input types and capturing of information. Finally, we visualize the evaluations using 2 gt tables. Thank you all so much for joining us for this mini series on ML models and being with us as we hit episode 60. This has been a wonderful ride!
- Source Code
Episode 61: Data Cleaning - Regular Expressions
- Okay, we've gotta say it - there is nothing "regular" about regular expressions. BUT that does not mean they are not an incredibly valuable tool in your programming toolbox. In this episode we go through how to apply regular expressions to a dataset and talk through some of the common tokens you might use when applying a regular expression to your dataset.
- Source Code
Episode 62: Data Cleaning - REGEX applied & stringr
- This week we continue using Regex, and this time talk about applying it to generate data for plots. Additionally we discuss techniques such as grouping, and using the stringr package for its str_* variants of the base R regex functions.
- Source Code
Episode 63: Data Cleaning - REGEX lookarounds & Player Gantt Charts
- We lookaround with regex this week, showing an alternative approach to setting anchors in your regular expressions using lookarounds. We apply this to extracting player substitutions. Then we calculate the number of stints and duration for players to create a player gantt chart across the Miami Heat and Milwaukee Bucks Game 2 of the Eastern Conference Playoffs.
- Source Code
Episode 64: Data Cleaning - Ugly Excel Files Part 1
- Ugly data. Ugly EXCEL data. Thats pretty common to come across as a data scientist. People unfamiliar with how to format data are often the ones creating the excel files you work with. This week, Patrick and Ellis talk through some techniques to handle these data and turn it into usable data. Patrick wrote up this weeks example, parsing through the data to generate a nice data.frame from the ugly excel example.
- Source Code
Episode 65: Data Cleaning - Ugly Excel Files Part 2
- This week Ellis works through the ugly excel file, writing out the code live as he goes, and explaining how to break up the parsing into nice, bite-size pieces and generalize them. Patrick is there asking questions and clarifying how things worked. At the end of the cast they end up with similar data.frames, ready to munge for final processing.
- Source Code
Episode 66: Data Cleaning - Ugly Excel Files Part 3
- Now that we have the excel file into a nice format, we go over the final pieces of processing to turn the incorrectly formatted fields into usable data. We talk about generating date objects, ifelse vs if_else, and have some fun!
- Source Code
Episode 67: Data Cleaning - Viewer Submitted Excel File
- For the first time in over a year, and 65 episodes, Ellis and Patrick are in the same room! This week they work on a viewer-submitted excel file. After last weeks episode, we put out a call to our viewers to submit the ugly data they see so we can try to help. Github user MikePrt submitted a file from the UK Government statistics organisation (Office for National Statistics (ONS)) as an example. We extract the data and produce a simple plot.
- Source Code
Episode 68: Data Cleaning - Ugly Excel Files Part 4 - Saving Outputs
- We continue our series on data cleaning and discuss sharing your outputs. Patrick and Ellis go over a few output file formats and two different excel libraries that give you differing levels of control over the outputs.
- Source Code
Episode 69: Modern Pentathlons with Mara Averick
- Ellis and Patrick are joined today by Mara Averick, a Developer Advocate for RStudio. We conclude our series on messy excel data by talking through cleaning an excel file from UIPM and reasoning out what the field and scoring are.Then we talk about Mara's role, career history, and advice she has for our viewers.
- UseR Highlighted: Mara Averick
- Source Code
Episode 70: Databases with {dplyr}
- Making friends with your friendly database administrator is a great way to improve your effectiveness as a data scientist in your organization. But what do you do if you don't know any SQL? We present {dbplyr} by the folks at RStudio. Easily connect, interact with and send queries to databases using familiar dplyr syntax and commands.
- Source Code
Episode 71: Databases in R | Exploring Your Database with NBA data
- Being handed a database without knowing its contents or where to start can be daunting. We talk about techniques we can use to start exploring it just like any other dataset. We get a list of the tables in your database, the column names, and show how you can write SQL to get the head of a table.
- Source Code
Episode 72: Databases in R | Shiny and Databases
- The fastest way for a data scientist to multiply their impact is to get their customers to be able to do the analysis themselves (with guiderails of course). Shiny provides a great user interface, combing this with some basic queries your clients may want improves response time and allows them to search to their hearts content. This week we show you a simple way to add interactivity with your database using {shiny} to query teams mean point differential at home across the 2001-2002 seasons.
- Source Code
Episode 73: Databases in R | Shiny,Databases, and Reactive Polling
- Now that we have a shiny app that allows our users to access and interact with the data in our database, how do we make sure that the user configuration is showing the most up-to-date information for selection? This is done through reactive polling - a time out feature that checks to see if there are any update to the database and updates the UI selection interface accordingly. We discuss the benefits and how to use the reactivePoll function combined with an observeEvent function to really supercharge our shiny app!
- Source Code
Episode 74: Databases with R | Joins in SQL vs Local
- Continuing the SQL/Database saga, we look at joins. we scrape a bunch of play by play information and game info and look at generating a database with this information. We then compare the speed of joining tables locally or within the sql database!
- Source Code
Episode 75: Databases with R | J Joins, databases, and commits in Shiny
- Now that we have a database full of data, and a shiny app to play with it, how do we capture and share the information across our users using the database? In this episode we share how we might create a sample database filled with play-by-play NBA data and create a shiny app to allow a coach or SME to review and add comments to the data as they review it. Then, they can decide to commit and save their thoughts for the future!
- Source Code
Episode 76: Databases with R | Polling databases in Shiny
- In Episode 75 we introduced the idea of committing changes from a shiny app to a database. But what about scenarios with multiple users? Ellis and Patrick explore an idea to allow for polling of the database and add updates that were committed to the database to active views of the rest of the users. We use reactive polling as introduced in episode 73 and updating reactiveValues.
- Source Code
Episode 77: Tidymodels - LM
- tidymodels is an ecosystem of packages developed by RStudio (Max Kuhn, Julia Silge to name a few) that is developed to help folks apply good modeling practices from start of the cleaned data to a fully productionalized model. We are going to be stepping through and learning how to apply tidymodels together. The first episode is on applying a simple linear model versus the base R method!
- Source Code
Episode 78: Tidymodels - Splits and Recipes
- tidymodels is an ecosystem of packages developed by RStudio (Max Kuhn, Julia Silge to name a few) that is developed to help folks apply good modeling practices from start of the cleaned data to a fully productionalized model. We are going to be stepping through and learning how to apply tidymodels together. The second episode we discuss how to set up your test/train splits as well as data preprocessing using the {recipes} package in conjuction with {workflow}! This smooths out and applies good practices simply and effectively to make data prep for modeling a breeze.
- Source Code
Episode 79: Tidymodels - Cross-validation and Metrics
- The third episode on tidymodels, we continue our data prep and model training by exploring cross-validation and metric evalidation. Ellis and Patrick show to set up a 5-fold cross validation set on your training split as well as fitting a tidymodels workflow! We finally show how to display and extract model fitting evaluation metrics.
- Source Code
Episode 80: Tidymodels - Decision Trees and Tuning
- The fourth episode on tidymodels, we sort out how to do parameter tuning of a model using the tune package. We set up a grid to train across and select the best model based on model metrics. We then retrain this model on the full test set and evaluate its performance against the final test set.
- Source Code
Episode 81: Tidymodels - Logistic Regression with GLM
- This week we look at how to perform a logistic regression using the tidymodels framework. During the fifth episode tidymodels, we show how to set up a logistic regression using GLM, perform a custom test/train split on the data, and calculate metrics such as ROC AUC, kappa, and accuracy. We visualize the performance and evaulate how well our model performed.
- Source Code
Episode 82: Tidymodels - Logistic Regression with GLM
- Continuing looking at classification models via tidymodels, this week we look at how to perform a multiple classification problem using random forests. We show how to tune your model, extract the optimal workflow, and then train it against your full training set and compare its performance on the test set. We calculate performance metrics such as ROC AUC and visualize the results.
- Source Code
Episode 83: Tidymodels - Naive Bayes of Penguins
- Naive bayes is the model we apply in this weeks Tidymodels series. We look at how to perform a multiple classification problem using the naive bayes theorem applied in the discrim package from tidymodels, and the klaR package to supply the engine. We show how to evaluate your model using 5-fold cross valudation, and then train it against your full training set and compare its performance on the test set. We calculate performance metrics such as ROC AUC and visualize the results.
- Source Code
Episode 84: Tidymodels - Workflow Sets and model selection
- Tidymodels makes it simple to try a multitude of modeling types by separating the preprocessing from the model type and creating a standardized way to apply different models. Workflow sets takes this a step further and makes it so you can train and compare these models at the same time, just like tuning. Using data from Kaggle, we look at how to perform model fitting for three model types and select the best workflow to train using our full train set and compare against our hold out test set. We calculate performance metrics such as RMSE and R-squared and visualize the results.
- Source Code
Episode 85: Tidymodels - Tuning Workflow Sets
- Tidymodels makes it simple to try a multitude of modeling types by separating the preprocessing from the model type and creating a standardized way to apply different models. In this episode we show how you can use Workflowsets along with tuning to create optimal models. Using wine data from Kaggle, we look at two different recipes and 3 different models requiring different levels of tuning. We select the best workflow and optimal tuned paramets to train using our full train set and compare against our hold out test set. We calculate performance metrics such as RMSE and R-squared and visualize the results.
- Source Code
Episode 86: Tidymodels - Julia Silge and Tune Racing
- This week have are thrilled to have Dr Julia Silge from RStudio join us to talk about tidymodels. Julia is one of the software engineers we have to thank for tidymodels and the ecosystem of packages that help us perform our data preprocessing and modeling steps with ease! In this episode we have a short interview with Julia where she talks a bit about her background, her current role and Tidymodels. We then jump into explaining how some code she wrote and shared in one of her owns screen casts on training an xgboost model to predict homeruns. One unique part of it is that Julia applies tune racing, making the tuning run faster using some clever comparisons to make sure only the best models continue to get trained across all cross folds. Patrick and Ellis ask questions throughout on how the code works and Julia's philosophies.
- Julia Silge's Blog Post on Racing Methods
Episode 87: Advent of Code Day 6 - Efficient Problem Solving
- This week we take a look at a problem from the Advent of Code, specifically day 6. Advent of Code is a fun time of year where the data science community comes together to solve a series of 25 problems posed by Eric Wastl. The goal is to see who all can solve the problems quickly and efficiently. It also provides an opportunity to work on problems unlike most of what you see in your day-to-day job. We work on finding an efficient solution to day 6 - Lanternfish. The fish reproduce at a standard rate, but calculating how many exist after a certain number of days is a problem that is trivial for small number of days, but quickly becomes too large for your computer if you approach the problem the wrong way!
- Source Code
Episode 88: Advent of Code Day 7 - For Loops and Lookup Vectors
- We work on finding an efficient solution to Advent of Code Day 7 - Whales. We need to find the most efficient location to align a series of crab submarines in order to escape with several different constraints. We discuss how to set up an efficient for loop, and create a lookup vector!!
- Source Code
Episode 89: Tables for Research
- We reach into our mailbag this week to answer a question from one of our viewers. In one of our episodes we talked about how you could extract coefficients from your fit models using the {broom} package. However, how would one turn that into a publication ready table? In this episode we use {gt} by Rich Iannone to convert our coefficients data.frame into a nice, publication-ready table!
- Source Code
Episode 90: Rmarkdown Guide - RMD Formatting
- Rmarkdown is an incredible tool that is widely used by R analysts to combine prose and code together into a beautiful symphony of reproducible outputs and information sharing. However, some of the set up as a new comer can be confusing. We are starting a series to discuss some of the knowledge to help users get going with their Rmarkdown journey. This week we start on the bones and structures of Rmarkdown documents, talk about markdown syntax, setting up your text to format as expected, and add some code chunks!
- Source Code
Episode 91: Rmarkdown Guide - Code Chunk Options & Figure Options
- Rmarkdown is an incredible tool that is widely used by R analysts to combine prose and code together into a beautiful symphony of reproducible outputs and information sharing. However, some of the set up as a new comer can be confusing. We are starting a series to discuss some of the knowledge to help users get going with their Rmarkdown journey. This week continues where we left off, talking through common chunk options that modify how your code and its outputs appear in the resulting output, and whether it even gets run at all. Then we cover common chunk options that modify figure outputs that are incredbly useful! Finally we start an rmarkdown report to demonstrate how we would use these options in a real report.
- Source Code
Episode 92: Rmarkdown Guide - Formatting Tabs for HTML outputs
- This weeks episode features a trick on how to make tabsets in your html outputs in Rmarkdown, as well as some advice on how to start organizing your code within an Rmarkdown document. Using the palmerpenguins dataset, we show how to make your code chunks super easy to update and things to think about when making your output.
- Source Code
Episode 93: Rmarkdown Guide - YAML Header
- The YAML header controls the macro level behaviors of your rmarkdown, from the output type, to the title, author, date, custom styling, table of contents, etc. In this episode we cover the basic YAML header contents, and how to add this customization to your rmarkdown documents. We also show two example outputs for html and word.
- Source Code
Episode 94: Rmarkdown Guide - Parameterized Reports
- Parameterised reports allow data scientists to multiply their impact by reducing the amount of work they need to do to produce new reports. Using the YAML header, a data scientist can set parameters that change based on user inputs to create customized reports at the click of a button. In this episode we go over the basics of adding a parameter, how to customize the input either interactively or programmatically, and using the parameter in your code. Then we create a custom example on pulling NBA basketball data for multiple years and displaying a team of interest.
- Source Code
Episode 95: Rmarkdown Guide - Interactive Reports with htmlwidgets
- So far in our series on rmarkdown, we have covered ways to generate reports, sometimes dynamically running them with parameters. This week we cover how you can generate html reports with embedded interactivity from htmlwidgets. These widgets allow the users to inspect and explore the data embedded in the report. This sort of technique is used a lot and there are a number of html widgets in the R ecosystem. In this episode we demonstrate how to explore baseball data using interactive plots from plotly and datatables from the DT package.
- Source Code
Episode 96: Rmarkdown Guide - ASIS Outputs
- This week we discuss a fun rmarkdown r chunk option - results. This little argument can have a big impact on the looks and output of our rmarkdown reports and can give a bunch of power to the developer to make the behavior and content of the report change based on the results of the code. It can also make what could be a tedious task in rmarkdown super fast!
- Source Code
Episode 97: Sampling, Simluation, and Intro to Bayes - Base R Distributions
- A powerful tool in the R toolbox is the set of distribution functions included in base R. These functions allow data scientists to explore a variety of potential distributions to simulate data and explore possibilities. This week we go over the meaning of the p, q, d, and r prefixes of the distribution functions and work through examples of how to use them using baseball data from the {lahman} package.
- Source Code
Episode 98: Sampling, Simluation, and Intro to Bayes - Sampling and Bootstraps
- sample is a fun and useful base R function that allows you to select a sample of n values from a vector at random. This has important implications to setting up bootstap sets to resample existing datasets. This week we go through the differences between simulation and resampling and how to do some simple set ups to resampling that will be the foundation to the next few episodes.
- Source Code
Episode 99: Sampling, Simluation, and Intro to Bayes - Basic Bayes
- Applying what we have learned these last few weeks, we are ready for Bayesian Statistics and Bayes theorem! This week we work through the concept behind Bayes, and attempt to talk through it in more approachable terms. We then try to apply the theorem to a few different cases to help solidify our understanding.
- Source Code
Episode 100: Sampling, Simluation, and Intro to Bayes - Beta Bayes
- Continuing our series on bayes, this week we learn about the conjugate prior of the binomial distribution, the beta distributions! Applying what we have learned about bayes theorem last week, we work through an example where we are trying to evaluate the performance of a basketball player in a drill where the average participant hits 65% of their shots, and this person hit 16 of 20. We discuss how to calculate some credible intervals, and update our analysis as we get more data on this player!
- Source Code
Episode 101: Sampling, Simluation, and Intro to Bayes - Poisson/Gamma
- Ever wonder how you could estimate the probability of a rate? Well, enter the Poisson distribution. Armed with "lambda", representing both the mean and sd of a distribution, we are able to simulate and calculate probabilities of number of occurrences, such as points scored in a game by a player. However, to apply bayes theorem and get credible intervals, we need a continuous probability, enter the conjugate prior: gamma. We use this to perform bayesian updating and calculate credible intervals to give us insight on a new player to our pretend basketball team.
- Source Code
Episode 102: Sampling, Simluation, and Intro to Bayes - Normal-Normal Conjugate
- This week we take a look at the most common, but also potential the most confusing distribution for our purposes - the normal distribution. We discuss how a bayesian looks at and uses a normal distribution, in where our mean and standard deviation both have their own distribution. An assumption is applied for us to work through a simple problem this week where we determine the probability of a basketball player being above average in a made up efficiency metric, and we demonstrate how to use bayesian updating as we gain new information on the player.
- Source Code
Episode 103: Sampling, Simluation, and Intro to Bayes - Normal-Gibbs Sampler
- The final episode in this series on bayes, we use learnings from several prior episodes to apply a new technique, Gibbs Sampling. This tool is used when there are multiple parameters that are being evaluated, each with their own parameters. We continue with the example from last week, but demonstrate how we can use a Gibbs Sampler to generate a distribution without having set the mean and standard deviation for a players efficiency metric. We also show a simple function that applies what we have learned in a simple API.
- Source Code
Episode 104: R Classes and Objects - dates and POSIXt
- This week we go on a date. Well, we talk about a date. Okay, okay, we talk about how to look and use date and datetime objects in R. We start with a high level overview of object systems that exist in R, and then reach into our mailbag to answer a question about lubridate. We talk about the fundamentals of date and POSIXt type objects, and ways to use them. Then we go over some of the difficulties of their behavior and how the {lubridate} package really makes dealing with dates much simpler.
- Source Code
Episode 105: R Classes and Objects - Base
- In the past 104 episodes, we realized we never spent time going over the base object types in R, how to build them up, and access them. This is something we have done in every episode, but decided this was the week we go over the mechanics of how it all works. We use four base object types; boolean, integer, numeric, and character, and show you how to build vectors, matrices, and data.frames. We go over how we think about objects.
- Source Code
Episode 106: R Classes and Objects - Factors
- A common question until R4.0.0 was "why is stringsAsFactors TRUE by default" to many a new R programmer. In this episode we discuss the mysterious factor object that is in base R. Why does it exist, how do you use it, and how to work with it are questions we attempt to answer here. We demonstrate changing vectors to and from factors, how factors impact regression models, and how to use factors to generate plots!
- Source Code
Episode 107: R Classes and Objects - Lists, Part 1
- Listy, list, lists. This episode we talk about one of our favorite, most flexible objects in R, a list. These objects can do almost anything, because they just don't care. Ellis and Patrick talk about how to create lists, discuss how they can nest and contain different object types, extraction of contents, and iterating over them. They talk about the {purrr} package and the valuable map family of functions, compare them to some of the apply family of functions, and compare them to a list.
- Source Code
Episode 108: R Classes and Objects - Lists, Part 2
- Listy, list, lists. AGAIN. This episode we continue our talk about lists. Last week we showed some methods to create and work with lists, and this week we show a variety of ways that lists can be used. We demonstrate summary statistics gathering, recording model results, and even looping over a list to generate a PDF report!
- Source Code
Episode 109: R Classes and Objects - Making an S3 Object, Part 1
- So far we have discussed the EXISTING objects included in base R. But our viewers may remember mention of additional object systems; s3, s4, RC, R6. In this episode we introduce the idea of making your own object in the S3 object system. Ever wonder how a tibble was made and how so many functions "just work" with it. Here we start to give you some insight to this idea by creating our own object and its own print method. Then we demo how you write a function to serve as a constructor of that object!
- Source Code
Episode 110: R Classes and Objects - Making an S3 Object - Part 2 - S3 Tournament
- We extend the idea of creating our own objects this week by demonstrating "s3 in practice". We pretend to be a Data Scientist for a a local sports betting company. The season has just ended for a local sports league. And we want to predict who will will the whole enchilada. First we need to sort out how we will simulate a single game. We create objects representing teams, and a series of functions to predict team performance and eventually a game winner!
- Source Code
Episode 111: Nate Latshaw, UFC Data, and data.table
- This week we are joined by the one and only Nate Latshaw. Nate is a software engineer and open source contributor, making amazing visualizations of UFC data in R. Some of Nates work includes a complex shiny app that has a lot of different ways to explore UFC fighter data. This week we are walked through how some of the visualizations are made, get a quick introduction to data.table, and get an inside look to how Nate creates such amazing visualizations. After the code, we talk about Nates career, experience in the open source community, and advice for those looking to start their own open works!
- Nate can be found at @NateLatshaw
- Source Code
Episode 112: R Classes and Objects - Making an S3 Object - Part 3 - S3 Tournament
- This is the final episode on creating and applying s3 objects. We discuss some comments we recieved from viewers asking about why s3 objects vs a named list, and then get down to business to completing our single round elimination tournatent. We create an object to represent a matchup, then abstract up to a tournament round, and finally the full tournament.
- Source Code
TidyX Episode 113 | R Classes and Objects - Making an S4 Object - Part 1
- We move onto the next, and possibly one of the more divisive (is that possible?) object systems in R - the S4 sytem. This system takes the free-wheeling s3 object class and says no more. Everything must be clearly defined up front, from the content of your object to its methods. We discuss some basics of why we use objects before getting into the nitty gritty of creating a few objects using the s4 system. We create a "print" method to demonstrate how to create a custom method, and show how to make a constructor.
- Source Code
TidyX Episode 114 | camcorder R package
- We take a step away from R objects to talk about a package that Ellis has been developing for the past 2 years - camcorder. Ellis talks about why he wrote the package, the ideas behind it, and folks might find value in it. He walks through an example of how to use the package, and gives a few call outs for folks that have supported the project throughout its two years.
- Source Code
TidyX Episode 115 | R Classes and Objects - Making an S4 Object, Part 2 - S4 Tournament
- We finally come back to talking about S4 objects this week and talk about how one might use an S4 object IRL. We look back and what we did for S3 objects and decide to use the same context but this time talk about how we would solve this problem in S4 instead of S3. We talk about creating new S4 generics and methods, and incorporate some view suggestions to create an object holding the results of a simulated game! We also open with a quick tangent to talk about an amusing thread by Danielle Navarro (https://twitter.com/djnavarro/status/1565515145488797696) on S3 chaos.
- Source Code
TidyX Episode 116 | R Classes and Objects - Making an S4 Object, Part 3 - S4 Tournament
- This week we close out our s4 discussion by finalizing our code to simulate tournaments. we built up last week simluated games, but now we simulate tournament rounds, create new s4 classes and methods, and ultimately simluate a tournament. We update how we were approaching calling the likely winner of a tournament to simulate a tournament 1000 times.
- Source Code
TidyX Episode 117 | Creating Participant IDs
- Ever wonder how you can use tidyverse tools to create unique identifiers for your experiment records? wonder no longer. This week we show you how to use the lesser known cur_group_id() function to get the group id number to serve as a participant ID. Then we demonstrate how you can also use joins, and finally discuss creating an index for grouping observations of the same participant using integer division!
- Source Code
TidyX Episode 118 | Windowing Functions with {zoo} and tidyverse
- What technology lets you see through a wall? Windows. This episode we take a look at the ever useful tidyverse and how we can perform windowing to calculate values across windows. We celebrate Albert Pujols hitting 700 career home runs by looking at his career home runs and perform examples of different windowing calculations. We show how these calculations can be used as part of your visualization to add context.
- Source Code
TidyX Episode 119 | Slice n' Dicing data with tidyverse
- This week we look at a common function we use to help us select random subsets of data in tidyverse - slice_sample. This function is the predecessor to sample_n and sample_frac, and allows us to quickly and easily grab n rows or a proportion of the data in a single line of code. We go over a few different arguments and set ups that people might use these functions!
- Source Code
TidyX Episode 120 | Working with columns in Tidyverse
- Selecting, renaming, and moving around columns around is an incredibly common task for data scientists. So much so that there are loads of little helpers embedded into the tidyverse world to improve quality of life. This episode we highlight the use of some of these helpers, such as the starts_with, ends_with, matches, and where functions, along with super important functions such as relocate and rename to move around columns and rename them respectively. Finally, we close with going over the differences between the dplyr::pull function and the purrr:pluck function.
- Source Code
TidyX Episode 121 | Tell me what you want - user submitted data
- This week we get into some data engineering problems provided by a viewer! Ellis and Patrick are provided some wide data that contains some simulated data of a few patients after surgery. Our job is to turn this into some useful long data based on what we were provided. Using tools from the last few weeks, we demonstrate how to use mutate, relocate, rename, pivot_longer and pivot_wider. We show to approaches, one using more advanced regex and pivoting tools to make the data useful.
- Source Code
TidyX Episode 122 | Event based data and filtering
- Event-based time series data is a super common type of data. But it does come with some unique challenges. This week we talk through some techniques a person might use to explore and filter this data into something more useful. We simulate event data where three participants have N observations and at any one of these observations an event may occur. We calculate number of events, time between events, how to get n observations post each event, and how to grab observations from two named events!
- Source Code
TidyX Episode 123 | Criss Cross Apple Sauce - Crossing in Tidyverse
- Crossing vectors and dataframes to generate new data or compare existing data is a very common practice in data analysis. Whether generating values to allow you to grid search or comparing values, there are helpers in R to make this process much easier. The tidyr crossing function and its familiars make this process a piece of cake. We explore the behaviors of these functions and give an example of how they can be useful!
- Source Code
TidyX Episode 124 | Combining Multiple Conditions
- This episode we work through a problem that was submitted by a colleague of Patrick: "I have multiple different potential values that I want to report based on a reference value. What would be a good way to combine them? case_when and ifelse don't seem to be doing it". We walk through the scenario, explain why case_when and ifelse fail, and provide a few solutions!
- Source Code
TidyX Episode 125 | Combining Multiple Conditions, Followup
- We reach into our mailbag to answer questions submitted by our viewers from our last episode. We go over a suggestion from @datadavidz to use the enframe function, explain how to un-rowwise your tibble, and give a solution to a similar problem submitted by Jeff Rothschild!
- Source Code
TidyX Episode 126 | Keeping duplicates on pivoting
- This week we pick up a problem that you too may have faced - pivoting your data and not getting the expected format due to some unexpected content in your data. This week we go through an example from Patrick, where we want to pivot values and keep the duplicated values independent. We work through a few different approaches to explain the thought process and how you too can preserve duplicates on pivot_wider.
- Source Code
TidyX Episode 127 | Fuzzy Wuzzy Joiny Tools
- How do you match two datasets that have ever so slightly different spellings for the values you want to match on? In comes Fuzzy Matching! This week we pick up a question from one of our patreon patrons on how can you match the names of different sports ball players across multiple sources! We generate a simple example using a "name bank" or reference dataset along with some simulated scraped data and show you two ways to do so. Ellis shows us how we can use agrep/agrepl from Base R, and Patrick walks through an example from the {fuzzyjoin} package!
- Source Code
TidyX Episode 128 | Data formats as data - AOC Day 1
- We solve Day 1 of advent of code in two ways this week. Ellis bases his approach using base R, applying a loop and pre-allocating a vector, while Patrick reads the in as a data.frame and applies tidyverse functions to come to the same conclusions. We discuss how sometimes the format of the data can be informational, and how you should approach processing when that matters.
- Source Code
TidyX Episode 129 | Generating Snowflakes
- Inspired by a blog Ellis saw online (see below), we write a snowflake generator using R. We talk about how you can write functions to build up to more complicated processes and use our highschool trig again. Happy Holidays from TidyX.
- Original Snowflake Blog Post: https://cloudfour.com/thinks/coding-a-snowflake-generator/
- Source Code
TidyX Episode 130 | Independent Interactive Reports with Plotly
- Ellis and Patrick got a question from a viewer asking how they can share interactive reports with their stakeholders without using shiny! Well, the answer is right in front of us in the use of Rmarkdown to generate html reports combined with the power of htmlwidgets from plotly. We generate a report that can be shared through a single file, that provides some fun interactivity to look at baseball batting averages.
- Source Code
TidyX Episode 131 | Player Selection in Shiny
- This week we work on a problem likely most sports scientists have dealt with - how to select a player by name when there are multiple players with the same name! We show two ways, first using selectInput and creating unique records for each player in the selection choices, and using DT and the ability datatable's have to create reactive inputs when they are clicked.
- Source Code
TidyX Episode 132 | Fuzzy Matching Shiny
- Expanding on what was done in episode 127, and taking the theme from the last few episodes on shiny, we demonstrate how you too can create a shiny app that will empower your non-programmer team to perform their own fuzzy matching. We use a collection of different techniques including uioutputs, DataTable, and a download handler!
- Source Code
TidyX Episode 133 | Intro to Flexdashboard - Flexing your Dashboard
- Somehow for 132 episodes we have not done a flexdashboard! This changes now. A flexdashboard is an advanced rmarkdown that allows you to create serverless dashboards. Nicely format and display your content for your stakeholders in interactive websites, and move away from manually creating them or using excel.
- Source Code
TidyX Episode 134 | Conditional Styling with DT
- DT offers a lot of power to the users in the ability to quickly make interactive tables in R. However, that is not its only superpower. Offering a number of formatting functions to style the contents for both in visual display and string formatting, there are many options for a power user. We go through the basics and some advanced skills like formatting based on another column or styling an entire row.
- Source Code
TidyX Episode 135 | Github cron jobs
- This week, we discuss using cron jobs in Github to automate the process of scraping webpages at a set cadence (e.g., every morning at 6am).
- Source Code
TidyX Episode 136 | Fuzzy Joins on Dates
- Sometimes. That's the thing. Some Times. We answer a viewer question extending from a prior episode answering the question of how do I join participant data from samples to the closest date of an event within N number of days? We try to give an answer, working through problem solving and try to give you the tools to solve this problem too!
- Source Code
TidyX Episode 137 | Magically Multiplying Tabbed Reports
- How can you easily generate a report with the same number of tabs as there are players on your team? This week we answer a viewer-submitted question on how you can build up tabbed reports without manually writing each tab. We show how to take advantage of the results="asis" chunk option in rmarkdown to make your reports quickly and easily!
- Source Code
TidyX Episode 138 | Interactive Magically Multiplying Tabbed Reports
- Last week we answered how to easily generate a report with the same number of tabs as there are players on your team, but we got a great question from a viewer - these tabs are great, but how do I add a plotly or other interactive content to these tabs. Just converting the histogram into a plotly using ggplotly fails. This is where the concept of child documents comes in. We give a brief introduction to child rmarkdown documents and their value then show how you can make your magically multiplying tabbed report interactive!
- Source Code
TidyX Episode 139 | Normalizing Z-Scores, pitfalls and generalizing
- Training ML models often requires users to normalize their inputs, and z-scores are a powerful tool for this. But how do you normalize when you have a dataset where you want to only use a subset to normalize on to then predict on the next set? In this episode we talk about the process of building a function (or function factory) to do this easily and flexibly!
- Source Code
TidyX Episode 140 | Data Restructuring via Splits
- A viewer sent in an interesting data processing question they had, where they wanted to take values from 4 columns and format them into a single column. Patrick and Ellis show several different approaches one could take to create a new data structure. If you have a data engineering question, please like, subscribe, and send us an email with your question, sample data, and example output!
- Source Code
TidyX Episode 141 | Building Function Factories
- In Episode 139 (bit.ly/TidyX_Ep139) we introduced the idea of a function factory - a function that returns another function. This is a powerful idea that allows you to set default values and apply them to new sets of data! This is a concept found all over programming, and you've probably been using them without realizing it! We do an intro to how they work and then ask you - what do you want to know about these cool powerhouses?
- Source Code
TidyX Episode 142 | Data from your Lists 4 ways
- Loops are a tool that is super powerful and flexible, but pulling information from each loop is not always the simplest. Through the lens of a home-baked, artisinal k-folds for loop, we discuss four different methods you could use to store your predicted values from your fold to then investigate later.
- Source Code
TidyX Episode 143 | For in for(loops)
- Decisions in your code set up impact how you can structure your code later on. Loops are a flow control that can have a massive impact on what you can do later on, so how you set up your loop has a strong impact on code readability. Ellis and Patrick explore 4 different techniques on how you can set up the seemingly simple for(i in x).
- Source Code
TidyX Episode 144 | Nested for loops for Simulation
- For loops are useful in a variety of different scenarios, but don't have to be restricted to just a single layer. Nested for loops allow you to iterate over muliple combinations of variables easily and quickly. In this episode we show how you might use a for loop to generate simulated data then apply three models to the newly simluated data to let you investigate your methods. We show where areas of this approach can generalize and be used for multiple data simulation and modeling techniques.
- Source Code
TidyX Episode 145 | Multi-Input Shiny
- Most shiny apps are set up for only a single input creating a cascade of impacts. However, sometimes the UI of your app may require one input to impact another. In this episode we introduce the idea of multiple inputs causing a change on a single value!
- Source Code
TidyX Episode 146 | Can You Download Handler This?
- Shiny apps are a useful tool for your stakeholders or downstream users to self service and explore data. However, that is rarely where it ends. How do you get the data out of your shiny app and into their reports? We show two methods in this episode, capturing results and saving them into a PDF, and taking an automatic screenshot of the app as is!
- Source Code
TidyX Episode 147 | Shiny Markdown Report
- Expanding on Episode 146, we show you some fun new techniques in shiny and generating outputs for your stakeholders. First, we talk about how you can use {shinyjs} by Dean Attali to enable or disable buttons in shiny apps to help ensure your users aren't clicking buttons at the right time. Then we get into the purpose of the episode, showing you how you can execute an Rmarkdown (or quarto) report through your shiny app, and present the values from the shiny app in a nice way!
- Source Code
TidyX Episode 148 | Shiny Model Builder
- We embark on a journey of integrating statistics into a shiny app. But rather than displaying pre-defined models, we give the power to our users! In this episode we work on the scaffolding for the next several episodes of discussing shiny techniques as well as modeling tools. The code to upload a user-defined file, update your input options based on the user provided data, and generate a simple LM are all things you'll learn in this episode!
- Source Code
TidyX Episode 149 | Shiny Model Builder - Test/Train Splits
- Our second episode on embedding statistical tests into a shiny app. This time we apply good practices to our model building by splitting the datasets into testing and training sets. Using a slider, the user can select the proportion of data to split into their training set to create the model, and their test set that then they can evaluate the performance of the regression. We also add in some additional tabs to our shiny app to help prevent our users from becoming overwhelmed!
- Source Code
TidyX Episode 150 | Shiny Model Builder - Predictor Selection Protections
- We have made it to 150 Episodes! We are so grateful for the support we have recieved thus far and hope to continue to do this for a long time coming! IN this episode we show how to protect our users from making a very understandable mistake given the original construction of the app. We show two methods for ensuring the user is unable to select the outcome variable in the predictor variables, updating the selectable values.
- Source Code
TidyX Episode 151 | Shiny Model Builder - Tree Based Models
- At long last, after setting up our shiny app to perform splits and protect our users, we get into providing different model selections! In this episode we focus on two types of tree-based models - a decision tree and random forest. We talk about how you set up your code to easily apply these different model functions, and have dynamic outputs based on the selected model, even a plot that only appears when a decision tree is selected!
- Source Code
TidyX Episode 152 | Formula 1 Data Packaging - Getting the data
- This week we start a series on building a data package! A data package is a great tool for you to be able to share versioned data with a variety of people very easily. It also is a great way to learn about package building concepts. Our data package will contain historical Formula 1 we have scraped from online resources. So this first episode in the series will focus on getting the data from a website using {rvest}. Finally, we close by showing how to make a fun plot of each years constructor champion as a waffle plot, coloring each tile with the average livery color of the team.
- Source Code
TidyX Episode 153 | Formula 1 Data Packaging - Initializing the package
- Last week we showed how we can scrape the web to pull in historical F1 Championship data into our R session. This week we building the initial files to create our data package! We show the tools available to create your basic R package by using the usethis package. We talk through the basic files required of any R package, and show how to store the R code from last week and the resulting data in a way that can then be called directly from the installed package!
- Source Code
TidyX Episode 154 | Formula 1 Data Packaging - Documenting and Sharing
- Now that we have the rough shell of a data package with the minimum requirements, lets make into an honest-to-god, sharable and useful package. We do this in a few ways, first we document our packages using Roxygen to make man pages, update our DESCRIPTION to include the packages we use, and finally offer up a usecase for the data in a vignette! Join us as we talk through all these steps and how to do it!
- Source Code
TidyX Episode 155 | R Packages - Functions
- Our last series covered building a useful data package, but what about writing a normal package? This new series will be about writing an R package based on functions from TidyX! This episode we talk through a) what is a function b) how to think about a function and c) writing cohesive functions.
- Source Code
TidyX Episode 156 | R Packages - Roxygenizing your Package
- Providing a minimum set of documentation for your package is made incredibly easy through the use of the {roxygen2} package. We write specially formatted comments, called roxygen blocks, before each of the functions we would like to document and voila we get our help pages! We discuss a number of basic roxygen tags to use and a few methods for approaching documentation!
- Source Code
TidyX Episode 157 | R Packages - Making an example of your Package
- For the most part our journey of package documentation is coming to an end, a user can see the functions, go to a help page, and see what should be entered into every argument. HOWEVER, thats still not enough! Remember, users are not you, they need ot see how a function should be used to get a better idea of the inputs and expected outputs. To that end, we show how to use the @examples tag and show some additional interesting roxygen tags like @seealso, @noRd, and how to link to other function documentation.
- Source Code
TidyX Episode 158 | R Packages - Write a vignette for yourself
- Now that our package functions are fully documented, our job as a package author is done, right? wrong. The next level is at least one vignette. A vignette is long form documentation included in your package to help folks understand not only how each function works, but how they can work together or how they can apply your package more generally. In this episode we show how to create, make and document your vignettes!
- Source Code
TidyX Episode 159 | R Packages - Feeling a little Testy
- Now that our package is documented, both at a function level and entire document, as well as containing data and useful functions, are we done? Well, by this video existing, you should answer NO! All of this is great, but how do we know that our functions are giving us correct answers and as we update or modify them they continue to behave as expected? Using the {testthat} package, we can write useful tests that get run as often as we'd like to confirm our expectations!
- Source Code
TidyX Episode 160 | Shiny URL Queries
- Ever wonder how you could set up your shiny app to always have the values you want to look into right when you open it? It can be such a pain to always re-enter the values into your app. In this week's episode, we show how you can set up an observe to check the URL when someone opens your shiny app and pre-populate it with selected values. We also hint at other ways this could be used.
- Source Code
TidyX Episode 161 | Shinylive - Is this thing on?
- Deploying shiny apps can be a pain, you need a server always running, the correct version of R, and making sure folks have access to it. Or you ask your users to download R and shiny locally, which we all know how that goes. Well, as of Posit::conf 2023, it was announced that shinylive is officially available for R! Hosting a shiny app directly from the browser! We take the opportunity to learn how to make a shinylive app, and take you through the steps of hosting it from github pages! Many thanks to Rami Krispin for his tutorial.
- Source Code
TidyX Episode 162 | Advanced Shiny - Web Scraping and Dynamic Linking
- Sometimes you need to dynamically pull content for your shiny app, since you don't have it all locally. This could be a database, internal API, and in our case, an external website. Using Hockey-Reference.com, we create a simple shiny application to allow us to get the list of games from a season and then pull game-level information such as scoring and penalites. Of course we want to give credit and allow our user to see more information, so we create a nice link back to hockey-reference.com that allows the user to go directly to the game page.
- Source Code
TidyX Episode 163 | Advanced Shiny - Player links in DataTable
- Going to player-level data after last weeks game level, we pull all the players for the 2024 Hockey Season, and generate a simple shiny app letting you filter down to keep just a few players to inspect. However, again the shiny app is only showing a small amount of information and we want to allow the user to see more information, so we provide a nice link in our datatable to link back to the player page on Hockey Reference to learn more about that particular player.
- Source Code
TidyX Episode 164 | Advanced Shiny - Running Multiple Linked Shiny Apps
- Running multiple shiny apps, and setting them up to link back to one another may sound like a tall task. BUT, it may be more approachable and powerful than you think. Building on the last few episodes, and sprinkling in some magic from episode 160, we updated our shiny apps from episode 162 and 163 to actually link together, adding a player roster in the Games Shiny app to link into the Player Shiny app. We also show how to run multiple apps using the {callr} R package!
- Source Code
TidyX Episode 165 | The Power of Plotting Compels You
- Sometimes you just want to make a simple, easy plot without having to load all those libraries. Other times you don't have access to all those libraries and you STILL need to make those nice plots. This week we talk a bit about Base R scatter plot building capabilities. We cover the basics of getting started, some tips and tricks, how to add a legend, and finally setting those nice labels, only using base R graphic tools.
- Source Code
TidyX Episode 166 | The Line Plot Saga
- Working off of last weeks base R scatter plots, we look into how one might generate a line plot! Often for some process or time series, we take the Lahman batting datasets and generate trend data for hits from the year 2000 to 2020. We lean into our Base R use and show some fun ways to group and summarize data using the aggregate function. We make line plots of hits, show how to add confidence and prediction intervals, and finally how to save your hard work!
- Source Code
TidyX Episode 167 | Grand Slam - Knocking it Out of the Park with Base R Density Plots
- In this thrilling R episode, we step up to the plate and hit a grand slam with base R's powerful capabilities for crafting stunning density and distribution plots. Using Lahman database baseball batting stats from 2010-2019, we dive into the world of data visualization, showcasing techniques like box plots, bar plots, and histograms. Special attention is given to the grandeur of density plots, all presented with the flair of RStats mastery. Join us on this home run journey into the art of visualizing data distributions with base R!
- Source Code
TidyX Episode 168 | Hall of Fame Showdown - Base R Plot Edition
- We dive deep into the world of baseball statistics using Lahman database batting data from 1980 to 2004. This episode focuses on comparing players who made it to the Hall of Fame with those who didn't, bringing you insightful visuals using base R plots. Explore the percentage of player inductions, analyze the distribution of Hall of Fame votes, and uncover the nuances of batting averages in this exciting exploration. The episode also features engaging strip charts, informative text plots, and an interactive experience, providing a zesty and insightful journey into the realm of baseball analytics
- Source Code
TidyX Episode 169 | Predicting Hall Of Famers in 20 Minutes
- We undertake a predictive analysis focused on forecasting potential inductees for the Baseball Hall of Fame Class of 2024 and explain it in only 20-minutes! Using tidyverse for data processing, and base R for model fitting, and prediction, we generate predictions for newly eligible players as well as players we are still waiting to be inducted. The exploration centers on significant player statistics, offering insights into the determinants of Hall of Fame selections. Join us for a fast paced look at baseball analytics and predictive modeling.
- Source Code
TidyX Episode 170 | Beyond Basic For loops - Tidy Expressions
- We jump into the intricacies of for loops, pushing beyond the basics into the realm of Tidy Expressions, using the power of efficient coding using tidyverse functions, double curly embraces, and str2lang. By doing this, we can leverage tidyverse functions for efficient coding in custom functions and maintain the non standard evaluation tricks used by tidyverse!
- Source Code
TidyX Episode 171 | Bae in the Fast Lane: Bayesian linear regression in 20-Minutes
- Learn Bayes Regression in just 20 minutes! Leveraging the power of R and key libraries like tidyverse, rstanarm, tidybayes, and bayesplot, we guide you through fitting a Bayesian model for predicting car mileage based on weight on the mtcars dataset. Uncover insights as we interpret credible intervals, explore the posterior distribution, and make predictions with uncertainty.
- Source Code
TidyX Episode 172 | 20 minutes to Predict MLB HOF Pitchers - Class of 2024
- Join us as we look into the numbers behind predicting MLB Hall of Fame pitchers! This episode includes crafting a dataset from the {Lahman} package, creating logistic regression models, and finally assessing them via model summary tools and visualizing techniques. Stay tuned for insights and adjustments as we navigate the challenges of forecasting HOF greatness!
- Source Code
TidyX Episode 173 | Pitch into the Bayes - "20" minute MLB Hall of Fame Pitchers predictions
- Step up to the mound in TidyX Episode 173 as we predict MLB Hall of Fame pitchers using the power of Bayesian models! Join us as we switch up our game plan, leaving no curveball unturned with rstanarm. We inspect the models and results with prediction intervals and probabilities, bringing a new dimension to player forecasts! We show how to apply this to new players, from randomly selected to a Seattle Favorite - King Felix.
- Source Code
TidyX Episode 174 | AI Speed Ball: Predicting the 2024 Pitcher HOF Class in 20 Minutes
- We're bringing the heat with AI! Join us as we step up to the plate and predict the next MLB Hall of Fame pitchers using the power of TensorFlow and Keras. With a killer convolutional neural network in our arsenal, we're ready to knock it out of the park! We go over normalization techniques, how to set up your model, and use it to predict who should be in and who will be out! Don't miss this action-packed inning!
- Source Code
TidyX Episode 175 | Strike Zone Shenanigans: Tidyverse Takes on Hall of Fame Hurlers
- We explored the world of data modeling using Tidyverse and Purrr to predict the next MLB Hall of Fame pitchers. Stay tuned for some fascinating insights into our modeling process! We use the same datasets as we have the last several weeks, and apply logic and code to create, evaluate, and tune our models.
- Source Code
TidyX Episode 176 | Are you Sure?
- In this episode, we're comparing pitchers using the power of Random Forests and bayesian statistics to make comparisons between pitchers likelihood of making it into the Hall of Fame! We show how to make simple simulations of individual player performance and differences, and finally make a function to let you easily compare players.
- Source Code
TidyX Episode 177 | Who's Next? FIBA API Viewer Question
- We tackle a real-world challenge brought to us by our viewer, Cohen MacDonald. Coehn found an undocumented API that has a bunch of game data from FIBA and has some great ideas on what to do with it. However, theres one problem: the dataset does not contain which players are on the court at what time, just who subs in or out. With an intriguing problem statement and example code from Cohen in hand, we delve into the intricacies of FIBA basketball game data. See how we harness the power of for loops to iteratively update values, addressing Cohen's query on player substitutions and lineup analysis.
- Source Code
TidyX Episode 178 | Player Time Chart - FIBA API Part 2
- In this follow-up to Episode 177, we dive deeper into the intricacies of FIBA basketball game data. Building upon our previous exploration, we refine our methods to generate insightful player time charts. Join us as we unravel the complexities of lineup analysis and visualize player dynamics over the course of a game. Get ready for another insightful episode of TidyX!
- Source Code
TidyX Episode 179 | How many SpaghettiOs does it take to write LOTR?
- We embark on a hilarious journey to answer the age-old question: how many SpaghettiOs would it take to write a whole book? Inspired by abstract_tyler's instagram reel (https://www.instagram.com/p/C6hUeRVp24H/), the we use the power of R to find out! Prepare for some serious spaghetti-fueled fun as we delve into the world of R for data wrangling. We'll tackle skills like joining data sets, calculating frequencies, and writing functions to automate the analysis.
- Source Code
TidyX Episode 180 | How much stuff have we sent to Space?
- Ever wondered how much stuff has rocketed into space? This episode we do a 180 and look at how we started TidyX by looking at a TidyTuesday dataset to explore objects launched into space! We'll learn how to wrangle the data, calculate launch counts by year, and create visualizations with ggplot2. Plus, we'll discover a cool trick for faceting plots with independent y-axes, and finally show a fun way to interact with facets using the trelliscopejs package. Join us for a stellar exploration of space exploration data!
- Source Code
TidyX Episode 181 | I Likert Coffee
- Calling all coffee lovers! ☕️ This episode of TidyX gets to the grounds of coffee expertise with a TidyTuesday survey. We'll brew up some data analysis to see if age affects how people rate their coffee knowledge. Get ready for Likert scales, wrangling data, and statistical throwdowns to see which age group claims coffee crown!
- Source Code
TidyX Episode 182 | Turbocharge Your Simulations with Parallel Processing! ⚡️
- Ever feel like your simulations take forever to run? This TidyX episode injects a dose of speed with parallel processing using the snowfall package! We'll revisit nested for loops for simulation, then supercharge them to run across multiple cores. Learn how to run simulations in parallel for faster results using the snowfall package, and combine and analyze simulation outputs for deeper insights.
- Source Code
TidyX Episode 183 | Within-group regression using {purrr}
- Unleash the power of {purrr} to perform within-group regressions! This episode we'll explore fitting separate linear models for different groups in your data, using the Palmer Penguins dataset as an example. Using map(), we'll quickly build models, extract key statistics, and visualize how groups differ. Join us to start the journey on master this great package and become a {purrr}fect data scientist!
- Source Code
TidyX Episode 184 | Hello Kitty: Intro to {purrr}
- This intro highlights purrr's core functionalities and different ways to write the functions, from named to anonymous functions, keeping types consistent, or even applying functions to filter and pull out contents from lists. Learn the basics to understand how we can apply these techniques to more complicated structures!
- Source Code
TidyX Episode 185 | Independence Days with {purrr}
- Using Wikipedia's list of independence days, we'll sho wyou have to use some asdanved {purrr} to wokr with the data, construct new functions, and work with extracted data from webpages to transform it into usable formats. We aim to answer the amusing quip that every 4 days a country celebrates its independence from the UK with this dataset!
- Source Code

thebioengineer / tidyx Goto Github PK

tidyx's Introduction

TidyX

Hosts

Description

Patreon

TidyX Episodes

tidyx's People

Contributors

Stargazers

Watchers

Forkers

tidyx's Issues

Recommend Projects

Recommend Topics

Recommend Org