tidymodels / workshops Goto Github PK

View Code? Open in Web Editor NEW

76.0 8.0 43.0 37.35 MB

Website and materials for tidymodels workshops

Home Page: https://workshops.tidymodels.org

License: Creative Commons Attribution Share Alike 4.0 International

HTML 5.29% CSS 8.26% R 0.92% SCSS 0.54% JavaScript 84.98%

workshops's Introduction

workshops

This repo contains tutorial materials for machine learning with tidymodels.

Organization

This repo is organized into directories:

slides/ has Quarto files for the latest version of our slides.
classwork/ contains Quarto files prepared for you to work along with the slides.
archive/ is the location for older versions of this workshop.

Code of Conduct

Please note that the workshops project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Archiving Notes

To archive previous workshop notes:

Make a subdirectory in archive/ called YYYY-MM-workshop-name.
Copy the contents of slides/ into archive/YYYY-MM-workshop-name.
Copy the contents of classwork/ into archive/YYYY-MM-workshop-name.
Copy index.qmd into archive/YYYY-MM-workshop-name.
In index.qmd, remove slides/ from links to slides.
In _quarto.yml:
- add an entry "archive/YYYY-MM-workshop-name/*qmd" under render.
- add an entry "archive/YYYY-MM-workshop-name/classwork/*qmd" under resources.
In archive/YYYY-MM-workshop-name/, add a _metadata.yml file with the contents

execute:
  freeze: true

In the command line, run quarto render archive/YYYY-MM-workshop-name. This will regenerate the workshop slides under docs/archive/YYYY-MM-workshop-name.
Check that:
- Running quarto render didn't change any files in docs/ outside of docs/archive/.
- The generated slides are added to _freeze/archive/YYYY-MM-workshop-name rather than in archive/YYYY-MM-workshop-name.
- The generated slides work (specifically, that filepaths to figures function correctly.)
In index.qmd, add an entry in H2 "Past workshops" like [M YYYY](archive/YYYY-MM-workshop-name/) in workshop-name
If you are adding slides other than English, update the navbar link in _quarto.yml.

Once the above changes are merged to main, make a GitHub Release noting the big-picture changes since the previous iteration of the workshop.

workshops's People

Contributors

Stargazers

Watchers

Forkers

davisvaughan dgrtwo spoese xuran114 jonthegeek mjzenz coco90417 fgazzelloni jbarbagallo hezibu ajaykmehta jonathanbratt tdubois-forks jorgedelro rserran gast1111 tfulge shadle naimzaa96 collinberke austinwpearce epiheather mcnanton bcjaeger syi824 ramanthevulcan elenapodleski javorraca ramnathv 27vale nextmarte diraol cmusso86 tuqmano nalcan dalessandrini lucianea nicholustintzaw edgararuiz acohenstat yupeng80 cristal307 fvd

workshops's Issues

Add material for end of Day 2

Either more slides or a .qmd to work through

day 1 - discuss why some variables are excluded

I would love if we could add a slide after the introduction of the data, but before this line, sparking inquiry about why we are excluding these variables

Akin to how it was done in https://emilhvitfeldt.github.io/useR2020-text-modeling-tutorial/#24 and the following couple of slides

Originally posted by @EmilHvitfeldt in #108 (comment)

Deck 4 - Embarrassingly parallel

Let's edit this slide, or maybe make two, to show parallel processing for Windows as well as Mac. (Linux? Probably those people already know how to do this on their computer. 😆 )

prep and bake

There were a lot of questions about how to see the results of the recipe (around slide 25 of 05). It might be helpful to show a slide about prep and bake(new_data = NULL).

Don't we want the "output" of code to be encased in a light grey background?

Right now it just "floats" on the page, which looks very strange.

I was expecting some kind of background like:

I think ideally it would get rendered in the same box as the code itself, like on our pkgdown sites (i.e. like the example above)

distance annotation is missing

05 slide 46 has an annotation pointing to https://workshops.tidymodels.org/slides/annotations.html#distance, but there isn't a distance annotation on that page.

all chunks have labels

It's a lot to change but it is much more hygienic.

Move info on rank deficient fit earlier

Folks want to understand what the rank deficient fit means the first time they see it, so we should move that annotation up to the first fit_resamples() in Deck 5.

2023 - conf - describing taxi "pre-processing"

The mutate(month = factor(...)) and drop_na() steps were both described as no-no's when showing the slide. Might be worth thinking about 1) whether we want to do both of these (or in the internals of data_taxi()?) and/or 2) how we might describe when why this is fine for the purposes of the workshop.

day 1 - reverse factor levels when plotting in "Your Data Budget"

Following Hannah's lead in #113, let's forcats::fct_rev(tip) in plotting code for the second "Your Data Budget" deck.

Deck 5 - Don't use `encoded` workflow in workflow-set, label set elements sequentially for the plot

convert character to factor in tree frog data

Related to tidymodels/hardhat#213

We should convert character predictor columns to factor (and comment on why we are doing this).

We still will fix the underlying function but this will make the initial session go easier.

Deck 4 - Move first `fit_resamples()` to right after CV, talk about bootstraps / validation set after obs-vs-pred plot

Shrink height in `hexes()`

I think if we shrink from 1.16 to 1.10 in hexes() for the height= modifier then I think the spacing between the hexes and the first element is better. I think this bugs me and @juliasilge 😛

We'd have to rerender the full site

Before

After

Prep at least an `.qmd` for end of first day

If we get through the first day really quickly, we will want some back-up content. Let's prep at least an .Rmd file to walk through more content. First idea: using the tree frogs data to introduce stacks

in whole game diagrams, increase font for model blocks

take look at hockey explainer results

#62 (comment)

Getting Help slide seems confusing

I'm not sure I understand the difference between the first two bullets. Plus, dont we have sticky notes? Should we refer to that instead?

Make `.qmd` files for participants

Copy code from slides, clean up, add in "your turn", etc

2023 conf - whole game slides

This is more of a personal opinion

I think it would be neat if the 3 models (diamonds) were contained in a container to show it is "a pool of possible models"

2023 - All intro slide files should be prefixed

right now all the advanced slides are prefixed advanced-, we should do the same with intro slides once the dust settles

Deck 3 - Speaker note on how workflows/hardhat handles levels better than `model.matrix()`

Slide 24

It would be nice to have a speaker note for this bullet that says exactly how workflows is better, so we dont have to think about it while we are up there.

For the speaker notes, hardhat::scream() does these two nice things:

Enforces that new levels are not allowed at prediction time (this is an optional check that can be turned off)
Restores missing levels that were present at fit time, but happen to be missing at prediction time (like, if your "new" data just doesn't have an instance of that level)

2023 - conf - direct users to install Rtools

This will be surfaced when people install.packages("pak") if people don't have it installed, but worth calling out that it may be an issue and how to address it.

your time - when is a good time to split your data

It says "when is a good time to split your data"

I think we need to emphasize that we are asking time in relation to the modeling workflow/whole game

Deck 1 - Finalize workshop policies

Any final tweaking, plus "whatever we are saying about masking" - see the last bullet in the slide

Make sure modeldatatoo works with more restrictive file permissions

We had a issue where someones computer wouldn't let {pins} same the file

2023 NYR final touches to deck 1

outline / "tentative plan"
reference to cloud instance?
install packages / package versions

Deck 4 - Include histogram/density chart of outcome when talking about stratification

When we talk about stratification here for the first time on slide 35, I feel like it would be useful to have an image of the outcome handy so we can talk about how stratification preserves the outcome distribution in each split

Something like:

ggplot(tree_frogs) + geom_histogram(aes(latency))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

make an animation for racing

See #107 (comment)

Maybe include a slide in deck 6 to talk about boosted tree vs random forest

i.e. both are built from decision trees, but one builds sequentially improving trees and the other builds independent trees

I think we could use a single slide that just has an image on it like this one from this article
https://medium.com/geekculture/xgboost-versus-random-forest-898e42870f30

I quite like this image actually

2023 - Add slide about seeds in intro slides

          I'm thinking that we should have a slide about seeds in the intro slides.

Originally posted by @topepo in #106 (comment)

Deck 4 - `rank_results()` slide should mention the `rank_metric` argument, and what the default is

I think it would be helpful to have on the screen what rank_results() uses by default (the first metric in the set), and how to change that with rank_metric

Slide 52

Add 404 page

Since we are archiving slides over time, we are bound to run into people using dead links. A 404 page pointing to the archive should fix part of this problem

Add notes about partial pooling

In deck 5 @topepo. Might be an annotation but some speaker notes would also be good because it is somewhat complex

Classwork 3: Confusing verbiage

In this classwork it says "fit with parsnip". Users at this point have just been introduced to parsnip, and can be a little confused how one fits using parsnip.

workshops/classwork/03-classwork.qmd

Line 74 in e2254c4

Fit with parsnip:

I would suggest that we change this line to "fit parsnip specification"

Deck 3 - Merge back to back hands on?

These slides are back to back. Should we just merge them into one slide with more hands on time?

Fill in Deck 1 - The Whole Game slide

It is still empty

using modeldatatoo

pins is a tricky install for some folks with locked-down laptops, and modeldatatoo needs it for the data_taxi() backend.

There may be ways to get around that in modeldatatoo, but also maybe another argument for urging folks to transition to the cloud instance🙈

Have an intuitive visual description of what degrees of freedom means

In the tuning slides, I think a visual depiction would help learners grasp what the degrees of freedom means for a model.

Here are two ideas:

Include this image from your book as a slide (with a talk track about "how bendy" it is):

Visualize how the effect of distance to goal is nonlinear, with a plot like this one:

example_data <- nhl_train %>% 
  mutate(distance = sqrt((89 - abs(coord_x))^2 + abs(coord_y)^2))

example_data %>%
  group_by(distance = cut(distance, c(0, seq(10, 60, 5), 100))) %>%
  summarize(pct_on_goal = mean(on_goal == "yes"), n = n()) %>%
  mutate(distance = fct_recode(distance, "<10" = "(0,10]", ">60" = "(60,100]")) %>%
  ggplot(aes(distance, pct_on_goal)) +
  geom_line(group = 1, size = 2) +
  scale_y_continuous(labels = scales::percent) +
  expand_limits(y = 0) +
  labs(x = "Distance to goal (bucketed)",
       y = "% of shots in this bucket that are on goal")

cc @juliasilge

Deck 3 - Model explanations

I think my biggest meta comment about deck 3 is that I feel like we don't explain the models we are talking about.

We said that our prereq for this workshop is basic tidyverse knowledge, so I don't think we can assume people know how rpart works.

Like, in this slide we fit an rpart model and then the next few slides use predict() on it and show some of the rpart plotting methods, but I don't see a place where we really stop and discuss what this kind of model does

day 1 - consider how much to peek at test data

re: the slide where we print out taxi_test

I find it weird we are like "don't look at the testing data", and then we go around and looks at it 😆

Maybe we should add a "don't try this at home" for this slide? 😄

Originally posted by @EmilHvitfeldt in #108 (comment)

Should we just print out the dims()? Or even just say it's a data frame?

2023 - Add spinning wheel taxi svg

          haha i'm with it! i'd say let's do it

Originally posted by @simonpcouch in #108 (comment)

2023 - conf - Update references to "tomorrow" to "Advanced tidymodels"

We sometimes refer to "day 2" or "tomorrow". For the workshops in Chicago, this should be updated to refer to "Advanced tidymodels". This issue is to keep track of such reference which we leave in for NYR but need to update afterwards.

Deck 4 has the following on random forests:

Often works well without tuning hyperparameters (more on this tomorrow!), as long as there are enough trees

Mention that if you use RStudio Cloud you don't need to install packages

To incentivize people to use it, which is good all around:

Less painful on the wifi vs everyone downloading packages all at once
Dont have to worry about installation issues

We have plenty of room on this slide to say something about it

use a common "figures" directory

In each qmd file's yaml:

knitr:
  opts_chunk: 
    echo: true
    collapse: true
    comment: "#>"
    fig.path: "figures/"

choosing the package to install remotes

We had 3/3 users with issues during the first classwork run into problems because of pak installation or credentatial setup. For at least two of them, switching to devtools::install_github() fixed the issue without having to go down the gitcreds rabbit hole.

Fill in Deck 3 - TMwR workflow diagrams

Make decisions about all countdown timers

Some long, some short, etc