juliasilge / supervised-ml-case-studies-course Goto Github PK

View Code? Open in Web Editor NEW

220.0 13.0 77.0 51.29 MB

Supervised machine learning case studies in R! 💫 A free interactive tidymodels course

Home Page: https://supervised-ml-course.netlify.app/

License: MIT License

R 24.49% JavaScript 29.47% CSS 35.35% Sass 10.69%

rstats machine-learning supervised-machine-learning online-course

supervised-ml-case-studies-course's Introduction

Welcome to the course repo for Supervised Machine Learning: Case Studies in R! 🎉

You can access this course for free online.

This course approaches supervised machine learning using:

the tidyverse
the tidymodels ecosystem

The interactive course site is built on the amazing framework created by Ines Montani, originally built for her spaCy course. The front-end is powered by Gatsby and Reveal.js and the back-end code execution uses Binder. Florencia D'Andrea helped port the course materials and made the fun logo.

To learn more about building a course on this framework, see Ines's starter repos for making courses in Python and R, and her explanation of how the framework works at the original course repo. The original version of this course based on the R package caret is available here.

Please note that this project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

The course material in this course is licensed CC-BY-SA, meaning you are free to use it, change it, and remix it as long as you give appropriate credit and distribute any new materials under the same license. The code is MIT-licensed.

supervised-ml-case-studies-course's People

Stargazers

Watchers

Forkers

garrettmooney thiyangt flor14 federicoandreis topepo mao-wang kojimizu meumesmostudio diegogradosb ravichandrasekaran pmsco eugene100hickey xiangao samuelcrane josuema m447 dieulinh jamesjin63 cristian-dinu-69 2series hfboyce snowdj stjordanis 5l1v3r1 joelrip andrewmaclachlan aespar21 dataeducation arvindvenkatadri elizabethst iharvalovich adithirgis wordsmith189 khameelbm preynoso8 nolll77 puritynyakundi korej04 yeongchull ank20 anhnguyendepocen cshoggard scottknappnj tsquall121 paulomaateus mattrowse sgingold josemagalan thisisdaryn fabianoqueiroz jonesworks michaellomuscio jmad1v07 spcanelon zejiang-unsw himan876 ntag1618 mike-rojas gabegarcia15 amine5249 petzi53 duk3racoon thisppsguy georgeoduor paulovillarroel marionmoussay mclix81 spockcodes daneinformatics jorgenueland mustafa77 harborwang jmcoyro krishnamurtyp yagowin

supervised-ml-case-studies-course's Issues

Datasets

Chapter 3.6 Error in initial_split when running locally

Love the tidymodels course!

I came across an error when running the following code snippet in chapter 3.6 locally.

vote_split <- voters_select %>%
    initial_split(p = 0.8,
                  strata = turnout16_2016)

Error in initial_split(., p = 0.8, strata = turnout16_2016, ) : argument 2 matches multiple formal arguments

When I changed the code snippet to the following, everything worked as expected:

vote_split <- voters_select %>%
    initial_split(prop = 0.8,
                  strata = turnout16_2016)

I'm not sure if "p" was an accepted abbreviation of "prop" previously, but changing "p" to "prop" fixed the error

Issues in chapter 2, in (Remote counting)

I´m following the supervised ML case studies course. I´m trying to run the following codes:

stackoverflow %>%
count(Remote)

#and

simple_glm <- stackoverflow %>%
select(-Respondent) %>%
glm(Remote ~ .,
family = "binomial",
data = .)

I´m getting this message "Error: Column Remote is unknown"

As long as i´m a very newbie R-user, this issues could be a very basic one.

Thanks,

Federico

Chapter 2, Section 4 - copy-paste error?

The instruction "In the calls to count(), check out the distributions for remote status first, and then country." looks like it's repeated in chapter 2, section 4. Copy-paste issue?

Ch. 4 Part 8 Typo

sisters_text at the end should be sisters_test, I believe

Chapter 4, Exercise 13: xgboost performing better than GBM

Hey Julia!

THANK YOU SO MUCH for an amazing course. I have been impressed end to end and I am definitely going to share it with a bunch of folks.

I got tripped up on something and wanted to see if it's a mistake, maybe with the saved model RDS, or updated default parameters.

In Chapter 4, Exercise 13, the rmse for the xgboost predictions (12.3472006) is smaller than the rmse for GBM predictions (12.4721538), but the lesson goes on to say that GBM performed the best.

You can test by going to https://supervised-ml-course.netlify.com/chapter4 in incognito mode, click "Show solution" and "Run Code":

Please let me know if I can help with a fix!

slides 1, chapter 1 - Mention to DataCamp

Error in predict.randomForest when running Chapter 1 locally

First of all, this tutorial is amazing. Content, pace, level of detail. I love it.

I encountered one issue with random forest when following the code of Chapter 1 locally. I'm on tidymodels 0.1.2 and randomForest 4.6-14 running in Windows.

While I found a solution by mutating chr cols in car_vars to factor, I have no idea why the code that works on netlify did not work locally.

Running predict(fit_rf, car_train) returned:

Error in predict.randomForest(object = object$fit, newdata = new_data) : 
New factor levels not present in the training data

To reproduce:

install.packages(c("tidymodels","randomForest"))
library(tidymodels)
csv_url <- "https://raw.githubusercontent.com/juliasilge/supervised-ML-case-studies-course/master/data/cars2018.csv"
download.file(csv_url,"cars.csv")
cars <- readr::read_csv("cars.csv")

set.seed(1234)

car_vars <- cars %>%
  select(-model, -model_index)

car_split <- car_vars %>%
  initial_split(prop = 0.8,
                strata = aspiration)

car_train <- training(car_split)

rf_mod <- rand_forest() %>%
  set_mode("regression") %>%
  set_engine("randomForest")

fit_rf <- rf_mod %>%
  fit(log(mpg) ~ ., 
      data = car_train) 

results <- car_train %>%
  mutate(mpg = log(mpg)) %>%
  bind_cols(predict(fit_rf, car_train) %>%
              rename(.pred_rf = .pred))

What I noticed is that all levels in str(fit_rf[["fit"]][["forest"]][["xlevels"]]) were numeric (contrary to the model stored in data/c1_fit_rf.rds. Maybe someone here could explain me why, since randomForest is new to me?

Solution to the error was to enforce factor class on chrs: car_vars <- mutate(car_vars, across(where(is.character),as.factor))

chapter 2, step 12...internal error...unknown 'composition' type

Something appears to be going wrong at this stage in the solution code

predict(stack_glm, stack_test)

Error: Internal error: Unknown composition type.
Traceback:

predict(stack_glm, stack_test)
predict.workflow(stack_glm, stack_test)
forge_predictors(new_data, workflow)
hardhat::forge(new_data, blueprint = mold$blueprint)
forge.data.frame(new_data, blueprint = mold$blueprint)
blueprint$forge$process(blueprint = blueprint, predictors = predictors,
. outcomes = outcomes, extras = extras)
forge_recipe_default_process_predictors(blueprint = blueprint,
. predictors = predictors)
recompose(predictors, blueprint$composition)
abort("Internal error: Unknown composition type.")
signal_abort(cnd)

Chapter 1, Exercise 5, Slide 5 - variable continuity

A very finicky point, but I mention it for continuity. Through the rest of the chapter, the variables for the training and testing set are "car_train" and "car_test", respectively, but on this slide, "car_training" and "car_testing" are used.

juice() is deprecated - Chp 2 Sec 7

Recipes v0.1.14 has deprecated juice() in favor of bake(x, new_data = NULL). Chapter 2, section 7, has the first reference to juice(). Code and explanatory text will both need updating.

Long success messages can break mid-word

Long success messages can break mid-word, like this example from Chapter 4, section 11:

Great job! In this case study, you will evaluate this model, plus two kinds of gradient boosting models.

I think one well-placed call to stringr::str_wrap would fix them all (and preserve the emoji's, hopefully), but can't find where to place it, or how to test it.

How to handle markup in meta.json

Something must be wrong with the info I am putting in meta.json because it is not rendering correctly.

Do you have any ideas of what I am doing wrong? Does it render like this for you @flor14?

e1017

Some exercises are not working linked to this error

Error: package e1071 is required
(ex, Exercise 11 of chapter 2)

When I tried to install this package on binder, everything fails.

Factor/character errors

Hello!
I'm going over Chapter 1,
In section 8 & 9:

results <- car_test %>%
    mutate(MPG = log(MPG)) %>%
    bind_cols(predict(fit_lm, car_test) %>%
                  rename(.pred_lm = .pred)) %>%
    bind_cols(predict(fit_rf, car_test) %>%
                  rename(.pred_rf = .pred))

Produces:

Error in predict.randomForest(object = object$fit, newdata = new_data) : 
  New factor levels not present in the training data

This was solved prior to splitting data by

car_vars<- cars2018 %>%
  select(-Model, -`Model Index`) %>% 
  mutate(across(where(is.character), as.factor))

However, in section 11


car_boot<- bootstraps(car_train)
rf_res <- rf_mod %>%
    fit_resamples(
        MPG ~ .,
        resamples = car_boot,
        control = control_resamples(save_pred = TRUE)
    )

produces

x Bootstrap01: formula: Error: Functions involving factors or characters have been detected on the RHS of formula. These are not allowed when indicators = "none". Functions involving factors were detected for the following columns: 'Lockup Torque Converter', 'Recommended Fuel', 'Fuel injection'.

I did notice that the Tidymodels version for the course is 0.1.0 and mine is 0.1.1
Is it just a version issue or do you have any advice on how to solve the previous error message?

Regards,
Maria

Confusing name of variable

Hi,

in the 4th case of study, the section "4 - tidy the survey data" says:

Group by age and summarize the value column to see how the overall agreement with all questions varied by age.
Count the value column to check out how many respondents agreed or disagreed overall.

The real name of the variable is rating and not value, as it was the name given to it in the pivot_longer() of the previous chunk of code. I realized later that probably value is written on purpose to emphasize that it is the column containing the value and not the key. Anyway, as this is written with then syntax usually used for literal names of variables, it seems a little bit confusing, at least to me.

Great course! :)

Where do I put the data files for the course?

The code bits, such as in this exercise use files that I created ahead of time with datasets. I don't see these files anywhere in the repo right now. Where do I put them? How will they be made accessible when Binder launches?

Search/replace links to Rdocumentation

Exercise 6 in Chapter 2 has the wrong object name

As pointed out here, there is a wrong name used for a dataframe in exercise 6 of Chapter 2.

Which tests do/do not work?

Hello @flor14! I have been going through the course, fixing various bits. I am noticing that various tests do not currently work, but I can't tell what it is about the tests that lead them to not work. Do you know? It doesn't seem like it is single vs. double quotes, or emoji? Do you have other ideas?

Error in Chapter model when running locally

I am trying to run the following code from Chapter 3 on my machine.

set.seed(234)

rf_res <- vote_wf %>%
    fit_resamples(
        vote_folds,
        metrics = metric_set(roc_auc, sens, spec),
        control = control_resamples(save_pred = TRUE)
    )

And receiving the following errors related to metrics arguments:

Error: 
The combination of metric functions must be:
- only numeric metrics
- a mix of class metrics and class probability metrics

The following metric function types are being mixed:
- prob (roc_auc)
- class (sens)
- other (spec <namespace:read>)

The same happens when I am running the glm model.

If I remove the metrics argument, everything works fine. But then I have to find alternative ways to computing roc_auc, sens and spec (e.g. as shown by @juliasilge elsewhere).

Chapter 1 exercises 8 and 12 compare MPG to log(MPG)

Don't forget to use log in both cases!

Chapter 2 - recipes::step_downsample is deprecated

Chapter 2 is calling recipes::step_downsample() which is now deprecated in favor of themis::step_downsample(), for users running current tidymodels versions locally. A simple call to library(themis) in the setup should quiet this problem. 😄 Thanks for making this course available!

Ch. 4 Part 10 Type

"If you stop 🛑 to think aout" should read If you stop 🛑 to think about"

Chapter 3.14.2 R Session Aborted when running locally

I have the following code, which matches the course solution:

library(tidymodels)
library(themis)

vote_train <- readRDS("data/c3_train_10_percent.rds")

vote_folds <- vfold_cv(vote_train, v = 10)

vote_recipe <- recipe(turnout16_2016 ~ ., data = vote_train) %>% 
    step_upsample(turnout16_2016)

rf_spec <- rand_forest() %>%
    set_engine("ranger") %>%
    set_mode("classification")

vote_wf <- workflow() %>%
    add_recipe(vote_recipe) %>%
    add_model(rf_spec)

set.seed(234)
rf_res <- vote_wf %>%
    fit_resamples(
        vote_folds,
        metrics = metric_set(roc_auc, sens, spec),
        control = control_resamples(save_pred = TRUE)
    )

glimpse(rf_res)

When I run the following code chunk locally, it results in R Session Aborted.

rf_res <- vote_wf %>%
    fit_resamples(
        vote_folds,
        metrics = metric_set(roc_auc, sens, spec),
        control = control_resamples(save_pred = TRUE)
    )

I'm not sure if it is related to the ranger library, but I did not get any errors when running 3.14.1 (i.e. logistic regression) and when I updated set_engine("ranger") to set_engine("randomForest") for 3.14.2.

I can also confirm that package versions in my local align with what is present in the course.

Chapter 4, exercise 3 - unnecessary instruction to load tidyverse

In exc_04_03.R, there's an instruction to load the tidyverse. But tidyverse is already loaded in line 1 of the snippet (for readr::read_csv).

There's corresponding copy for that instruction in chapter_4.md, too: "Load the tidyverse package, for functions to manipulate data from dplyr and tidyr and visualize data from ggplot2."

Happy to submit a pull req. Really appreciate the effort you've put into the course, and especially the work in porting it and re-launching it.

Exercise 2.11

Dataset c2_testing_one_percent.rds is missing or has another name. Also, it is never used in the exercises 2.11 part 1 and 2

chapter 1, exercise 8 - Wrong rds file?

Running exercises 8 and 9 in chapter 1 produces similar errors due to not having the training and testing objects not existing in the environment:

Error: Evaluation error: object 'training' not found.
Traceback:

1. car_train %>% mutate(`Linear regression` = predict(fit_lm, training), 
 .     `Random forest` = predict(fit_rf, training))
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
3. eval(quote(`_fseq`(`_lhs`)), env, env)
4. eval(quote(`_fseq`(`_lhs`)), env, env)
5. `_fseq`(`_lhs`)
6. freduce(value, `_function_list`)
7. withVisible(function_list[[k]](value))
8. function_list[[k]](value)
9. mutate(., `Linear regression` = predict(fit_lm, training), `Random forest` = predict(fit_rf, 
 .     training))
10. mutate.tbl_df(., `Linear regression` = predict(fit_lm, training), 
  .     `Random forest` = predict(fit_rf, training))
11. mutate_impl(.data, dots, caller_env())

Cannot download file

Cannot download file cl_cars_vars_full.rds. An error page is thrown.

Thanks.