Giter Site home page Giter Site logo

juliasilge / supervised-ml-case-studies-course Goto Github PK

View Code? Open in Web Editor NEW
220.0 13.0 77.0 51.29 MB

Supervised machine learning case studies in R! ๐Ÿ’ซ A free interactive tidymodels course

Home Page: https://supervised-ml-course.netlify.app/

License: MIT License

R 24.49% JavaScript 29.47% CSS 35.35% Sass 10.69%
rstats machine-learning supervised-machine-learning online-course

supervised-ml-case-studies-course's Introduction

Welcome to the course repo for Supervised Machine Learning: Case Studies in R! ๐ŸŽ‰

You can access this course for free online.

This course approaches supervised machine learning using:

The interactive course site is built on the amazing framework created by Ines Montani, originally built for her spaCy course. The front-end is powered by Gatsby and Reveal.js and the back-end code execution uses Binder. Florencia D'Andrea helped port the course materials and made the fun logo.

Binder Netlify Status

To learn more about building a course on this framework, see Ines's starter repos for making courses in Python and R, and her explanation of how the framework works at the original course repo. The original version of this course based on the R package caret is available here.

Please note that this project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

The course material in this course is licensed CC-BY-SA, meaning you are free to use it, change it, and remix it as long as you give appropriate credit and distribute any new materials under the same license. The code is MIT-licensed.

supervised-ml-case-studies-course's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

supervised-ml-case-studies-course's Issues

Chapter 3.6 Error in initial_split when running locally

Love the tidymodels course!

I came across an error when running the following code snippet in chapter 3.6 locally.

vote_split <- voters_select %>%
    initial_split(p = 0.8,
                  strata = turnout16_2016)

Error in initial_split(., p = 0.8, strata = turnout16_2016, ) : argument 2 matches multiple formal arguments

When I changed the code snippet to the following, everything worked as expected:

vote_split <- voters_select %>%
    initial_split(prop = 0.8,
                  strata = turnout16_2016)

I'm not sure if "p" was an accepted abbreviation of "prop" previously, but changing "p" to "prop" fixed the error

Issues in chapter 2, in (Remote counting)

Iยดm following the supervised ML case studies course. Iยดm trying to run the following codes:

stackoverflow %>%
count(Remote)

#and

simple_glm <- stackoverflow %>%
select(-Respondent) %>%
glm(Remote ~ .,
family = "binomial",
data = .)

Iยดm getting this message "Error: Column Remote is unknown"

As long as iยดm a very newbie R-user, this issues could be a very basic one.

Thanks,

Federico

Chapter 2, Section 4 - copy-paste error?

The instruction "In the calls to count(), check out the distributions for remote status first, and then country." looks like it's repeated in chapter 2, section 4. Copy-paste issue?

Chapter 4, Exercise 13: xgboost performing better than GBM

Hey Julia!

THANK YOU SO MUCH for an amazing course. I have been impressed end to end and I am definitely going to share it with a bunch of folks.

I got tripped up on something and wanted to see if it's a mistake, maybe with the saved model RDS, or updated default parameters.

In Chapter 4, Exercise 13, the rmse for the xgboost predictions (12.3472006) is smaller than the rmse for GBM predictions (12.4721538), but the lesson goes on to say that GBM performed the best.

You can test by going to https://supervised-ml-course.netlify.com/chapter4 in incognito mode, click "Show solution" and "Run Code":

input
output

Please let me know if I can help with a fix!

Error in predict.randomForest when running Chapter 1 locally

First of all, this tutorial is amazing. Content, pace, level of detail. I love it.

I encountered one issue with random forest when following the code of Chapter 1 locally. I'm on tidymodels 0.1.2 and randomForest 4.6-14 running in Windows.

While I found a solution by mutating chr cols in car_vars to factor, I have no idea why the code that works on netlify did not work locally.

Running predict(fit_rf, car_train) returned:

Error in predict.randomForest(object = object$fit, newdata = new_data) : 
New factor levels not present in the training data

To reproduce:

install.packages(c("tidymodels","randomForest"))
library(tidymodels)
csv_url <- "https://raw.githubusercontent.com/juliasilge/supervised-ML-case-studies-course/master/data/cars2018.csv"
download.file(csv_url,"cars.csv")
cars <- readr::read_csv("cars.csv")

set.seed(1234)

car_vars <- cars %>%
  select(-model, -model_index)

car_split <- car_vars %>%
  initial_split(prop = 0.8,
                strata = aspiration)

car_train <- training(car_split)

rf_mod <- rand_forest() %>%
  set_mode("regression") %>%
  set_engine("randomForest")

fit_rf <- rf_mod %>%
  fit(log(mpg) ~ ., 
      data = car_train) 

results <- car_train %>%
  mutate(mpg = log(mpg)) %>%
  bind_cols(predict(fit_rf, car_train) %>%
              rename(.pred_rf = .pred))

What I noticed is that all levels in str(fit_rf[["fit"]][["forest"]][["xlevels"]]) were numeric (contrary to the model stored in data/c1_fit_rf.rds. Maybe someone here could explain me why, since randomForest is new to me?

Solution to the error was to enforce factor class on chrs: car_vars <- mutate(car_vars, across(where(is.character),as.factor))

chapter 2, step 12...internal error...unknown 'composition' type

Something appears to be going wrong at this stage in the solution code

predict(stack_glm, stack_test)

Error: Internal error: Unknown composition type.
Traceback:

  1. predict(stack_glm, stack_test)
  2. predict.workflow(stack_glm, stack_test)
  3. forge_predictors(new_data, workflow)
  4. hardhat::forge(new_data, blueprint = mold$blueprint)
  5. forge.data.frame(new_data, blueprint = mold$blueprint)
  6. blueprint$forge$process(blueprint = blueprint, predictors = predictors,
    . outcomes = outcomes, extras = extras)
  7. forge_recipe_default_process_predictors(blueprint = blueprint,
    . predictors = predictors)
  8. recompose(predictors, blueprint$composition)
  9. abort("Internal error: Unknown composition type.")
  10. signal_abort(cnd)

Chapter 1, Exercise 5, Slide 5 - variable continuity

A very finicky point, but I mention it for continuity. Through the rest of the chapter, the variables for the training and testing set are "car_train" and "car_test", respectively, but on this slide, "car_training" and "car_testing" are used.

juice() is deprecated - Chp 2 Sec 7

Recipes v0.1.14 has deprecated juice() in favor of bake(x, new_data = NULL). Chapter 2, section 7, has the first reference to juice(). Code and explanatory text will both need updating.

Long success messages can break mid-word

Long success messages can break mid-word, like this example from Chapter 4, section 11:

Great job! In this case study, you will evaluate this model, plus two kinds of gradient boosting models.

I think one well-placed call to stringr::str_wrap would fix them all (and preserve the emoji's, hopefully), but can't find where to place it, or how to test it.

c4-s11-word-break

e1017

Some exercises are not working linked to this error

Error: package e1071 is required
(ex, Exercise 11 of chapter 2)

When I tried to install this package on binder, everything fails.

Factor/character errors

Hello!
I'm going over Chapter 1,
In section 8 & 9:

results <- car_test %>%
    mutate(MPG = log(MPG)) %>%
    bind_cols(predict(fit_lm, car_test) %>%
                  rename(.pred_lm = .pred)) %>%
    bind_cols(predict(fit_rf, car_test) %>%
                  rename(.pred_rf = .pred))

Produces:

Error in predict.randomForest(object = object$fit, newdata = new_data) : 
  New factor levels not present in the training data

This was solved prior to splitting data by

car_vars<- cars2018 %>%
  select(-Model, -`Model Index`) %>% 
  mutate(across(where(is.character), as.factor))

However, in section 11


car_boot<- bootstraps(car_train)
rf_res <- rf_mod %>%
    fit_resamples(
        MPG ~ .,
        resamples = car_boot,
        control = control_resamples(save_pred = TRUE)
    )

produces

x Bootstrap01: formula: Error: Functions involving factors or characters have been detected on the RHS of formula. These are not allowed when indicators = "none". Functions involving factors were detected for the following columns: 'Lockup Torque Converter', 'Recommended Fuel', 'Fuel injection'.

I did notice that the Tidymodels version for the course is 0.1.0 and mine is 0.1.1
Is it just a version issue or do you have any advice on how to solve the previous error message?

Regards,
Maria

Confusing name of variable

Hi,

in the 4th case of study, the section "4 - tidy the survey data" says:

  • Group by age and summarize the value column to see how the overall agreement with all questions varied by age.
  • Count the value column to check out how many respondents agreed or disagreed overall.

The real name of the variable is rating and not value, as it was the name given to it in the pivot_longer() of the previous chunk of code. I realized later that probably value is written on purpose to emphasize that it is the column containing the value and not the key. Anyway, as this is written with then syntax usually used for literal names of variables, it seems a little bit confusing, at least to me.

Great course! :)

Which tests do/do not work?

Hello @flor14! I have been going through the course, fixing various bits. I am noticing that various tests do not currently work, but I can't tell what it is about the tests that lead them to not work. Do you know? It doesn't seem like it is single vs. double quotes, or emoji? Do you have other ideas?

Error in Chapter model when running locally

I am trying to run the following code from Chapter 3 on my machine.

set.seed(234)

rf_res <- vote_wf %>%
    fit_resamples(
        vote_folds,
        metrics = metric_set(roc_auc, sens, spec),
        control = control_resamples(save_pred = TRUE)
    )

And receiving the following errors related to metrics arguments:

Error: 
The combination of metric functions must be:
- only numeric metrics
- a mix of class metrics and class probability metrics

The following metric function types are being mixed:
- prob (roc_auc)
- class (sens)
- other (spec <namespace:read>)

The same happens when I am running the glm model.

If I remove the metrics argument, everything works fine. But then I have to find alternative ways to computing roc_auc, sens and spec (e.g. as shown by @juliasilge elsewhere).

Chapter 2 - recipes::step_downsample is deprecated

Chapter 2 is calling recipes::step_downsample() which is now deprecated in favor of themis::step_downsample(), for users running current tidymodels versions locally. A simple call to library(themis) in the setup should quiet this problem. ๐Ÿ˜„ Thanks for making this course available!

Ch. 4 Part 10 Type

Screen Shot 2019-09-23 at 11 03 18 PM
"If you stop ๐Ÿ›‘ to think aout" should read If you stop ๐Ÿ›‘ to think about"

Chapter 3.14.2 R Session Aborted when running locally

I have the following code, which matches the course solution:

library(tidymodels)
library(themis)

vote_train <- readRDS("data/c3_train_10_percent.rds")

vote_folds <- vfold_cv(vote_train, v = 10)

vote_recipe <- recipe(turnout16_2016 ~ ., data = vote_train) %>% 
    step_upsample(turnout16_2016)

rf_spec <- rand_forest() %>%
    set_engine("ranger") %>%
    set_mode("classification")

vote_wf <- workflow() %>%
    add_recipe(vote_recipe) %>%
    add_model(rf_spec)

set.seed(234)
rf_res <- vote_wf %>%
    fit_resamples(
        vote_folds,
        metrics = metric_set(roc_auc, sens, spec),
        control = control_resamples(save_pred = TRUE)
    )

glimpse(rf_res)

When I run the following code chunk locally, it results in R Session Aborted.

rf_res <- vote_wf %>%
    fit_resamples(
        vote_folds,
        metrics = metric_set(roc_auc, sens, spec),
        control = control_resamples(save_pred = TRUE)
    )

I'm not sure if it is related to the ranger library, but I did not get any errors when running 3.14.1 (i.e. logistic regression) and when I updated set_engine("ranger") to set_engine("randomForest") for 3.14.2.

I can also confirm that package versions in my local align with what is present in the course.

Chapter 4, exercise 3 - unnecessary instruction to load tidyverse

In exc_04_03.R, there's an instruction to load the tidyverse. But tidyverse is already loaded in line 1 of the snippet (for readr::read_csv).

There's corresponding copy for that instruction in chapter_4.md, too: "Load the tidyverse package, for functions to manipulate data from dplyr and tidyr and visualize data from ggplot2."

Happy to submit a pull req. Really appreciate the effort you've put into the course, and especially the work in porting it and re-launching it.

Exercise 2.11

Dataset c2_testing_one_percent.rds is missing or has another name. Also, it is never used in the exercises 2.11 part 1 and 2

chapter 1, exercise 8 - Wrong rds file?

Running exercises 8 and 9 in chapter 1 produces similar errors due to not having the training and testing objects not existing in the environment:

Error: Evaluation error: object 'training' not found.
Traceback:

1. car_train %>% mutate(`Linear regression` = predict(fit_lm, training), 
 .     `Random forest` = predict(fit_rf, training))
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
3. eval(quote(`_fseq`(`_lhs`)), env, env)
4. eval(quote(`_fseq`(`_lhs`)), env, env)
5. `_fseq`(`_lhs`)
6. freduce(value, `_function_list`)
7. withVisible(function_list[[k]](value))
8. function_list[[k]](value)
9. mutate(., `Linear regression` = predict(fit_lm, training), `Random forest` = predict(fit_rf, 
 .     training))
10. mutate.tbl_df(., `Linear regression` = predict(fit_lm, training), 
  .     `Random forest` = predict(fit_rf, training))
11. mutate_impl(.data, dots, caller_env())

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.