Giter Site home page Giter Site logo

dlab-berkeley / r-machine-learning-legacy Goto Github PK

View Code? Open in Web Editor NEW
47.0 7.0 25.0 41.62 MB

D-Lab's 6 hour introduction to machine learning in R. Learn the fundamentals of machine learning, regression, and classification, using tidymodels in R.

License: Other

R 100.00%
machine-learning r data-science tidymodels classification regression rsample

r-machine-learning-legacy's Introduction

D-Lab R Machine Learning with tidymodels

DataHub Binder License: CC BY 4.0

This repository contains the materials for D-Lab's R Machine Learning with tidymodels. Prior experience with the concepts in R Fundamentals and Data Wrangling and Manipulation in R is assumed.

Workshop Goals

In this workshop, we provide an introduction to machine learning algorithms by making use of the tidymodels package. First, we discuss what machine learning is, what problems it works well for, and what problems it might work less well for. Then, we'll explore the tidymodels framework to learn how to fit machine learning models in R. Finally, we will apply the tidymodels framework to explore multiple machine learning algorithms in R.

By the end of the workshop, learners should feel prepared to explore machine learning approaches for their data problems.

Familiarity with R programming and data wrangling is assumed. If you are not familiar with the materials in Data Wrangling and Manipulation in R, we recommend attending that workshop first. In addition, this workshop focuses on how to implement machine learning approaches. Learners will likely benefit from previous exposure to statistics.

Installation Instructions

We will use RStudio to go through the workshop materials, which requires the installation of both the R language and the RStudio software. Complete the following steps:

  1. Download R: Follow the links according to the operating system that you are running. Download the package, and install R onto your computer. You should install the most recent version (at least version 4.0).

  2. Download RStudio: Install RStudio Desktop. This should be free. Do this after you have already installed R. The D-Lab strongly recommends an RStudio edition of 2022.02.0+443 "Prairie Trillium" or higher.

  3. Download these workshop materials:

  • Click the green "Code" button in the top right of the repository information.
  • Click "Download Zip".
  • Extract this file to a folder on your computer where you can easily access it (we recommend Desktop).
  1. Optional: if you're familiar with git, you can instead clone this repository by opening a terminal and entering git clone [email protected]:dlab-berkeley/R-Machine-Learning.git.

  2. Be sure to run the install.R script in the repository so that all necessary packages are installed.

Is R Not Working on Your Laptop?

This workshop makes use of many packages within the R ecosystem. For that reason, we recommend using R on your local machine.

If you do not have R installed and the materials loaded on your workshop by the time it starts, we strongly recommend using the UC Berkeley DataHub to run the materials for these lessons. You can access the DataHub by clicking the following button:

DataHub

Some users may have to click the link twice if the materials do not load initially.

The DataHub downloads this repository, along with any necessary packages, and allows you to run the materials in an RStudio instance on UC Berkeley's servers. No installation is needed from your end - you only need an internet browser and a CalNet ID to log in. By using the DataHub, you can save your work and come back to it at any time. When you want to return to your saved work, go straight to DataHub, sign in, and click on the R-Machine-Learning folder.

If you don't have a Berkeley CalNet ID, you can still run these lessons in the cloud, by clicking this button:

Binder

If you are loading Binder with this repository for the first time, it may take a few minutes to set up. Binder operates similarly to the D-Lab DataHub, but on a different set of servers. By using Binder, however, you cannot save your work.

Run the Code

Now that you have all the required software and materials, you need to run the code:

  1. Launch the RStudio software.

  2. Use the file navigator to find the R-Machine-Learning folder you downloaded from Github.

  3. Open up the file corresponding to the part of the workshop you're attending.

  4. If necessary, run install.R to make sure the requisite packages are installed. This should not be necessary on Binder.

  5. Place your cursor on a given line and press "Command + Enter" (Mac) or "Control + Enter" (PC) to run an individual line of code.

  6. The solutions folder contains the solutions to the challenge problems.

Additional Resources

This workshop draws heavily on the following resources:

Other D-Lab R Workshops

Basic Competency

Intermediate/Advanced Competency

Contributors

Previous iterations of D-Lab's Machine Learning with R were created by:

r-machine-learning-legacy's People

Contributors

asteves avatar averysaurus avatar ck37 avatar henchc avatar heroashman avatar jaeyk avatar pssachdeva avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

r-machine-learning-legacy's Issues

Error producing metrics and confusion plot

In lines 251-256 of 04-decision-trees.rmd these two plots run into errors. I think it's because tree_fit_viz_metr and tree_fit_viz_mat are not defined already. Here is the relevant code:

`# Metrics
(tree_fit_viz_metr + labs(title = "Non-tuned")) / (visualize_class_eval(tree_fit_tuned) + labs(title = "Tuned"))

Confusion matrix

(tree_fit_viz_mat + labs(title = "Non-tuned")) / (visualize_class_conf(tree_fit_tuned) + labs(title = "Tuned"))`

Stacks Still broken

Error message

Error: The inputted candidates argument was not generated with the appropriate control

  • settings. Please see ?control_stack

Typo in Part3

Line 70 in Part3.md should be set_mode rather than set_model.

Clarify installation workflow

From the instructions it is unclear which step is suppose to be performed first:

  1. The first line in Part1 renv::init()
  2. The 5th step in installation instructions install.packages(c("tidyverse", "tidymodels", "here","pROC","glmnet", "ranger", "rpart", "xgboost","rpart.plot", "doParallel", "palmerpenguins", "ISLR2", "klaR", "stacks"))

Solutions for 04_regularization

Solutions for 04_regularization should include:

Import data

penguins <- palmerpenguins::penguins %>%
filter(!is.na(bill_length_mm))

Set seed

set.seed(23)

Perform split

penguin_split <- penguins %>% initial_split(prop = 0.80)
penguins_train <- training(penguin_split)
penguins_test <- testing(penguin_split)

Participants will run into error if null values in bill_length_mm are not dropped

Binder link option doesn't give Rstudio environment to run code

In the installation instructions the step:

If you don't have a Berkeley CalNet ID, you can still run these lessons in the cloud by clicking this button: Binder_link_here

This leads you to a Jupyter environment that cannot run code in the .Rmd lesson files.

I suggest we remove this option.

Workshop Title

Workshop title should be "R-Introduction-to-Machine-Learning-with-tidymodels"

recommendations for additional restructuring

I delivered the workshop with these current materials, and things went pretty smoothly, but it's too much material for two 3 hour workshops.

I'd recommend splitting this workshop into either 3 or 4 2-hour workshops. This could look something like:

  • Part 1: introduction and regression
  • Part 2: preprocessing
  • Part 3: regularization and cross-validation
  • Part 4: more models

Perhaps Parts 1/2 could be merged into a single workshop.

I'd also recommend beefing up the "more models" section to more appropriately fill a 2 hour slot: more details on logistic regression, naive bayes, and random forest

04.Decision_Trees.RMD `makeCluster()` function error

Running the part 4 .RMD locally, I'm getting an error on line 197 with:

cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)

I believe this function optimizes a local environment for heavier computation? I've dug into the docs and arguments get deep into CS, beyond my scope at this time. It's possible to continue downstream in the workflow without these lines being run, which is what I've done so far.

Use `renv` to ensure that attendees have the required packages

While the README contains instructions on installing the necessary packages for the workshop, it might be useful to have an renv setup so that users could simply run a renv::init() at the beginning of the first notebook in order to make sure all necessary packages are installed.

Add datahub and binder links and icons

@pssachdeva I just noticed this repo is missing binerhub links in the README: https://github.com/dlab-berkeley/R-Machine-Learning/blob/ee158d8506d68e72498a36d7484a403ed5b4b506/README.md?plain=1#L81-L87

Any chance you could add it and test it in time for today's workshop at 10am since we have folks from partner organizations who may need a binderized version?

Also, it would be great if you can add the iconified buttons like what's shown in R Fundamentals:

image

Thanks!

Hard stop errors when running .RMD(s) on my local environment.

  • Overview-1.RMD
    needed to:

install.packages(‘rlang’)

before installation of all packages chunk, or else ggplot2 would not load.

  • 04- Decision Trees, ln. 196
    Error in makePSOCKcluster(names = spec, ...) : Cluster setup failed. 4 of 4 workers failed to connect.

Error did not break the .RMD workflow.

  • 05 - Random Forest
    update_model() function, same error as above

  • 06- xgboost
    update_model(), same error, ln. 351

  • 09 - hlclust

ln. 37:
Error: Problem with mutate() input ..1. x there is no package called ‘BBmisc’ℹ Input ..1 is across(is.numeric, BBmisc::normalize)

`install.packages("BBmisc")

library(BBmisc)

and np.

 

_

Simplify and update renv package installation process

The renv can be simplified to remove unnecessary packages (to speed up (re-)installation) and replaced with an exact step-by-step process for people to follow. Some participants struggled with package installation as it is currently written.

https://github.com/dlab-berkeley/R-Machine-Learning/blob/ee158d8506d68e72498a36d7484a403ed5b4b506/README.md?plain=1#L43-L75

Also, install_github() should be replaced with standard install.packages() where possible, e.g. remotes::install_github("tidymodels/discrim") can just be install.packages("discrim") since the standard v1.0.0 package has been released now.

This caused additional issues because it assumes they have already run install.packages("remotes") because it is not explicitly stated in the instructions, so add that if there's a good reason to keep any install_github calls around.

Participant Instructions to run 01_overview.RMD

It may help students and instructors if we include the "Participants Instructions" closer to the top of the read me, in large script with a message something like:

"If you are working in a local computing environment, be sure to prepare for the workshop by running all the code in the 01.Overview.RMD before our first meeting. Doing this could take a few minutes, and is needed for you to run the workshop's code on your own computer."

Datahub-Rstudio package loading issues

When I open this workshop in datahub via the gitpuller, and run the first line:

renv::init()

I get the error message:

> knitr::opts_chunk$set(echo = TRUE)
Error in loadNamespace(x) : there is no package called ‘knitr’

When I try to install knitr manually, it takes a very long time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.