data-lessons / gapminder-r Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 9.0 34.83 MB

Home Page: http://data-lessons.github.io/gapminder-R/

License: Other

HTML 69.42% Makefile 0.59% R 0.76% Shell 0.02% CSS 18.12% JavaScript 11.09%

gapminder-r's People

Contributors

Stargazers

Watchers

Forkers

michaellevy noddymaree izahn devbioinfoguy hbs-rcs tracykteal smartgamer anhnguyendepocen ds777

gapminder-r's Issues

Move first dplyr exercise to after mutate

Need to move MCQ: Data Reduction {.challenge} to after the mutate portion of the lesson.

tidy exercise

The first challenge in the tidy data lesson, "Gather and plot" is contrived and awkward. A simple exercise that has students identify what's untidy about a table and work through the arguments to gather is needed.

Change wording of first dplyr exercise

Feedback from social scientists in room that you can't calculate daily income from gdp per capita.

Change wording to:

Produce a data.frame with only the names and years of countries where per capita gdp is less than a dollar a day sorted from most- to least-recent.

Tip: The gdpPercap variable is annual gdp per person. You’ll need to adjust.
Tip: For complex tasks, it often helps to use pencil and paper to write/draw/map the various steps needed and how they fit together before writing any code.
What is the annual per-capita gdp, rounded to the nearest dollar, of the first row in the data.frame?

Pre-workshop install & update check

Per-workshop instructions: install R and RStudio or update to the latest version of R and RStudio. Install tidyverse and update.packages. If library(tidyverse) produces anything but the usual, show up 30 minutes early.

Switch all the base read/write functions to tidyverse versions

Eg read.csv -> read_csv

plot types

Add breakdown of 1 vs. 2 vs. 3+ variable plots, with continuous and categorical variables. Make a table that they can see with a suggested first plot for each and the corresponding mappings and geom. I think this goes after introducing ggplot and geoms with scatterplots.

Add spreadsheet and OpenRefine modules

I think this is the repo that @tracykteal has taught in the morning of this workshop. @tracykteal, in talking with @devbioinfoguy et al, we were thinking about folding that into this repo. One advantage of leaving it elsewhere is that it can be developed independently and this workshop could use those advances. OTOH, having it here might promote tighter coupling.

Also, Is there a version of the spreadsheet lesson you like for this workshop?

Add MCQs to day 2 lessons

New "More fundamentals of R" module

I think there should be an additional module between ggplot and dplyr (which usually falls first thing day 2) that reviews the basics, refreshes students to R, and goes beyond what we can/should do in the intro R and data-types and subsetting lessons.

Why? By the time we get to the end of data-types and subsetting, students are bored of the basics (even if they recognize their importance, and even if they're struggling to keep up), and for that reason I've tried to minimize what's covered in those lessons, which means leaving out some important things, like NA-handling. Also, students come back on day 2 and try to jump right into dplyr and the onramp is too sharp/it's disconnected from what they just learned in ggplot. This will let them warmup, let us introduce more fundamentals, and also get to ggplot faster on day 1.

Use `<-` for assignment throughout

Use <- for assignment all the way through might help learners distinguish column creation vs data.frame assignment in mutate and summarize.

Can I add transitions to the next lesson at the bottom of each page?

e.g. at the bottom of intro rstudio I would link to project management.

tidyverse introduction

This should probably come at the beginning of ggplot, including install tidyverse.

Materials for teaching RMarkdown and KnitR

There has been some discussion about including RMarkdown and/or KnitR in this lesson. @marianschmi lessons might be a good resource for this

https://github.com/marschmi/RMarkdown_May11_2016

Intro R HBS debrief ideas

?seq challenge question: prepare people that didn't teach but need to learn (be explicit on this)
indexes divisible by three challenge bonus question: needs a little more work or better phrasing
more emphasis on c() function, its purpose and use
- In introducing vectors, perhaps show several ways to create one
Perhaps a graphic of $ and [] operations to illustrate what is happening
Use message, warning, and stop to demo them. Prepares for messages returned with read_xxx.

Rmarkdown revamp

The RMarkdown/knitr lesson needs work. In particular, the default document in RStudio when students do new -> Rmarkdown is overwhelming to newcomers, especially the chunk options. There are some ideas for development in code/RMarkdown_ideas.Rmd.

Transition data OpenRefine -> disk -> R

Per @ErinBecker's suggestion (#2)

Replace variance and relative variation exercises

Variance is tricky for some learners, and the wide -> long exercise is too contrived.

Make a directory of all functions taught

More exercises, more multi-level exercises

These lessons work best with lots of exercises -- shout-out and multiple choice questions frequently interspersed and a couple students-go-code ("Challenge") exercises in each lesson. We need more of everything, especially in the second half of the lessons (~after Data.frame manipulation).

In all challenge exercises, some students finish quickly and others take longer. To deal with this, each should have at least one "bonus" harder challenge to keep the faster students occupied while the others work, and ideally a second level "advanced" challenge too.

Timing and emphasis

Could we get ggplot and dplyr in day 1? If those are each half a day, that really only leaves a few hours for tidyr, statistics, knitr, and the capstone project. A single downloaded data directory will help; the spreadsheet and OpenRefine lessons can be streamlined... we don't need to belabor those points, just hit the essentials on spreadsheets and text clustering with OpenRefine. Likewise for the intro R lesson -- construct the lesson so teachers will move swiftly through it. Fold the projects lesson into another -- each transition creates a hurdle that takes time.

dplyr HBS debrief ideas

In addition to #29 and #30:

Can challenges be re-written to be a little clearer?
I (ML) think an overarching subsection early on, clarifying especially the data.frame-centric nature of dplyr and tidyverse (always DF in; always DF out) might be helpful in general and with the tricky assignment of the data.frame vs assignment to a column within a data.frame. Also, “here are five fundamental tasks you’ll learn to do in data.frames. Dplyr has one function for each of these tasks.”
Variance exercise: may need to reword parts to make more clear the goal of the exercise. "Can you make a plot that shows the distribution of country level gdp by without summarizing beforehand?"
Consider changing variance to mean. Variance might be one weight too many for some.
Graphic like PIPE graphic in SWC Unix materials
Typo in code chunk in pipe section
How prominent should the pipe be? Much disagreement about this. Also, how much should we make a point of nesting and repeated assigning vs. just demoing the "right" (piped) way to do things?

Clarifying where data exists and assignment vs printing

More clarity on where data exists … in console, in memory, in disk, etc…
Do we need to change lessons so that we're not always printing to the screen, but saving the data, to reinforce saving data back to the global environment (people stumbled in exercises as they weren't assigning operations to variables, but instead just printing to the console)

Add some explicit work-with-a-partner exercises

Instructor Notes

It seems the webpages have several purposes that potentially pull them in different directions: learner notes for during the class, stand-alone notes for self-study or for learners to return to after a class, and instructor notes for in class. Even collapsing the first and second purposes, the lessons are far too verbose for instructors to use as a class plan. Instructors end up scrolling through and looking for visual clues like headers, plots, code chunks, and exercises, but this doesn't always work well.

@tracykteal, SWC/DC folks must've given this some thought.... Is there a best practice? Seems like we could write a script to extract all the headers, code-chunk names, and exercise headers and make bullet-point instructor notes from those. That would automate the process and keep the instructor notes in sync with the lessons. I know some science of teaching folks advocate keeping expected times to reach various points in lesson plans, perhaps we should think about including that too, especially given the tendency for the early lessons to expand and go too slowly and the later lessons to get compressed.

Polish writing

There are places, especially in the second day's lessons, where the writing is more bullet-points-for-instructors and less readable-for-students.

arguments to tidy::gather

Students have been struggling with the arguments to gather, especially key and value.

Davis pilot workshop reflection

Feedback

For the post-workshop survey, there was no option for ours, and the drop-down choose-your-workshop question is required with no "other" option, so I asked people to choose the Philipines workshop and make a note that they were in Davis. Hopefully it's not to hard to separate those out now, but there should probably be an other option for that question. Also not sure why ours didn't show up there.
Here are the feedback results from end of morning both days and end of afternoon day 1.
I got an email this afternoon from a professor who was in the workshop and loved it and did the capstone project Saturday morning after. :)

Reflections

In general, I thought it went well, learners seemed happy with it. Day 1 especially (spreadsheets, openrefine, and R lessons 1-4) I think is pretty solid as is.

Day 2 could use some tinkering. Single-table dplyr took the whole morning. Afternoon consisted of tidyr::gather, statistical modeling, writing functions, and dynamic documents, in that order.

By the end of day 2, students were pretty fried. I don't know if there is a way around that: Forging new neural connections for two days is just exhausting. But the (my) tendency to cram material in the second afternoon needs to be avoided. Reserving space for a capstone exercise might help with this, or students might be too spent to do that kind of independent work at the end. An alternative is a showcase of possible next steps: Here's the kind of natural language processing you can do in R (showing without teaching) and your first resource to start learning it, and the same for social network analysis, structural equation modeling, etc.

Lesson 5 - `dplyr`

People like learning dplyr, understandably so. It handles most of what most people do. The basic structure is good, I think.

Piping of data.frames to the first argument in the subsequent function didn't sink in with some students, even though I felt like I went over it quite a few times. In exercises, several students would include data.frames in functions that were receiving them from a pipe. I think this is a symptom of the various arguments to dplyr functions not being clear enough, and the issue below about the structure of mutate and summarise being different than the others is part of this. Introducing all the verbs with intermediate assignment and then introducing piping at the very end might help.

The structure of mutate and summarise is different than the other verbs because they contain a colName = that the others don't. Maybe pointing explicitly to that syntactical difference a couple times -- "these two functions create new columns, and we give those columns names with colName =" -- would help. Assignment to columns within piped functions and assigning the resulting data.frame to a variable is complicated, and at least some learners have a hard time groking the component parts.

Piping to head at the end of dplyr chains inevitably leads to students copy-and-pasting code and assigning the head of a data.frame to a variable. head should be taught, with str and summary, but we can keep it separate from piping by using tbl_df's nice printing. Maybe start the dplyr lesson with conversion to tbl_df. It's conceptually easy and would only take 30 seconds at the beginning of the lesson and would avoid headaches further along.

Lesson 6 - `tidyr`

Something was missing from the gather part of lesson 6. I was trying to move quickly and so gave a pretty quick explanation and worked one example before giving the students an exercise and moving on. I don't think the motivation was clear, and a lot of students had trouble with the various arguments (key and value especially) to gather. A second example, perhaps bigger and more realistic would be useful. Separately from this lesson a student asked about working with three-dimensional arrays in R, he had subject-by-time-by-electrode data... tidying a dataset like that could be cool. Making a stronger connection between tidy data and ggplot might help motivate. E.g. If you wanted to plot this wide data and map the various conditions to color, how would you do it in ggplot? You can't easily, but with gather you can convert it to the form that ggplot (and lm and more) expect.

Lesson 9 - statistical modeling

The social scientists were hungry for this, as we rely heavily on statistical models. The content that is there worked well, and I love the connection with ggplot. Introducing a few more functions (t-test, anova) might be useful and low cost.

Lesson 7 - writing functions

I rushed through this. Students were able to write their own (F_to_C) function and source their code/functions.R files, so they actually got quite a bit rather quickly. Some saw the payoff in terms of organization, but we need a better motivating function after the temperature conversion examples. Something that makes learners say "oh yeah, I do that over and over, it would be great to write one function and just be able to call that."

Lesson 8 - dynamic documents

This lesson needs some improvement. Making our own custom .Rmd template will help; that way we can introduce students gently (the first code chunk in the default template is probably overwhelming!). I'd start with basics of markdown and later introduce code chunks and then code chunk options.

Part of the problem is that this forces a break from the model of the rest of the workshop, especially if the instructor has been piping a live-script to learners' browsers. Not sure what to do with that, but again the custom template might help by getting instructor and learners doing the same things in the same place.

What's missing?

Text processing. A brief introduction to paste, gsub, grep, etc. would be useful to many. I haven't used stringr, but I understand it uses a more consistent syntax than the base functions so would likely provide a gentler introduction.
lists and lapply. Not sure this belongs in the first two days, but it enables automated read/write and so much more.

Pare down "Data types and subsetting"

We have decided to wholeheartedly embrace dplyr (in "Data.frame manipulation"), so some of the subsetting here is redundant. A possible plan for things that should stay:

Vectors and vectorization
- numeric indexing
- logical indexing
A data.frame is vectors as columns in a table
- $ extraction; no [] or 2-d extraction
Inspecting data frames (especially str and summary, head is less important since we'll be working with tibbles and their friendly print method).
read csv

Make it clear this is a sandbox, not supported lessons

Per @ErinBecker any repo within the data-lessons organization isn't guaranteed support, as this is a sandbox for development work.

We should make this clear in the readme and in the intro/about sections or other headers of individual sections.

Overarching narrative/motivation/agenda

As @devbioinfoguy put it, "More explanation at start of workshop on overarching narrative on R and project process: cleaning, inspection/QA, basic analysis, viz, etc." Could come at the beginning of R or in spreadsheets. Projects goes with this. Maybe that does go in spreadsheets and rather than spreadsheets it's something like "Organizing Principles".

Clarify MCQ data reduction

$/day and $/year is confusing. Instruct learners to use pencil and paper to draw out the steps.