Giter Site home page Giter Site logo

gapminder-r's People

Contributors

aammd avatar aaren avatar abbycabs avatar abought avatar amarder avatar bkatiemills avatar claresloggett avatar devbioinfoguy avatar fmichonneau avatar griffinp avatar hdashnow avatar izahn avatar jdblischak avatar jpallen avatar liz-is avatar michaellevy avatar mikblack avatar naupaka avatar nfaux avatar pbanaszkiewicz avatar petebachant avatar pipitone avatar sritchie73 avatar synesthesiam avatar tbekolay avatar tomwright01 avatar tracykteal avatar twitwi avatar wking avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

gapminder-r's Issues

tidy exercise

The first challenge in the tidy data lesson, "Gather and plot" is contrived and awkward. A simple exercise that has students identify what's untidy about a table and work through the arguments to gather is needed.

Change wording of first dplyr exercise

Feedback from social scientists in room that you can't calculate daily income from gdp per capita.

Change wording to:

Produce a data.frame with only the names and years of countries where per capita gdp is less than a dollar a day sorted from most- to least-recent.

Tip: The gdpPercap variable is annual gdp per person. You’ll need to adjust.
Tip: For complex tasks, it often helps to use pencil and paper to write/draw/map the various steps needed and how they fit together before writing any code.
What is the annual per-capita gdp, rounded to the nearest dollar, of the first row in the data.frame?

Pre-workshop install & update check

Per-workshop instructions: install R and RStudio or update to the latest version of R and RStudio. Install tidyverse and update.packages. If library(tidyverse) produces anything but the usual, show up 30 minutes early.

plot types

Add breakdown of 1 vs. 2 vs. 3+ variable plots, with continuous and categorical variables. Make a table that they can see with a suggested first plot for each and the corresponding mappings and geom. I think this goes after introducing ggplot and geoms with scatterplots.

Add spreadsheet and OpenRefine modules

I think this is the repo that @tracykteal has taught in the morning of this workshop. @tracykteal, in talking with @devbioinfoguy et al, we were thinking about folding that into this repo. One advantage of leaving it elsewhere is that it can be developed independently and this workshop could use those advances. OTOH, having it here might promote tighter coupling.

Also, Is there a version of the spreadsheet lesson you like for this workshop?

New "More fundamentals of R" module

I think there should be an additional module between ggplot and dplyr (which usually falls first thing day 2) that reviews the basics, refreshes students to R, and goes beyond what we can/should do in the intro R and data-types and subsetting lessons.

Why? By the time we get to the end of data-types and subsetting, students are bored of the basics (even if they recognize their importance, and even if they're struggling to keep up), and for that reason I've tried to minimize what's covered in those lessons, which means leaving out some important things, like NA-handling. Also, students come back on day 2 and try to jump right into dplyr and the onramp is too sharp/it's disconnected from what they just learned in ggplot. This will let them warmup, let us introduce more fundamentals, and also get to ggplot faster on day 1.

Use `<-` for assignment throughout

Use <- for assignment all the way through might help learners distinguish column creation vs data.frame assignment in mutate and summarize.

Intro R HBS debrief ideas

  • ?seq challenge question: prepare people that didn't teach but need to learn (be explicit on this)
  • indexes divisible by three challenge bonus question: needs a little more work or better phrasing
  • more emphasis on c() function, its purpose and use
    • In introducing vectors, perhaps show several ways to create one
  • Perhaps a graphic of $ and [] operations to illustrate what is happening
  • Use message, warning, and stop to demo them. Prepares for messages returned with read_xxx.

Rmarkdown revamp

The RMarkdown/knitr lesson needs work. In particular, the default document in RStudio when students do new -> Rmarkdown is overwhelming to newcomers, especially the chunk options. There are some ideas for development in code/RMarkdown_ideas.Rmd.

More exercises, more multi-level exercises

These lessons work best with lots of exercises -- shout-out and multiple choice questions frequently interspersed and a couple students-go-code ("Challenge") exercises in each lesson. We need more of everything, especially in the second half of the lessons (~after Data.frame manipulation).

In all challenge exercises, some students finish quickly and others take longer. To deal with this, each should have at least one "bonus" harder challenge to keep the faster students occupied while the others work, and ideally a second level "advanced" challenge too.

Timing and emphasis

Could we get ggplot and dplyr in day 1? If those are each half a day, that really only leaves a few hours for tidyr, statistics, knitr, and the capstone project. A single downloaded data directory will help; the spreadsheet and OpenRefine lessons can be streamlined... we don't need to belabor those points, just hit the essentials on spreadsheets and text clustering with OpenRefine. Likewise for the intro R lesson -- construct the lesson so teachers will move swiftly through it. Fold the projects lesson into another -- each transition creates a hurdle that takes time.

dplyr HBS debrief ideas

In addition to #29 and #30:

  • Can challenges be re-written to be a little clearer?
  • I (ML) think an overarching subsection early on, clarifying especially the data.frame-centric nature of dplyr and tidyverse (always DF in; always DF out) might be helpful in general and with the tricky assignment of the data.frame vs assignment to a column within a data.frame. Also, “here are five fundamental tasks you’ll learn to do in data.frames. Dplyr has one function for each of these tasks.”
  • Variance exercise: may need to reword parts to make more clear the goal of the exercise. "Can you make a plot that shows the distribution of country level gdp by without summarizing beforehand?"
  • Consider changing variance to mean. Variance might be one weight too many for some.
  • Graphic like PIPE graphic in SWC Unix materials
  • Typo in code chunk in pipe section
  • How prominent should the pipe be? Much disagreement about this. Also, how much should we make a point of nesting and repeated assigning vs. just demoing the "right" (piped) way to do things?

Clarifying where data exists and assignment vs printing

  • More clarity on where data exists … in console, in memory, in disk, etc…
  • Do we need to change lessons so that we're not always printing to the screen, but saving the data, to reinforce saving data back to the global environment (people stumbled in exercises as they weren't assigning operations to variables, but instead just printing to the console)

Instructor Notes

It seems the webpages have several purposes that potentially pull them in different directions: learner notes for during the class, stand-alone notes for self-study or for learners to return to after a class, and instructor notes for in class. Even collapsing the first and second purposes, the lessons are far too verbose for instructors to use as a class plan. Instructors end up scrolling through and looking for visual clues like headers, plots, code chunks, and exercises, but this doesn't always work well.

@tracykteal, SWC/DC folks must've given this some thought.... Is there a best practice? Seems like we could write a script to extract all the headers, code-chunk names, and exercise headers and make bullet-point instructor notes from those. That would automate the process and keep the instructor notes in sync with the lessons. I know some science of teaching folks advocate keeping expected times to reach various points in lesson plans, perhaps we should think about including that too, especially given the tendency for the early lessons to expand and go too slowly and the later lessons to get compressed.

Polish writing

There are places, especially in the second day's lessons, where the writing is more bullet-points-for-instructors and less readable-for-students.

Davis pilot workshop reflection

Feedback

  • For the post-workshop survey, there was no option for ours, and the drop-down choose-your-workshop question is required with no "other" option, so I asked people to choose the Philipines workshop and make a note that they were in Davis. Hopefully it's not to hard to separate those out now, but there should probably be an other option for that question. Also not sure why ours didn't show up there.
  • Here are the feedback results from end of morning both days and end of afternoon day 1.
  • I got an email this afternoon from a professor who was in the workshop and loved it and did the capstone project Saturday morning after. :)

Reflections

In general, I thought it went well, learners seemed happy with it. Day 1 especially (spreadsheets, openrefine, and R lessons 1-4) I think is pretty solid as is.

Day 2 could use some tinkering. Single-table dplyr took the whole morning. Afternoon consisted of tidyr::gather, statistical modeling, writing functions, and dynamic documents, in that order.

By the end of day 2, students were pretty fried. I don't know if there is a way around that: Forging new neural connections for two days is just exhausting. But the (my) tendency to cram material in the second afternoon needs to be avoided. Reserving space for a capstone exercise might help with this, or students might be too spent to do that kind of independent work at the end. An alternative is a showcase of possible next steps: Here's the kind of natural language processing you can do in R (showing without teaching) and your first resource to start learning it, and the same for social network analysis, structural equation modeling, etc.

Lesson 5 - dplyr

People like learning dplyr, understandably so. It handles most of what most people do. The basic structure is good, I think.

Piping of data.frames to the first argument in the subsequent function didn't sink in with some students, even though I felt like I went over it quite a few times. In exercises, several students would include data.frames in functions that were receiving them from a pipe. I think this is a symptom of the various arguments to dplyr functions not being clear enough, and the issue below about the structure of mutate and summarise being different than the others is part of this. Introducing all the verbs with intermediate assignment and then introducing piping at the very end might help.

The structure of mutate and summarise is different than the other verbs because they contain a colName = that the others don't. Maybe pointing explicitly to that syntactical difference a couple times -- "these two functions create new columns, and we give those columns names with colName =" -- would help. Assignment to columns within piped functions and assigning the resulting data.frame to a variable is complicated, and at least some learners have a hard time groking the component parts.

Piping to head at the end of dplyr chains inevitably leads to students copy-and-pasting code and assigning the head of a data.frame to a variable. head should be taught, with str and summary, but we can keep it separate from piping by using tbl_df's nice printing. Maybe start the dplyr lesson with conversion to tbl_df. It's conceptually easy and would only take 30 seconds at the beginning of the lesson and would avoid headaches further along.

Lesson 6 - tidyr

Something was missing from the gather part of lesson 6. I was trying to move quickly and so gave a pretty quick explanation and worked one example before giving the students an exercise and moving on. I don't think the motivation was clear, and a lot of students had trouble with the various arguments (key and value especially) to gather. A second example, perhaps bigger and more realistic would be useful. Separately from this lesson a student asked about working with three-dimensional arrays in R, he had subject-by-time-by-electrode data... tidying a dataset like that could be cool. Making a stronger connection between tidy data and ggplot might help motivate. E.g. If you wanted to plot this wide data and map the various conditions to color, how would you do it in ggplot? You can't easily, but with gather you can convert it to the form that ggplot (and lm and more) expect.

Lesson 9 - statistical modeling

The social scientists were hungry for this, as we rely heavily on statistical models. The content that is there worked well, and I love the connection with ggplot. Introducing a few more functions (t-test, anova) might be useful and low cost.

Lesson 7 - writing functions

I rushed through this. Students were able to write their own (F_to_C) function and source their code/functions.R files, so they actually got quite a bit rather quickly. Some saw the payoff in terms of organization, but we need a better motivating function after the temperature conversion examples. Something that makes learners say "oh yeah, I do that over and over, it would be great to write one function and just be able to call that."

Lesson 8 - dynamic documents

This lesson needs some improvement. Making our own custom .Rmd template will help; that way we can introduce students gently (the first code chunk in the default template is probably overwhelming!). I'd start with basics of markdown and later introduce code chunks and then code chunk options.

Part of the problem is that this forces a break from the model of the rest of the workshop, especially if the instructor has been piping a live-script to learners' browsers. Not sure what to do with that, but again the custom template might help by getting instructor and learners doing the same things in the same place.

What's missing?

  • Text processing. A brief introduction to paste, gsub, grep, etc. would be useful to many. I haven't used stringr, but I understand it uses a more consistent syntax than the base functions so would likely provide a gentler introduction.
  • lists and lapply. Not sure this belongs in the first two days, but it enables automated read/write and so much more.

Pare down "Data types and subsetting"

We have decided to wholeheartedly embrace dplyr (in "Data.frame manipulation"), so some of the subsetting here is redundant. A possible plan for things that should stay:

  • Vectors and vectorization
    • numeric indexing
    • logical indexing
  • A data.frame is vectors as columns in a table
    • $ extraction; no [] or 2-d extraction
  • Inspecting data frames (especially str and summary, head is less important since we'll be working with tibbles and their friendly print method).
  • read csv

Overarching narrative/motivation/agenda

As @devbioinfoguy put it, "More explanation at start of workshop on overarching narrative on R and project process: cleaning, inspection/QA, basic analysis, viz, etc." Could come at the beginning of R or in spreadsheets. Projects goes with this. Maybe that does go in spreadsheets and rather than spreadsheets it's something like "Organizing Principles".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.