data-lessons / gapminder-r Goto Github PK
View Code? Open in Web Editor NEWHome Page: http://data-lessons.github.io/gapminder-R/
License: Other
Home Page: http://data-lessons.github.io/gapminder-R/
License: Other
Need to move MCQ: Data Reduction {.challenge} to after the mutate portion of the lesson.
The first challenge in the tidy data lesson, "Gather and plot" is contrived and awkward. A simple exercise that has students identify what's untidy about a table and work through the arguments to gather
is needed.
Feedback from social scientists in room that you can't calculate daily income from gdp per capita.
Change wording to:
Produce a data.frame with only the names and years of countries where per capita gdp is less than a dollar a day sorted from most- to least-recent.
Tip: The gdpPercap variable is annual gdp per person. You’ll need to adjust.
Tip: For complex tasks, it often helps to use pencil and paper to write/draw/map the various steps needed and how they fit together before writing any code.
What is the annual per-capita gdp, rounded to the nearest dollar, of the first row in the data.frame?
Per-workshop instructions: install R and RStudio or update to the latest version of R and RStudio. Install tidyverse and update.packages. If library(tidyverse) produces anything but the usual, show up 30 minutes early.
Eg read.csv -> read_csv
Add breakdown of 1 vs. 2 vs. 3+ variable plots, with continuous and categorical variables. Make a table that they can see with a suggested first plot for each and the corresponding mappings and geom. I think this goes after introducing ggplot and geoms with scatterplots.
I think this is the repo that @tracykteal has taught in the morning of this workshop. @tracykteal, in talking with @devbioinfoguy et al, we were thinking about folding that into this repo. One advantage of leaving it elsewhere is that it can be developed independently and this workshop could use those advances. OTOH, having it here might promote tighter coupling.
Also, Is there a version of the spreadsheet lesson you like for this workshop?
I think there should be an additional module between ggplot and dplyr (which usually falls first thing day 2) that reviews the basics, refreshes students to R, and goes beyond what we can/should do in the intro R and data-types and subsetting lessons.
Why? By the time we get to the end of data-types and subsetting, students are bored of the basics (even if they recognize their importance, and even if they're struggling to keep up), and for that reason I've tried to minimize what's covered in those lessons, which means leaving out some important things, like NA-handling. Also, students come back on day 2 and try to jump right into dplyr and the onramp is too sharp/it's disconnected from what they just learned in ggplot. This will let them warmup, let us introduce more fundamentals, and also get to ggplot faster on day 1.
Use <-
for assignment all the way through might help learners distinguish column creation vs data.frame assignment in mutate
and summarize
.
e.g. at the bottom of intro rstudio I would link to project management.
This should probably come at the beginning of ggplot
, including install tidyverse.
There has been some discussion about including RMarkdown and/or KnitR in this lesson. @marianschmi lessons might be a good resource for this
c()
function, its purpose and use
message
, warning
, and stop
to demo them. Prepares for messages returned with read_xxx.The RMarkdown/knitr lesson needs work. In particular, the default document in RStudio when students do new -> Rmarkdown is overwhelming to newcomers, especially the chunk options. There are some ideas for development in code/RMarkdown_ideas.Rmd.
Per @ErinBecker's suggestion (#2)
Variance is tricky for some learners, and the wide -> long exercise is too contrived.
These lessons work best with lots of exercises -- shout-out and multiple choice questions frequently interspersed and a couple students-go-code ("Challenge") exercises in each lesson. We need more of everything, especially in the second half of the lessons (~after Data.frame manipulation).
In all challenge exercises, some students finish quickly and others take longer. To deal with this, each should have at least one "bonus" harder challenge to keep the faster students occupied while the others work, and ideally a second level "advanced" challenge too.
Could we get ggplot and dplyr in day 1? If those are each half a day, that really only leaves a few hours for tidyr, statistics, knitr, and the capstone project. A single downloaded data directory will help; the spreadsheet and OpenRefine lessons can be streamlined... we don't need to belabor those points, just hit the essentials on spreadsheets and text clustering with OpenRefine. Likewise for the intro R lesson -- construct the lesson so teachers will move swiftly through it. Fold the projects lesson into another -- each transition creates a hurdle that takes time.
It seems the webpages have several purposes that potentially pull them in different directions: learner notes for during the class, stand-alone notes for self-study or for learners to return to after a class, and instructor notes for in class. Even collapsing the first and second purposes, the lessons are far too verbose for instructors to use as a class plan. Instructors end up scrolling through and looking for visual clues like headers, plots, code chunks, and exercises, but this doesn't always work well.
@tracykteal, SWC/DC folks must've given this some thought.... Is there a best practice? Seems like we could write a script to extract all the headers, code-chunk names, and exercise headers and make bullet-point instructor notes from those. That would automate the process and keep the instructor notes in sync with the lessons. I know some science of teaching folks advocate keeping expected times to reach various points in lesson plans, perhaps we should think about including that too, especially given the tendency for the early lessons to expand and go too slowly and the later lessons to get compressed.
There are places, especially in the second day's lessons, where the writing is more bullet-points-for-instructors and less readable-for-students.
Students have been struggling with the arguments to gather
, especially key and value.
In general, I thought it went well, learners seemed happy with it. Day 1 especially (spreadsheets, openrefine, and R lessons 1-4) I think is pretty solid as is.
Day 2 could use some tinkering. Single-table dplyr
took the whole morning. Afternoon consisted of tidyr::gather
, statistical modeling, writing functions, and dynamic documents, in that order.
By the end of day 2, students were pretty fried. I don't know if there is a way around that: Forging new neural connections for two days is just exhausting. But the (my) tendency to cram material in the second afternoon needs to be avoided. Reserving space for a capstone exercise might help with this, or students might be too spent to do that kind of independent work at the end. An alternative is a showcase of possible next steps: Here's the kind of natural language processing you can do in R (showing without teaching) and your first resource to start learning it, and the same for social network analysis, structural equation modeling, etc.
dplyr
People like learning dplyr
, understandably so. It handles most of what most people do. The basic structure is good, I think.
Piping of data.frames to the first argument in the subsequent function didn't sink in with some students, even though I felt like I went over it quite a few times. In exercises, several students would include data.frames in functions that were receiving them from a pipe. I think this is a symptom of the various arguments to dplyr
functions not being clear enough, and the issue below about the structure of mutate
and summarise
being different than the others is part of this. Introducing all the verbs with intermediate assignment and then introducing piping at the very end might help.
The structure of mutate
and summarise
is different than the other verbs because they contain a colName =
that the others don't. Maybe pointing explicitly to that syntactical difference a couple times -- "these two functions create new columns, and we give those columns names with colName =
" -- would help. Assignment to columns within piped functions and assigning the resulting data.frame to a variable is complicated, and at least some learners have a hard time groking the component parts.
Piping to head
at the end of dplyr
chains inevitably leads to students copy-and-pasting code and assigning the head of a data.frame to a variable. head
should be taught, with str
and summary
, but we can keep it separate from piping by using tbl_df
's nice printing. Maybe start the dplyr
lesson with conversion to tbl_df
. It's conceptually easy and would only take 30 seconds at the beginning of the lesson and would avoid headaches further along.
tidyr
Something was missing from the gather
part of lesson 6. I was trying to move quickly and so gave a pretty quick explanation and worked one example before giving the students an exercise and moving on. I don't think the motivation was clear, and a lot of students had trouble with the various arguments (key
and value
especially) to gather
. A second example, perhaps bigger and more realistic would be useful. Separately from this lesson a student asked about working with three-dimensional arrays in R, he had subject-by-time-by-electrode data... tidying a dataset like that could be cool. Making a stronger connection between tidy data and ggplot might help motivate. E.g. If you wanted to plot this wide data and map the various conditions to color, how would you do it in ggplot
? You can't easily, but with gather
you can convert it to the form that ggplot
(and lm
and more) expect.
The social scientists were hungry for this, as we rely heavily on statistical models. The content that is there worked well, and I love the connection with ggplot. Introducing a few more functions (t-test, anova) might be useful and low cost.
I rushed through this. Students were able to write their own (F_to_C) function and source their code/functions.R
files, so they actually got quite a bit rather quickly. Some saw the payoff in terms of organization, but we need a better motivating function after the temperature conversion examples. Something that makes learners say "oh yeah, I do that over and over, it would be great to write one function and just be able to call that."
This lesson needs some improvement. Making our own custom .Rmd template will help; that way we can introduce students gently (the first code chunk in the default template is probably overwhelming!). I'd start with basics of markdown and later introduce code chunks and then code chunk options.
Part of the problem is that this forces a break from the model of the rest of the workshop, especially if the instructor has been piping a live-script to learners' browsers. Not sure what to do with that, but again the custom template might help by getting instructor and learners doing the same things in the same place.
paste
, gsub
, grep
, etc. would be useful to many. I haven't used stringr
, but I understand it uses a more consistent syntax than the base functions so would likely provide a gentler introduction.lapply
. Not sure this belongs in the first two days, but it enables automated read/write and so much more.We have decided to wholeheartedly embrace dplyr
(in "Data.frame manipulation"), so some of the subsetting here is redundant. A possible plan for things that should stay:
str
and summary
, head
is less important since we'll be working with tibbles and their friendly print method).Per @ErinBecker any repo within the data-lessons organization isn't guaranteed support, as this is a sandbox for development work.
We should make this clear in the readme and in the intro/about sections or other headers of individual sections.
As @devbioinfoguy put it, "More explanation at start of workshop on overarching narrative on R and project process: cleaning, inspection/QA, basic analysis, viz, etc." Could come at the beginning of R or in spreadsheets. Projects goes with this. Maybe that does go in spreadsheets and rather than spreadsheets it's something like "Organizing Principles".
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.