swcarpentry / r-novice-gapminder Goto Github PK

View Code? Open in Web Editor NEW

162.0 23.0 530.0 168.33 MB

R for Reproducible Scientific Analysis

Home Page: http://swcarpentry.github.io/r-novice-gapminder/

License: Other

Shell 1.01% R 93.46% TeX 5.52%

carpentries software-carpentry lesson r data-wrangling data-visualisation data-visualization english programming stable

r-novice-gapminder's Issues

Tie back to shell session?

In the context of a SWC workshop it would be nice to tie this lesson to the shell by calling Rscript from the shell. It's a lot to ask because it requires source() (conceptually) and commandArgs() (to be interesting). Maybe mention it in further readings?

Spell check?

I would be happy to do a global spell check run on the Rmd files. If this is a good idea, let me know when origin/gh-pages is ready for a PR with lots of small changes.

Challenges and the materials in general should tell a story about analysing the gapminder dataset

When editing the materials and challenges, I think it's useful to keep in mind how they relate back to the central thread of the gapminder data and the kind of analysis we can do with it.

Discuss.

Can we delete the xx-*-.md files?

Are we done with e.g. 01-rstudio-intro.md now that we have 01-rstudio-intro.Rmd?

Should we include plotting focused primarily on ggplot or base graphics?

Discussion in #1 seems to indicate that many students like learning ggplot, and if they have no prior experience with base R plotting then it is not so hard to pick up. However, it also make be too much to cover along with everything else and novices will run into cases where it is helpful to understand some of how base graphics work as well. How should we focus the lessons?

06-data-subsetting: order of operations

Currently, the callout on order of operations in 06-data-subsetting says:

remember the order of operations. : is really a function, so what happens is it takes its first argument as -1, and second as 3...

I have two concerns with this:

"is really a function" doesn't seem germane: as I understand it, all operators are functions, but only some have lower precedence than the unary minus sign.
I don't think we can ask the students to "remember" order of operations here. Even if the students knew where : fit in the hierarchy from high-precedence operators like ^ to low-precedence operators like =, they probably wouldn't anticipate that R gives very different precedences to unary and binary minus signs:

> -1:3      # Unary "minus" evaluated before `:`
[1] -1  0  1  2  3

> 0 - 1:3   # Binary "minus" evaluated after `:`
[1] -1 -2 -3

Given that we probably don't want a digression into R's operator precedence, I'm not sure what the solution is (other than recommending that students always use parentheses around arguments to : when math is happening nearby). I thought I'd raise the issue to see what other SWC folk think.

Unnumbered challenge in 06-data-subsetting

06-data-subsetting has an unnumbered challenge between challenge 1 and 2. This challenge also lacks a solution at the bottom of the page.

Update MAKEFILE to check knitr version

make preview will only compile the code blocks in the challenges and callouts correctly if knitr is version 1.10.12 or higher. Currently, this requires installing from github: (devtools::install_github("yihui/knitr")).

I've written an R script that will throw an error if the knitr version is too low (tools/check_knitr_version.R) in Pull Request #41 , but I wasn't able to successfully incorporate it into the Makefile to prevent future contributors from clobbering the code block rendering.

Conversion from R-novice-inflammation

It looks like a lot of work has been put into the r-novice-inflammation lessons over the last few months, and they now cover a lot more R-specific material (as opposed to a literal translation of the python materials).

I'm wondering if pulling in those lessons, and replacing the Inflammation data with the gapminder data is a good place to start for paring this material down to a half day workshop? We could then have extra lessons for instructors who want to run R over a full day (e.g. a ggplot2 lesson and knitr lesson).

Links are out of date in reference.md

The content and links in reference.md need to be updated to match the new lesson order and names.

Link githubrepo page to rendered html

At the top of https://github.com/swcarpentry/r-novice-gapminder, please add a link to http://swcarpentry.github.io/r-novice-gapminder/index.html so people know where to find the rendered version. I can't put in a PR for that, only those with commit access can add it (right after where it says 'Introduction to R for non-programmers using gapminder data.')

Update topic timings

The header material for each topic file includes the expected amount of time each lesson should take. Most of them are the default: 15 minutes. @hdashnow can you update each of the topics with realistic times? You've had to most experience running this lesson material.

Pull in lesson-example CONTRIBUTING.md and LAYOUT.md

This information has been updated since the repository has started. Key information includes R markdown usage. Pull from https://github.com/swcarpentry/lesson-example

make problems: .html not produced by 'make preview'

The original issue was about 'make check' failing. That got filed as an issue in the lesson template repo at swcarpentry/DEPRECATED-lesson-template#299

I was unable to get 'make preview' to work. I created a new file with a .Rmd extension, and ran 'make preview' and it did not create a corresponding .html file. I removed all the .html files, even. R is in my path.

I am doing this on a fresh installation of El Capitan, with freshly installed R, RStudio, Pandoc, Anaconda[23], etc. All of the necessary R libraries are installed, and if I manually knit the new .Rmd file, it produces the proper output.

I may not be able to get to this this evening, but I'll try it again, just to be sure it was not an oversight on my part.

one day approach to teaching material

I just finished teaching a one day version (~6 hrs including breaks) of these materials. I thought it might be useful to share what I did for others that might be looking for a trimmed down version of these thorough materials. The workshop started with a morning of bash, then an afternoon and morning of R, and then an afternoon of git. That turned out to be a mistake for the person teaching git, but whatever. My motivation was to give people the minimum that they needed to get going with R.

Lesson 1: I quickly went over the material without using git
Lesson 2: skipped
Lesson 4: More or less went up to but not including factors
Lesson 5: Combined with the material from Lesson 4
Lesson 9: Combined with the material from Lesson 4
Lesson 3: Discussed using the help window in RStudio to find help with read.csv
Lesson 6: Skipped using which, "Handling special values", "Factor subsetting", "Matrix subsetting", "List subsetting"
Lesson 7: Got to the gdp calculator at the end of the first day and picked up there on the second day
Lesson 10: Only taught if ... else in the context of the more sophisticated gdp calculator
Lesson 11: skipped
Lesson 12: skipped
Lesson 13: Followed pretty literally
Lesson 8: Followed pretty literally
Lesson 14: skipped
Lesson 15: skipped
I also added about 20-30 min on knitr showing how the plots from Lesson 8 could be put into a document.

I realize that I skipped lists and matrices and barely introduced factors to say they are categorical data types. To get through the dplyr and ggplot2 stuff those just aren't needed and you can go a long way in R without needing them.

When teaching the function component I tried to build up the gdp calculator piece by piece. I would show them how to do the year and have them do the country. Towards the end of the first day I could sense that they weren't getting it and they were glazing over. So I had them get in pairs and alternate explaining each line to each other. They really perked up and seemed to have more confidence. We repeated this the following morning to rebuild what we had done and to go forward with the if statements.

Having taught some version of these materials twice now, I fear that a lot of the SWC materials have become bloated beyond what is truly necessary to get someone going and so that kind of effected what and how I approached the materials.

Lesson on packages and package ecosystem

I'm loathe to add anything to this, but I think an R intro could use an introduction to packages and the R package ecosystem. This would NOT be about how to create packages, but navigating repos, finding packages and understanding them as collections of functions.

Here are some initial thoughts. Feedback?

Time: 25 mins?
Goals: Students should be able to:

Understand the difference between installing and loading packages.
See what packages are on their machine and loaded.
Deal with Error: could not find function by loading/installing packages.
Identify multiple sources of R packages (CRAN, BioConductor, ROpenSci, Github)
Install a package with install.packages.
View the source code of a package function.
Locally and on github (via the metacran github org)
Install a package with devtools::install_github
Install a package downloaded from a website.
Change CRAN mirrors or package repos.
View package Index, DESCRIPTION, and vingettes both on the web and locally.
Identify when a package may have a non-R dependency (actually installing may be too much).
Find packages suitable for a task using CRAN web views, r-pkg.org, and web search
Identify a person or resource to contact with questions about packages and bug reports.

(Yes, this is the ambitious version)

Challenges:

Find a package for a given task (likely related to the field of people at the workshop)
Install package via CRAN/BioC and github
Find the appropriate package help listserv for a package
An ambitious idea is to make a fake SWC package that has the answer to questions embedded. e.g., "Who is the maintainer of swctools?", "File a bug report for swctools.", "The source for swctools::somefn contains coordinates to treasure..."

Notes:

Package installation is a notoriously finicky part of the workshop due to wi-fi issues, students lacking permissions on lab-issued computers, etc. Hopefully this lesson is taught later in the second day, after those first install issues are identified and dealt with.
It's important to make a point about the human side of the package ecosystem. Authors are humans, quality is not guaranteed. Things like number of users, package age and active maintenance are signals of quality. (Related to Mozilla's Web Literacy Map concept of credibility).
I imagine that field-specific Data Carpentry workshops would customize this, especially on the BioC/genomics side (CC @tracykteal)

Should packrat content be included or removed?

Via discussion in #1 it seems like people want to take it out, especially for novice materials.

Error in 05-data-structures-part2

The exercise where we get learners to rbind to an existing dataframe is wrong.

df <- data.frame(id = c('a', 'b', 'c', 'd', 'e', 'f'), x = 1:6, y = c(214:219))
df

df <- rbind(df, list("g", 11, 42))

Should give an error and the following:

class(df$id)

Should give us "factor" but instead has "character" in the .md and .html versions.

this is used to motivate the use of stringsAsFactors = FALSE, which seems to already have been on before this code was run.

I'd submit a pull request, but the .Rmd is correct, just the parsing of it is wrong.

Maybe whoever built the lessons from html has this globally enabled?

Remove extra files in repo that have been superseded by content within lessons

There are a couple of files that are no longer needed. I think it would be safe to get rid of or else transition the content into one of the first few lesson modules. At any rate, their content is generally out of date compared to the current version of the materials.

Need to be removed or transferred:

motivation.md
OUTLINE.md
plan.md

From make check:

ERROR: Validation failed for ./motivation.md: Could not automatically identify correct template.
ERROR: Validation failed for ./OUTLINE.md: Could not automatically identify correct template.
ERROR: Validation failed for ./plan.md: Could not automatically identify correct template.

Repetition/modification of code example

In 13-dplyr.Rmd there's a code example that reads:

gdp_pop_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion=gdpPercap*pop/10^9) %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap=mean(gdpPercap),
              sd_gdpPercap=sd(gdpPercap),
              mean_pop=mean(pop),
              sd_pop=sd(pop),
              mean_pop=mean(pop),
              sd_pop=sd(pop))

In context, this should possibly be:

gdp_pop_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion=gdpPercap*pop/10^9) %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap=mean(gdpPercap),
              sd_gdpPercap=sd(gdpPercap),
              mean_pop=mean(pop),
              sd_pop=sd(pop),
              mean_gdp_billion=mean(gdp_billion),
              sd_gdp_billion=sd(gdp_billion))

or it might be an unintended duplication, which could be removed.

make preview fails

Looks like one of the python scripts isn't happy:

pandoc -s -t html \
    --template=_layouts/page \
    --filter=tools/filters/blockquote2div.py \
    --filter=tools/filters/id4glossary.py \
    -Vheader="$(cat _includes/header.html)" -Vbanner="$(cat _includes/banner.html)" -Vfooter="$(cat _includes/footer.html)" -Vjavascript="$(cat _includes/javascript.html)" \
    -o 01-rstudio-intro.html 01-rstudio-intro.md
pandoc: Error running filter tools/filters/blockquote2div.py
fd:4: hPutBuf: resource vanished (Broken pipe)
make: *** [01-rstudio-intro.html] Error 83

Convert non-topic Markdown files to Rmarkdown?

Following the discussion in Issue #17, I propose we convert the non-topic markdown files (e.g. LAYOUT.md, index.md, CONTRIBUTING.md, etc.) to R markdown files. That way we can add "*.md" to the .gitignore, to simplify the process for future contributors (who should then only ever edit the .Rmd files).

08-plot-ggplot2 Switch axis for life exp. & GDP?

The examples show GDP on the y-axis and life expectancy on the x-axis. As many discussions (elsewhere) concern the utility of GDP in predicting life expectancy, should GDP instead be on the x-axis?

Separate challenge chunks

For many of the back to back challenges, there is only one {.challenge} markdown block. Clarity would be increased by separating out into a challenge block for each individual challenge. See the end of 01-rstudio-intro.Rmd for an example.

Keep development in gh-pages branch or move it to master?

Currently all of the content (from the COMBINE repo) is on the gh-pages branch. Should we keep it there, and delete master, or occasionally merge back to master, or...?

Simplify downloading of gapminder-FiveYearData.csv

We currently ask students to download the zip file for the raw data:
https://github.com/swcarpentry/r-novice-gapminder/blame/gh-pages/02-project-intro.md#L159

Unzipping is an unnecessary pain here, it may be easier to download the raw version directly with a "right click -> save as" using this direct link
https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv

Should we be using .Rmd or .md files?

Just noticed that all the files are .md and .html -- should we be using .Rmd instead?

tidyr lesson - inconsistencies in dataframe atomic types

When you load the gap_wide data from the .csv file in the repo, columns 37 and 38 get inputted as integers and columns 1 and 2 get inputted as factors. When you follow along in the lesson the gap_wide_new columns 1 and 2 are characters and columns 37 and 38 are numeric. This means the all.equal() function wont work until you make those changes in the gap_wide dataframe.

Solutions for challenges

Many students asked for solutions to the challenges. This is particularly important when we don't get time to discuss the later challenges with the whole class. It's also handy for new instructors teaching the first workshop.

I suggest putting a link to a page of challenges and solutions at the end of each lesson.

Diagrams do not render properly on dplyr and tidyr lessons

Perhaps you can figure this one out @remi-daigle?

Seems to be an issue with the pandoc -> html -> jekyll workflow somewhere... The diagrams using DiagrammeR don't display properly. See http://swcarpentry.github.io/r-novice-gapminder/13-dplyr.html and http://swcarpentry.github.io/r-novice-gapminder/14-tidyr.html

Add motivational slides

We ought to have a short Motivational slideshow that we can use at the start of a lesson. from the LAYOUT.md :

Every lesson must include a short slide deck in motivation.md suitable for a short presentation (3 minutes or less) that the instructor can use to explain to learners how knowing the subject will help them.

I was thinking that #37 would make really great material for such a thing. The original authors could work on that, or if people agree we can merge that in and someone else can make it into slides
cc @hdashnow @SamPenrose

Knitr lessons

It would be really nice to have a small lesson on the use of knitr for report generation within RStudio. @nfaux ran a workshop last week with a knitr lesson, but there are not yet any corresponding lesson materials. We've got his project file to use as reference materials.

Create an outline.md with an overview of the lessons in this repo, which to include in a standard SWC workshop, and in what order

There is a ton of great content in this repo. Perhaps a great place to start getting it organized would be an outline.md document that lists the repos, and proposes an order and a core set. Then we can focus on getting these core modules polished and move some of the others in to supplementary or additional materials sections? As discussed in the comments on #10

Archival PRs

Over in the git-novice lessons they use closed PRs to keep an archive of workshop archives. To track how different people implement and change the lessons. Shall we try a similar approach here?

I will try this approach for this week's SWC workshop at Simon Fraser University

Initial conversion of md lessons to Rmd

@sritchie73 has agreed to lead off the MozSL Sprint with some conversions of the lessons from md to Rmd.

Add general description and layout of functions

The start of 07-functions lesson needs some text about what a function is, and it's general form (like the if/else lesson).

Data types lesson 4

http://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html
R has 5 basic atomic types (meaning they can’t be broken down into anything smaller):

(there is a problem with indentation and how the atomic types are counted)
It looks like this:

logical (e.g., TRUE, FALSE)
numeric
integer (e.g, 2L, as.integer(3))
double (i.e. decimal) (e.g, -24.57, 2.0, pi)
complex (i.e. complex numbers) (e.g, 1 + 0i, 1 + 4i)
text (called “character” in R) (e.g, "a", "swc", 'This is a cat')

But should look more like this:

logical (e.g., TRUE, FALSE)
numeric
    integer (e.g, 2L, as.integer(3))
    double (i.e. decimal) (e.g, -24.57, 2.0, pi)
complex (i.e. complex numbers) (e.g, 1 + 0i, 1 + 4i)
text (called "character" in R) (e.g, "a", "swc", 'This is a cat')

Reference materials

Hi all,

I'm not sure whether this is more appropriate as an issue or a pull request, but here is the repository for the gapminder lessons I developed for the February workshop we held in Melbourne: https://github.com/resbaz/r-novice-gapminder/

Also some acknowledgements and attributions: These materials were heavily based on materials originally written by @dfalster and @richfitz and modified by @dbarneche for a Software Carpentry R workshop run in Sydney last October (https://github.com/dbarneche/2014-10-31-USyd), and on some of the intermediate R materials (particularly the data structures lesson) written by @karthik (still part of the bc repo: https://github.com/swcarpentry/bc/tree/gh-pages/intermediate/r).

Introduce recycling rule earlier

Right now, the recycling rule is first introduced explicitly in lesson 6, in a section on subsetting, in a subsection on logical operators. Recycling isn't primarily a property of subsetting or logical operations, however, and so my sense is that it could seem like a pretty big digression for the students. We're asking them to do (somewhat advanced) vector operations for the first time in a very unfamiliar context.

What do the project maintainers think about moving section 9 (vector operations, including the recycling rule) up before the section on subsetting? From what I can tell, section 9 doesn't really depend on sections 6-8, except for one advanced example at the very end that uses a function.

Add more easy challenges

In many of the lessons, the first challenge jumps straight into asking participants to go beyond what was just demonstrated and integrating ideas from other lessons.

I propose making sure there is at least one challenge for each major learning milestone that just revises the topic. This challenge should only reinforce what was just demonstrated, and then the later challenges can slowly expand on this and ask the students to use critical thinking, draw from past topics, do some research etc. The difficult of the challenges should ramp up gradually.

Also keep in mind using the gapminder dataset as a theme throughout challenges where appropriate: #20

Here's an example of a lesson that I think has this problem (although many of them do, maybe others would like to mention specific cases):
http://swcarpentry.github.io/r-novice-gapminder/07-functions.html

I actually tried to fix this one in the last workshop by pulling in some inflammation expamples, but it's still not quite right.

equal 01-rstudio-intro

The first lesson has a tip that says:
you should never use == to compare two numbers unless they are integers.
Instead you should use the all.equal function.

But in R

3.555555555555 == 3.555555555554
[1] FALSE
all.equal(3.555555555555,3.555555555554)
[1] TRUE

From R documentation
Description

all.equal(x, y) is a utility to compare R objects x and y testing ‘near equality’. If they are different, comparison is still made to some extent, and a report of the differences is returned. Do not use all.equal directly in if expressions—either use isTRUE(all.equal(....)) or identical if appropriate.

I think that tip should be removed or updated, using all.equal in this situation is not recommendable

Add "*.html" to .gitignore?

Should we add "*.html" to .gitignore? When a core maintainer wanted to add the HTML, they could use "git add -f".

Plot example as soon as possible

Summary

Something that I love about Software Carpentry lesson the first time that I read it is to mention plots as soon as possible and I think this will be a great improve to R Gapminder lesson.

Description

Add a "motivational" topic before 01-rstudio-intro.md. Something like: "At data/gapminder-FiveYearData.csv you will find some information about many countries. Get the population of X and create a plot of it using plot(c(P1, P2, P3, ..., PN). Congrats, you just wrote your first R code and create a plot with it. Can you plot the population of Y? And the population for all the countries in less than one minute? In the next chapters you will learn a few things about R and at the end you will be capable to create a plot for each country in less than one minute."

Improve prerequisites

We have

Have attended the Shell and Git sessions.

as prerequisites and I think that we can

reword the Shell requisite to something like "understand the concepts of files and directories (including the working directory)",
drop the Git requisite since learners will only need it at https://swcarpentry.github.io/r-novice-gapminder/02-project-intro.html.

Move while loop to a callout?

This might be a contentious suggestion... but what about removing while loops from the core lesson, and just mentioning them in a callout? I think while loops are the least useful for beginners. It almost seems like they are just in there for completeness.

I suggest removing the while loop part of the lesson, and put in a call out with links to extra readings for people who are interested.

Thoughts?

Lesson 4: Data Structures: factor is not a data structure

Under 'Data Structures' heading, factor is listed as a data structure, but factor is a data type.

Connecting RStudio with git

I'm teaching a workshop where we've done unix and then git and now the r-novice-gapminder material. Tomorrow, I'd really like to show the students how to integrate git with RStudio. Today we started down this path, but when we went into the preferences, it was clear that OS X wouldn't show the /usr/bin/ folder which is where git is stored. Of course, I've long since forgotten how to do this. I'm not sure where it is stored with the Window's installer or how to point RStudio there. Any suggestions on what needs to be done? If anyone can help me, I'll happily file a pull request to include the instructions in 01-rstudio-intro.Rmd once the workshop is over.

Reduce R content to fit in one full day. Potentially create a mid-day end point for half-day workshops.

Just bringing across some discussion from the mailing list about how long the R material should go for.

My summary of that discussion is some agreement that it should go for one day. It may be useful to make it easily run as a half day workshop by creating a natural stopping point mid-way. Any additional lessons that don't fit in a day will be listed as optional extras.

I'm somewhat surprised that you're planning to run the R content in half a day. We've never run less than a full day of R. And I'm a bit worried students wouldn't get up to the juicy bits in half a day.

If you plan on doing this, I would suggest that you organise the lessons so that a subset of them could be taught as a half-day workshop or the full set as a whole day. You could have a "capstone" exercise/topic at the end of the half day content so it still feels like a fully rounded-off session.

What do you think?

Warm regards,
Harriet

Hi Harriet -

That's a good point; I agree completely. Especially with SQL being off the list, I would imagine that almost all workshops will expand the scripting material (R/python) to fill that space. I do also think though that it would be good to have a split point (perhaps with a capstone as you mention), since it may not always be a single contiguous day of R material vs two half-days.

But that said, I think we should probably shoot for having no more than one day's of material in the core novice lessons (and other content could go into extra optional lessons or intermediate lessons if people teach an all-R for 2 days type event).

Best,
Naupaka

Please not change the CSV file on the fly for the lesson

Screenshot of 04-data-structures-part1.html

Description

The lesson has

Go back to your text editor and add add this line to feline-data.csv:
tabby,2.3 or 2.4,TRUE
Reload your cats data like before, and check what type of data we find in the weight column:
cats <- read.csv(file="data/feline-data.csv")
typeof(cats$weight[1])
[1] "double"
Oh no, our weights aren’t the double type anymore! If we try to do the same math we did on them before, we run into trouble:
cats$weight[1] + cats$weight[2]
[1] 7.1

The text doesn't match with the code examples. We should use data/feline-data2.csv or something like that to avoid the problem.

Duplicate heading on 04-data-structures-part1.Rmd

$ git log --oneline -1
0c89e17 Updating HTML
$ grep -n "## Factors" 04-data-structures-part1.Rmd
226:## Factors
241:## Factors

Having this headings with the same name doesn't make sense to me. I think that the first one should be "Data Frame".

swcarpentry / r-novice-gapminder Goto Github PK

r-novice-gapminder's Issues

Summary

Description

Description

Recommend Projects

Recommend Topics

Recommend Org