Giter Site home page Giter Site logo

swcarpentry / r-novice-gapminder Goto Github PK

View Code? Open in Web Editor NEW
161.0 23.0 529.0 168.92 MB

R for Reproducible Scientific Analysis

Home Page: http://swcarpentry.github.io/r-novice-gapminder/

License: Other

Shell 1.02% R 93.41% TeX 5.57%
carpentries software-carpentry lesson r data-wrangling data-visualisation data-visualization english programming stable

r-novice-gapminder's Introduction

Create a Slack Account with us Build and Deploy Website Slack Status Binder DOI

R for Reproducible Scientific Analysis

An introduction to R for non-programmers using the Gapminder data. Please see https://swcarpentry.github.io/r-novice-gapminder for a rendered version of this material, the lesson template documentation for instructions on formatting, building, and submitting material, or run make in this directory for a list of helpful commands.

The goal of this lesson is to teach novice programmers to write modular code and best practices for using R for data analysis. R is commonly used in many scientific disciplines for statistical analysis and its array of third-party packages. We find that many scientists who come to Software Carpentry workshops use R and want to learn more. The emphasis of these materials is to give attendees a strong foundation in the fundamentals of R, and to teach best practices for scientific computing: breaking down analyses into modular units, task automation, and encapsulation.

Note that this workshop focuses on the fundamentals of the programming language R, and not on statistical analysis.

The lesson contains more material than can be taught in a day. The [instructor notes page]({{ page.root }}/guide) has some suggested lesson plans suitable for a one or half day workshop.

A variety of third party packages are used throughout this workshop. These are not necessarily the best, nor are they comprehensive, but they are packages we find useful, and have been chosen primarily for their usability.

Current Maintainers:

Previous Maintainers:

r-novice-gapminder's People

Contributors

aammd avatar aaren avatar abought avatar annajiat avatar ateucher avatar bbolker avatar bkatiemills avatar brynnelliott avatar claresloggett avatar fmichonneau avatar griffinp avatar hdashnow avatar jcoliver avatar jrnold avatar katrinleinweber avatar kbroman avatar matthieu-bruneaux avatar mawds avatar mkuzak avatar naupaka avatar nfaux avatar nlesniak avatar remi-daigle avatar rgaiacs avatar rmcd1024 avatar sritchie73 avatar tem11010 avatar tomwright01 avatar vince-p avatar zkamvar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

r-novice-gapminder's Issues

08-plot-ggplot2 Switch axis for life exp. & GDP?

The examples show GDP on the y-axis and life expectancy on the x-axis. As many discussions (elsewhere) concern the utility of GDP in predicting life expectancy, should GDP instead be on the x-axis?

equal 01-rstudio-intro

The first lesson has a tip that says:
you should never use == to compare two numbers unless they are integers.
Instead you should use the all.equal function.

But in R

3.555555555555 == 3.555555555554
[1] FALSE
all.equal(3.555555555555,3.555555555554)
[1] TRUE

From R documentation
Description

all.equal(x, y) is a utility to compare R objects x and y testing ‘near equality’. If they are different, comparison is still made to some extent, and a report of the differences is returned. Do not use all.equal directly in if expressions—either use isTRUE(all.equal(....)) or identical if appropriate.

I think that tip should be removed or updated, using all.equal in this situation is not recommendable

Error in 05-data-structures-part2

The exercise where we get learners to rbind to an existing dataframe is wrong.

df <- data.frame(id = c('a', 'b', 'c', 'd', 'e', 'f'), x = 1:6, y = c(214:219))
df
df <- rbind(df, list("g", 11, 42))

Should give an error and the following:

class(df$id)

Should give us "factor" but instead has "character" in the .md and .html versions.

this is used to motivate the use of stringsAsFactors = FALSE, which seems to already have been on before this code was run.

I'd submit a pull request, but the .Rmd is correct, just the parsing of it is wrong.

Maybe whoever built the lessons from html has this globally enabled?

Update topic timings

The header material for each topic file includes the expected amount of time each lesson should take. Most of them are the default: 15 minutes. @hdashnow can you update each of the topics with realistic times? You've had to most experience running this lesson material.

Archival PRs

Over in the git-novice lessons they use closed PRs to keep an archive of workshop archives. To track how different people implement and change the lessons. Shall we try a similar approach here?

I will try this approach for this week's SWC workshop at Simon Fraser University

Reference materials

Hi all,

I'm not sure whether this is more appropriate as an issue or a pull request, but here is the repository for the gapminder lessons I developed for the February workshop we held in Melbourne: https://github.com/resbaz/r-novice-gapminder/

Also some acknowledgements and attributions: These materials were heavily based on materials originally written by @dfalster and @richfitz and modified by @dbarneche for a Software Carpentry R workshop run in Sydney last October (https://github.com/dbarneche/2014-10-31-USyd), and on some of the intermediate R materials (particularly the data structures lesson) written by @karthik (still part of the bc repo: https://github.com/swcarpentry/bc/tree/gh-pages/intermediate/r).

Tie back to shell session?

In the context of a SWC workshop it would be nice to tie this lesson to the shell by calling Rscript from the shell. It's a lot to ask because it requires source() (conceptually) and commandArgs() (to be interesting). Maybe mention it in further readings?

tidyr lesson - inconsistencies in dataframe atomic types

When you load the gap_wide data from the .csv file in the repo, columns 37 and 38 get inputted as integers and columns 1 and 2 get inputted as factors. When you follow along in the lesson the gap_wide_new columns 1 and 2 are characters and columns 37 and 38 are numeric. This means the all.equal() function wont work until you make those changes in the gap_wide dataframe.

Repetition/modification of code example

In 13-dplyr.Rmd there's a code example that reads:

gdp_pop_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion=gdpPercap*pop/10^9) %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap=mean(gdpPercap),
              sd_gdpPercap=sd(gdpPercap),
              mean_pop=mean(pop),
              sd_pop=sd(pop),
              mean_pop=mean(pop),
              sd_pop=sd(pop))

In context, this should possibly be:

gdp_pop_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion=gdpPercap*pop/10^9) %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap=mean(gdpPercap),
              sd_gdpPercap=sd(gdpPercap),
              mean_pop=mean(pop),
              sd_pop=sd(pop),
              mean_gdp_billion=mean(gdp_billion),
              sd_gdp_billion=sd(gdp_billion))

or it might be an unintended duplication, which could be removed.

Connecting RStudio with git

I'm teaching a workshop where we've done unix and then git and now the r-novice-gapminder material. Tomorrow, I'd really like to show the students how to integrate git with RStudio. Today we started down this path, but when we went into the preferences, it was clear that OS X wouldn't show the /usr/bin/ folder which is where git is stored. Of course, I've long since forgotten how to do this. I'm not sure where it is stored with the Window's installer or how to point RStudio there. Any suggestions on what needs to be done? If anyone can help me, I'll happily file a pull request to include the instructions in 01-rstudio-intro.Rmd once the workshop is over.

Introduce recycling rule earlier

Right now, the recycling rule is first introduced explicitly in lesson 6, in a section on subsetting, in a subsection on logical operators. Recycling isn't primarily a property of subsetting or logical operations, however, and so my sense is that it could seem like a pretty big digression for the students. We're asking them to do (somewhat advanced) vector operations for the first time in a very unfamiliar context.

What do the project maintainers think about moving section 9 (vector operations, including the recycling rule) up before the section on subsetting? From what I can tell, section 9 doesn't really depend on sections 6-8, except for one advanced example at the very end that uses a function.

one day approach to teaching material

I just finished teaching a one day version (~6 hrs including breaks) of these materials. I thought it might be useful to share what I did for others that might be looking for a trimmed down version of these thorough materials. The workshop started with a morning of bash, then an afternoon and morning of R, and then an afternoon of git. That turned out to be a mistake for the person teaching git, but whatever. My motivation was to give people the minimum that they needed to get going with R.

  • Lesson 1: I quickly went over the material without using git
  • Lesson 2: skipped
  • Lesson 4: More or less went up to but not including factors
  • Lesson 5: Combined with the material from Lesson 4
  • Lesson 9: Combined with the material from Lesson 4
  • Lesson 3: Discussed using the help window in RStudio to find help with read.csv
  • Lesson 6: Skipped using which, "Handling special values", "Factor subsetting", "Matrix subsetting", "List subsetting"
  • Lesson 7: Got to the gdp calculator at the end of the first day and picked up there on the second day
  • Lesson 10: Only taught if ... else in the context of the more sophisticated gdp calculator
  • Lesson 11: skipped
  • Lesson 12: skipped
  • Lesson 13: Followed pretty literally
  • Lesson 8: Followed pretty literally
  • Lesson 14: skipped
  • Lesson 15: skipped
  • I also added about 20-30 min on knitr showing how the plots from Lesson 8 could be put into a document.

I realize that I skipped lists and matrices and barely introduced factors to say they are categorical data types. To get through the dplyr and ggplot2 stuff those just aren't needed and you can go a long way in R without needing them.

When teaching the function component I tried to build up the gdp calculator piece by piece. I would show them how to do the year and have them do the country. Towards the end of the first day I could sense that they weren't getting it and they were glazing over. So I had them get in pairs and alternate explaining each line to each other. They really perked up and seemed to have more confidence. We repeated this the following morning to rebuild what we had done and to go forward with the if statements.

Having taught some version of these materials twice now, I fear that a lot of the SWC materials have become bloated beyond what is truly necessary to get someone going and so that kind of effected what and how I approached the materials.

Lesson on packages and package ecosystem

I'm loathe to add anything to this, but I think an R intro could use an introduction to packages and the R package ecosystem. This would NOT be about how to create packages, but navigating repos, finding packages and understanding them as collections of functions.

Here are some initial thoughts. Feedback?

Time: 25 mins?
Goals: Students should be able to:

  • Understand the difference between installing and loading packages.
  • See what packages are on their machine and loaded.
  • Deal with Error: could not find function by loading/installing packages.
  • Identify multiple sources of R packages (CRAN, BioConductor, ROpenSci, Github)
  • Install a package with install.packages.
  • View the source code of a package function.
  • Locally and on github (via the metacran github org)
  • Install a package with devtools::install_github
  • Install a package downloaded from a website.
  • Change CRAN mirrors or package repos.
  • View package Index, DESCRIPTION, and vingettes both on the web and locally.
  • Identify when a package may have a non-R dependency (actually installing may be too much).
  • Find packages suitable for a task using CRAN web views, r-pkg.org, and web search
  • Identify a person or resource to contact with questions about packages and bug reports.

(Yes, this is the ambitious version)

Challenges:

  • Find a package for a given task (likely related to the field of people at the workshop)
  • Install package via CRAN/BioC and github
  • Find the appropriate package help listserv for a package
  • An ambitious idea is to make a fake SWC package that has the answer to questions embedded. e.g., "Who is the maintainer of swctools?", "File a bug report for swctools.", "The source for swctools::somefn contains coordinates to treasure..."

Notes:

  • Package installation is a notoriously finicky part of the workshop due to wi-fi issues, students lacking permissions on lab-issued computers, etc. Hopefully this lesson is taught later in the second day, after those first install issues are identified and dealt with.
  • It's important to make a point about the human side of the package ecosystem. Authors are humans, quality is not guaranteed. Things like number of users, package age and active maintenance are signals of quality. (Related to Mozilla's Web Literacy Map concept of credibility).
  • I imagine that field-specific Data Carpentry workshops would customize this, especially on the BioC/genomics side (CC @tracykteal)

Should we include plotting focused primarily on ggplot or base graphics?

Discussion in #1 seems to indicate that many students like learning ggplot, and if they have no prior experience with base R plotting then it is not so hard to pick up. However, it also make be too much to cover along with everything else and novices will run into cases where it is helpful to understand some of how base graphics work as well. How should we focus the lessons?

Add more easy challenges

In many of the lessons, the first challenge jumps straight into asking participants to go beyond what was just demonstrated and integrating ideas from other lessons.

I propose making sure there is at least one challenge for each major learning milestone that just revises the topic. This challenge should only reinforce what was just demonstrated, and then the later challenges can slowly expand on this and ask the students to use critical thinking, draw from past topics, do some research etc. The difficult of the challenges should ramp up gradually.

Also keep in mind using the gapminder dataset as a theme throughout challenges where appropriate: #20

Here's an example of a lesson that I think has this problem (although many of them do, maybe others would like to mention specific cases):
http://swcarpentry.github.io/r-novice-gapminder/07-functions.html

I actually tried to fix this one in the last workshop by pulling in some inflammation expamples, but it's still not quite right.

Add "*.html" to .gitignore?

Should we add "*.html" to .gitignore? When a core maintainer wanted to add the HTML, they could use "git add -f".

Update MAKEFILE to check knitr version

make preview will only compile the code blocks in the challenges and callouts correctly if knitr is version 1.10.12 or higher. Currently, this requires installing from github: (devtools::install_github("yihui/knitr")).

I've written an R script that will throw an error if the knitr version is too low (tools/check_knitr_version.R) in Pull Request #41 , but I wasn't able to successfully incorporate it into the Makefile to prevent future contributors from clobbering the code block rendering.

Spell check?

I would be happy to do a global spell check run on the Rmd files. If this is a good idea, let me know when origin/gh-pages is ready for a PR with lots of small changes.

Duplicate heading on 04-data-structures-part1.Rmd

$ git log --oneline -1
0c89e17 Updating HTML
$ grep -n "## Factors" 04-data-structures-part1.Rmd
226:## Factors
241:## Factors

Having this headings with the same name doesn't make sense to me. I think that the first one should be "Data Frame".

Conversion from R-novice-inflammation

It looks like a lot of work has been put into the r-novice-inflammation lessons over the last few months, and they now cover a lot more R-specific material (as opposed to a literal translation of the python materials).

I'm wondering if pulling in those lessons, and replacing the Inflammation data with the gapminder data is a good place to start for paring this material down to a half day workshop? We could then have extra lessons for instructors who want to run R over a full day (e.g. a ggplot2 lesson and knitr lesson).

make problems: .html not produced by 'make preview'

The original issue was about 'make check' failing. That got filed as an issue in the lesson template repo at swcarpentry/DEPRECATED-lesson-template#299

I was unable to get 'make preview' to work. I created a new file with a .Rmd extension, and ran 'make preview' and it did not create a corresponding .html file. I removed all the .html files, even. R is in my path.

I am doing this on a fresh installation of El Capitan, with freshly installed R, RStudio, Pandoc, Anaconda[23], etc. All of the necessary R libraries are installed, and if I manually knit the new .Rmd file, it produces the proper output.

I may not be able to get to this this evening, but I'll try it again, just to be sure it was not an oversight on my part.

Convert non-topic Markdown files to Rmarkdown?

Following the discussion in Issue #17, I propose we convert the non-topic markdown files (e.g. LAYOUT.md, index.md, CONTRIBUTING.md, etc.) to R markdown files. That way we can add "*.md" to the .gitignore, to simplify the process for future contributors (who should then only ever edit the .Rmd files).

Reduce R content to fit in one full day. Potentially create a mid-day end point for half-day workshops.

Just bringing across some discussion from the mailing list about how long the R material should go for.

My summary of that discussion is some agreement that it should go for one day. It may be useful to make it easily run as a half day workshop by creating a natural stopping point mid-way. Any additional lessons that don't fit in a day will be listed as optional extras.

I'm somewhat surprised that you're planning to run the R content in half a day. We've never run less than a full day of R. And I'm a bit worried students wouldn't get up to the juicy bits in half a day.

If you plan on doing this, I would suggest that you organise the lessons so that a subset of them could be taught as a half-day workshop or the full set as a whole day. You could have a "capstone" exercise/topic at the end of the half day content so it still feels like a fully rounded-off session.

What do you think?

Warm regards,
Harriet

Hi Harriet -

That's a good point; I agree completely. Especially with SQL being off the list, I would imagine that almost all workshops will expand the scripting material (R/python) to fill that space. I do also think though that it would be good to have a split point (perhaps with a capstone as you mention), since it may not always be a single contiguous day of R material vs two half-days.

But that said, I think we should probably shoot for having no more than one day's of material in the core novice lessons (and other content could go into extra optional lessons or intermediate lessons if people teach an all-R for 2 days type event).

Best,
Naupaka

Remove extra files in repo that have been superseded by content within lessons

There are a couple of files that are no longer needed. I think it would be safe to get rid of or else transition the content into one of the first few lesson modules. At any rate, their content is generally out of date compared to the current version of the materials.

Need to be removed or transferred:

  • motivation.md
  • OUTLINE.md
  • plan.md

From make check:

ERROR: Validation failed for ./motivation.md: Could not automatically identify correct template.
ERROR: Validation failed for ./OUTLINE.md: Could not automatically identify correct template.
ERROR: Validation failed for ./plan.md: Could not automatically identify correct template.

06-data-subsetting: order of operations

Currently, the callout on order of operations in 06-data-subsetting says:

remember the order of operations. : is really a function, so what happens is it takes its first argument as -1, and second as 3...

I have two concerns with this:

  1. "is really a function" doesn't seem germane: as I understand it, all operators are functions, but only some have lower precedence than the unary minus sign.

  2. I don't think we can ask the students to "remember" order of operations here. Even if the students knew where : fit in the hierarchy from high-precedence operators like ^ to low-precedence operators like =, they probably wouldn't anticipate that R gives very different precedences to unary and binary minus signs:

> -1:3      # Unary "minus" evaluated before `:`
[1] -1  0  1  2  3

> 0 - 1:3   # Binary "minus" evaluated after `:`
[1] -1 -2 -3

Given that we probably don't want a digression into R's operator precedence, I'm not sure what the solution is (other than recommending that students always use parentheses around arguments to : when math is happening nearby). I thought I'd raise the issue to see what other SWC folk think.

Please not change the CSV file on the fly for the lesson

Screenshot of 04-data-structures-part1.html

screen shot 2016-05-14 at 12 21 24

Description

The lesson has

Go back to your text editor and add add this line to feline-data.csv:

tabby,2.3 or 2.4,TRUE

Reload your cats data like before, and check what type of data we find in the weight column:

cats <- read.csv(file="data/feline-data.csv")
typeof(cats$weight[1])
[1] "double"

Oh no, our weights aren’t the double type anymore! If we try to do the same math we did on them before, we run into trouble:

cats$weight[1] + cats$weight[2]
[1] 7.1

The text doesn't match with the code examples. We should use data/feline-data2.csv or something like that to avoid the problem.

Knitr lessons

It would be really nice to have a small lesson on the use of knitr for report generation within RStudio. @nfaux ran a workshop last week with a knitr lesson, but there are not yet any corresponding lesson materials. We've got his project file to use as reference materials.

Move while loop to a callout?

This might be a contentious suggestion... but what about removing while loops from the core lesson, and just mentioning them in a callout? I think while loops are the least useful for beginners. It almost seems like they are just in there for completeness.

I suggest removing the while loop part of the lesson, and put in a call out with links to extra readings for people who are interested.

Thoughts?

Solutions for challenges

Many students asked for solutions to the challenges. This is particularly important when we don't get time to discuss the later challenges with the whole class. It's also handy for new instructors teaching the first workshop.

I suggest putting a link to a page of challenges and solutions at the end of each lesson.

Data types lesson 4

http://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html
R has 5 basic atomic types (meaning they can’t be broken down into anything smaller):

(there is a problem with indentation and how the atomic types are counted)
It looks like this:

logical (e.g., TRUE, FALSE)
numeric
integer (e.g, 2L, as.integer(3))
double (i.e. decimal) (e.g, -24.57, 2.0, pi)
complex (i.e. complex numbers) (e.g, 1 + 0i, 1 + 4i)
text (called “character” in R) (e.g, "a", "swc", 'This is a cat')

But should look more like this:

logical (e.g., TRUE, FALSE)
numeric
    integer (e.g, 2L, as.integer(3))
    double (i.e. decimal) (e.g, -24.57, 2.0, pi)
complex (i.e. complex numbers) (e.g, 1 + 0i, 1 + 4i)
text (called "character" in R) (e.g, "a", "swc", 'This is a cat')

make preview fails

Looks like one of the python scripts isn't happy:

pandoc -s -t html \
    --template=_layouts/page \
    --filter=tools/filters/blockquote2div.py \
    --filter=tools/filters/id4glossary.py \
    -Vheader="$(cat _includes/header.html)" -Vbanner="$(cat _includes/banner.html)" -Vfooter="$(cat _includes/footer.html)" -Vjavascript="$(cat _includes/javascript.html)" \
    -o 01-rstudio-intro.html 01-rstudio-intro.md
pandoc: Error running filter tools/filters/blockquote2div.py
fd:4: hPutBuf: resource vanished (Broken pipe)
make: *** [01-rstudio-intro.html] Error 83

Separate challenge chunks

For many of the back to back challenges, there is only one {.challenge} markdown block. Clarity would be increased by separating out into a challenge block for each individual challenge. See the end of 01-rstudio-intro.Rmd for an example.

Plot example as soon as possible

Summary

Something that I love about Software Carpentry lesson the first time that I read it is to mention plots as soon as possible and I think this will be a great improve to R Gapminder lesson.

Description

Add a "motivational" topic before 01-rstudio-intro.md. Something like: "At data/gapminder-FiveYearData.csv you will find some information about many countries. Get the population of X and create a plot of it using plot(c(P1, P2, P3, ..., PN). Congrats, you just wrote your first R code and create a plot with it. Can you plot the population of Y? And the population for all the countries in less than one minute? In the next chapters you will learn a few things about R and at the end you will be capable to create a plot for each country in less than one minute."

Add motivational slides

We ought to have a short Motivational slideshow that we can use at the start of a lesson. from the LAYOUT.md :

Every lesson must include a short slide deck in motivation.md suitable for a short presentation (3 minutes or less) that the instructor can use to explain to learners how knowing the subject will help them.

I was thinking that #37 would make really great material for such a thing. The original authors could work on that, or if people agree we can merge that in and someone else can make it into slides
cc @hdashnow @SamPenrose

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.