swcarpentry / r-novice-gapminder Goto Github PK
View Code? Open in Web Editor NEWR for Reproducible Scientific Analysis
Home Page: http://swcarpentry.github.io/r-novice-gapminder/
License: Other
R for Reproducible Scientific Analysis
Home Page: http://swcarpentry.github.io/r-novice-gapminder/
License: Other
In the context of a SWC workshop it would be nice to tie this lesson to the shell
by calling Rscript
from the shell. It's a lot to ask because it requires source()
(conceptually) and commandArgs()
(to be interesting). Maybe mention it in further readings
?
I would be happy to do a global spell check run on the Rmd files. If this is a good idea, let me know when origin/gh-pages is ready for a PR with lots of small changes.
When editing the materials and challenges, I think it's useful to keep in mind how they relate back to the central thread of the gapminder data and the kind of analysis we can do with it.
Discuss.
Are we done with e.g. 01-rstudio-intro.md now that we have 01-rstudio-intro.Rmd?
Discussion in #1 seems to indicate that many students like learning ggplot, and if they have no prior experience with base R plotting then it is not so hard to pick up. However, it also make be too much to cover along with everything else and novices will run into cases where it is helpful to understand some of how base graphics work as well. How should we focus the lessons?
Currently, the callout on order of operations in 06-data-subsetting says:
remember the order of operations.
:
is really a function, so what happens is it takes its first argument as -1, and second as 3...
I have two concerns with this:
"is really a function" doesn't seem germane: as I understand it, all operators are functions, but only some have lower precedence than the unary minus sign.
I don't think we can ask the students to "remember" order of operations here. Even if the students knew where :
fit in the hierarchy from high-precedence operators like ^
to low-precedence operators like =
, they probably wouldn't anticipate that R gives very different precedences to unary and binary minus signs:
> -1:3 # Unary "minus" evaluated before `:`
[1] -1 0 1 2 3
> 0 - 1:3 # Binary "minus" evaluated after `:`
[1] -1 -2 -3
Given that we probably don't want a digression into R's operator precedence, I'm not sure what the solution is (other than recommending that students always use parentheses around arguments to :
when math is happening nearby). I thought I'd raise the issue to see what other SWC folk think.
06-data-subsetting has an unnumbered challenge between challenge 1 and 2. This challenge also lacks a solution at the bottom of the page.
make preview
will only compile the code blocks in the challenges and callouts correctly if knitr is version 1.10.12 or higher. Currently, this requires installing from github: (devtools::install_github("yihui/knitr")).
I've written an R script that will throw an error if the knitr version is too low (tools/check_knitr_version.R
) in Pull Request #41 , but I wasn't able to successfully incorporate it into the Makefile to prevent future contributors from clobbering the code block rendering.
It looks like a lot of work has been put into the r-novice-inflammation lessons over the last few months, and they now cover a lot more R-specific material (as opposed to a literal translation of the python materials).
I'm wondering if pulling in those lessons, and replacing the Inflammation data with the gapminder data is a good place to start for paring this material down to a half day workshop? We could then have extra lessons for instructors who want to run R over a full day (e.g. a ggplot2 lesson and knitr lesson).
The content and links in reference.md
need to be updated to match the new lesson order and names.
At the top of https://github.com/swcarpentry/r-novice-gapminder, please add a link to http://swcarpentry.github.io/r-novice-gapminder/index.html so people know where to find the rendered version. I can't put in a PR for that, only those with commit access can add it (right after where it says 'Introduction to R for non-programmers using gapminder data.')
The header material for each topic file includes the expected amount of time each lesson should take. Most of them are the default: 15 minutes. @hdashnow can you update each of the topics with realistic times? You've had to most experience running this lesson material.
This information has been updated since the repository has started. Key information includes R markdown usage. Pull from https://github.com/swcarpentry/lesson-example
The original issue was about 'make check' failing. That got filed as an issue in the lesson template repo at swcarpentry/DEPRECATED-lesson-template#299
I was unable to get 'make preview' to work. I created a new file with a .Rmd extension, and ran 'make preview' and it did not create a corresponding .html file. I removed all the .html files, even. R is in my path.
I am doing this on a fresh installation of El Capitan, with freshly installed R, RStudio, Pandoc, Anaconda[23], etc. All of the necessary R libraries are installed, and if I manually knit the new .Rmd file, it produces the proper output.
I may not be able to get to this this evening, but I'll try it again, just to be sure it was not an oversight on my part.
I just finished teaching a one day version (~6 hrs including breaks) of these materials. I thought it might be useful to share what I did for others that might be looking for a trimmed down version of these thorough materials. The workshop started with a morning of bash, then an afternoon and morning of R, and then an afternoon of git. That turned out to be a mistake for the person teaching git, but whatever. My motivation was to give people the minimum that they needed to get going with R.
read.csv
which
, "Handling special values", "Factor subsetting", "Matrix subsetting", "List subsetting"if ... else
in the context of the more sophisticated gdp calculatorknitr
showing how the plots from Lesson 8 could be put into a document.I realize that I skipped lists and matrices and barely introduced factors to say they are categorical data types. To get through the dplyr
and ggplot2
stuff those just aren't needed and you can go a long way in R without needing them.
When teaching the function component I tried to build up the gdp calculator piece by piece. I would show them how to do the year and have them do the country. Towards the end of the first day I could sense that they weren't getting it and they were glazing over. So I had them get in pairs and alternate explaining each line to each other. They really perked up and seemed to have more confidence. We repeated this the following morning to rebuild what we had done and to go forward with the if
statements.
Having taught some version of these materials twice now, I fear that a lot of the SWC materials have become bloated beyond what is truly necessary to get someone going and so that kind of effected what and how I approached the materials.
I'm loathe to add anything to this, but I think an R intro could use an introduction to packages and the R package ecosystem. This would NOT be about how to create packages, but navigating repos, finding packages and understanding them as collections of functions.
Here are some initial thoughts. Feedback?
Time: 25 mins?
Goals: Students should be able to:
Error: could not find function
by loading/installing packages.(Yes, this is the ambitious version)
Challenges:
swctools
?", "File a bug report for swctools
.", "The source for swctools::somefn
contains coordinates to treasure..."Notes:
Via discussion in #1 it seems like people want to take it out, especially for novice materials.
The exercise where we get learners to rbind to an existing dataframe is wrong.
df <- data.frame(id = c('a', 'b', 'c', 'd', 'e', 'f'), x = 1:6, y = c(214:219))
df
df <- rbind(df, list("g", 11, 42))
Should give an error and the following:
class(df$id)
Should give us "factor" but instead has "character" in the .md and .html versions.
this is used to motivate the use of stringsAsFactors = FALSE, which seems to already have been on before this code was run.
I'd submit a pull request, but the .Rmd is correct, just the parsing of it is wrong.
Maybe whoever built the lessons from html has this globally enabled?
There are a couple of files that are no longer needed. I think it would be safe to get rid of or else transition the content into one of the first few lesson modules. At any rate, their content is generally out of date compared to the current version of the materials.
Need to be removed or transferred:
From make check
:
ERROR: Validation failed for ./motivation.md: Could not automatically identify correct template.
ERROR: Validation failed for ./OUTLINE.md: Could not automatically identify correct template.
ERROR: Validation failed for ./plan.md: Could not automatically identify correct template.
In 13-dplyr.Rmd
there's a code example that reads:
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop),
mean_pop=mean(pop),
sd_pop=sd(pop))
In context, this should possibly be:
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop),
mean_gdp_billion=mean(gdp_billion),
sd_gdp_billion=sd(gdp_billion))
or it might be an unintended duplication, which could be removed.
Looks like one of the python scripts isn't happy:
pandoc -s -t html \
--template=_layouts/page \
--filter=tools/filters/blockquote2div.py \
--filter=tools/filters/id4glossary.py \
-Vheader="$(cat _includes/header.html)" -Vbanner="$(cat _includes/banner.html)" -Vfooter="$(cat _includes/footer.html)" -Vjavascript="$(cat _includes/javascript.html)" \
-o 01-rstudio-intro.html 01-rstudio-intro.md
pandoc: Error running filter tools/filters/blockquote2div.py
fd:4: hPutBuf: resource vanished (Broken pipe)
make: *** [01-rstudio-intro.html] Error 83
Following the discussion in Issue #17, I propose we convert the non-topic markdown files (e.g. LAYOUT.md, index.md, CONTRIBUTING.md, etc.) to R markdown files. That way we can add "*.md" to the .gitignore, to simplify the process for future contributors (who should then only ever edit the .Rmd files).
The examples show GDP on the y-axis and life expectancy on the x-axis. As many discussions (elsewhere) concern the utility of GDP in predicting life expectancy, should GDP instead be on the x-axis?
For many of the back to back challenges, there is only one {.challenge} markdown block. Clarity would be increased by separating out into a challenge block for each individual challenge. See the end of 01-rstudio-intro.Rmd for an example.
Currently all of the content (from the COMBINE repo) is on the gh-pages branch. Should we keep it there, and delete master, or occasionally merge back to master, or...?
We currently ask students to download the zip file for the raw data:
https://github.com/swcarpentry/r-novice-gapminder/blame/gh-pages/02-project-intro.md#L159
Unzipping is an unnecessary pain here, it may be easier to download the raw
version directly with a "right click -> save as" using this direct link
https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv
Just noticed that all the files are .md and .html -- should we be using .Rmd instead?
When you load the gap_wide data from the .csv file in the repo, columns 37 and 38 get inputted as integers and columns 1 and 2 get inputted as factors. When you follow along in the lesson the gap_wide_new columns 1 and 2 are characters and columns 37 and 38 are numeric. This means the all.equal() function wont work until you make those changes in the gap_wide dataframe.
Many students asked for solutions to the challenges. This is particularly important when we don't get time to discuss the later challenges with the whole class. It's also handy for new instructors teaching the first workshop.
I suggest putting a link to a page of challenges and solutions at the end of each lesson.
Perhaps you can figure this one out @remi-daigle?
Seems to be an issue with the pandoc -> html -> jekyll workflow somewhere... The diagrams using DiagrammeR
don't display properly. See http://swcarpentry.github.io/r-novice-gapminder/13-dplyr.html and http://swcarpentry.github.io/r-novice-gapminder/14-tidyr.html
We ought to have a short Motivational slideshow that we can use at the start of a lesson. from the LAYOUT.md :
Every lesson must include a short slide deck in motivation.md suitable for a short presentation (3 minutes or less) that the instructor can use to explain to learners how knowing the subject will help them.
I was thinking that #37 would make really great material for such a thing. The original authors could work on that, or if people agree we can merge that in and someone else can make it into slides
cc @hdashnow @SamPenrose
It would be really nice to have a small lesson on the use of knitr for report generation within RStudio. @nfaux ran a workshop last week with a knitr lesson, but there are not yet any corresponding lesson materials. We've got his project file to use as reference materials.
There is a ton of great content in this repo. Perhaps a great place to start getting it organized would be an outline.md document that lists the repos, and proposes an order and a core set. Then we can focus on getting these core modules polished and move some of the others in to supplementary or additional materials sections? As discussed in the comments on #10
Over in the git-novice lessons they use closed PRs to keep an archive of workshop archives. To track how different people implement and change the lessons. Shall we try a similar approach here?
I will try this approach for this week's SWC workshop at Simon Fraser University
@sritchie73 has agreed to lead off the MozSL Sprint with some conversions of the lessons from md to Rmd.
The start of 07-functions lesson needs some text about what a function is, and it's general form (like the if/else lesson).
http://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html
R has 5 basic atomic types (meaning they can’t be broken down into anything smaller):
(there is a problem with indentation and how the atomic types are counted)
It looks like this:
logical (e.g., TRUE, FALSE)
numeric
integer (e.g, 2L, as.integer(3))
double (i.e. decimal) (e.g, -24.57, 2.0, pi)
complex (i.e. complex numbers) (e.g, 1 + 0i, 1 + 4i)
text (called “character” in R) (e.g, "a", "swc", 'This is a cat')
But should look more like this:
logical (e.g., TRUE, FALSE)
numeric
integer (e.g, 2L, as.integer(3))
double (i.e. decimal) (e.g, -24.57, 2.0, pi)
complex (i.e. complex numbers) (e.g, 1 + 0i, 1 + 4i)
text (called "character" in R) (e.g, "a", "swc", 'This is a cat')
Hi all,
I'm not sure whether this is more appropriate as an issue or a pull request, but here is the repository for the gapminder lessons I developed for the February workshop we held in Melbourne: https://github.com/resbaz/r-novice-gapminder/
Also some acknowledgements and attributions: These materials were heavily based on materials originally written by @dfalster and @richfitz and modified by @dbarneche for a Software Carpentry R workshop run in Sydney last October (https://github.com/dbarneche/2014-10-31-USyd), and on some of the intermediate R materials (particularly the data structures lesson) written by @karthik (still part of the bc repo: https://github.com/swcarpentry/bc/tree/gh-pages/intermediate/r).
Right now, the recycling rule is first introduced explicitly in lesson 6, in a section on subsetting, in a subsection on logical operators. Recycling isn't primarily a property of subsetting or logical operations, however, and so my sense is that it could seem like a pretty big digression for the students. We're asking them to do (somewhat advanced) vector operations for the first time in a very unfamiliar context.
What do the project maintainers think about moving section 9 (vector operations, including the recycling rule) up before the section on subsetting? From what I can tell, section 9 doesn't really depend on sections 6-8, except for one advanced example at the very end that uses a function.
In many of the lessons, the first challenge jumps straight into asking participants to go beyond what was just demonstrated and integrating ideas from other lessons.
I propose making sure there is at least one challenge for each major learning milestone that just revises the topic. This challenge should only reinforce what was just demonstrated, and then the later challenges can slowly expand on this and ask the students to use critical thinking, draw from past topics, do some research etc. The difficult of the challenges should ramp up gradually.
Also keep in mind using the gapminder dataset as a theme throughout challenges where appropriate: #20
Here's an example of a lesson that I think has this problem (although many of them do, maybe others would like to mention specific cases):
http://swcarpentry.github.io/r-novice-gapminder/07-functions.html
I actually tried to fix this one in the last workshop by pulling in some inflammation expamples, but it's still not quite right.
The first lesson has a tip that says:
you should never use == to compare two numbers unless they are integers.
Instead you should use the all.equal function.
But in R
3.555555555555 == 3.555555555554
[1] FALSE
all.equal(3.555555555555,3.555555555554)
[1] TRUE
From R documentation
Description
all.equal(x, y) is a utility to compare R objects x and y testing ‘near equality’. If they are different, comparison is still made to some extent, and a report of the differences is returned. Do not use all.equal directly in if expressions—either use isTRUE(all.equal(....)) or identical if appropriate.
I think that tip should be removed or updated, using all.equal in this situation is not recommendable
Should we add "*.html" to .gitignore? When a core maintainer wanted to add the HTML, they could use "git add -f".
Something that I love about Software Carpentry lesson the first time that I read it is to mention plots as soon as possible and I think this will be a great improve to R Gapminder lesson.
Add a "motivational" topic before 01-rstudio-intro.md
. Something like: "At data/gapminder-FiveYearData.csv
you will find some information about many countries. Get the population of X and create a plot of it using plot(c(P1, P2, P3, ..., PN)
. Congrats, you just wrote your first R code and create a plot with it. Can you plot the population of Y? And the population for all the countries in less than one minute? In the next chapters you will learn a few things about R and at the end you will be capable to create a plot for each country in less than one minute."
We have
Have attended the Shell and Git sessions.
as prerequisites and I think that we can
This might be a contentious suggestion... but what about removing while loops from the core lesson, and just mentioning them in a callout? I think while loops are the least useful for beginners. It almost seems like they are just in there for completeness.
I suggest removing the while loop part of the lesson, and put in a call out with links to extra readings for people who are interested.
Thoughts?
Under 'Data Structures' heading, factor is listed as a data structure, but factor is a data type.
I'm teaching a workshop where we've done unix and then git and now the r-novice-gapminder material. Tomorrow, I'd really like to show the students how to integrate git with RStudio. Today we started down this path, but when we went into the preferences, it was clear that OS X wouldn't show the /usr/bin/ folder which is where git is stored. Of course, I've long since forgotten how to do this. I'm not sure where it is stored with the Window's installer or how to point RStudio there. Any suggestions on what needs to be done? If anyone can help me, I'll happily file a pull request to include the instructions in 01-rstudio-intro.Rmd
once the workshop is over.
Just bringing across some discussion from the mailing list about how long the R material should go for.
My summary of that discussion is some agreement that it should go for one day. It may be useful to make it easily run as a half day workshop by creating a natural stopping point mid-way. Any additional lessons that don't fit in a day will be listed as optional extras.
I'm somewhat surprised that you're planning to run the R content in half a day. We've never run less than a full day of R. And I'm a bit worried students wouldn't get up to the juicy bits in half a day.
If you plan on doing this, I would suggest that you organise the lessons so that a subset of them could be taught as a half-day workshop or the full set as a whole day. You could have a "capstone" exercise/topic at the end of the half day content so it still feels like a fully rounded-off session.
What do you think?
Warm regards,
HarrietHi Harriet -
That's a good point; I agree completely. Especially with SQL being off the list, I would imagine that almost all workshops will expand the scripting material (R/python) to fill that space. I do also think though that it would be good to have a split point (perhaps with a capstone as you mention), since it may not always be a single contiguous day of R material vs two half-days.
But that said, I think we should probably shoot for having no more than one day's of material in the core novice lessons (and other content could go into extra optional lessons or intermediate lessons if people teach an all-R for 2 days type event).
Best,
Naupaka
Screenshot of 04-data-structures-part1.html
The lesson has
Go back to your text editor and add add this line to feline-data.csv:
tabby,2.3 or 2.4,TRUE
Reload your cats data like before, and check what type of data we find in the weight column:
cats <- read.csv(file="data/feline-data.csv") typeof(cats$weight[1])
[1] "double"
Oh no, our weights aren’t the double type anymore! If we try to do the same math we did on them before, we run into trouble:
cats$weight[1] + cats$weight[2]
[1] 7.1
The text doesn't match with the code examples. We should use data/feline-data2.csv
or something like that to avoid the problem.
$ git log --oneline -1
0c89e17 Updating HTML
$ grep -n "## Factors" 04-data-structures-part1.Rmd
226:## Factors
241:## Factors
Having this headings with the same name doesn't make sense to me. I think that the first one should be "Data Frame".
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.