datacarpentry / r-socialsci Goto Github PK

View Code? Open in Web Editor NEW

106.0 15.0 195.0 216.8 MB

R for Social Scientists

Home Page: https://datacarpentry.org/r-socialsci/

License: Other

R 99.94% Shell 0.06%

carpentries data-carpentry lesson r data-visualisation data-wrangling data-visualization english social-sciences stable

r-socialsci's Introduction

r-socialsci

Lesson on R for social scientists. Please see https://datacarpentry.org/r-socialsci/ for a rendered version of this lesson.

This is an introduction to R designed for participants with no programming experience. These lessons can be taught in a day (~ 6 hours). The lessons cover some basic information about R syntax, the RStudio interface, and move through how to import CSV files, the structure of data frames, how to deal with factors, how to add/remove rows and columns, how to calculate summary statistics from a data frame, and a brief introduction to plotting.

The instructor notes page have some tips about how to best teach this workshop.

Maintainers:

r-socialsci's People

Contributors

Stargazers

Watchers

Forkers

richdeto nsbowden daniellequinn gtlaflair murraycadzow sashasaurus johnsolk elizabethwilliams8 petersmyth12 elinw kevin-vilbig jkaupp mcomsa potterzot dhicks ekgade wy2288 matrix2reality benmarwick rltillett dmerson chriseshleman angela-li katiecoburn hdekk resourcefulsquirrel donalus andrewsanchez pow123 dkermer cuihuash neildaviesevans cwickham kerchner adivea marwahaha jmjamison shawnjanzen kliegl rochellelundy andrew66882011 jessespencersmith lilianhj pkiraly katrinleinweber maczokni bricakeld ucla-data-science-center ajstewartlang jborycz 3mmarand atraxler francescavantaggiato husseingb erslayton monkmanmh emljames kisplab davidrvera kessonovitch ciakovx heijer smoser11 ameliakallaher williamngiam jgblanc aoling2 caldisskjelmann gunzivan28 mireiavalle jrmuirhead laurabotzet chi-hsiangwang bmillerlab julvania nmarchio vguetler kcovarrubias18 susansayre rcurty carlosug robertn01 ayalhassan zjsteyn bkmgit cengel sefabey timmarchand atheobold fmichonneau annajiat gurpreet2301 menghamo kelseygonzalez nataliablock dafnevk jonjab ablankson mvail danielagawehns

r-socialsci's Issues

n_row / 2 is a decimal

In the starting with data episode, one of the exercises has the learners get the middle row by using n_row / 2. There are 131 rows in the dataframe, so the result is 65.5. Using this in subsetting, however, gives the 65th row for most learners. Some get an error.

Blank plot in Episode 04 ggplot

First plot (unnamed-chunk-3) is blank.

http://www.datacarpentry.org/r-socialsci/04-Data-visualisation-with-ggplot2/

Tweak data output in episode 2

Currently, many datasets are shown in their entirety and it's distracting, we need to use head(), or adjust options() to limit the size of the outputs.

fix datasets for last plots in ggplot lesson

Error in rendered page

In the "Extracting subsets from vectors" section of 01-R-basics there is an error message because of a mis-spelled variable name. This PR fixes:
#43

Some RStudio Wokring Directory clarity in pre-requisite episode

Here is the instance that might warrant clarification

" You can change it from the menu items for the tab, or more likely it will change when you create a project with its own folder as we will be doing later. " I think the first it refers to the current working directory, but may not be crystal clear for new users. Also, menu items for the tab is a little unclear. In the Sessions menu there is a Set Working Directory option, but I am not sure if this is what is being referenced. I think this is an important point (the whole working directory idea) for new users.

introduce r syntax highlighting

Code chunks should have the tag

.language-r

as described here. This will make the code show up with standard R color coding and other standard aesthetics.

Data (surveys_complete) missing in Episode 04 ggplot

surveys_complete data is missing

http://www.datacarpentry.org/r-socialsci/04-Data-visualisation-with-ggplot2/

flip order of %in% example

possessions %in% c("car", "bicycle", "motorcycle", "truck", "boat")

vs c("car", "bicycle", "motorcycle", "truck", "boat") %in% possessions

Convert code chunks in 03-Introducing-dplyr-and-tidyr

See #23 for example

Edit typos, formatting, and style of pre-requisites md file

This issue is meant to address the some of the basic items in the Lesson Release Checklist for this repository.

Transition to standardized GitHub labels

The lesson infrastructure committee unanimously approved the proposal of using the same set of labels across all our repositories during its last meeting on May 23rd, 2018.

This repository has now been converted to use the standard set of labels.

If this repository used the previous set of recommended labels by Software Carpentry, they have been converted to the new one using the following rules:

SWC legacy labels	New 'The Carpentries' labels
bug	type:bug
discussion	type:discussion
enhancement	type:enhancement
help-wanted	help wanted
newcomer-friendly	good first issue
template-and-tools	type:template and tools
work-in-progress	status:in progress

The label instructor-training was removed as it is not used in the workflow of certifying new instructors anymore. The label question was left as is when it was in use, and removed otherwise. If your repository used custom labels (and issues were flagged with these labels), they were left as is.

The lesson infrastructure committee hopes the standard set of labels will make it easier for you to manage the issues you receive on the repositories you manage.

The lesson infrastructure committee will evaluate how the labels are being used in the next few months and we will solicit your feedback at this stage. In the meantime, if you have any questions or concerns, please leave a comment on this issue.

-- The Lesson Infrastructure subcommittee

PS: we will close this issue in 30 days if there is no activity.

database episode

fix formatting issues
put database file in a data/ folder to follow good practices of working directory organization mentioned in first episode

add output cells

For each input chunk of code, show the output that will be produced. This will help the learners and the instructors to know that they are getting the expected output.

Output code should be marked with .output: as described here.

Does the lesson need the SPSS file?

Is it necessary to cover how to read SPSS file in this lesson? What is the motivation behind it?

fix inline nrow calls

`posessions` typo in episode 01

posessions should be possessions

Inconsistent header usage in Episode 03 dplyr & tidyr

The headers in episode 3 seems inconsistent. Pipes are a double header (## Pipes), while the mutate function is triple (### Mutate), and the summarize function is quadruple (#### The summarize() function). Is there a style guide for when to use which headers?

Convert code chunks in 02-Reading-text-files

Code chunks like this

~~~
library(ggplot2)
library(readr)
SAFI_results <- read_csv("SAFI_results.csv")
~~~

need to be converted to:

```{r}
library(ggplot2)
library(readr)
SAFI_results <- read_csv("SAFI_results.csv")
```

`!` operator mentioned in Episode 03 dplyr & tidyr

The ! operator is mentioned on line 294, but it may be useful to introduce it earlier, and in a full section rather than a code comment. I propose adding it to Episode 01 R Basics around line 416, where other operators like <, >, ==, and != are introduced.

http://www.datacarpentry.org/r-socialsci/03-Introducing-dplyr-and-tidyr/
http://www.datacarpentry.org/r-socialsci/01-R-basics/

data types and structures

The discussion of data types and data structures in "Vectors and data types" could be clarified. Perhaps even defining these terms before using them would help. Also note that the first sentence of the section reads "A vector is the most common and basic data type in R, and is pretty much the workhorse of R." perhaps this should be changed to "basic data structure"

dplyr/tidyr episode should be rewritten to use the SAFI dataset

the same dataset should be used across lessons and episodes in Data Carpentry workshops.

Help with proofreading and adapting the content with social science examples

So myself and @langtonhugh have just completed our data carpentry training instructor 2-day workshop, and in looking to contribute to something for our checkout tasks, we came across this re-writing of the R material for social sciences.

We thought that the best use of our skills would be to contribute to this by helping proof-read, and amend the sessions with social-sciences related examples. We just wanted to ask if there is any direction on where to focus, or if you had any particular tasks/ requirements in mind? I had a look at the "reading in data" section, and have some ideas in mind, and have spotted some typos, so happy to make some changes there and then submit a pull request, but also open to steer to focus on other elements if that would be more helpful? Let us know!

Thanks for contributing! If this contribution is for instructor training, please send an email to [email protected] with a link to this contribution so we can record your progress. You’ve completed your contribution step for instructor checkout just by submitting this contribution.

Please keep in mind that lesson maintainers are volunteers and it may be some time before they can respond to your contribution. Although not all contributions can be incorporated into the lesson materials, we appreciate your time and effort to improve the curriculum. If you have any questions about the lesson maintenance process or would like to volunteer your time as a contribution reviewer, please contact Kate Hertweck ([email protected]).

render exercises as exercises

Some of the exercises (e.g. in this episode) are not formatted as exercises. Exercises and their solutions should be marked with specific syntax detailed here.

Create a code handout

similarly to the R ecology lesson, this lesson should have a code out with the skeleton of code chunks, links to data files, etc.

`|` character on Spanish keyboards

The Spanish Mac keyboard does not have a | key. This character can be created using:

alt + 1

Add to instructor notes?

04, "percent of each house type"

In the Barplot section of the ggplot lesson, the narrative text says that we want to create a dataframe with "the percent of each house type in each village." However, the code calculates percentages across the entire dataset, not by village. Note that the totals for each village are clearly well below 100%.

The code should be

percent_wall_type <- interviews_plotting %>%
    filter(respondent_wall_type != "cement") %>%
    count(village, respondent_wall_type) %>%
    group_by(village) %>%
    mutate(percent = n / sum(n))
    ungroup()

Convert code chunks in 06-Using-relational-database-with-R

See #23 for an example

06-Relational Database: Clarify wording and code (specific cases)

There are a few places that could use a better description of the code being run:

line 79
results <- dbSendQuery(mydb, "SELECT * FROM Question1")
lines 156-165
dbfile_new = "a_newdb.sqlite"
mydb_new = dbConnect(dbDriver("SQLite"), dbfile_new)

dbWriteTable(conn = mydb_new , name = "SN7577", value = "SN7577.csv",
row.names = FALSE, header = TRUE)

dbWriteTable(conn = mydb_new , name = "Q1", value = Q1,
row.names = FALSE)

dbListTables(mydb_new)

line 180
tbl(mydb_dplyr, sql("SELECT count(*) from SN7577"))

Add details re: line 104 - why should the connection be closed? Does it need to be reopened to run code in following chunks?
"Once you have retrieved the data you should close the connection."

Clarify lines 177 and 202 - unsure what they are trying to say
"as is the mthod for running queries. However using the 'tbl' functionwe still need to provide avalid SQL string."
"Notice that on the nrow command we get NA rather than a count of rows. Thisis because dplyr doesn't hold the full table even after the 'Select * ...' "

General advice on working directory organization is missing

Something like this: http://www.datacarpentry.org/R-ecology-lesson/00-before-we-start.html#getting_set_up

missing SPSS version of the data file

One of the exercises in episode 02 refers to loading the data from an SPSS output file.

Use the import dataset wizard to import the SN7577_spss.sav dataset.

I didn't find this file in the data subfolder or in the python version of the lesson. Does this need to be added to the repo?

Setup page is missing content

Page is blank, please add content!

http://www.datacarpentry.org/r-socialsci/setup/

Date/time section in episode 2 is too long?

the date/time section is quite extensive, and some details could be simplified. Specific learning objectives for working with date/time data should be articulated so this section can be made more precise.

The episodes are in the wrong folder

all the episodes need to be transferred to the _episodes_rmd folder, and renamed to end in a Rmd extension.

read_csv vs read.csv

This always leads to a long conversation because learners have heard of one or the other of these and want to talk about the differences. Think about how this can be incorporated into the curriculum as an example of reading helpfiles and understanding default options.

add caveat about manually reformatting spreadsheets

Episode 2 has a good exercise for fixing formatting issues in a badly formatted spreadsheet, however this may lead the learner to infer that manually formatting a spreadsheet is good practice (despite the fact that it's not reproducible). It would be good to note here the following (with a little exposition of each):

This exercise is provided as an example of what good formatting looks like and why that formatting is needed.
A note about reproducibility.
It's best practice to format your data sheets like this from the beginning, but if you can't, at least save a raw version of your data before making any manual edits.
Some note about how we'll be learning about OpenRefine for making more reproducible changes to data sheets later in the workshop.

discuss why these formatting problems are problematic

Episode 2 has this text:

The problems that we can see are as follows:
White space to the top and to the left of the data.
There are two header line types with different data items in each
One of the header line has two separate data items

but it doesn't explain why these are problematic. It would be good to have more exposition about why these are problems (ie. what will happen to the user downstream if they have these formatting issues in their data sheets).

n_row / 2 is a decimal

missing link for dataset description (7577)

In episode 02, there's supposed to be a link to a description of the example dataset:

Full details of the SN7577 dataset are available [here]

but the URL is missing. I'm not sure what this is supposed to be, so I'm leaving it blank for my current batch of reformatting edits.

It's possible that this information should also be added to the https://github.com/datacarpentry/python-socialsci lesson, too.

rename episode 02

from "reading text files" to "reading CSV files"

reading text files would suggest we are reading full-text data.

Markdown not rendering correctly in Episode 02 Reading Text Files

R Markdown is not rendering correctly somewhere around line 153:
"```{r, eval = FALSE} SN7577_tab[, -1] # The whole data.frame, except the first column SN7577_tab[-c(7:34786), ] # Equivalent to head(surveys)"

http://www.datacarpentry.org/r-socialsci/02-Reading-text-files/

Include language on tibbles in the 02-starting-with-data lesson?

In the 02-starting-with-data lesson the language is about data frames, but the output window and examples is displayed in tibbles. This seems a little confusing. Would it be clearer to emphasize the tibble portion a little earlier?

[This is submitted as part of the instructor training closeout.]

Prerequisites episode 404s

The prerequisites episode leads to a 404 page.

http://www.datacarpentry.org/r-socialsci/00-Pre-requisites/

Convert code chunks in 04-Data-visualisation-with-ggplot2

See #23 for example

missing part of solution in renaming factors

in the solution section for the renaming factors exercise, we're missing the solution for part 1 (where you rename the factors, before plotting them in a specific order). The method is explained in the lesson (memb_assoc[is.na(memb_assoc)] <- "undetermined"), but it would be helpful to have it in the solutions to for helpers who haven't been following along with every part of the lesson.

Lesson Release checklist

For each lesson release, copy this checklist to an issue and check off
during preparation for release

Scheduled Freeze Date: 2018-04-27
Scheduled Release Date: 2018-04-30

Checklist of tasks to complete before release:

01-R basics.md editing typos, formatting, style

Addresses these issues associated with lesson release checklist