Giter Site home page Giter Site logo

datacarpentry / r-socialsci Goto Github PK

View Code? Open in Web Editor NEW
106.0 15.0 195.0 216.8 MB

R for Social Scientists

Home Page: https://datacarpentry.org/r-socialsci/

License: Other

R 99.94% Shell 0.06%
carpentries data-carpentry lesson r data-visualisation data-wrangling data-visualization english social-sciences stable

r-socialsci's Introduction

Build Status Create a Slack Account with us Slack Status DOI

r-socialsci

Lesson on R for social scientists. Please see https://datacarpentry.org/r-socialsci/ for a rendered version of this lesson.

This is an introduction to R designed for participants with no programming experience. These lessons can be taught in a day (~ 6 hours). The lessons cover some basic information about R syntax, the RStudio interface, and move through how to import CSV files, the structure of data frames, how to deal with factors, how to add/remove rows and columns, how to calculate summary statistics from a data frame, and a brief introduction to plotting.

The instructor notes page have some tips about how to best teach this workshop.

Maintainers:

r-socialsci's People

Contributors

agully1 avatar ahobert avatar angela-li avatar aranganath24 avatar atheobold avatar bbartholdy avatar bkmgit avatar brynnelliott avatar caldisskjelmann avatar cengel avatar cforgaci avatar dmerson avatar eirini-zormpa avatar elliewix avatar erinbecker avatar fmichonneau avatar gtlaflair avatar ha0ye avatar jessesadler avatar juanfung avatar katrinleinweber avatar kelseygonzalez avatar maneesha avatar martinolmos avatar monkmanmh avatar ndporter avatar petersmyth12 avatar serahkiburu avatar steltenpower avatar zkamvar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

r-socialsci's Issues

n_row / 2 is a decimal

In the starting with data episode, one of the exercises has the learners get the middle row by using n_row / 2. There are 131 rows in the dataframe, so the result is 65.5. Using this in subsetting, however, gives the 65th row for most learners. Some get an error.

Tweak data output in episode 2

Currently, many datasets are shown in their entirety and it's distracting, we need to use head(), or adjust options() to limit the size of the outputs.

Some RStudio Wokring Directory clarity in pre-requisite episode

Here is the instance that might warrant clarification

" You can change it from the menu items for the tab, or more likely it will change when you create a project with its own folder as we will be doing later. " I think the first it refers to the current working directory, but may not be crystal clear for new users. Also, menu items for the tab is a little unclear. In the Sessions menu there is a Set Working Directory option, but I am not sure if this is what is being referenced. I think this is an important point (the whole working directory idea) for new users.

flip order of %in% example

possessions %in% c("car", "bicycle", "motorcycle", "truck", "boat")

vs c("car", "bicycle", "motorcycle", "truck", "boat") %in% possessions

Transition to standardized GitHub labels

The lesson infrastructure committee unanimously approved the proposal of using the same set of labels across all our repositories during its last meeting on May 23rd, 2018.

This repository has now been converted to use the standard set of labels.

If this repository used the previous set of recommended labels by Software Carpentry, they have been converted to the new one using the following rules:

SWC legacy labels New 'The Carpentries' labels
bug type:bug
discussion type:discussion
enhancement type:enhancement
help-wanted help wanted
newcomer-friendly good first issue
template-and-tools type:template and tools
work-in-progress status:in progress

The label instructor-training was removed as it is not used in the workflow of certifying new instructors anymore. The label question was left as is when it was in use, and removed otherwise. If your repository used custom labels (and issues were flagged with these labels), they were left as is.

The lesson infrastructure committee hopes the standard set of labels will make it easier for you to manage the issues you receive on the repositories you manage.

The lesson infrastructure committee will evaluate how the labels are being used in the next few months and we will solicit your feedback at this stage. In the meantime, if you have any questions or concerns, please leave a comment on this issue.

-- The Lesson Infrastructure subcommittee

PS: we will close this issue in 30 days if there is no activity.

database episode

  • fix formatting issues
  • put database file in a data/ folder to follow good practices of working directory organization mentioned in first episode

add output cells

For each input chunk of code, show the output that will be produced. This will help the learners and the instructors to know that they are getting the expected output.

Output code should be marked with .output: as described here.

See also datacarpentry/python-socialsci#11

Inconsistent header usage in Episode 03 dplyr & tidyr

The headers in episode 3 seems inconsistent. Pipes are a double header (## Pipes), while the mutate function is triple (### Mutate), and the summarize function is quadruple (#### The summarize() function). Is there a style guide for when to use which headers?

Convert code chunks in 02-Reading-text-files

Code chunks like this

~~~
library(ggplot2)
library(readr)
SAFI_results <- read_csv("SAFI_results.csv")
~~~

need to be converted to:

```{r}
library(ggplot2)
library(readr)
SAFI_results <- read_csv("SAFI_results.csv")
```

data types and structures

The discussion of data types and data structures in "Vectors and data types" could be clarified. Perhaps even defining these terms before using them would help. Also note that the first sentence of the section reads "A vector is the most common and basic data type in R, and is pretty much the workhorse of R." perhaps this should be changed to "basic data structure"

Help with proofreading and adapting the content with social science examples

So myself and @langtonhugh have just completed our data carpentry training instructor 2-day workshop, and in looking to contribute to something for our checkout tasks, we came across this re-writing of the R material for social sciences.

We thought that the best use of our skills would be to contribute to this by helping proof-read, and amend the sessions with social-sciences related examples. We just wanted to ask if there is any direction on where to focus, or if you had any particular tasks/ requirements in mind? I had a look at the "reading in data" section, and have some ideas in mind, and have spotted some typos, so happy to make some changes there and then submit a pull request, but also open to steer to focus on other elements if that would be more helpful? Let us know!


Thanks for contributing! If this contribution is for instructor training, please send an email to [email protected] with a link to this contribution so we can record your progress. You’ve completed your contribution step for instructor checkout just by submitting this contribution.

Please keep in mind that lesson maintainers are volunteers and it may be some time before they can respond to your contribution. Although not all contributions can be incorporated into the lesson materials, we appreciate your time and effort to improve the curriculum. If you have any questions about the lesson maintenance process or would like to volunteer your time as a contribution reviewer, please contact Kate Hertweck ([email protected]).


Create a code handout

similarly to the R ecology lesson, this lesson should have a code out with the skeleton of code chunks, links to data files, etc.

04, "percent of each house type"

In the Barplot section of the ggplot lesson, the narrative text says that we want to create a dataframe with "the percent of each house type in each village." However, the code calculates percentages across the entire dataset, not by village. Note that the totals for each village are clearly well below 100%.

The code should be

percent_wall_type <- interviews_plotting %>%
    filter(respondent_wall_type != "cement") %>%
    count(village, respondent_wall_type) %>%
    group_by(village) %>%
    mutate(percent = n / sum(n))
    ungroup()

06-Relational Database: Clarify wording and code (specific cases)

There are a few places that could use a better description of the code being run:

  • line 79
    results <- dbSendQuery(mydb, "SELECT * FROM Question1")

  • lines 156-165
    dbfile_new = "a_newdb.sqlite"
    mydb_new = dbConnect(dbDriver("SQLite"), dbfile_new)

dbWriteTable(conn = mydb_new , name = "SN7577", value = "SN7577.csv",
row.names = FALSE, header = TRUE)

dbWriteTable(conn = mydb_new , name = "Q1", value = Q1,
row.names = FALSE)

dbListTables(mydb_new)

  • line 180
    tbl(mydb_dplyr, sql("SELECT count(*) from SN7577"))

Add details re: line 104 - why should the connection be closed? Does it need to be reopened to run code in following chunks?
"Once you have retrieved the data you should close the connection."

Clarify lines 177 and 202 - unsure what they are trying to say
"as is the mthod for running queries. However using the 'tbl' functionwe still need to provide avalid SQL string."
"Notice that on the nrow command we get NA rather than a count of rows. Thisis because dplyr doesn't hold the full table even after the 'Select * ...' "

missing SPSS version of the data file

One of the exercises in episode 02 refers to loading the data from an SPSS output file.

Use the import dataset wizard to import the SN7577_spss.sav dataset.

I didn't find this file in the data subfolder or in the python version of the lesson. Does this need to be added to the repo?

Date/time section in episode 2 is too long?

the date/time section is quite extensive, and some details could be simplified. Specific learning objectives for working with date/time data should be articulated so this section can be made more precise.

read_csv vs read.csv

This always leads to a long conversation because learners have heard of one or the other of these and want to talk about the differences. Think about how this can be incorporated into the curriculum as an example of reading helpfiles and understanding default options.

add caveat about manually reformatting spreadsheets

Episode 2 has a good exercise for fixing formatting issues in a badly formatted spreadsheet, however this may lead the learner to infer that manually formatting a spreadsheet is good practice (despite the fact that it's not reproducible). It would be good to note here the following (with a little exposition of each):

  • This exercise is provided as an example of what good formatting looks like and why that formatting is needed.
  • A note about reproducibility.
  • It's best practice to format your data sheets like this from the beginning, but if you can't, at least save a raw version of your data before making any manual edits.
  • Some note about how we'll be learning about OpenRefine for making more reproducible changes to data sheets later in the workshop.

discuss why these formatting problems are problematic

Episode 2 has this text:

The problems that we can see are as follows:
White space to the top and to the left of the data.
There are two header line types with different data items in each
One of the header line has two separate data items

but it doesn't explain why these are problematic. It would be good to have more exposition about why these are problems (ie. what will happen to the user downstream if they have these formatting issues in their data sheets).

n_row / 2 is a decimal

In the starting with data episode, one of the exercises has the learners get the middle row by using n_row / 2. There are 131 rows in the dataframe, so the result is 65.5. Using this in subsetting, however, gives the 65th row for most learners. Some get an error.

rename episode 02

from "reading text files" to "reading CSV files"

reading text files would suggest we are reading full-text data.

missing part of solution in renaming factors

in the solution section for the renaming factors exercise, we're missing the solution for part 1 (where you rename the factors, before plotting them in a specific order). The method is explained in the lesson (memb_assoc[is.na(memb_assoc)] <- "undetermined"), but it would be helpful to have it in the solutions to for helpers who haven't been following along with every part of the lesson.

Loading SPSS version of data assumes RStudio

The first exercise in episode 02 asks the student to load "data/SN7577_spss.sav" using the RStudio import wizard.

Would it be worthwhile to show how to load such datasets without using RStudio's import wizard? For example foreign::read.spss or haven::read_spss. The latter is included as part of the tidyverse.

Lesson release checklist

Lesson Release checklist

For each lesson release, copy this checklist to an issue and check off
during preparation for release

Scheduled Freeze Date: 2018-04-27
Scheduled Release Date: 2018-04-30

Checklist of tasks to complete before release:

  • check that the learning objectives reflect the content of the lessons
  • check that learning objectives are phrased as statements using action words
  • check for typos
  • check that the live coding examples work as expected
  • if example code generates warnings, explain in narrative and instructor notes
  • check that challenges and their solutions work as expected
  • check that the challenges test skills that have been seen
  • check that the setup instructions are up to date (e.g., update version numbers)
  • check that data is available and mentions of the data in the lessons are accurate
  • check that the instructor guide is up to date with the content of the lessons
  • check that all the links within the lessons work (this should be automated)
  • check that the cheat sheets included in lessons are up to date (e.g., RStudio updates them regularly)
  • check that languge is clear and free of idioms and colloquialisms
  • make sure formatting of the code in the lesson looks good (e.g. line breaks)
  • check for clarity and flow of narrative
  • update README as needed
  • fill out “overview” for each module - minutes needed for teaching and exercises, questions and learning objectives
  • check that contributor guidelines are clear and consistent
  • clean up files (e.g. delete deprecated files, insure filenames are consistent)
  • update the release notes (NEWS)
  • tag release on GitHub

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.