README for Getting and Cleaning Data Course Project

mjdata
July 23, 2015

Getting and Cleaning Data Course Project README

Submitted files

tidy_wide.txt: A tidy data sets produced by run_analysis.R. It can be read into R with read.table(header=TRUE)

The following five files are submitted in the repo

run_analysis.R: Cleans and combines the raw data sets, then generates a tidy data text file "tidy_wide.txt" that meets the principles of tidy data.
CodeBook.Rmd: A R Markdown script that creates CodeBook.md
CodeBook.md: Describes the variables, the data, and transformations or works performed in run_analysis.R to clean up the data.
README.Rmd: A R Markdown script that creates README.md
README.md: Describes what the script run_analysis.R do and explains how the three scripts work and how they are connected.

Notes

run_analysis() runs the script run_analysis.R
Raw data sets are in the folder "./data" with subdirectories "./data/train" and "./data/test" under my working directory.
README.md and codeBook.md are complement with each other.
The scrips run on Windows 7 (64 bits), R v3.2.1, RStudio, text encoding UTF-8.
The raw data for the project was downloaded manually from the url and unzipped into a folder "./data" under my working directory.
References cited in CodeBook.md and README.md include:1, 2, 3

What does the cleaning script run_analysis.R do?

The cleaning script run_analysis.R gathers all the relevant information of a single observational unit from multiple connected tables or files; cleans the raw data sets; and creates a tidy data sets "tidy_wide.txt" that meets the tidy data principles.

Extracts only the measurements on the mean and standard deviation for each measurement from the traing and the test sets.
Merges the extracted training and test sets to create one data set.
Uses descriptive activity names to name the activities in the data set.
Creates a tidy data set with the average of each variable for each activity and each subject.
Labels the tidy data set with new descriptive variable names.
Writes the tidy data sets in a file called "tidy_wide.txt" in the working directory.

require packages {data.table, readr, plyr, dplyr, stringr, reshape2, tidyr}

1. Extracts the data sets

Reading in the raw data sets into R with read.table()and read_table()

Raw datasets: A single type of observational unit is spread out over the multiple tables or files:

'README.txt': Information on the raw data sets.
'features_info.txt': Information about the feutures variable
'features.txt': Names of all 561 feature variables
'activity_labels.txt': Links the activity labels with their 6 activity names.
'train/X_train.txt': Training data sets,(7352, 561) table, with no column names. Each row respresents the values of the 561 features variables for each measurement.
'train/y_train.txt': 7352 long activity labels of each measurement in training data set with range from 1 to 6.
'train/subject_train.txt': 7352 long subject ID numbers ranges from 1 to 30. Each row identifies the subject who performed an activity for each window sample in Training data.
'test/X_test.txt': Test data set, (2947, 561) table, with no column names. Each row respresents the values of the 561 features variables for each measurement.
'test/y_test.txt': 2947 long activity label of each measurement in test data set.
'train/subject_train.txt': 2947 long subject ID numbers ranges from 1 to 30. Each row identifies the subject who performed an activity for each measurement.

####Notes on the raw data sets

-- The features values are normalized and bounded within [-1, 1].

-- Each features vector is a row on the train and test data text file.

-- The files in "/tain/Inertial Signals" and "/test/Inertial Signals" are not used for this course project.

-- A full description of the raw data sets is available at

Extracting only the measurements on the mean and standard deviation for each measurement from the traing and the test sets.

-- Extracting the features variables only with "mean()" or "std()" in the name, using str_detect{stringr}:

-- 66 features varaibles are extracted form the 561 features vector.   
-- 'extract' index vector is:   
  c(1, 2, 3, 4, 5, 6, 41, 42, 43, 44, 45, 46, 81, 82, 83, 84, 85, 86, 121, 122, 123, 124, 125, 126, 161, 162, 163, 164, 165, 166, 201, 202, 214, 215, 227, 228, 240, 241, 253, 254, 266, 267, 268, 269, 270, 271, 345, 346, 347, 348, 349, 350, 424, 425, 426, 427, 428, 429, 503, 504, 516, 517, 529, 530, 542, 543)

-- Extracting raw train and test data by subsetting with 'extract' index

    Dimension of the extracted tain table: (7352, 66) 
    Dimension of the extracted test table: (2947, 66)

Notes: meanFreq() was not meant to be selected, as my understanding.

2. Merges the extracted training and test sets to create one data set.

Merging the extracted train/test data tables with the corresponding acvtivity and subject data respectively by column, using bind_cols{dplyr}

    Dimension of the combind extracted tain data frame: (7352, 68)   
    Dimension of the combind extracted test data frame: (2947, 68)

-- Labeling the first two columns with "activity" and "subject"

Creating one big data frame by combining the merged train and test data frames by row, using bind_rows{dplyr}

    Dimension of the one big tidy data sets: (10299, 68)

-- Labeling the columns[3:68] with a tranformed features names

-- The original features names are transformed using str_replace{stringr} and str_sub{stringr} in addition to some corrections.

CodeBook has a detailed description** of the transformed features variabls including the schema of the name construcion levels and their meanings.

3. Names the activities in the data set useing descriptive activity names

-- Transforming the activity lables to lower case by str_to_loser{stringr}

-- Replacing the activity integer values in all rows of the big data sets with activity labels ("walking", "walking_upstairs", "walking_downstairs", "sitting", "standing", "laying")

####Notes: The merged one big data is tidy data:

All relevant data of the single observational unit are in one table
One column per variable and one variable per column
One observation per row and one row per observations

4. Creates a tidy data set with the average of each variable for each activity and each subject.

Melting the one big table using melt{reshape2} with (activity, subject) as ID field, creating a 4-column long table

-- Dimension of melted table: (679734, 4) 
-- Column names: ("activity", "subject", "variable", "value")

Casting the long table using cast{reshape2) with the ID field, with mean of the original 'variable' averaged over the time windows for each activity and each subject, creating a "wide" tidy data table

-- Dimension of the new tidy table: (180, 68)

5. Labels the tidy data set with new variable names.

Creating new variable names for the new variables

-- The new variable names[3:68] are created by adding "avarage" at the end of the transformed features names.

-- The transformed features names are made by using str_replace{stringr} and str_sub{stringr} in addition to some corrections to the original feature names.

Labeling new tidy table by replacing the names of colums[3:68] with the new variable names.

-- names of the first 3 columns:

"activity", "subject", "time.body.accelerat.x.mean.average"

The new labeling names describe appropriately that the observations in the tidy data sets "tidy_wide.txt" are the mean values of the original 66 features variables averaged over the time windows for each activity and each subject.

CodeBook describes in detail how the new lables desribe properly the tidy data sets.

6. Writes the tidy data sets into a file called "tidy_wide.txt" in my working directory using write.table with row.names=FALSE.

"tidy_wide.txt" meets the tidy data principles.

The data file "tidy_wide.txt" meets the principles of tidy data, as described in the Coursera "Getting and cleaning data" week 1 course lecture and in the references 1, 2,3

Each variable measured is in one column;
Each different observation of that variable is in a different row;
Each type of observational unit forms a table:
One file per table, one table per file
CodeBook still has the specific description of the tidy data file contents.

mjdata / getclean_project Goto Github PK

getclean_project's Introduction