Giter Site home page Giter Site logo

getcleandata's Introduction

GetCleanData

Repo of the coursework for the "Getting and Cleaning Data" course project (part of the Coursera "Data Science Specialization").

Included files/directories:

'UCI HAR Dataset' directory

This is the dataset that has been unzipped after downloading from the site prescribed on the project page.

The files from this dataset that are of particular interest include:

  • 'README.txt'

      The README file included in the original dataset
    
  • 'activity_labels.txt'

      The activity codes (an integer in the range 1:6) and their corresponding descriptive names
    
  • 'features.txt'

      The column number and descriptive labels for the main data files - 'X_train.txt' & 'X_test.txt' - 561 rows & 2 columns.
    
  • 'features_info.txt'

      Information about what the labels contained in 'features.txt' mean.
    
  • '/train/subject_train.txt' & '/test/subject_test.txt'

      Seems to be the subject id that corresponds to each row in the data files - the number of rows correspond and the values are integers in the range 1:30.
    
  • '/train/X_train.txt' & '/test/X_test.txt'

      Seems to be the main data files - note that the number of columns in this file corresponds to the number of rows in 'features.txt', and I assumed that they will both be in the same order.
    
  • '/train/y_train.txt' & '/test/y_test.txt'

      A list that seems to be the activity code that corresponds to each row of the data file - the number of rows match, and the values are in the range found in 'activity_labels.txt'.
    

The 'README.txt' file in this directory contains further information about the data set provided. The authors of this dataset have asked that anyone using it references the following publication:

Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012

'CodeBook.md' file

Describes the variables, the data, and any transformations or work that was performed to clean up the data.

'run_analysis.R' file

This is the R script that performs the required data transformations - see below for a detailed description of how it achieves its task.

'summarised_analysis.txt' file

The file that was generated to satisfy point 5 of the project requirements (see below)

How does the 'run_analysis.R' script work?

These were the project requirements:

  1. Merges the training and the test sets to create one data set.
  2. Extracts only the measurements on the mean and standard deviation for each measurement.
  3. Uses descriptive activity names to name the activities in the data set
  4. Appropriately labels the data set with descriptive variable names.
  5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

It seemed logical to carry out the steps in the following order: 4->2->1->3->5.

Note that the number of rows in the file 'features.txt' corresponds to the number of columns in both the '/train/X_train.txt' & '/test/X_test.txt' files, so I used 'features.txt' to generate the column names of those files.

Given that:

  1. the number of rows in '/train/subject_train.txt' & '/test/subject_test.txt' correspond to the number of rows in '/train/X_train.txt' & '/test/X_test.txt', and
  2. the range of values in '/train/subject_train.txt' & '/test/subject_test.txt' were integers between 1 and 30, it seemed reasonable to assume that these files were the corresponding subject identifiers for the relevant observation files. I therefore combined them with the observation files to identify which observations belong to which subject.

I then used similar reasoning to match up the activity codes from '/train/y_train.txt' & '/test/y_test.txt' to the observations.

For each of the training and testing data, I only kept the columns that had "mean", "Mean" or "std" in the name. This also included the relevant columns that were the result of Fourier analasys of the data - I decided NOT to exclude them because the instruction did not make in clear whether they should be included or not. Note that it is easier to not use data that is included, rather than having to regenerate the data that you exclude but later decide that you really want - this was my reason for erring on the side of including these columns.

Once the data tables were appropriately labelled and assembled for training and testing separately, it was very easy to merge them into one data set using the merge() command. I was then able to replace the activity codes with their descriptions as per the definitions contained in 'activity_labels.txt'.

Step 5 I had the most trouble with, because for some reason I couldn't get 'dplyr' to work. I was eventually able to use the 'reshape' package to generate the appropriate summary table, and then write it out to disk using the write.table() function (using the option row.names=FALSE).

Instructions for reading the data summary file

Assuming that the working directly is set to the directory where 'summarised_analysis.txt' resides, the summary table can be read into memory with the following command:

read.table("summarised_analysis.txt")

getcleandata's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.