coursera-practicalMachineLearning

##Goal of the project In this project, we're given a training set at https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv and a test set at https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv from the source http://groupware.les.inf.puc-rio.br/har These sets contains data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants who are asked to perform barbell lifts correctly and incorrectly in 5 different ways. Our goal is to predict the manner in which the people did the exercise, i.e. the "classe" variable in the training set. After training and creating a model, we will use this model to predict 20 different test cases in the test data set. We will submit our final predictions to Coursera and see how well we did :)

##Preparation for R Markdown

library(knitr)
opts_chunk$set(echo = TRUE, message=FALSE, results = 'hold')

##Loading all the needed libraries first

library(caret)
library(randomForest)
library(corrplot)

##Setting the seed

set.seed(1535)

##Downloading testing and training data

Let's download the 2 files from the internet first.

download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "/Users/aslisabanci/Documents/pml/trainingData.csv", method="curl")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "/Users/aslisabanci/Documents/pml/testingData.csv", method="curl")

Now we should get what's inside the downloaded training .csv file. When reading the files, we'll also specify a vector of values that should be treated as NAs.

trainingData = read.csv("/Users/aslisabanci/Documents/pml/trainingData.csv", na.strings = c("NA", "#DIV/0!",""))
dim(trainingData)

We see that the training data has 19622 observations of 160 variables.

Let's do the same for our testing .csv file.

testingData = read.csv("/Users/aslisabanci/Documents/pml/testingData.csv", na.strings = c("NA", "#DIV/0!",""))
dim(testingData)

We see that the testing data has 20 observations of 160 variables.

#Reducing the number of predictors Looking at the predictors, logically the first 5 of them (id, user name and timestamps) shouldn't have anything to do with our model. So let's remove them from both the training and testing data sets.

trainingData<-trainingData[, -c(1:5)]
testingData<-testingData[, -c(1:5)]

Now, we're left with 155 columns.

Another thing that draws our attention in the data is that there are many NA values. Let's remove columns which have more than 50% of "NA".

naColumns <- colSums(is.na(trainingData)) > nrow(trainingData) * .5
trainingData <- trainingData[,!naColumns]

Now our training data has 55 columns.

Lastly, let's check for columns with near zero variance and remove those columns, too.

nearZeroVarColumns <- nearZeroVar(trainingData)
trainingData <- trainingData[, -nearZeroVarColumns]

Now, our training data has 54 columns.

We should do the same steps on our testing data, too.

testingData <- testingData[,!naColumns]
testingData <- testingData[, -nearZeroVarColumns]

Now, our testing data also has 54 columns. We're satisfied with the cleaning part so far, and we're good to go :)

##Cross Validation In order to avoid overfitting, it's good to do cross validation on our training data. Also, we would like to see our model's accuracy on a cross validation set, before we apply it on the final testing set.

So we'll split our cleaned training data into two sets, named: cleanTraining and crossValidation.

inTrain <- createDataPartition(trainingData$classe, p=0.7, list=FALSE)
cleanTraining <- trainingData[inTrain,]
crossValidation <- trainingData[-inTrain,]

Our clean training set has 13737 observations and our cross validation set has 5885 observations.

##Fitting a model We'll build a random forest model because we're given a non-linear problem and we know that tree methods usually work well with high accuracy, with these kinds of problems. And, compared to a single tree model, random forest's accuracy would be better.

model = randomForest(classe ~., data=cleanTraining)
model

Our OOB error rate of .34% looks promising and tells us that we can continue using this model.

Now it's time we get our predictions:

prediction <- predict(model, crossValidation)
confMatrix <- confusionMatrix(prediction,crossValidation$classe)
outOfSampleError <- 1-sum(diag(confMatrix$table))/sum(confMatrix$table)
outOfSampleError

Out of sample error is calculated as 0.0022.

Let's look at our confusion matrix in detail:

print(confMatrix)

Our model's accuracy is 0.9978 which is pretty good :)

#Getting final predictions on the test set Now we can apply our prediction model to the testing data with 20 observations.

testSetPrediction <- predict(model, testingData)
testSetPrediction

#Saving final predictions on to the files We'll define the function we're given by the Coursera team, to write the predictions in separate files

pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}

Now, let's call this function with our testSetPrediction vector

pml_write_files(testSetPrediction)

##Final validation We've submitted these predictions to Coursera and our model passed the validation %100!

aslisabanci / coursera-practicalmachinelearning Goto Github PK

coursera-practicalmachinelearning's Introduction

coursera-practicalMachineLearning

coursera-practicalmachinelearning's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent