Introduction

In this study by Veloso at al., accelerometers were placed in various positions on subjects doing weight lifting exercises. The aim is to identify what method is being used during the exercise.

The sensors were placed on:

An experienced weight lifter observed the exercise and classified it as:

This is the classe variable that is trying to be predicted.

Data Processing

Download

The training and test data (reserved for the final test) is downloaded, if required and loaded.

trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
trainFile <- "pml-training.csv"
if (!file.exists(trainFile)) {
    download.file(trainURL, trainFile)
}
trainingAll <- read.csv(trainFile)

finalTestURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
finalTestFile <- "pml-testing.csv"
if (!file.exists(finalTestFile)) {
    download.file(finalTestURL, finalTestFile)
}
finalTesting <- read.csv(finalTestFile)

Variable Selection

Variables that are missing in the final testing data are removed from both sets.

NAMean <- sapply(finalTesting, function(x){mean(is.na(x))})
notNA <- (NAMean < 0.9)
trainingAll <- trainingAll[,notNA]
finalTesting <- finalTesting[,notNA]

The first seven columns contains, weight lifters names, dates etc, are also removed.

trainingAll <- trainingAll[,-c(1:7)]
finalTesting <- finalTesting[,-c(1:7)]

Cross Validation

Training data is split into three parts:

library(caret)
set.seed(271001)
inTrain = createDataPartition(trainingAll$classe, p = 0.6, list = FALSE)
training <- trainingAll[inTrain,]
trainingTMP <- trainingAll[-inTrain,]
inStack = createDataPartition(trainingTMP$classe, p = 1/2, list = FALSE)
trainingStack <- trainingTMP[inStack,]
trainingTest <- trainingTMP[-inStack,]
rm(trainingTMP)

Exporatory Analysis

Featureplot

None of the individual features show huge potential in separating the classe variable. The six features shown below, being among the best!

featurePlot(x = training[,c("yaw_belt", "accel_belt_z", "magnet_belt_x", "magnet_belt_y", "magnet_arm_y", "magnet_forearm_x")], y = training$classe, labels = c("classe",""))

Scatter plot

A scatter plot using two of these variables, shows a huge amount of overlap in the classe variable. While there may be a difference in the variability of the classe, this will not help in identifying individual points. There is not much evidence to suggest a linear model would have much success, non-linear models will be tested.

library(ggplot2)
ggplot(training, aes(x = magnet_belt_y, y = magnet_arm_y, colour = classe)) +
    geom_point(alpha = 0.3) + labs(title = "Scatterplot showing very little seperation")

Model Building

Four individual models were tested. Each of these was tested for out of sample error using the Base model testing and Stack training set.

Decision Tree

The first model tested, was a simple Decision Tree.

set.seed(43770)
rpartModel <- train(classe ~ ., data = training, method = "rpart")
rpartModel
## CART 
## 
## 11776 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.03595159  0.5147208  0.36911608
##   0.06063123  0.4119124  0.20362821
##   0.11770290  0.3253111  0.06482018
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03595159.

The tree generated is shown below.

library(rattle)
fancyRpartPlot(rpartModel$finalModel)

Model testing

rpartPred <- predict(rpartModel, newdata = trainingStack)
rpartCM <- confusionMatrix(rpartPred, trainingStack$classe) #50%
rpartCM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1009  318  307  300   79
##          B   19  274   21  108  108
##          C   72  167  356  235  196
##          D    0    0    0    0    0
##          E   16    0    0    0  338
## 
## Overall Statistics
##                                           
##                Accuracy : 0.504           
##                  95% CI : (0.4882, 0.5197)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3522          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9041  0.36100  0.52047   0.0000  0.46879
## Specificity            0.6423  0.91909  0.79315   1.0000  0.99500
## Pos Pred Value         0.5012  0.51698  0.34698      NaN  0.95480
## Neg Pred Value         0.9440  0.85706  0.88678   0.8361  0.89269
## Prevalence             0.2845  0.19347  0.17436   0.1639  0.18379
## Detection Rate         0.2572  0.06984  0.09075   0.0000  0.08616
## Detection Prevalence   0.5131  0.13510  0.26153   0.0000  0.09024
## Balanced Accuracy      0.7732  0.64005  0.65681   0.5000  0.73190

Accuracy is very low, at only 50%. Note that there is not even one Prediction of classe = “D”. Out of sample error is estimated to be 50%

Linear Discriminant Analysis

The second model attempted was Linear Discriminant Analysis.

set.seed(43770)
ldaModel <- train(classe ~ ., data = training, method = "lda")
ldaModel
## Linear Discriminant Analysis 
## 
## 11776 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7006017  0.6214287
## 
## 

Model testing

ldaPred <- predict(ldaModel, newdata = trainingStack)
ldaCM <- confusionMatrix(ldaPred, trainingStack$classe) #70%
ldaCM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 889 120  71  41  29
##          B  25 494  68  25 137
##          C  99  88 452  91  60
##          D  99  30  76 464  57
##          E   4  27  17  22 438
## 
## Overall Statistics
##                                         
##                Accuracy : 0.6977        
##                  95% CI : (0.683, 0.712)
##     No Information Rate : 0.2845        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.6175        
##  Mcnemar's Test P-Value : < 2.2e-16     
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7966   0.6509   0.6608   0.7216   0.6075
## Specificity            0.9070   0.9194   0.8956   0.9201   0.9781
## Pos Pred Value         0.7730   0.6595   0.5722   0.6391   0.8622
## Neg Pred Value         0.9181   0.9165   0.9259   0.9440   0.9171
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2266   0.1259   0.1152   0.1183   0.1116
## Detection Prevalence   0.2931   0.1909   0.2014   0.1851   0.1295
## Balanced Accuracy      0.8518   0.7851   0.7782   0.8209   0.7928

There is some improvement in accuracy, 70%. Out of sample error is estimated to be 30%

Stochastic Gradient Boosting

As basic models were not highly successful, more sophisticated models were tested, starting with a boosted model, Stochastic Gradient Boosting.

set.seed(43770)
gbmModel <- train(classe ~ ., data = training, method = "gbm", verbose = FALSE)
gbmModel
## Stochastic Gradient Boosting 
## 
## 11776 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7532751  0.6875371
##   1                  100      0.8179083  0.7696817
##   1                  150      0.8477424  0.8073977
##   2                   50      0.8501228  0.8102000
##   2                  100      0.8994659  0.8728019
##   2                  150      0.9258328  0.9061706
##   3                   50      0.8894213  0.8600255
##   3                  100      0.9346021  0.9172661
##   3                  150      0.9538152  0.9415799
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Model testing

gbmPred <- predict(gbmModel, newdata = trainingStack)
gbmCM <- confusionMatrix(gbmPred, trainingStack$classe) #96%

There is a large improvement in accuracy, 96%. However there is still a little room for improvement. Out of sample error is estimated to be 4%

Random Forest

The final (base) model tried, was Random Forest, as this is known to be very successful as a classification algorithm.

set.seed(43770)
rfModel <- train(classe ~ ., data = training, method = "rf")
rfModel
## Random Forest 
## 
## 11776 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9850077  0.9810360
##   27    0.9865813  0.9830265
##   52    0.9778256  0.9719505
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

Model testing

rfPred <- predict(rfModel, newdata = trainingStack)
rfCM <- confusionMatrix(rfPred, trainingStack$classe) #98.9%

This model shows the highest accuracy, 99%. Out of sample error is estimated to be 1%

Confusion Matrices

The difference in the success of the models can be seen clearly in a comparision of the confusion matrices below.

library(dplyr)
normCM1 <- as.data.frame(rpartCM$table / apply(rpartCM$table, 1, sum))
normCM1$Model = 1
normCM2 <- as.data.frame(ldaCM$table / apply(ldaCM$table, 1, sum))
normCM2$Model = 2
normCM3 <- as.data.frame(gbmCM$table / apply(gbmCM$table, 1, sum))
normCM3$Model = 3
normCM4 <- as.data.frame(rfCM$table / apply(rfCM$table, 1, sum))
normCM4$Model = 4
normCM <- bind_rows(normCM1, normCM2, normCM3, normCM4)
modelNames <- factor(1:4, labels = c("Decision Tree", "Linear Discriminant Analysis", "Stochastic Gradient Boosting", "Random Forest"))
normCM$Model <- modelNames[normCM$Model]
normCM$Freq[is.na(normCM$Freq)] <- 0
ggplot(normCM, aes(y = Prediction, x = Reference, fill = Freq)) +
    geom_tile() +
    geom_text(aes(label = sprintf("%1.2f",Freq)), vjust = 1) +
    scale_fill_gradient(low = "yellow", high = "red") +
    scale_y_discrete(limits = c("E", "D", "C", "B", "A")) +
    labs(x = "Actual", title = "Normalized Confusion Matrix", fill="") +
    facet_wrap(~Model)

Model Stacking

Although the Random Forest Model has very high accuracy, it was decided to see whether model stacking could improve it further. The predictions from the four models made on the Base model testing and Stack training set are used as predictors themselves.

The models were stacked using a Random Forest Model.

stackDF <- data.frame(rpartPred, ldaPred, gbmPred, rfPred, classe = trainingStack$classe)
combMod <- train(classe ~ ., data = stackDF, method = "rf")

Stacked Model testing

The function stackedPredict obtains the predictions from the four base models and passes these to the stacked Model.

stackedPredict <- function(newdata) {
    rpartPred <- predict(rpartModel, newdata = newdata)
    ldaPred <- predict(ldaModel, newdata = newdata)
    gbmPred <- predict(gbmModel, newdata = newdata)
    rfPred <- predict(rfModel, newdata = newdata)
    stackDF <- data.frame(rpartPred, ldaPred, gbmPred, rfPred)
    predict(combMod, newdata = stackDF)    
}
combPred <- stackedPredict(trainingTest)
stackedCM <- confusionMatrix(combPred, trainingTest$classe) #99.1

There is no significant improvement over the Forest Model, with accuracy of 99%.

The gives a final out of sample error estimate of 1%

Final Testing

This was the final model used to make the predictions for the final test set.

As these final test predictions are the answers for a Coursera Quiz, the honour code prevents me from sharing them, sorry!

However, I will confirm, that they were accurate. Further, they were identical to the predictions made just using the Random Forest Model.

finalPred <- stackedPredict(finalTesting)
data.frame(classe = finalPred)

Citation

The data for this study comes, with thanks, from:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Read more: http://groupware.les.inf.puc-rio.br/har#ixzz4e2aw89ds