In this study by Veloso at al., accelerometers were placed in various positions on subjects doing weight lifting exercises. The aim is to identify what method is being used during the exercise.
The sensors were placed on:
Upper Arm
Forearm
Belt
Dumbbell
An experienced weight lifter observed the exercise and classified it as:
A - correct Method
B - Throwing elbows
C - Only lifting dumbbell halfway
D - Only lowering dumbbell halfway
E - Throwing hips
This is the classe variable that is trying to be predicted.
The training and test data (reserved for the final test) is downloaded, if required and loaded.
trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
trainFile <- "pml-training.csv"
if (!file.exists(trainFile)) {
download.file(trainURL, trainFile)
}
trainingAll <- read.csv(trainFile)
finalTestURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
finalTestFile <- "pml-testing.csv"
if (!file.exists(finalTestFile)) {
download.file(finalTestURL, finalTestFile)
}
finalTesting <- read.csv(finalTestFile)
Variables that are missing in the final testing data are removed from both sets.
NAMean <- sapply(finalTesting, function(x){mean(is.na(x))})
notNA <- (NAMean < 0.9)
trainingAll <- trainingAll[,notNA]
finalTesting <- finalTesting[,notNA]
The first seven columns contains, weight lifters names, dates etc, are also removed.
trainingAll <- trainingAll[,-c(1:7)]
finalTesting <- finalTesting[,-c(1:7)]
Training data is split into three parts:
60% Training
20% Base Model Testing and Stack Training (1/2 of 40%)
20% Stacked Model Testing
library(caret)
set.seed(271001)
inTrain = createDataPartition(trainingAll$classe, p = 0.6, list = FALSE)
training <- trainingAll[inTrain,]
trainingTMP <- trainingAll[-inTrain,]
inStack = createDataPartition(trainingTMP$classe, p = 1/2, list = FALSE)
trainingStack <- trainingTMP[inStack,]
trainingTest <- trainingTMP[-inStack,]
rm(trainingTMP)
None of the individual features show huge potential in separating the classe variable. The six features shown below, being among the best!
featurePlot(x = training[,c("yaw_belt", "accel_belt_z", "magnet_belt_x", "magnet_belt_y", "magnet_arm_y", "magnet_forearm_x")], y = training$classe, labels = c("classe",""))
A scatter plot using two of these variables, shows a huge amount of overlap in the classe variable. While there may be a difference in the variability of the classe, this will not help in identifying individual points. There is not much evidence to suggest a linear model would have much success, non-linear models will be tested.
library(ggplot2)
ggplot(training, aes(x = magnet_belt_y, y = magnet_arm_y, colour = classe)) +
geom_point(alpha = 0.3) + labs(title = "Scatterplot showing very little seperation")
Four individual models were tested. Each of these was tested for out of sample error using the Base model testing and Stack training set.
The first model tested, was a simple Decision Tree.
set.seed(43770)
rpartModel <- train(classe ~ ., data = training, method = "rpart")
rpartModel
## CART
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03595159 0.5147208 0.36911608
## 0.06063123 0.4119124 0.20362821
## 0.11770290 0.3253111 0.06482018
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03595159.
The tree generated is shown below.
library(rattle)
fancyRpartPlot(rpartModel$finalModel)
rpartPred <- predict(rpartModel, newdata = trainingStack)
rpartCM <- confusionMatrix(rpartPred, trainingStack$classe) #50%
rpartCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1009 318 307 300 79
## B 19 274 21 108 108
## C 72 167 356 235 196
## D 0 0 0 0 0
## E 16 0 0 0 338
##
## Overall Statistics
##
## Accuracy : 0.504
## 95% CI : (0.4882, 0.5197)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3522
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9041 0.36100 0.52047 0.0000 0.46879
## Specificity 0.6423 0.91909 0.79315 1.0000 0.99500
## Pos Pred Value 0.5012 0.51698 0.34698 NaN 0.95480
## Neg Pred Value 0.9440 0.85706 0.88678 0.8361 0.89269
## Prevalence 0.2845 0.19347 0.17436 0.1639 0.18379
## Detection Rate 0.2572 0.06984 0.09075 0.0000 0.08616
## Detection Prevalence 0.5131 0.13510 0.26153 0.0000 0.09024
## Balanced Accuracy 0.7732 0.64005 0.65681 0.5000 0.73190
Accuracy is very low, at only 50%. Note that there is not even one Prediction of classe = “D”. Out of sample error is estimated to be 50%
The second model attempted was Linear Discriminant Analysis.
set.seed(43770)
ldaModel <- train(classe ~ ., data = training, method = "lda")
ldaModel
## Linear Discriminant Analysis
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7006017 0.6214287
##
##
ldaPred <- predict(ldaModel, newdata = trainingStack)
ldaCM <- confusionMatrix(ldaPred, trainingStack$classe) #70%
ldaCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 889 120 71 41 29
## B 25 494 68 25 137
## C 99 88 452 91 60
## D 99 30 76 464 57
## E 4 27 17 22 438
##
## Overall Statistics
##
## Accuracy : 0.6977
## 95% CI : (0.683, 0.712)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6175
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7966 0.6509 0.6608 0.7216 0.6075
## Specificity 0.9070 0.9194 0.8956 0.9201 0.9781
## Pos Pred Value 0.7730 0.6595 0.5722 0.6391 0.8622
## Neg Pred Value 0.9181 0.9165 0.9259 0.9440 0.9171
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2266 0.1259 0.1152 0.1183 0.1116
## Detection Prevalence 0.2931 0.1909 0.2014 0.1851 0.1295
## Balanced Accuracy 0.8518 0.7851 0.7782 0.8209 0.7928
There is some improvement in accuracy, 70%. Out of sample error is estimated to be 30%
As basic models were not highly successful, more sophisticated models were tested, starting with a boosted model, Stochastic Gradient Boosting.
set.seed(43770)
gbmModel <- train(classe ~ ., data = training, method = "gbm", verbose = FALSE)
gbmModel
## Stochastic Gradient Boosting
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7532751 0.6875371
## 1 100 0.8179083 0.7696817
## 1 150 0.8477424 0.8073977
## 2 50 0.8501228 0.8102000
## 2 100 0.8994659 0.8728019
## 2 150 0.9258328 0.9061706
## 3 50 0.8894213 0.8600255
## 3 100 0.9346021 0.9172661
## 3 150 0.9538152 0.9415799
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
gbmPred <- predict(gbmModel, newdata = trainingStack)
gbmCM <- confusionMatrix(gbmPred, trainingStack$classe) #96%
There is a large improvement in accuracy, 96%. However there is still a little room for improvement. Out of sample error is estimated to be 4%
The final (base) model tried, was Random Forest, as this is known to be very successful as a classification algorithm.
set.seed(43770)
rfModel <- train(classe ~ ., data = training, method = "rf")
rfModel
## Random Forest
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9850077 0.9810360
## 27 0.9865813 0.9830265
## 52 0.9778256 0.9719505
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
rfPred <- predict(rfModel, newdata = trainingStack)
rfCM <- confusionMatrix(rfPred, trainingStack$classe) #98.9%
This model shows the highest accuracy, 99%. Out of sample error is estimated to be 1%
The difference in the success of the models can be seen clearly in a comparision of the confusion matrices below.
library(dplyr)
normCM1 <- as.data.frame(rpartCM$table / apply(rpartCM$table, 1, sum))
normCM1$Model = 1
normCM2 <- as.data.frame(ldaCM$table / apply(ldaCM$table, 1, sum))
normCM2$Model = 2
normCM3 <- as.data.frame(gbmCM$table / apply(gbmCM$table, 1, sum))
normCM3$Model = 3
normCM4 <- as.data.frame(rfCM$table / apply(rfCM$table, 1, sum))
normCM4$Model = 4
normCM <- bind_rows(normCM1, normCM2, normCM3, normCM4)
modelNames <- factor(1:4, labels = c("Decision Tree", "Linear Discriminant Analysis", "Stochastic Gradient Boosting", "Random Forest"))
normCM$Model <- modelNames[normCM$Model]
normCM$Freq[is.na(normCM$Freq)] <- 0
ggplot(normCM, aes(y = Prediction, x = Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = sprintf("%1.2f",Freq)), vjust = 1) +
scale_fill_gradient(low = "yellow", high = "red") +
scale_y_discrete(limits = c("E", "D", "C", "B", "A")) +
labs(x = "Actual", title = "Normalized Confusion Matrix", fill="") +
facet_wrap(~Model)
Although the Random Forest Model has very high accuracy, it was decided to see whether model stacking could improve it further. The predictions from the four models made on the Base model testing and Stack training set are used as predictors themselves.
The models were stacked using a Random Forest Model.
stackDF <- data.frame(rpartPred, ldaPred, gbmPred, rfPred, classe = trainingStack$classe)
combMod <- train(classe ~ ., data = stackDF, method = "rf")
The function stackedPredict obtains the predictions from the four base models and passes these to the stacked Model.
stackedPredict <- function(newdata) {
rpartPred <- predict(rpartModel, newdata = newdata)
ldaPred <- predict(ldaModel, newdata = newdata)
gbmPred <- predict(gbmModel, newdata = newdata)
rfPred <- predict(rfModel, newdata = newdata)
stackDF <- data.frame(rpartPred, ldaPred, gbmPred, rfPred)
predict(combMod, newdata = stackDF)
}
combPred <- stackedPredict(trainingTest)
stackedCM <- confusionMatrix(combPred, trainingTest$classe) #99.1
There is no significant improvement over the Forest Model, with accuracy of 99%.
The gives a final out of sample error estimate of 1%
This was the final model used to make the predictions for the final test set.
As these final test predictions are the answers for a Coursera Quiz, the honour code prevents me from sharing them, sorry!
However, I will confirm, that they were accurate. Further, they were identical to the predictions made just using the Random Forest Model.
finalPred <- stackedPredict(finalTesting)
data.frame(classe = finalPred)
The data for this study comes, with thanks, from:
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz4e2aw89ds