Introduction

In this study by Veloso at al., accelerometers were placed in various positions on subjects doing weight lifting exercises. The aim is to identify what method is being used during the exercise.

The sensors were placed on:

An experienced weight lifter observed the exercise and classified it as:

This is the classe variable that is trying to be predicted.

Data Processing

Download

The training and test data (reserved for the final test) is downloaded, if required and loaded.

trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
trainFile <- "pml-training.csv"
if (!file.exists(trainFile)) {
    download.file(trainURL, trainFile)
}
trainingAll <- read.csv(trainFile)

finalTestURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
finalTestFile <- "pml-testing.csv"
if (!file.exists(finalTestFile)) {
    download.file(finalTestURL, finalTestFile)
}
finalTesting <- read.csv(finalTestFile)

Variable Selection

Variables that are missing in the final testing data are removed from both sets.

NAMean <- sapply(finalTesting, function(x){mean(is.na(x))})
notNA <- (NAMean < 0.9)
trainingAll <- trainingAll[,notNA]
finalTesting <- finalTesting[,notNA]

The first seven columns contains, weight lifters names, dates etc, are also removed.

trainingAll <- trainingAll[,-c(1:7)]
finalTesting <- finalTesting[,-c(1:7)]

Cross Validation

Training data is split into three parts:

library(caret)
set.seed(271001)
inTrain = createDataPartition(trainingAll$classe, p = 0.6, list = FALSE)
training <- trainingAll[inTrain,]
trainingTMP <- trainingAll[-inTrain,]
inStack = createDataPartition(trainingTMP$classe, p = 1/2, list = FALSE)
trainingStack <- trainingTMP[inStack,]
trainingTest <- trainingTMP[-inStack,]
rm(trainingTMP)

Exporatory Analysis

Featureplot

None of the individual features show huge potential in separating the classe variable. The six features shown below, being among the best!

featurePlot(x = training[,c("yaw_belt", "accel_belt_z", "magnet_belt_x", "magnet_belt_y", "magnet_arm_y", "magnet_forearm_x")], y = training$classe, labels = c("classe",""))

Scatter plot

A scatter plot using two of these variables, shows a huge amount of overlap in the classe variable. While there may be a difference in the variability of the classe, this will not help in identifying individual points. There is not much evidence to suggest a linear model would have much success, non-linear models will be tested.

library(ggplot2)
ggplot(training, aes(x = magnet_belt_y, y = magnet_arm_y, colour = classe)) +
    geom_point(alpha = 0.3) + labs(title = "Scatterplot showing very little seperation")