In this post, what I want to show is how the Machine Learning algorithms Random Forest, Boosted Trees and Linear Discriminant Analysis will compare to a stack or an ensemble of all of them together. Load the Alzheimer’s data using the following commands
library(caret) library(gbm) set.seed(3433) library(AppliedPredictiveModeling) data(AlzheimerDisease) adData = data.frame(diagnosis,predictors) inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]] training = adData[ inTrain,] testing = adData[-inTrain,]
Set the seed to 62433 and predict diagnosis with all the other variables using a random forest (“rf”), boosted trees (“gbm”) and linear discriminant analysis (“lda”) model. Stack the predictions together using random forests (“rf”). What is the resulting accuracy on the test set? Is it better or worse than each of the individual predictions?
set.seed(62433) rfmodel <- suppressMessages(train(diagnosis~., data=training, method="rf")) gbmmodel <- suppressMessages(train(diagnosis~., data=training, method="gbm")) ldamodel <- suppressMessages(train(diagnosis~., data=training, method="lda"))
Now lets find the accuracies for each of the predictions. These results will vary a little based on
the version of the packages you are using but shouldn’t be so far apart.
confusionMatrix(testing$diagnosis, rfresult)$overall['Accuracy'] confusionMatrix(testing$diagnosis, gbmresult)$overall['Accuracy'] confusionMatrix(testing$diagnosis, ldaresult)$overall['Accuracy'] confusionMatrix(testing$diagnosis, combined.result)$overall['Accuracy']
It shows that the stacked prediction has higher accuracy than the individual algorithms.