The caret package (short for classication and regression training) contains functions to streamline
the model training process for complex regression and classication problems. The package utilizes
a number of R packages but tries not to load them all at package start-up1. The package “suggests”
field includes 27 packages. caret loads packages as needed and assumes that they are installed.
To install the the caret package, issue the command below in R to ensure that all the needed packages are installed.
> install.packages(“caret”, dependencies = c(“Depends”, “Suggests”))
The main help pages for the package are at http://caret.r-forge.r-project.org/
Here, there are extended examples and a large amount of information that previously found in the
package vignettes.
caret has several functions that attempt to streamline the model building and evaluation process,
as well as feature selection and other techniques.
One of the primary tools in the package is the train function which can be used to
• evaluate, using resampling, the effect of model tuning parameters on performance
• choose the “optimal”model across these parameters
• estimate model performance from a training set
By adding formal package dependencies, the package startup time can be greatly decreased
There are options for customizing almost every step of this process (e.g. resampling technique,
choosing the optimal parameters etc). To demonstrate this function, the Sonar data from the
mlbench package will be used.
The Sonar data consist of 208 data points collected on 60 predictors. The goal is to predict the two
classes (M for metal cylinder or R for rock).
First, we split the data into two groups: a training set and a test set. To do this, the createDataPartition
function is used:
library(mlbench)
library(caret)
set.seed(100)
data(“Sonar”)
inTrain <- createDataPartition(y=Sonar$Class,p=.75,list = FALSE)
str(inTrain)
int [1:157, 1] 1 2 3 4 5 6 7 11 13 16 …
– attr(*, “dimnames”)=List of 2
..$ : NULL
..$ : chr “Resample1”
The output is a set of integers for the rows of Sonar that belong in the training set and we set the format not to be a list. By default, createDataPartition does a stratied random split of the data. To partition the data we do as below
training <- Sonar[ inTrain,]
testing <- Sonar[-inTrain,]
nrow(training)
[1] 157
> nrow(testing)
[1] 51
To tune a model using Algorithm, the train function can be used. More details on this function
can be found at http://caret.r-forge.r-project.org/training.html
Here, a partial least squares discriminant analysis (PLSDA) model will be tuned over the number
of PLS components that should be retained. The most basic syntax to do this is:
plsFit <- train(Class~., data = training, method=”pls”, preProc=c(“center”,”scale”))
we center and scale the predictors for the training set and all future samples.
Take a look at the plsFit object
> plsFit
Partial Least Squares
157 samples
60 predictor
2 classes: ‘M’, ‘R’
Pre-processing: centered (60), scaled (60)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 157, 157, 157, 157, 157, 157, …
Resampling results across tuning parameters:
ncomp Accuracy Kappa
1 0.7473460 0.4946851
2 0.7798675 0.5604342
3 0.7750962 0.5497151
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was ncomp = 2.
However, we would probably like to customize it in a few ways:
• expand the set of PLS models that the function evaluates. By default, the function will tune
over three values of each tuning parameter.
• the type of resampling used. The simple bootstrap is used by default. We will have the
function use three repeats of 10 fold cross validation.
• the methods for measuring performance. If unspecifieded, overall accuracy and the Kappa
statistic are computed. For regression models, root mean squared error and R2 are computed.
Here, the function will be altered to estimate the area under the ROC curve, the sensitivity
and specificity.
To change the candidate values of the tuning parameter, either of the tuneLength or tuneGrid arguments can be used. The train function can generate a candidate set of parameter values and the tuneLength argument controls how many are evaluated. In the case of PLS, the function uses a sequence of integers from 1 to tuneLength. If we want to evaluate all integers between 1 and 15, setting tuneLength = 15 would achieve this. The tuneGrid argument is used when specific values are desired. A data frame is used where each row is a tuning parameter setting and each column is a tuning parameter. An example is used below to illustrate this.
The syntax for the model would then be:
> plsFit <- train(Class ~ .,data = training,method = “pls”, tuneLength = 15, preProc = c(“center”, “scale”))
To modify the resampling method, a trainControl function is used. The option method controls
the type of resampling and defaults to “boot”. Another method “repeatedcv” is used to specify
repeated K fold cross-validation (and the argument repeats controls the number of repetitions).
K is controlled by the number argument and defaults to 10. The new syntax is then:
ctrl <- trainControl(method = “repeatedcv”,repeats = 3)
plsFit <- train(Class ~ .,data = training,method = “pls”, tuneLength = 15, trControl =”ctrl”,preProc = c(“center”, “scale”))
Finally, to choose different measures of performance, additional arguments are given to trainControl.
The summaryFunction argument is used to pass in a function that takes the observed and predicted
values and estimate some measure of performance. Two such functions are already included in the
package: defaultSummary and twoClassSummary. The latter will compute measures specific to two-class
problems, such as the area under the ROC curve, the sensitivity and specificity. Since the ROC
curve is based on the predicted class probabilities (which are not computed automatically), another
option is required. The classProbs = TRUE option is used to include these calculations.
Lastly, the function will pick the tuning parameters associated with the best results. Since we are
using custom performance measures, the criterion that should be optimized must also be specied.
In the call to train, we can use metric = “ROC” to do this.
The final model would then be:
> ctrl <- trainControl(method = “repeatedcv”, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
plsFit <- train(Class ~ .,data = training,method = “pls”, tuneLength = 15, trControl =”ctrl”,preProc = c(“center”, “scale”))
The model is applied to the whole data set using this specication and this is the model that is
used to predict future samples.The package has several functions for visualizing the results. One method for doing this is the plot function for train objects. The command plot(plsFit) produced the figue below
which shows the relationship between the resampled performance values and the number of PLS
components.
To predict new samples, predict.train can be used. For classication models, the default behavior
is to calculate the predicted class. Using the option type = “prob” can be used to compute class
probabilities from the model. For example:
plsClasses <- predict(plsFit, newdata = testing)
plsProb <- predict(plsFit, newdata = testing, type =”prob”)
str(plsClasses)
Factor w/ 2 levels “M”,”R”: 1 1 1 2 2 2 2 1 2 2 …
head(plsProbs)
M R
8 0.5680417 0.4319583
9 0.5970792 0.4029208
10 0.5047394 0.4952606
12 0.4618328 0.5381672
14 0.3499746 0.6500254
15 0.4113770 0.5886230
caret contains a function to compute the confusion matrix and associated statistics for the model
confusionMatrix(data = plsClasses, testing$Class)
Confusion Matrix and Statistics
Reference
Prediction M R
M 14 5
R 13 19
Accuracy : 0.6471
95% CI : (0.5007, 0.7757)
No Information Rate : 0.5294
P-Value [Acc > NIR] : 0.06052
Kappa : 0.3045
Mcnemar’s Test P-Value : 0.09896
Sensitivity : 0.5185
Specificity : 0.7917
Pos Pred Value : 0.7368
Neg Pred Value : 0.5937
Prevalence : 0.5294
Detection Rate : 0.2745
Detection Prevalence : 0.3725
Balanced Accuracy : 0.6551
‘Positive’ Class : M
To fit an another model to the data, train can be invoked with minimal changes. Lists of models
available can be found at: http://caret.r-forge.r-project.org/modelList.html and
http://caret.r-forge.r-project.org/bytag.html
For example, to fit a regularized discriminant model to these data, the following syntax can be used:
To illustrate, a custom grid is used
rdaGrid = data.frame(gamma = (0:4)/4, lambda = 3/4)
set.seed(123)
rdaFit <- train(Class ~ .,data = training,method = “rda”,tuneGrid = rdaGrid,trControl = ctrl,metric = “ROC”)
Take a look at rdaFit
>rdaFit
Regularized Discriminant Analysis
157 samples
60 predictor
2 classes: ‘M’, ‘R’
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 142, 141, 141, 141, 142, 142, …
Resampling results across tuning parameters:
gamma ROC Sens Spec
0.00 0.8612517 0.8250000 0.7946429
0.25 0.9117560 0.8810185 0.7416667
0.50 0.9132688 0.8935185 0.7327381
0.75 0.8998843 0.8777778 0.7238095
1.00 0.7741319 0.7212963 0.6523810
Tuning parameter ‘lambda’ was held constant at a value of 0.75
ROC was used to select the optimal model using the largest value.
The final values used for the model were gamma = 0.5 and lambda = 0.75.
now lets do prediction using the new model
rdaClasses <- predict(rdaFit, newdata = testing)
confusionMatrix(rdaClasses, testing$Class)
Confusion Matrix and Statistics
Reference
Prediction M R
M 18 8
R 9 16
Accuracy : 0.6667
95% CI : (0.5208, 0.7924)
No Information Rate : 0.5294
P-Value [Acc > NIR] : 0.03308
Kappa : 0.3326
Mcnemar’s Test P-Value : 1.00000
Sensitivity : 0.6667
Specificity : 0.6667
Pos Pred Value : 0.6923
Neg Pred Value : 0.6400
Prevalence : 0.5294
Detection Rate : 0.3529
Detection Prevalence : 0.5098
Balanced Accuracy : 0.6667
‘Positive’ Class : M
How do these models compare in terms of their resampling results? The resamples function can be
used to collect, summarize and contrast the resampling results. Since the random number seeds
were initialized to the same value prior to calling train, the same folds were used for each model.
To assemble them:
resamps <- resamples(list(pls = plsFit, rda = rdaFit))
There are several functions to visualize these results. For example, a Bland-Altman type plot can
be created using xyplot(resamps, what = “BlandAltman”)..
Since, for each resample, there are paired results a paired t-test can be used to assess whether there
is a dierence in the average resampled area under the ROC curve. The diff.resamples function
can be used to compute this:
diffs <- diff(resamps)