Machine Learning in R with caret
Machine Learning is one of the main uses of the R programming language. There are many models within the world of Machine Learning and there are still many more R packages to use these algorithms and caret is one of the main ones.
Although the fact that there are many libraries implementing models is a good thing, it also has its negative points. For me, the main disadvantage is that finding the optimal value of hyperparameters changes depending on the library… So the process is really tedious.
Luckily, in R we have caret, in my opinion, a fantastic package that will allow you to apply more than 230 Machine Learning models, all of them following the same structure and making your work much easier. In addition, R’s caret package has a lot of fantastic functions that will make your work much easier in the different stages of the Machine Learning process: feature selection, data splitting, model validation, etc.
As you can see, R’s caret is a fantastic package and without a doubt, if you use R it is one of the packages that you should know in-depth. And that is exactly what we are going to do in this post. Let’s get to it!
How to prepare data with caret
Before building a Machine Learning model, we must first perform several steps, such as Feature Selection, imputation of missing values, data dummyfication, etc.
These are many steps that you generally have to code yourself. Luckily, caret offers features to help us through most of the steps in the data preparation process. Let’s get started with Feature Selection!
How to perform Feature Selection
One of the fundamental aspects in the selection of variables is to check if their variance is zero or close to zero. This makes perfect sense: if the variance is close to zero, that means that there is not much variation within the data, that is, almost all observations have similar values.
Therefore, variables with variance 0 are usually discarded, since it is very likely that they only add noise to our model (obviously you should also consider the range of the variable).
Checking whether variables have zero variance or not with caret is very simple. We just have to use the
Let’s see how it works with an example from the Sacramento dataset, which includes information on house prices in Sacramento. First of all, in case you are not familiar with the dataset, let’s visualize it:
library(caret) data(Sacramento) str(Sacramento[1:3,])
'data.frame': 3 obs. of 9 variables: $ city : Factor w/ 37 levels "ANTELOPE","AUBURN",..: 34 34 34 $ zip : Factor w/ 68 levels "z95603","z95608",..: 64 52 44 $ beds : int 2 3 2 $ baths : num 1 1 1 $ sqft : int 836 1167 796 $ type : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 $ price : int 59222 68212 68880 $ latitude : num 38.6 38.5 38.6 $ longitude: num -121 -121 -121
As we can see, we have several numerical variables (number of bathrooms, number of beds, price, latitude, and longitude). Let’s see if they have zero variance or not.
numeric_cols = sapply(Sacramento, is.numeric) variance = nearZeroVar(Sacramento[numeric_cols], saveMetrics = T) variance
As we can see, if we pass the
saveMetrics the argument, it saves the values that it has used for the calculations. So one of the returned values is
nzv (near-zero-variance), which is false in all cases.
So we can use all of our numeric variables in our model prediction, at least for now.
Another important question is the correlation between variables. There are several models such as linear regression and logistic regression where you cannot have correalted data.
So, let’s see how to check the correlation between variables in R with caret.
How to find correlated variables with caret
Finding correlated variables in R using caret is very easy. To do this, you just have to pass a correlation matrix to the
findCorrelation function. With this, caret will tell us which variables to eliminate (if there are any).
Let’s see how it works:
sacramento_cor = cor(Sacramento[numeric_cols]) findCorrelation(sacramento_cor)
As we can see, in this case, there are no correlated variables, so caret tells us that there is no variable to eliminate. However, if we create a new correlated variable, we will see how it would tell us that there are problems. Let’s see:
fake_data = data.frame( variable1 = 1:20, variable2 = (1:20)*2, variable3 = runif(20), variable4 = runif(20) * runif(20) ) findCorrelation(cor(fake_data), verbose = T, names = T)
Compare row 1 and column 2 with corr 1 Means: 0.438 vs 0.269 so flagging column 1 All correlations <= 0.9  "variable1"
As we can see, the
findCorrelation function identifies that
variable1 is correlated with
variable2, and indicates that it should be removed. But what if the variable was a linear transformation of other variables? That is, if we had a
variable5, for example, that is the sum of
variable3. This would still be a problem, even though the variables are not correlated.
Well, precisely to detect these cases, caret includes the
findLinearCombos function. Let’s see how it works:
# I create fake data fake_data$variable5 = fake_data$variable1 + 2*fake_data$variable3 # I check if there are any linear combinations findLinearCombos(fake_data)
$linearCombos $linearCombos[]  2 1 $linearCombos[]  5 1 3 $remove  2 5
As we can see, the findLinearCombos function tells us that columns 1 and 2 are linear combinations and so are columns 5,1, and 3. That is why it recommends eliminating columns 2 and column 5.
As we can see, we can perform very important questions about the Feature Selection process in R thanks to the caret package and, furthermore, in a very simple way.
But that’s not all, R’s caret package also helps a lot in the transformation of the data. Let’s see how he does it!
How to transform data with caret
Among the transformations we usually undertake in Machine Learning we remark:
- Creating dummy variables: many models cannot work with categorical variables. Instead, they dumify it, that is, they create n-1 columns (where n is the number of categories), and that each of these columns indicates the presence (1) or absence (0) of that particular value. Although many models (such as logistic regression) do it themselves, other models (such as xgboost) require you to do it manually.
- Data scaling: consists of normalizing the scale of the data, since this is very important in algorithms such as regularization models (Ridge, Lasso and Elastic Net) or kNN, among others.
- Imputation of missing values: the vast majority of models (except those based on trees) cannot work with missing values. That is why, when we have missing values, we either impute or eliminate those observations or variables. Luckily, caret makes it very easy to impute missing values using various types of models.
- Dimensionality reduction: When we work on a problem with a high level of dimensionality, that is, with many variables, it is usually interesting to reduce the number of variables while maintaining as much variability as possible. This process is usually done with a principal component analysis or PCA.
So, let’s see how to do all these types of transformations in our machine learning models in R with caret, the vast majority of them with the same function:
How to create dummy variables with caret
Creating dummy variables with caret is very simple, we simply have to use the
dummyVars function and apply a
predict to obtain the resulting data.
type.Condo type.Multi_Family type.Residential 1 0 0 1 2 0 0 1 3 0 0 1 4 0 0 1 5 0 0 1 6 1 0 0
As we can see, caret has converted a single column (type) into three columns (one per category), each of them being binary. However, it has not eliminated one of the categories, creating redundancy. After all:
Condo = 0 & Multi_Family = 0 --> Residential = 1 .
Luckily we can indicate this with the
drop2nd == TRUE parameter.
pre_dummy = dummyVars(price ~ type, data = Sacramento, drop2nd = T) sacramento_dummy = predict(pre_dummy, Sacramento) head(sacramento_dummy)
type.Condo type.Multi_Family type.Residential 1 0 0 1 2 0 0 1 3 0 0 1 4 0 0 1 5 0 0 1 6 1 0 0
How to scale data
To scale the data, we simply have to pass arguments to the method parameter to caret’s
preProcess function. This function accepts two main types:
center: subtract the average from the values, so that they all have average 0.
scale: divide the values between the standard deviation. In this way, the data will have standard deviation 1.
range: normalizes the data, making it have a range from 0 to 1.
preProcess(Sacramento, method = "center")
Created from 932 samples and 9 variables Pre-processing: - centered (6) - ignored (3)
As we can see, caret has centered the data of 6 variables, corresponding to the numerical variables, ignoring 3 variables. We see the message, but not the data. Why?
The reason is that the
preProcess function is not intended to transform the data at the moment, but to do the transformation in the training (or inference) process.
However, we can see how our data looks after applying the transformation. To do this, we have to pass the result of the preprocessing and our data to the predict function. Let’s see how it works.
preprocess = preProcess(Sacramento, method = "center") predict(preprocess, Sacramento)[1:10,]
As we can see, now caret does return all the data with the processing already applied (in this case, having subtracted the mean). Let’s see, for example, how we would normalize the data.
preprocess = preProcess(Sacramento, method = "range") Sacramento_processed = predict(preprocess, Sacramento) cat("--- Datos sin procesar ---","\n", "Min:", min(Sacramento$sqft),"\n", "Max:", max(Sacramento$sqft), "\n","\n", "--- Datos procesados ---","\n", "Min:", min(Sacramento_processed$sqft),"\n", "Max:", max(Sacramento_processed$sqft) )
--- Datos sin procesar --- Min: 484 Max: 4878 --- Datos procesados --- Min: 0 Max: 1
As we can see, we have normalized the data simply with one line of code. But this is not all, since the preProcess function allows you to do much more, such as impute missing values. Let’s see.
How to impute missing values with caret
To impute missing values with caret, we will use the
preProcess function. In this case, there are different values that we can pass to the method parameter:
knnImpute: allows you to use the kNN algorithm to impute missing values. As you know (if not, I’ll explain it in this post), the kNN algorithm requires you to indicate the number of neighbors to use in the prediction. That is why, if we use the
knnImputemethod, we will also have to indicate the
bagImpute: with this value we will use several decision trees to make the imputation of our missing value.
medianImpute: as its name suggests, it imputes the median (in the case of a numeric variable). This is usually preferable to imputing the mean, since the mean can be affected by outliers.
Let’s see how missing value imputation works with caret in practice. To do this, first of all, we are going to “remove” some data from our dataset to simulate that we have missing values.
city zip beds baths sqft type price latitude longitude 3 0 3 3 3 0 0 0 0
As we can see, we now have 4 variables with 3 missing values each. Let’s see how each imputation method works:
# Realizamos la imputación pre_knn = preProcess(sacramento_missing, method = "knnImpute", k = 2) pre_bag = preProcess(sacramento_missing, method = "bagImpute") pre_median = preProcess(sacramento_missing, method = "medianImpute") # Obtenemos los datos imputed_knn = predict(pre_knn, sacramento_missing) imputed_bag = predict(pre_bag, sacramento_missing) imputed_median = predict(pre_median, sacramento_missing) # Comprobamos con el valor real print(Sacramento[c(1,4,5), c(1,3,4,5)]) print(imputed_knn[c(1,4,5), c(1,3,4,5)]) # Uses normalized data print(imputed_bag[c(1,4,5), c(1,3,4,5)]) print(imputed_median[c(1,4,5), c(1,3,4,5)])
As we can see, we have been able to carry out the imputation of the missing values in a very simple way. So far we have already seen a lot of things for data preprocessing with caret: variable selection, data transformation, imputation of missing data… But there is still more! With caret you can do cool things like using a PCA for dimensionality reduction. Let’s see how it works!
How to reduce dimensionality
When we work on Machine Learning problems with many variables, we often have problems because, the vast majority of models do not work well with many predictor variables and, if they do, they require a lot of data.
In these cases, a good option is usually to apply a dimensionality reduction method, such as principal component analysis or PCA.
Luckily, applying a PCA to our R dataset is very easy thanks to caret. To do this, we simply have to indicate the PCA value to the
method parameter of the
preProcess function. Likewise, with the
thresh parameter we can indicate the percentage of variability that we want to keep.
pre_pca = preProcess(Sacramento, method = "pca", thresh = 0.8) predict(pre_pca, Sacramento)
As we can see, now the dataset has 6 columns instead of 9. Yes, I know, this is not the best example in which applying a PCA adds a lot of value, but, as we can see, we can do it and in a very simple way thanks to caret.
With this, we have already seen all the options that the caret library offers for data transformation. But the options go much, much further, especially in modeling. Let’s see what it offers.
How to create machine learning models with caret
Choose the machine learning model to use
When we want to create a Machine Learning model in R, we generally load a library that contains the algorithm that interests us. For example, if we want to use a Random Forest, we will load the
randomForest package, while if we want to use AdaBoost, we will load the
And here the first problem arises, and that is that each package is different and has its own implementation: some require that you pass a formula, others that you pass the predictors and the dependent variable separately, some manage the dummyfication, but others do not …
In addition, each model has its own hyperparameters, and the way to tune them changes from package to package.
Well, creating machine learning models in R with caret is very simple, since caret unifies the way of creating and optimizing the hyperparameters of 238 different models.
So, if we want to create a machine learning model with caret, the first thing is to know how that model is called within caret. We can discover this on this page. For example, there we will see that we can call the
randomForest model from the
randomForest library with the name
How to partition data in train and test
Once we have chosen our model, we will have to divide the data into train and test. To do this, caret offers a very useful function, called
createDataPartition, which is used to make this partition.
The function is very simple, we simply have to pass our dependent variable and the proportion of data that we want to be trained (generally between 0.7 and 0.8 of the total).
With this, the
createDataPartition function returns the indices of the observations that must go to each partition. By default, the information is returned as a list, which I personally don’t like. Luckily, we can avoid this by specifying the
list = FALSE parameter.
Let’s see how to split our data between train and test with caret:
cat('Train rows: ', nrow(train),"\n", 'Test rows: ', nrow(test), sep="")
Train rows: 747 Test rows: 185
As we can see, we have been able to create the data partition in a super simple way in caret. Having seen this, let’s see how to train a machine learning model in R with caret.
How to train a machine learning model with caret
Once we have defined the model, we can create it very easily with the
train function. We will simply have to pass the independent variables on one side and the dependent variable on the other and indicate the model in the
Sacramento$zip = NULL Sacramento$city = NULL indep_var = colnames(Sacramento) != "price" model_rf = train(x = Sacramento[indep_var], y = Sacramento$price, method = 'rf' ) model_rf
Random Forest 932 samples 6 predictor No pre-processing Resampling: Bootstrapped (25 reps) Summary of sample sizes: 932, 932, 932, 932, 932, 932, ... Resampling results across tuning parameters: mtry RMSE Rsquared MAE 2 76275.05 0.6534992 54482.84 4 77632.80 0.6416936 55744.72 6 78933.34 0.6310063 56864.50 RMSE was used to select the optimal model using the smallest value. The final value used for the model was mtry = 2.
As we can see, with the train function we have not only created the model (in this case a randomForest), but also made a small tuning of the
mtry parameter (which indicates the number of random variables to choose from each tree created) and indicates the main adjustment measures of the model (RMSE and MAE).
But there is still more, when creating our model, we can tell caret to make a transformation of our data, like the ones we have seen previously. For that we simply have to pass the preProcess value to the train function.
For example, suppose we are going to use the kNN algorithm, which requires that the data be normalized. Let’s see how we can process the data in the training itself:
# Me quedo con las variables numéricas num_cols = sapply(Sacramento, is.numeric) Sacramento_num = Sacramento[num_cols] # Separo entre variables dependientes e independientes indep_var = colnames(Sacramento_num) != "price" model_knn = train(x = Sacramento[indep_var], y = Sacramento$price, preProcess = "range", method = 'knn' ) model_knn
k-Nearest Neighbors 932 samples 6 predictor Pre-processing: re-scaling to [0, 1] (6) Resampling: Bootstrapped (25 reps) Summary of sample sizes: 932, 932, 932, 932, 932, 932, ... Resampling results across tuning parameters: k RMSE Rsquared MAE 5 39670.84 0.9157169 26896.41 7 39718.85 0.9183391 27052.33 9 40274.35 0.9188837 27325.56 RMSE was used to select the optimal model using the smallest value. The final value used for the model was k = 5.
As we can see, the model has normalized the 6 variables, has created an algorithm kNN with different values of k and has decided that the optimal value of k is k = 5.
And, best of all, there is still more: caret makes tuning a model much easier using grid search. Let’s see how.
How to optimize the hyperparameters of a model with caret
As we have seen, when we make a model in caret, it directly applies a default tuning. However, we may be interested in controlling what values these hyperparameters take. Well, doing this with caret is very simple.
In order to test different parameters, we must first create our own Grid Search. When we do an optimization by Grid Search we basically test all the possible combinations of all the hyperparameters that we indicate.
For example, suppose we want to create a rule-based Lasso regression and we want to tune the
lambda parameter, which indicates the level of penalty to be performed.
Important: the parameters that we can tune for each model appear in the list of available models.
To do this, we simply have to pass each value of each parameter that we want it to test to the expand.grid function. Important, if there are parameters that we only want to have one value, we also have to include them. Let’s see how it’s done:
model_lasso = train(x = Sacramento[indep_var], y = Sacramento$price, method = "glmnet", family = "gaussian", tuneGrid = tunegrid ) model_lasso = train(x = Sacramento[indep_var], y = Sacramento$price, method = "glmnet", family = "gaussian", tuneGrid = tunegrid ) model_lasso
glmnet 932 samples 3 predictor No pre-processing Resampling: Bootstrapped (25 reps) Summary of sample sizes: 932, 932, 932, 932, 932, 932, ... Resampling results across tuning parameters: lambda RMSE Rsquared MAE 0 83549.32 0.5893103 61258.44 1 83549.32 0.5893103 61258.44 100 83549.32 0.5893103 61258.44 200 83549.08 0.5893124 61258.51 500 83524.09 0.5894426 61275.17 1000 83508.27 0.5894367 61319.27 2000 83566.59 0.5887109 61473.51 5000 84354.89 0.5812886 62298.71 10000 85149.90 0.5764404 63031.99 50000 97271.45 0.5764554 72190.38 Tuning parameter 'alpha' was held constant at a value of 1 RMSE was used to select the optimal model using the smallest value. The final values used for the model were alpha = 1 and lambda = 1000.
As you can see, performing a Grid Search with caret is very simple. And yes, although it is already a lot, there is still more. And is that caret allows something else too: do cross validation or cross validation. Let’s see how to do it.
How to do Cross Validation with caret
To perform cross validation in R with caret we simply have to call the
trainControl function and pass this result to our training function.
trainControl function we can indicate many of the questions that interest us, such as the resampling method to use or how many times we should use it.
The most typical thing is to set the method as
repeatedcv which allows cross validation, although we can also bootstrap if we set the value to
Also, if our data is imbalanced, we can balance it in different ways using the
sampling parameter. The types of samples it allows are:
down for downsampling,
up for upsampling or applying specific sampling models with
Let’s see how it works by applying it to the Lasso regression example we created previously:
fitControl = trainControl(method = "repeatedcv", number = 10, repeats = 10) cv_model_lasso = train(x = Sacramento[indep_var], y = Sacramento$price, method = 'glmnet', family = 'gaussian', tuneGrid = tunegrid, trControl = fitControl ) cv_model_lasso
glmnet 932 samples 3 predictor No pre-processing Resampling: Cross-Validated (10 fold, repeated 10 times) Summary of sample sizes: 839, 840, 838, 839, 839, 839, ... Resampling results across tuning parameters: lambda RMSE Rsquared MAE 0 82431.41 0.6050358 60544.17 1 82431.41 0.6050358 60544.17 100 82431.41 0.6050358 60544.17 200 82431.36 0.6050361 60544.15 500 82428.08 0.6050931 60566.41 1000 82445.83 0.6050029 60618.18 2000 82539.04 0.6043956 60790.61 5000 83361.05 0.5979379 61635.24 10000 84335.30 0.5926398 62447.44 50000 97474.98 0.5926398 72309.98 Tuning parameter 'alpha' was held constant at a value of 1 RMSE was used to select the optimal model using the smallest value. The final values used for the model were alpha = 1 and lambda = 500.
As we can see, each of the models has performed a 10-fold cross validation, which has been repeated 10 times. And, of course, the model error for the different lambda values have changed.
Finally, I would like to comment on an important question for the training of our machine learning models in R with caret: parallel training.
How to train Machine Learning models in R in parallel
When we create models, they can take time to execute, especially if we carry out a very extensive Grid Search (besides, it must be said, caret is not very fast).
Luckily, caret offers us the option of parallelizing the models, in such a way that we can make many more models in less time.
To check this, let’s see how long it takes to create a Lasso regression with many hyperparameters if we do not parallelize the model:
tic = Sys.time() tunegrid = expand.grid( alpha = seq(0,1,0.1), lambda = c(0,1,100,200,500,1000,2000,5000,10000,50000) ) fitControl = trainControl(method = "repeatedcv", number = 10, repeats = 10) cv_model_lasso = train(x = Sacramento[indep_var], y = Sacramento$price, method = 'glmnet', family = 'gaussian', tuneGrid = tunegrid, trControl = fitControl ) toc = Sys.time() cat("Total time:",toc-tic)
Total time: 20.35262
As we can see, it took a little over 20 seconds to complete the entire process. But what if we parallelize it?
Parallelizing a model in R with caret is very simple, you just have to create a cluster with the
doParallel library and stop the cluster once we have trained.
The cluster can be created as follows:
library(doParallel) cl = makePSOCKcluster(5) registerDoParallel(cl)
Now that we have created the cluster, we can run the same code as before, which will now be parallelized automatically.
tic = Sys.time() tunegrid = expand.grid( alpha = seq(0,1,0.1), lambda = c(0,1,100,200,500,1000,2000,5000,10000,50000) ) fitControl = trainControl(method = "repeatedcv", number = 10, repeats = 10) cv_model_lasso_par = train(x = Sacramento[indep_var], y = Sacramento$price, method = 'glmnet', family = 'gaussian', tuneGrid = tunegrid, trControl = fitControl ) toc = Sys.time() cat("Total time:",toc-tic)
Total time: 9.741221
As we can see, now the creation of the model has only taken 9 seconds, that is, less than half the time that without parallelizing. All with 2 very simple lines of code. And be careful, this is applicable to all 238 models that caret includes.
Finally, we have to stop the cluster, which we can do with the following function:
As you can see, caret offers very very interesting advantages. Finally, we come to the final stretch of this post, where we will see how to make predictions with caret, as well as evaluate the performance of an ML model. Let’s go there!
How to make predictions and measure predictive capacity of the model with caret in R
To make predictions with R we must pass new data and our model to the predict function, like any other normal model in R.
1 2 3 4 5 6 141811.1 168743.7 135557.5 144312.5 135713.9 161708.4
Likewise, caret also offers functions to calculate the predictive capacity of the models. This will depend on the types of data we have. For numeric variables, we can use the
RMSE functions and the
defaultSummary function, which returns the main metrics (RMSE, R2, and MAE).
I personally tend to like the RMSE function better, basically because it tends to be the way (in general) I use to measure the predictive capacity of models. Also, it is easier to perform than the
defaultSummary function, since the latter requires you to create a dataframe for it to function. Let’s see how they work:
print("Use of defaultSummary") defaultSummary( data = data.frame(obs = Sacramento$price[1:100], pred = pred)) print("Use of RMSE") RMSE(pred, Sacramento$price[1:100])
 "Use of defaultSummary" RMSE Rsquared MAE 5.615570e+04 3.388848e-01 4.676571e+04  "Use of RMSE"  56155.7
Likewise, in the case of categorical variables, caret offers the
confusionMatrix function, which calculates the confusion matrix, as well as all the metrics associated with it.
pred_fake = factor(round(runif(100))) real_fake = factor(round(runif(100))) confusionMatrix(pred_fake, real_fake)
Confusion Matrix and Statistics Reference Prediction 0 1 0 25 28 1 21 26 Accuracy : 0.51 95% CI : (0.408, 0.6114) No Information Rate : 0.54 P-Value [Acc > NIR] : 0.7591 Kappa : 0.0247 Mcnemar's Test P-Value : 0.3914 Sensitivity : 0.5435 Specificity : 0.4815 Pos Pred Value : 0.4717 Neg Pred Value : 0.5532 Prevalence : 0.4600 Detection Rate : 0.2500 Detection Prevalence : 0.5300 Balanced Accuracy : 0.5125 'Positive' Class : 0
Although it is not a real case, we see that caret offers a lot of information with just one line of code.
In short, if you are going to do machine learning with R, caret is a package that you should know. Not only does it unify many models in the same package, but it standardizes super interesting things such as hyperparameter optimization or cross-validation. In addition, it allows you to train all the models in a super simple way.
As if that were not enough, it has several functions with which, in a very simple way, we can see how good our model has been.
In short, caret is a very good package and I hope this post has served you to all the potential it has. See you in the next post!