Muchas veces **se presta mucha atención al machine learning, pero poco a la preparación de los datos** (eliminación de la simetría, imputación de valores perdidos, feature engineering).

Este proyecto busca **demostrar cómo afecta el hecho de preparar bien los datos** más allá del modelo que se utilice.

- Machine Learning
- Cajas Negras (SVM, NN, XGBoost…)
- Cajas Blancas (Elastic Net, Lasso, Árboles de decisión..)

- Preparación de datos
- Feature Engineering

Para ello, basándome en una competición de Kaggle, **he aplicado 10 modelos de machine learning diferentes** (SVM, XGBoost, Lasso, etc.) **cada uno a dos bases de datos:** una en la que se hace una buena preparación de datos y otra en la que no.

Share on linkedin

Share on twitter

Share on email

Share on linkedin

Share on twitter

Share on email

This project aims to **emphasize the importance** of a **good missing value handling, feature generation and feature selection**. To do so, I will predict the price of houses in Iowa by using several different machine learningn algortihms.

The idea is that by the end of this kernel you will have a good grasp on how important a good data preprocessing is and how it might affect the outcome of your model.

The models we will compare are the following:

- Linear model
- Lasso Regression
- Ridge Regression
- Elastic-Net Regression
- Decision Trees
- Random Forest
- k-NN
- XGBoost
- Support Vector Machines
- Neural Network

As the aim is to show the importance of data preprocessing, we will require two datasets:

**Dataset with raw data**. In tis case we will do minimal processing (such as a basic NA imputation). This will be Dataset 1.**Dataset with**. In this case, we will undertake a more thorough process NA imputation, data preprocessing, feature engineering and so on.*cleaned*data

So let’s get started.

These are the libraries we will use.

```
library(plyr)
library(dplyr)
library(randomForest)
library(ggplot2)
library(psych)
library(caret)
library(xgboost)
library(rpart)
library(neuralnet)
library(nnet)
library(tidyr)
library(forcats)
cat("\014")
```

Now I will read both files.

```
train <- read.csv("train.csv", stringsAsFactors = FALSE)
test <- read.csv("test.csv", stringsAsFactors = FALSE)
```

We will delete the Id variable for both train and test.

```
index_train <- train$Id
train$Id <- NULL
test$Id <- NULL
```

In order to merge both variables I will first create the SalePrice variable in test dataset. Then I will merge them both.

```
test$SalePrice <- NA
total <- rbind(train, test)
```

Let’s see how complete is the data.

```
withNA <- which(colSums(is.na(total)) > 0)
sort(colSums(sapply(total[withNA], is.na)), decreasing = TRUE)
```

```
## PoolQC MiscFeature Alley Fence SalePrice
## 2909 2814 2721 2348 1459
## FireplaceQu LotFrontage GarageYrBlt GarageFinish GarageQual
## 1420 486 159 159 159
## GarageCond GarageType BsmtCond BsmtExposure BsmtQual
## 159 157 82 82 81
## BsmtFinType2 BsmtFinType1 MasVnrType MasVnrArea MSZoning
## 80 79 24 23 4
## Utilities BsmtFullBath BsmtHalfBath Functional Exterior1st
## 2 2 2 2 1
## Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
```

If you check the data description provided in the competition, you will see that many of those variables with many NA NA actually means that they do not have that feature. Therefore to be fair we will consider those variables NA’s as a category “None” for both datasets.

By reading the data descriptions you will see that the variables that have NAs that should be “None” are the following: PoolQc, MiscFeature, Alley, Fence, FireplaceQu, GarageFinish, GarageQual, GarageCond, GarageType, BsmtCond, BsmtExposure, BsmtQual, BsmtFinType2 and BsmtFinType.

We will coerce this variable’s NAs into “None”.

```
change <- c("PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageFinish", "GarageQual", "GarageCond", "GarageType", "BsmtCond", "BsmtExposure", "BsmtQual","BsmtFinType1", "BsmtFinType2", "BsmtFinType")
i = 1
for(i in 1:dim(total)[2]){
if(colnames(total[i]) %in% change == TRUE){
total[,colnames(total[i])][is.na(total[,colnames(total[i])])] <- "None"
i = i + 1
}else{
i = i + 1
}
}
```

If we now analyze the amount of NA in each column, we will see that these have significantly decreased.

```
withNA <- which(colSums(is.na(total)) > 0)
sort(colSums(sapply(total[withNA], is.na)), decreasing = TRUE)
```

```
## SalePrice LotFrontage GarageYrBlt MasVnrType MasVnrArea
## 1459 486 159 24 23
## MSZoning Utilities BsmtFullBath BsmtHalfBath Functional
## 4 2 2 2 2
## Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 1 1 1 1 1
## TotalBsmtSF Electrical KitchenQual GarageCars GarageArea
## 1 1 1 1 1
## SaleType
## 1
```

Now we will create two different datasets. In each we will undertake different value imputation procedures as state in the introduction section.

This is not the way we should work. Or at least that is what we uphold. I am just doing this to show you how differently models work when you have the data cleaned compared to when you don’t. Cleaning is tedious, but it is the way to go.

That being said, we will coerce NA’s into the mode of the variable. We use the mode instead od the mean mainly because if we had an outlier, for example, the mean would be affected by that extreme value. The mode, in the other hand, will suffer little change.

We will exlude “SalePrice” from the imputation for obvious reasons: this is the variable we want to predict.

Besides, as R does not include a function to calculate the mode, we will create it ourselves. In my case, I got it from Stackoverflow. Thanks Ken Williams for the contribution.

```
# Creating the Mode function
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
```

```
# Applying it to the variables
total_noclean <- total
i = 1
for(i in 1:dim(total_noclean)[2]){
if(colnames(total_noclean[i]) %in% names(withNA) == TRUE){
if(colnames(total_noclean[i]) == "SalePrice"){
i = i +1
}else{
total_noclean[,colnames(total_noclean[i])][is.na(total_noclean[,colnames(total_noclean[i])])] <- Mode( total_noclean[,colnames(total_noclean[i])][!is.na(total_noclean[,colnames(total_noclean[i])])])
i = i + 1
}}else{
i = i + 1
}
}
```

Now we should have no more variables to impute. We rerun the code just to make sure.

```
withNA2 <- which(colSums(is.na(total_noclean)) > 0)
sort(colSums(sapply(total_noclean[withNA2], is.na)), decreasing = TRUE)
```

```
## SalePrice
## 1459
```

As you can see, we are done here. Simple, isn’t i? Of course it is, that’s why most of the times won’t work;)

First, we must be aware that when doing the previous step of coercing NA’s into “None”, we might have missed some interesting things. **Just by looking the amount of NA’s that we get when we first looked at the “Completness of data” you can see some weird thing**s:

- There were 157 NAs at “GaraType” but 159 for the rest of Garage variables. Maybe there are 2 observations with wrong data inputed.
- BsmtFinType2 had 80 NAs, while BsmtQual had 81 and the rest of Bsmt variables 82.

Besides we will also have to “predict”, to the extent possbible, the value of missing data in other ways rather than the mode.

As I have mentioned before, GarageType had 2 NA less that the rest of observations. We will check those two observations:

`total_clean[total_clean$GarageFinish == "None" & total_clean$GarageType != "None" ,c ("GarageFinish","GarageQual","GarageCond","GarageType")]`

```
## GarageFinish GarageQual GarageCond GarageType
## 2127 None None None Detchd
## 2577 None None None Detchd
```

As we though,these two variables should also be consiedered as “None”.

`total_clean[total_clean$GarageFinish == "None" & total_clean$GarageType != "None" ,"GarageType"] <- "None"`

Other variables concerning the garage are: GarageCars, GarageArea and GarageYrBlt. We will check if any of these does not have a “None” in the rest variables regarding Garage.

```
total_clean %>%
filter(GarageType == "None") %>%
filter(GarageCars!=0, GarageArea!=0) %>%
filter(!is.na(GarageCars) |!is.na(GarageArea) | !is.na(GarageYrBlt)) %>%
select(GarageCars,GarageArea,GarageYrBlt, GarageType,YearBuilt)
```

```
## GarageCars GarageArea GarageYrBlt GarageType YearBuilt
## 1 1 360 NA None 1910
```

It is clear that in this case GarageType should not be “None”. We will impute the mode of the GarageType instead.

`Mode(total_clean$GarageType)`

`## [1] "Attchd"`

`total_clean[total_clean$GarageType == "None" & total_clean$GarageCars == 1 & total_clean$GarageArea == 360 & total_clean$YearBuilt == 1910, "GarageType"] <- Mode(total_clean$GarageType) `

Finally, we will have to impute the “GarageYrBlt”. We will first check that there is no observation with NA in this variable but not in the rest of Garage variables.

```
total_clean %>%
filter(is.na(GarageYrBlt) == TRUE ) %>%
filter(GarageArea != "None", GarageCars != 0, GarageFinish != "None", GarageQual != "None", GarageArea != 0)
```

```
## [1] MSSubClass MSZoning LotFrontage LotArea Street
## [6] Alley LotShape LandContour Utilities LotConfig
## [11] LandSlope Neighborhood Condition1 Condition2 BldgType
## [16] HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd
## [21] RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## [26] MasVnrArea ExterQual ExterCond Foundation BsmtQual
## [31] BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## [36] BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## [41] CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## [46] GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## [51] BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## [56] Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## [61] GarageCars GarageArea GarageQual GarageCond PavedDrive
## [66] WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## [71] PoolArea PoolQC Fence MiscFeature MiscVal
## [76] MoSold YrSold SaleType SaleCondition SalePrice
## <0 rows> (or 0-length row.names)
```

Now, what do we do with the Garage Yearl Built? We will impute the year the house was built, except if the house has been remodeled, in which case we will input this year.

It makes sense that the garage and the house were both built in the same year and that if you remodel the house it is highly likely you will also remodel the garage.

```
for (i in 1:nrow(total_clean)){
if(is.na(total_clean$GarageYrBlt[i])){
total_clean$GarageYrBlt[i] <- ifelse(!is.na(total_clean$YearRemodAdd[i]),total_clean$YearRemodAdd[i],total_clean$YearBuilt[i])
i = i + 1
}else{
i = i + 1
}
}
```

Finally, we will have to clear the two observations that have missing data. We will check their values:

```
total_clean %>%
filter(is.na(GarageCars) | is.na(GarageArea)) %>%
select("GarageCars","GarageArea","GarageFinish","GarageQual","GarageCond","GarageType")
```

```
## GarageCars GarageArea GarageFinish GarageQual GarageCond GarageType
## 1 NA NA None None None None
```

As we can see, both missing data are from the same exact observation. As all the rest variables makes us believe that this house does not have Garage, we will impute a 0 in both variables.

```
total_clean$GarageCars[is.na(total_clean$GarageCars)] <- 0
total_clean$GarageArea[is.na(total_clean$GarageArea)] <- 0
```

Now we will do the same with the Basement variables. In this case, there might some observations that we can get.

```
total_clean %>%
select(BsmtCond,BsmtExposure,BsmtQual,BsmtFinType2,BsmtFinType1) %>%
filter(BsmtFinType1 != "None") %>%
filter(BsmtCond == "None" | BsmtExposure == "None" | BsmtQual == "None" | BsmtFinType2 == "None" )
```

```
## BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1
## 1 TA No Gd None GLQ
## 2 TA None Gd Unf Unf
## 3 TA None Gd Unf Unf
## 4 None Mn Gd Rec GLQ
## 5 None No TA Unf BLQ
## 6 Fa No None Unf Unf
## 7 TA No None Unf Unf
## 8 TA None Gd Unf Unf
## 9 None Av TA Unf ALQ
```

As you can see, there are 9 variables where whe have inputed “None” when we actually shouldn’t. In this cases, we will impute the corresponding Mode.

```
total_clean[total_clean$BsmtFinType1 != "None" & total_clean$BsmtCond == "None", "BsmtCond"] <- Mode(total_clean$BsmtCond)
total_clean[total_clean$BsmtFinType1 != "None" & total_clean$BsmtExposure == "None", "BsmtExposure"] <- Mode(total_clean$BsmtExposure)
total_clean[total_clean$BsmtFinType1 != "None" & total_clean$BsmtQual == "None", "BsmtQual"] <- Mode(total_clean$BsmtQual)
total_clean[total_clean$BsmtFinType1 != "None" & total_clean$BsmtFinType2 == "None", "BsmtFinType2"] <- Mode(total_clean$BsmtFinType2)
```

Now we will estimate other variables that have missing data.

`total_clean[(is.na(total_clean$BsmtFullBath)|is.na(total_clean$BsmtHalfBath)|is.na(total_clean$BsmtFinSF1)|is.na(total_clean$BsmtFinSF2)|is.na(total_clean$BsmtUnfSF)|is.na(total_clean$TotalBsmtSF)), c("BsmtQual", "BsmtFullBath", "BsmtHalfBath", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF")]`

```
## BsmtQual BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 2121 None NA NA NA NA NA
## 2189 None NA NA 0 0 0
## TotalBsmtSF
## 2121 NA
## 2189 0
```

As you can see, all these variable’s NA correspond to two observations and neither of them have a Bsmt. Thus, these observations value should be 0.

```
total_clean[2121, c("BsmtFullBath", "BsmtHalfBath", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF")] <- 0
total_clean[2189, c("BsmtFullBath", "BsmtHalfBath")] <- 0
```

Once again, there are 24 missing values in “MasVnrType”, while there are just 23 in “MasVnrArea”. We will try to figure out that extra missing value.

`total_clean[is.na(total_clean$MasVnrType) & !is.na(total_clean$MasVnrArea), c("MasVnrType","MasVnrArea")]`

```
## MasVnrType MasVnrArea
## 2611 <NA> 198
```

As we don’t have enough variables to impute a precise value, we will just impute the Mode.

`total_clean[is.na(total_clean$MasVnrType) & !is.na(total_clean$MasVnrArea), "MasVnrType"] <- Mode(total_clean$MasVnrType[total_clean$MasVnrType!="None"])`

Finally, for the remaining variables we will impute “None” in the case of “MasVnrType” and 0 in the case of “MasVnrArea”.

```
total_clean$MasVnrType[is.na(total_clean$MasVnrType)] <- "None"
total_clean$MasVnrArea[is.na(total_clean$MasVnrArea)] <- 0
```

Even though we have previously imputed the value for the missing data of Pool related variables, as there are more than 1 variable, we can cross-check the internal coherence of variable values.

`total_clean[total_clean$PoolQC == "None" & total_clean$PoolArea>0, c("PoolQC","PoolArea","OverallQual")]`

```
## PoolQC PoolArea OverallQual
## 2421 None 368 4
## 2504 None 444 6
## 2600 None 561 3
```

As we can see, there are 3 values that have been uncorrectly imputed. Thus we will fix it. We will check whether the overall quality of the house it’s a good proxy variable to explain the pool quality.

`table(total_clean$PoolQC,total_clean$OverallQual)`

```
##
## 1 2 3 4 5 6 7 8 9 10
## Ex 0 0 0 0 0 0 0 2 0 2
## Fa 0 0 0 0 0 1 1 0 0 0
## Gd 0 0 0 0 0 2 1 0 0 1
## None 4 13 40 226 825 728 598 340 107 28
```

`chisq.test(total_clean$PoolQC , total_clean$OverallQual)`

```
## Warning in chisq.test(total_clean$PoolQC, total_clean$OverallQual): Chi-
## squared approximation may be incorrect
```

```
##
## Pearson's Chi-squared test
##
## data: total_clean$PoolQC and total_clean$OverallQual
## X-squared = 126.16, df = 27, p-value = 9.196e-15
```

As we have no better proxy, **we will use the overall quality as proxy**. We have to take into account that overall quality goes from 1-10, while PoolQc have just 5 categories. So, we will divide the Overall Quality by 2 so that they both have the same scale, round it up and assign that value.

```
total_clean$PoolQC[2421] <- "Fa"
total_clean$PoolQC[2504] <- "TA"
total_clean$PoolQC[2600] <- "Fa"
```

Therea re three variables of Lot: “LotConfig”, “LotShape” and “LotFrontage”. The first two do not have “None” level as one of their categories. Thus, this time we have no way to estimate the LotFrontage.

According to variable descriptions, Lot Frontage is the “Linear feet of street connected to property”. Thus it makes sense that it changes according to the Neighborhood (more residential neighbours should have higher Lot Frontage). We will check if this is right or not with a simple ANOVA.

`summary(aov(LotFrontage ~ Neighborhood, total_clean))`

```
## Df Sum Sq Mean Sq F value Pr(>F)
## Neighborhood 24 354702 14779 36.66 <2e-16 ***
## Residuals 2408 970700 403
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 486 observations deleted due to missingness
```

`ggplot(total_clean, aes(Neighborhood, LotFrontage)) + geom_bar(stat = "summary", fun.y = "median") + coord_flip() + labs(title="LotFrontage median by neighborhood", x="")`

`## Warning: Removed 486 rows containing non-finite values (stat_summary).`

As there is indeed a significant variance, we will impute the median of the neighborhood to the observations with NA in their LotFrontage.

```
for(i in 1:length(total_clean$MSSubClass)){
if(is.na(total_clean$LotFrontage[i])){
total_clean$LotFrontage[i] <- median(total_clean$LotFrontage[total_clean$Neighborhood == total_clean$Neighborhood[i]], na.rm=TRUE)
i = i + 1
}else{
i = i + 1
}
}
```

According to variable description zoning refers to “the general zoning classification of the sale”. Thus, the zoning is supposed to be highly correlated with the Neighhborhood. We will check that out.

`chisq.test(total_clean$Neighborhood,total_clean$MSZoning)`

```
## Warning in chisq.test(total_clean$Neighborhood, total_clean$MSZoning): Chi-
## squared approximation may be incorrect
```

```
##
## Pearson's Chi-squared test
##
## data: total_clean$Neighborhood and total_clean$MSZoning
## X-squared = 5011.1, df = 96, p-value < 2.2e-16
```

`table(total_clean$Neighborhood,total_clean$MSZoning)`

```
##
## C (all) FV RH RL RM
## Blmngtn 0 0 0 25 3
## Blueste 0 0 0 0 10
## BrDale 0 0 0 0 30
## BrkSide 0 0 0 43 65
## ClearCr 0 0 0 44 0
## CollgCr 0 0 0 253 14
## Crawfor 0 0 2 91 10
## Edwards 0 0 2 180 12
## Gilbert 0 0 0 165 0
## IDOTRR 22 0 0 0 68
## MeadowV 0 0 0 0 37
## Mitchel 0 0 0 104 9
## NAmes 0 0 7 436 0
## NoRidge 0 0 0 71 0
## NPkVill 0 0 0 23 0
## NridgHt 0 0 0 165 1
## NWAmes 0 0 0 131 0
## OldTown 2 0 0 39 198
## Sawyer 0 0 0 148 3
## SawyerW 0 0 6 119 0
## Somerst 0 139 0 43 0
## StoneBr 0 0 0 51 0
## SWISU 1 0 9 38 0
## Timber 0 0 0 72 0
## Veenker 0 0 0 24 0
```

As you can see the supposition was right. Then, we will impute the MSZoning according to the neighborhoods. To do so, we first check out the neighbourhoods.

`total_clean[is.na(total_clean$MSZoning), c("MSZoning", "Neighborhood")]`

```
## MSZoning Neighborhood
## 1916 <NA> IDOTRR
## 2217 <NA> IDOTRR
## 2251 <NA> IDOTRR
## 2905 <NA> Mitchel
```

Now we simply impute the values just by looking at the table above.

```
total_clean$MSZoning[c(1916,2217,2251)] <- "RM"
total_clean$MSZoning[2905] <- "RL"
```

There are four variables regarding exterior: “Exterior1st”, “Exterior2nd”, “ExterCond” and “ExterQual”. The first two variables refer to the material covering the house and the other two to how good it is conservated. Let’s check out those variables to see if we have any clue.

`total_clean[is.na(total_clean$Exterior1st)|is.na(total_clean$Exterior2nd), c("Exterior1st","Exterior2nd","ExterCond","ExterQual","Neighborhood")]`

```
## Exterior1st Exterior2nd ExterCond ExterQual Neighborhood
## 2152 <NA> <NA> TA TA Edwards
```

As we can see both NAs refer to the same observation and both conditions are TA. I suspect that the condition of the exterior might be correlated with the material, as some material might be better than others and thus provide better condition.

Besides, we will check if both materials can actually be the same, so that we just have to analyze one variable.

`table(total_clean$Exterior1st, total_clean$Exterior2nd)[1:10,1:10]`

```
##
## AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc
## AsbShng 35 0 0 0 0 1 0 0
## AsphShn 0 2 0 0 0 0 0 0
## BrkComm 0 0 4 0 0 0 0 0
## BrkFace 1 0 0 44 0 0 3 0
## CBlock 0 0 0 0 1 0 0 0
## CemntBd 0 0 0 0 0 124 0 0
## HdBoard 0 1 0 1 0 0 383 6
## ImStucc 0 0 0 0 0 0 0 1
## MetalSd 0 1 0 0 1 0 3 0
## Plywood 0 0 18 0 1 0 6 4
##
## MetalSd Other
## AsbShng 0 0
## AsphShn 0 0
## BrkComm 0 0
## BrkFace 3 0
## CBlock 0 0
## CemntBd 0 0
## HdBoard 1 0
## ImStucc 0 0
## MetalSd 437 0
## Plywood 0 0
```

As we can see in most cases both materials are the same. Now, let’s guess the material by considering the Neighborhood.

`table(total_clean$Exterior1st[total_clean$Neighborhood == "Edwards"], total_clean$ExterQual[total_clean$Neighborhood == "Edwards"])`

```
##
## Ex Fa Gd TA
## AsbShng 0 0 0 7
## AsphShn 0 0 0 1
## BrkComm 0 0 0 2
## BrkFace 0 0 0 4
## CemntBd 3 0 0 0
## HdBoard 0 0 4 17
## MetalSd 0 1 0 36
## Plywood 0 0 1 15
## Stucco 1 0 0 5
## VinylSd 0 0 9 29
## Wd Sdng 0 1 1 43
## WdShing 0 0 0 13
```

As we can see, most of the houses in that neighborhood that have a “TA” Exterior Quality have “Wd Sdng” as Exterior material. Thus, we will impute this.

`total_clean[is.na(total_clean$Exterior1st)|is.na(total_clean$Exterior2nd), c("Exterior1st","Exterior2nd")] <- "Wd Sdng"`

**Utilities ** Let’s see which is the most common type of utility.

`summary(as.factor(total_clean$Utilities))`

```
## AllPub NoSeWa NA's
## 2916 1 2
```

As we can see, all variables have the same exact value. Thus, this variable is actually not significant and we should drop it out. As this is something we haven’t seen in the Dataset 1 (see the other tab), I won’t change it there.

`total_clean$Utilities <- NULL`

**Functional** If you read the variable description it says “Home functionality (Assume typical unless deductions are warranted)”. Thus, we wil consider the two NAs as typical.

`total_clean$Functional[is.na(total_clean$Functional)] <- "Typ"`

**Electrical** It refers to the electrical system of the house. I suspect that all will have similar electrical systems.

`summary(as.factor(total_clean$Electrical))`

```
## FuseA FuseF FuseP Mix SBrkr NA's
## 188 50 8 1 2671 1
```

As the vast majority of houses have the same electrical system, we will impute that value (the mode).

`total_clean$Electrical[is.na(total_clean$Electrical)] <- "SBrkr"`

**Kitchen Quality ** As we have done before, we will use the overall quality of the house as a proxy variable for the quality of the kitchen.

`total_clean[is.na(total_clean$KitchenQual), c("OverallQual")]`

`## [1] 5`

`total_clean$KitchenQual[is.na(total_clean$KitchenQual)] <- "TA"`

**Sale Type** As we have no good way to evaluate the type of sale, we will simply impute the mode of the variable.

`total_clean$SaleType[is.na(total_clean$SaleType)] <- Mode(total_clean$SaleType)`

Finally we will check that is ok.

```
withNA3 <- which(colSums(is.na(total_clean)) > 0)
sort(colSums(sapply(total_clean[withNA3], is.na)), decreasing = TRUE)
```

```
## SalePrice
## 1459
```

By looking at the variables descriptions you will see that most categorical variabes have a very similar encoding:Ex,Gd,TA,Fa, NA (now None). As they all have the same levels I will call the “standarized”. The variables that have this encoding are ordinal variables. However, as most ML algorithms work with numerical variables, we will coerce them into integers (from 1 to 5).

In this sense, we differenciate three types of character variables:

- Variables with standarized ordinal values
- Ordinal non standarized variables
- Nominal variables

We will split the relabeling into these three categories:

These refer to the status of features that not all houses have, such as the pool as well as features that all have (kitchen). By checking the variables description you get that these variables are the following: BsmtQual, BsmtCond, FireplaceQu, GarageQual, GarageCond, PoolQC, ExterQual, ExterCond, HeatingQC, KitchenQual.

For both datasets, we will relabel the variables, considering “None” as 0 and “Ex” as 5. To save some space I will just show you one code.

```
relabel <- c("BsmtQual","BsmtCond","FireplaceQu","GarageQual","GarageCond","PoolQC","ExterQual","ExterCond","HeatingQC","KitchenQual")
quality <- c("None" = 0, "Po" = 1, "Fa"= 2, "TA" = 3, "Gd" = 4, "Ex" = 5)
for(i in 1:ncol(total_clean)){
x = colnames(total_clean[i])
if( x %in% relabel){
total_clean[i] <- as.integer(revalue(total_clean[,x],quality))
i = i + 1
} else{
i = i + 1
}
}
```

We will check whether the process has been correctly undertaken or not by analyzing the levels of one of the variables we have changed.

`levels(as.factor(total_clean$BsmtQual))`

`## [1] "0" "2" "3" "4" "5"`

In this case, as they don’t follow the same pattern, we will rename them one by one. The variables to be revalued will by found by looking at variable’s description. Besides, we will undertake the process for both datasets, but I will only be showing one.

```
#Street
total_noclean$Street <- as.integer(revalue(total_noclean$Street,c("Pave" = 0,"Grvl" = 1)))
#Alley
total_noclean$Alley <- as.integer(revalue(total_noclean$Alley,c("None" = 0,"Pave" = 1,"Grvl" = 2)))
#LotShape
total_noclean$LotShape <- as.integer(revalue(total_noclean$LotShape,c("IR3" = 0,"IR2" = 1,"IR1" = 2,"Reg" = 3)))
#BsmtExposure
total_noclean$BsmtExposure <- as.integer(revalue(total_noclean$BsmtExposure,c("None" = 0, "No" = 1, "Mn" = 2, "Av" = 3, "Gd" = 4)))
#BsmtFinType1
total_noclean$BsmtFinType1 <- as.integer(revalue(total_noclean$BsmtFinType1,c("None" = 0, "Unf" = 1, "LwQ" = 2, "Rec" = 3, "BLQ" = 4, "ALQ" = 5, "GLQ" = 6)))
#BsmtFinType2
total_noclean$BsmtFinType2 <- as.integer(revalue(total_noclean$BsmtFinType2,c("None" = 0, "Unf" = 1, "LwQ" = 2, "Rec" = 3, "BLQ" = 4, "ALQ" = 5, "GLQ" = 6)))
#CentralAir
total_noclean$CentralAir <- as.integer(revalue(total_noclean$CentralAir,c("N" = 0, "Y" = 1)))
#Functional
total_noclean$Functional <- as.integer(revalue(total_noclean$Functional,c("Sal" = 0, "Sev" = 1, "Maj2" = 2, "Maj1" = 3, "Mod" = 4, "Min2" = 5, "Min1" = 6, "Typ" = 7)))
```

`## The following `from` values were not present in `x`: Sal`

```
#GarageFinish
total_noclean$GarageFinish <- as.integer(revalue(total_noclean$GarageFinish,c("None"= 0, "Unf" = 1, "RFn" = 2, "Fin" = 3)))
#PavedDrive
total_noclean$PavedDrive <- as.integer(revalue(total_noclean$PavedDrive,c("N"= 0, "P" = 1, "Y" = 2)))
```

`## The following `from` values were not present in `x`: Sal`

As there should no longer be character variables factors that are indeed ordinal, we will make all character variables categoric.

```
#Index of non numeric variables
character <- which(sapply(total_clean, is.character))
for(i in 1:ncol(total_clean)){
x = colnames(total_clean[i])
if(i %in% character == TRUE){
total_clean[i] <- as.factor(total_clean[,x])
i = i + 1
}else{
i = i + 1
}
}
total_clean$SalePrice <- as.numeric(total_clean$SalePrice)
```

To ensure everything is Ok, we will analyze if there is any character variable.

`which(sapply(total_clean, is.character))`

`## named integer(0)`

We can finally analyze the structure of the data.

`str(total_clean)`

```
## 'data.frame': 2919 obs. of 79 variables:
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : num 65 80 68 60 84 85 75 80 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Alley : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LotShape : int 3 3 2 2 2 2 3 2 3 3 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ MasVnrArea : num 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : int 4 3 4 3 4 3 4 3 3 3 ...
## $ ExterCond : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : int 4 4 4 3 4 4 5 4 3 3 ...
## $ BsmtCond : int 3 3 3 4 3 3 3 3 3 3 ...
## $ BsmtExposure : int 1 4 2 1 3 1 3 2 1 1 ...
## $ BsmtFinType1 : int 6 5 6 5 6 6 6 5 1 6 ...
## $ BsmtFinSF1 : num 706 978 486 216 655 ...
## $ BsmtFinType2 : int 1 1 1 1 1 1 1 4 1 1 ...
## $ BsmtFinSF2 : num 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : num 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : num 856 1262 920 756 1145 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : int 5 5 5 4 5 5 5 5 4 5 ...
## $ CentralAir : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : num 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : num 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : int 4 3 4 4 4 3 4 3 3 3 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : int 7 7 7 7 7 7 7 7 6 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : int 0 3 3 4 3 0 4 3 3 3 ...
## $ GarageType : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : int 2 2 2 1 2 1 2 2 1 2 ...
## $ GarageCars : num 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : num 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : int 3 3 3 3 3 3 3 3 2 4 ...
## $ GarageCond : int 3 3 3 3 3 3 3 3 3 3 ...
## $ PavedDrive : int 2 2 2 2 2 2 2 2 2 2 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Fence : Factor w/ 5 levels "GdPrv","GdWo",..: 5 5 5 5 5 3 5 5 5 5 ...
## $ MiscFeature : Factor w/ 5 levels "Gar2","None",..: 2 2 2 2 2 4 2 4 2 2 ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : num 208500 181500 223500 140000 250000 ...
```

As we can see, there are two factor variables that are labeled as integers. These two variables are the month and year sold. We will change these two variables.

```
total_clean$MoSold <- as.factor(total_clean$MoSold)
total_noclean$MoSold <- as.factor(total_noclean$MoSold)
```

**Note**. In order to show the importance of feature engineering I will only undertake feature engineering changes in the 2nd Dataset.

One of the most important things while considering to purchase a house is it’s age. Newer houses are usually more expensive, because they have more technologically advanced and unused materials. A remodeled house on the other hand is usually not as expensive as a new house, but is more expensive than an old one for the exact reasons.

Thus, we will calculate the age of the house.

```
#We convert the variable into numeric
total_clean$Age <- total_clean$YrSold - total_clean$YearRemodAdd
#We convert the variable back to factor
total_clean$YrSold <- as.factor(total_clean$YrSold)
total_noclean$YrSold <- as.factor(total_noclean$YrSold)
```

We will see whether there is a positive correlation or not

`ggplot(total_clean, aes(Age, SalePrice)) + geom_point(alpha=0.4) + geom_smooth(method = "lm", se=FALSE) + labs(title="House price by year", x ="House Age")`

`## Warning: Removed 1459 rows containing non-finite values (stat_smooth).`

`## Warning: Removed 1459 rows containing missing values (geom_point).`

As you can see, generally the elder the house the lower the saleprice.

It is obvious that the bigger the house the higher the price. However, we do not have a variable that resumes the amount of square feet in the house. Thus, we will create it.

```
total_clean$TotalSF <- total_clean$GrLivArea + total_clean$TotalBsmtSF + total_clean$OpenPorchSF + total_clean$WoodDeckSF + total_clean$EnclosedPorch + total_clean$X3SsnPorch + total_clean$ScreenPorch + total_clean$PoolArea
ggplot(total_clean, aes(TotalSF, SalePrice)) + geom_point(alpha=0.4) + geom_smooth(method = "loess", se=FALSE) + labs(title="Houseprice by total square feet")
```

`## Warning: Removed 1459 rows containing non-finite values (stat_smooth).`

`## Warning: Removed 1459 rows containing missing values (geom_point).`

`cor(total_clean$TotalSF,total_clean$SalePrice, use = "complete.obs" )`

`## [1] 0.7800884`

As we can see, despite two variables seem to have a strange pattern, they do have a strong positive relationship. Thus, we will simply analyze and remove if possible those two cases.

`total_clean[total_clean$TotalSF>7500, c("GrLivArea","TotalBsmtSF","OpenPorchSF","WoodDeckSF","PoolArea","SalePrice") ]`

```
## GrLivArea TotalBsmtSF OpenPorchSF WoodDeckSF PoolArea SalePrice
## 524 4676 3138 406 208 0 184750
## 1183 4476 2396 78 171 555 745000
## 1299 5642 6110 292 214 480 160000
## 2550 5095 5095 484 546 0 NA
```

We see how the “problematic” values are the observation 524 and 1299. We will analyze this variables more in detail, together with the 2550, which we have to predict, just to check if they have any similarities.

`t(total_clean[c(524,1299,2550),])`

```
## 524 1299 2550
## MSSubClass "60" "60" "20"
## MSZoning "RL" "RL" "RL"
## LotFrontage "130" "313" "128"
## LotArea "40094" "63887" "39290"
## Street "0" "0" "0"
## Alley "0" "0" "0"
## LotShape "2" "0" "2"
## LandContour "Bnk" "Bnk" "Bnk"
## LotConfig "Inside" "Corner" "Inside"
## LandSlope "Gtl" "Gtl" "Gtl"
## Neighborhood "Edwards" "Edwards" "Edwards"
## Condition1 "PosN" "Feedr" "Norm"
## Condition2 "PosN" "Norm" "Norm"
## BldgType "1Fam" "1Fam" "1Fam"
## HouseStyle "2Story" "2Story" "1Story"
## OverallQual "10" "10" "10"
## OverallCond "5" "5" "5"
## YearBuilt "2007" "2008" "2008"
## YearRemodAdd "2008" "2008" "2009"
## RoofStyle "Hip" "Hip" "Hip"
## RoofMatl "CompShg" "ClyTile" "CompShg"
## Exterior1st "CemntBd" "Stucco" "CemntBd"
## Exterior2nd "CmentBd" "Stucco" "CmentBd"
## MasVnrType "Stone" "Stone" "Stone"
## MasVnrArea " 762" " 796" "1224"
## ExterQual "5" "5" "5"
## ExterCond "3" "3" "3"
## Foundation "PConc" "PConc" "PConc"
## BsmtQual "5" "5" "5"
## BsmtCond "3" "3" "3"
## BsmtExposure "4" "4" "4"
## BsmtFinType1 "6" "6" "6"
## BsmtFinSF1 "2260" "5644" "4010"
## BsmtFinType2 "1" "1" "1"
## BsmtFinSF2 "0" "0" "0"
## BsmtUnfSF " 878" " 466" "1085"
## TotalBsmtSF "3138" "6110" "5095"
## Heating "GasA" "GasA" "GasA"
## HeatingQC "5" "5" "5"
## CentralAir "1" "1" "1"
## Electrical "SBrkr" "SBrkr" "SBrkr"
## X1stFlrSF "3138" "4692" "5095"
## X2ndFlrSF "1538" " 950" " 0"
## LowQualFinSF "0" "0" "0"
## GrLivArea "4676" "5642" "5095"
## BsmtFullBath "1" "2" "1"
## BsmtHalfBath "0" "0" "1"
## FullBath "3" "2" "2"
## HalfBath "1" "1" "1"
## BedroomAbvGr "3" "3" "2"
## KitchenAbvGr "1" "1" "1"
## KitchenQual "5" "5" "5"
## TotRmsAbvGrd "11" "12" "15"
## Functional "7" "7" "7"
## Fireplaces "1" "3" "2"
## FireplaceQu "4" "4" "4"
## GarageType "BuiltIn" "Attchd" "Attchd"
## GarageYrBlt "2007" "2008" "2008"
## GarageFinish "3" "3" "3"
## GarageCars "3" "2" "3"
## GarageArea " 884" "1418" "1154"
## GarageQual "3" "3" "3"
## GarageCond "3" "3" "3"
## PavedDrive "2" "2" "2"
## WoodDeckSF "208" "214" "546"
## OpenPorchSF "406" "292" "484"
## EnclosedPorch "0" "0" "0"
## X3SsnPorch "0" "0" "0"
## ScreenPorch "0" "0" "0"
## PoolArea " 0" "480" " 0"
## PoolQC "0" "4" "0"
## Fence "None" "None" "None"
## MiscFeature "None" "None" "None"
## MiscVal " 0" " 0" "17000"
## MoSold "10" "1" "10"
## YrSold "2007" "2008" "2007"
## SaleType "New" "New" "New"
## SaleCondition "Partial" "Partial" "Partial"
## SalePrice "184750" "160000" NA
## Age "-1" " 0" "-2"
## TotalSF " 8428" "12738" "11220"
```

If you look closer you will find some things in common:

- They all share the same Neighborhood.
- They all were sold before the house was finished (negative age).
- The SaleCondition was “Partial”.

According to data description Partial Sale means that “Home was not completed when last assessed”. Thus, SalePrice of this variable may not represent the actual market value of the house, but rather the amount of money they have already payed in advance.

We should definitively remove this, as these are two outliers. However, they will be very useful to predict the value of the house 2550, as it has similar conditions.

Let’s see how the model works without these two observations.

```
total_clean <- total_clean[-c(524,1299),]
ggplot(total_clean, aes(TotalSF, SalePrice)) + geom_point(alpha=0.4) + geom_smooth(method = "loess", se=FALSE) + labs(title="Houseprice by total square feet (adjusted)")
```

`## Warning: Removed 1459 rows containing non-finite values (stat_smooth).`

`## Warning: Removed 1459 rows containing missing values (geom_point).`

`cor(total_clean$TotalSF,total_clean$SalePrice, use = "complete.obs" )`

`## [1] 0.8289224`

Another interesting variable might be the amount of bathrooms in the house. I have also think about the total number of rooms (bathrooms + rooms), but it is likely that this last one will be highly correlated with the total SF.

In order to correctly count the amount of bathrooms we have to add both the basement bathrooms, as FullBath only indicates “Full bathrooms above grade”.

```
total_clean$Bathrooms <- (total_clean$BsmtHalfBath + total_clean$HalfBath)*0.5 + total_clean$FullBath + total_clean$BsmtFullBath
ggplot(total_clean, aes(Bathrooms, SalePrice)) + geom_jitter(alpha=0.4) + geom_smooth(method = "loess", se=FALSE) + scale_x_continuous(breaks=seq(1,6,0.5)) + labs(title = "Houseprice by Amount of bathrooms")
```

`## Warning: Removed 1459 rows containing non-finite values (stat_smooth).`

`## Warning: Removed 1459 rows containing missing values (geom_point).`

`cor(total_clean$Bathrooms, total_clean$SalePrice, use = "complete.obs")`

`## [1] 0.6358963`

Now we will try to identify for each datasets which are the most relevant variables. This process will enable us to use just certain specific variables for some (not all) ML models, like the linear model. We will undertake two differente processes.

- Random Forest: as it is a tree based algorithm work with both categorical and numerical data. It will give us a glimpse of which the importante variables are.
- Correlation matrix: as RF does not show colinearity among variables, I will undertake a correlation matrix. By doing so I will pick significant variables that are not highly correlated.

As we have said, we will run a Random Forest to see which are the most important explanatory variables.

```
set.seed(1234)
train_forest <- total_clean %>%
filter(!is.na(SalePrice))
random_forest <- randomForest(x= train_forest[,-which(names(train_forest) == "SalePrice")], y=train_forest$SalePrice, ntree=100,importance=TRUE)
random_forest <- importance(random_forest)
random_forest <- data.frame(Variables = row.names(random_forest), MSE = random_forest[,1])
random_forest <- random_forest[order(random_forest$MSE, decreasing = TRUE),]
par(mfrow=c(1,2))
ggplot(random_forest[1:30,], aes(x=reorder(Variables, MSE), y=MSE)) + geom_bar(stat = "identity") + labs(x = "Variable", y= "MSE") + coord_flip() + labs(title = "Variablles Importance (MSE explanation)")
```

`ggplot(random_forest[1:30,], aes(x=reorder(Variables, -MSE), y=MSE, group=1)) + geom_point() + geom_line()+ labs(x = "Variable", y= "MSE") + labs(title = "MSE explanation evolution") + theme(axis.text.x = element_text(angle = 45, hjust = 1))`

As we can see, the most important variables are TotalSF, Neighborhood, OveralQual, GrLivArea, BsmtFinSF1, Bathrooms, GarageArea and YearBuilt. However, for sure that some of these variables (such as Total SF and BsmtFinSF1) are highly correlated. Thus, we now have to analyze the correlations among variables.

In this case we will also run a Random Forest to analyze which are the most important variables.

```
set.seed(1234)
train_forest <- total_noclean %>%
filter(!is.na(SalePrice))
random_forest <- randomForest(x= train_forest[,-which(names(train_forest) == "SalePrice")], y=train_forest$SalePrice, ntree=100,importance=TRUE)
random_forest <- importance(random_forest)
random_forest <- data.frame(Variables = row.names(random_forest), MSE = random_forest[,1])
random_forest <- random_forest[order(random_forest$MSE, decreasing = TRUE),]
par(mfrow=c(1,2))
ggplot(random_forest[1:30,], aes(x=reorder(Variables, MSE), y=MSE)) + geom_bar(stat = "identity") + labs(x = "Variable", y= "MSE") + coord_flip() + labs(title = "Variablles Importance (MSE explanation)")
```

`ggplot(random_forest[1:30,], aes(x=reorder(Variables, -MSE), y=MSE, group=1)) + geom_point() + geom_line()+ labs(x = "Variable", y= "MSE") + labs(title = "MSE explanation evolution") + theme(axis.text.x = element_text(angle = 45, hjust = 1))`

It is obvious that in this cae the relevant variables have changed.

In order to undertake a correlation matrix, it is good to first consider several things:

- Pearson Correlations only works with numerical data. Thus, I will have to get a df with this information first.
- Pearson Correlation just measures the strenght of
**linear correlation**between variables. There might be some other type of correlation (non-linear) that we simply just won’t see.

First, we have to split the data and just get numeric variables.

```
# We create a new df for numeric variables
total_clean_numeric_index <- names(which(sapply(total_clean, is.numeric)))
total_clean_numeric <- total_clean[, which(names(total_clean) %in% total_clean_numeric_index)]
#We calculate correlations and filter for just those higher than |0.5| and pick the names
correlations <- as.matrix(x = sort(cor(total_clean_numeric, use="pairwise.complete.obs")[,"SalePrice"], decreasing = TRUE))
names <- names(which(apply(correlations,1, function(x) abs(x)>0.5)))
#We sort the dataset to just show those variables
total_clean_numeric <- total_clean_numeric[, names]
#We create and represent the correlations matrix
correlations <- cor(total_clean_numeric, use="pairwise.complete.obs")
cor.plot(correlations, numbers=TRUE, xlas = 2, upper= FALSE, main="Correlations among important variables", zlim=c(abs(0.65),abs(1)), colors=FALSE)
```

By analyzing the matrix we realize of the following:

- TotalSF is higly correlated with GrLivArea (0.86), TotalBsmntSF (0.8), 1stFlrSF(0.77). We pick TotalSF.
- OverallQual is highly correlated with ExterQual (0.73) and KitchenQual (0.67). We pick OverallQual.
- GarageCars is highly correlated with GarageArea (0.89). We keep GarageCars.
- Bathrooms is highly correlated with FullBath (0.71). We keep Bathrooms.
- TotRmsAbvGrd is highyl correlated with GrLivingArea. As we have drop off this last one, we keep TotRmsAbvGrd.
- YearRemodAdd is perfectly correlated with Age. However, it has lower correlations with other variables (due to rounding I suppose). Thus, we keep YearRemodAdd.

Finally, We will create a df that just keeps the “important” variables for those models that do not cope with multicolinearity (such as linear model).

```
important <- c("SalePrice","TotalSF","OverallQual","Bathrooms","GarageCars","YearBuilt","BsmtQual","GarageFinish","GarageYrBlt","FireplaceQu","YearRemodAdd", "Neighborhood")
total_models_clean <- total_clean[, which(names(total_clean) %in% important)]
```

This time we will undertake the same exact process.

```
# We create a new df for numeric variables
total_noclean_numeric_index <- names(which(sapply(total_noclean, is.numeric)))
total_noclean_numeric <- total_noclean[, which(names(total_noclean) %in% total_noclean_numeric_index)]
#We calculate correlations and filter for just those higher than |0.5| and pick the names
correlations2 <- as.matrix(x = sort(cor(total_noclean_numeric, use="pairwise.complete.obs")[,"SalePrice"], decreasing = TRUE))
names2 <- names(which(apply(correlations2,1, function(x) abs(x)>0.5)))
#We sort the dataset to just show those variables
total_noclean_numeric <- total_noclean_numeric[, names2]
#We create and represent the correlations matrix
correlations2 <- cor(total_noclean_numeric, use="pairwise.complete.obs")
cor.plot(correlations2, numbers=TRUE, xlas = 2, upper= FALSE, main="Correlations among important variables", zlim=c(abs(0.65),abs(1)), colors=FALSE)
```

In this case we will get rid of the following variables:

- ExterQual: highly correlates with OverallQUal (0.73).
- TotRmsAbvGrd: highly correlates with GrLivArea (0.81).
- Garage Area: highly correlates with GarageCars (0.89).
- X1stFlrSF highly correlates with TotalBsmtSF (0.8).

Thus we will keep the following variables:

```
important2 <- c("SalePrice","OverallQual","GrLivArea","KitchenQual","GarageCars","TotalBsmtSF","FullBath","BsmtQual","GarageFinish","YearBuilt","FireplaceQu","YearRemodAdd", "Neighborhood")
total_models_noclean <- total_noclean[, which(names(total_noclean) %in% important2)]
```

**Note**: removing skewness is a recommended yet not imprescinble process. Thus, **this processes will not be undertaken in Dataset 1** due to it’s *lazy perspective*. As I have said at the beginning the idea is to show how important not just to run the correct model, but also to prepare the data.

Now that we already now which variables are the important ones, we will prepare those variables by finding outliers and deleting skewness.

Why is it important to fix the skewness? Well, some ML algorithms are based on the assumption of homoscedasticity, which means that all errors have the same variance. As explained in StackExchange:“the difference between what you predict Y^ and the true values Y should be constant. You can ensure that by making sure that Y follows a Gaussian distribution”.

```
numeric <- c("SalePrice","TotalSF","YearBuilt","GarageYrBlt","YearRemodAdd")
total_model_clean_numeric <- total_models_clean[,names(total_models_clean)%in% numeric]
```

Now that we have the numeric variables, we will simply create a for that analyzes the skewness and uses a logarithmic transformation is skewness is higher than 0.75, as it is a moderately high skewness. Explanation here.

```
for(i in 1:ncol(total_model_clean_numeric)){
if (abs(skew(total_model_clean_numeric[,i]))>0.75){
total_model_clean_numeric[,i] <- log(total_model_clean_numeric[,i] +1)
}
}
```

We will check and see if any variable has been modified.

`head(total_model_clean_numeric,10)`

```
## YearBuilt YearRemodAdd GarageYrBlt SalePrice TotalSF
## 1 2003 2003 2003 12.24770 7.873978
## 2 1976 1976 1976 12.10902 7.945555
## 3 2001 2002 2001 12.31717 7.918992
## 4 1915 1970 1998 11.84940 7.930566
## 5 2000 2000 2000 12.42922 8.194229
## 6 1993 1995 1993 11.87061 7.843456
## 7 2004 2005 2004 12.63461 8.214194
## 8 1973 1973 1973 12.20608 8.259717
## 9 1931 1950 1931 11.77453 8.013674
## 10 1939 1950 1939 11.67845 7.636752
```

As you can see, SalePrice has been modified. Thus, will have to transform the predictions we make. Just for you to see, we will check how the skewness of the predicted variable has changes.

```
qqnorm(total_clean$SalePrice, main = "Skewness of data without transformation (SalePrice)")
qqline(total_clean$SalePrice)
```

```
qqnorm(total_model_clean_numeric$SalePrice, main = "Skewness of data transformed (SalePrice)")
qqline(total_model_clean_numeric$SalePrice)
```

It is important to normalize data so that the model is not affected by the scales of the data. You can find further explanation [here] (https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029.)

Obviously, we will only scale numeric data.

```
SalePrice <- total_model_clean_numeric$SalePrice
total_model_clean_numeric <- sapply(total_model_clean_numeric[, -which(names(total_model_clean_numeric) %in% "SalePrice")], scale)
total_model_clean_numeric <- cbind(SalePrice,total_model_clean_numeric)
total_model_clean_numeric <- as.data.frame(total_model_clean_numeric)
```

For those variables thar are actually not numeric variables we will proceed to created dummy variables, which basically means, a variable for each level of each actualy variable with two possible values 1 (Yes), 0 (No).

To do so, we first have to coerce all categorical variables that are encoded as numeric into factors.

```
total_models_clean$OverallQual <- as.factor(total_models_clean$OverallQual)
total_models_clean$BsmtQual <- as.factor(total_models_clean$BsmtQual)
total_models_clean$FireplaceQu <- as.factor(total_models_clean$FireplaceQu)
total_models_clean$GarageFinish <- as.factor(total_models_clean$GarageFinish)
total_models_clean$GarageCars <- as.factor(total_models_clean$GarageCars)
total_models_clean$Bathrooms <- as.factor(total_models_clean$Bathrooms)
```

Now we will make all categorical variables dummy variables. This will make the dimensions of the dataset grow considerably.

```
categorical <- total_models_clean[!colnames(total_models_clean) %in% numeric]
dummy <- as.data.frame(model.matrix(~.-1, categorical))
total_final_2 <- cbind(SalePrice = total_model_clean_numeric$SalePrice, dummy)
```

Now we will get rid off the variables that appear less than 5 times in the train dataset along with others that do not appear in the test dataset. We do so because having less than 5 variables is not enough data to make a good prediction.

```
names1 <- names(colSums(total_final_2[!is.na(total_final_2$SalePrice),])[colSums(total_final_2[!is.na(total_final_2$SalePrice),])<5])
names2 <- names(colSums(total_final_2[is.na(total_final_2$SalePrice), -which(names(total_final_2) == "SalePrice") ])[colSums(total_final_2[is.na(total_final_2$SalePrice),-which(names(total_final_2) == "SalePrice")])==0])
delete <- c(names1,names2)
```

Now we will get rid off these columns and bind them with the normalized variables.

```
total_final_2 <- total_final_2[, -which(names(total_final_2)%in% c(delete, "SalePrice"))]
total_final_2 <- cbind(total_model_clean_numeric, total_final_2)
```

Finally we will split data into train and test, and the response and explanatory variables.

```
train_2 <- total_final_2[!is.na(total_final_2$SalePrice),]
train_2_explanatory <- train_2[, -which(names(train_2) %in% "SalePrice")]
train_2_response <- train_2[, which(names(train_2) %in% "SalePrice")]
test_2 <- total_final_2[is.na(total_final_2$SalePrice),]
test_2_explanatory <- test_2[, -which(names(test_2) %in% "SalePrice")]
```

Once again, we begin by coercing categorical data that re encoded as numeric to factors.

```
total_models_noclean$OverallQual <- as.factor(total_models_noclean$OverallQual)
total_models_noclean$KitchenQual <- as.factor(total_models_noclean$KitchenQual)
total_models_noclean$GarageCars <- as.factor(total_models_noclean$GarageCars)
total_models_noclean$FullBath <- as.factor(total_models_noclean$FullBath)
total_models_noclean$BsmtQual <- as.factor(total_models_noclean$BsmtQual)
total_models_noclean$GarageFinish <- as.factor(total_models_noclean$GarageFinish)
total_models_noclean$FireplaceQu <- as.factor(total_models_noclean$FireplaceQu)
```

Now we create the dummy variables.

```
categorical <- select_if(total_models_noclean, is.factor)
total_model_noclean_numeric <- total_models_noclean[, -which(colnames(total_models_noclean) %in% names(categorical))]
dummy <- as.data.frame(model.matrix(~.-1, categorical))
total_final_1 <- cbind(total_model_noclean_numeric, dummy)
```

Now we check the colnames of the variables that do not appear a significant amout of times in train or that do not appear in test.

```
names1 <- names(colSums(total_final_1[!is.na(total_final_1$SalePrice),])[colSums(total_final_1[!is.na(total_final_1$SalePrice),])<5])
names2 <- names(colSums(total_final_1[is.na(total_final_1$SalePrice), -which(names(total_final_1) == "SalePrice") ])[colSums(total_final_1[is.na(total_final_1$SalePrice),-which(names(total_final_1) == "SalePrice")])==0])
delete <- c(names1,names2)
```

We delete this variables.

`total_final_1 <- total_final_1[, -which(names(total_final_1)%in% delete)]`

Finally we will split data into train and test, and the response and explanatory variables.

```
train_1 <- total_final_1[!is.na(total_final_1$SalePrice),]
train_1_explanatory <- train_1[, -which(names(train_1) %in% "SalePrice")]
train_1_response <- train_1[, which(names(train_1) %in% "SalePrice")]
test_1 <- total_final_1[is.na(total_final_1$SalePrice),]
test_1_explanatory <- test_1[, -which(names(test_1) %in% "SalePrice")]
```

We begin with a very simple model: linear regression. As you know, it basically find the line that reduces the Root Mean Square Error.

```
model_lm_1 <- lm(SalePrice~., data=train_1)
prediction_1_lm <-predict(model_lm_1,test_1_explanatory)
summary(model_lm_1)
```

```
##
## Call:
## lm(formula = SalePrice ~ ., data = train_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -368648 -13028 394 12148 217535
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.686e+05 1.711e+05 -3.907 9.78e-05 ***
## YearBuilt 1.056e+02 6.970e+01 1.515 0.130072
## YearRemodAdd 2.377e+02 5.953e+01 3.993 6.85e-05 ***
## TotalBsmtSF 1.374e+01 2.820e+00 4.873 1.23e-06 ***
## GrLivArea 3.791e+01 2.753e+00 13.771 < 2e-16 ***
## NeighborhoodBlmngtn 6.840e+03 2.343e+04 0.292 0.770326
## NeighborhoodBrDale 6.333e+03 2.357e+04 0.269 0.788217
## NeighborhoodBrkSide 2.621e+04 2.267e+04 1.156 0.247823
## NeighborhoodClearCr 5.341e+04 2.286e+04 2.336 0.019627 *
## NeighborhoodCollgCr 3.471e+04 2.231e+04 1.556 0.119976
## NeighborhoodCrawfor 5.520e+04 2.264e+04 2.437 0.014914 *
## NeighborhoodEdwards 1.188e+04 2.240e+04 0.530 0.596127
## NeighborhoodGilbert 2.706e+04 2.225e+04 1.216 0.224196
## NeighborhoodIDOTRR 9.059e+03 2.293e+04 0.395 0.692883
## NeighborhoodMeadowV -4.516e+02 2.339e+04 -0.019 0.984598
## NeighborhoodMitchel 2.400e+04 2.256e+04 1.064 0.287587
## NeighborhoodNAmes 2.796e+04 2.228e+04 1.255 0.209710
## NeighborhoodNoRidge 8.081e+04 2.278e+04 3.548 0.000401 ***
## NeighborhoodNPkVill 1.765e+04 2.435e+04 0.725 0.468754
## NeighborhoodNridgHt 4.951e+04 2.269e+04 2.182 0.029260 *
## NeighborhoodNWAmes 2.996e+04 2.234e+04 1.341 0.180081
## NeighborhoodOldTown 1.040e+04 2.258e+04 0.461 0.645124
## NeighborhoodSawyer 2.526e+04 2.247e+04 1.124 0.261001
## NeighborhoodSawyerW 3.197e+04 2.245e+04 1.424 0.154722
## NeighborhoodSomerst 3.646e+04 2.248e+04 1.622 0.105108
## NeighborhoodStoneBr 6.706e+04 2.311e+04 2.902 0.003766 **
## NeighborhoodSWISU 1.723e+04 2.323e+04 0.742 0.458427
## NeighborhoodTimber 4.049e+04 2.270e+04 1.784 0.074675 .
## NeighborhoodVeenker 6.239e+04 2.399e+04 2.601 0.009395 **
## OverallQual3 4.872e+03 1.675e+04 0.291 0.771186
## OverallQual4 1.676e+04 1.553e+04 1.079 0.280559
## OverallQual5 1.922e+04 1.562e+04 1.231 0.218577
## OverallQual6 2.439e+04 1.571e+04 1.552 0.120816
## OverallQual7 3.643e+04 1.593e+04 2.287 0.022334 *
## OverallQual8 5.341e+04 1.629e+04 3.278 0.001070 **
## OverallQual9 9.361e+04 1.740e+04 5.380 8.71e-08 ***
## OverallQual10 1.145e+05 1.881e+04 6.086 1.49e-09 ***
## BsmtQual2 4.146e+03 8.112e+03 0.511 0.609383
## BsmtQual3 6.458e+03 6.292e+03 1.026 0.304869
## BsmtQual4 9.471e+03 6.656e+03 1.423 0.154998
## BsmtQual5 2.966e+04 7.898e+03 3.755 0.000181 ***
## FullBath1 -5.428e+03 1.116e+04 -0.486 0.626868
## FullBath2 -8.231e+03 1.121e+04 -0.734 0.462921
## FullBath3 3.167e+04 1.276e+04 2.482 0.013192 *
## KitchenQual3 4.986e+03 5.592e+03 0.892 0.372733
## KitchenQual4 1.068e+04 6.082e+03 1.755 0.079455 .
## KitchenQual5 2.971e+04 7.380e+03 4.027 5.96e-05 ***
## FireplaceQu1 3.442e+03 7.136e+03 0.482 0.629671
## FireplaceQu2 1.049e+04 5.666e+03 1.851 0.064341 .
## FireplaceQu3 9.898e+03 2.599e+03 3.808 0.000146 ***
## FireplaceQu4 9.679e+03 2.486e+03 3.893 0.000104 ***
## FireplaceQu5 1.813e+04 7.079e+03 2.561 0.010532 *
## GarageFinish1 3.588e+04 1.457e+04 2.462 0.013937 *
## GarageFinish2 4.082e+04 1.474e+04 2.769 0.005699 **
## GarageFinish3 4.456e+04 1.470e+04 3.031 0.002479 **
## GarageCars1 -2.714e+04 1.422e+04 -1.909 0.056504 .
## GarageCars2 -2.172e+04 1.414e+04 -1.536 0.124835
## GarageCars3 2.617e+03 1.447e+04 0.181 0.856530
## GarageCars4 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30860 on 1402 degrees of freedom
## Multiple R-squared: 0.855, Adjusted R-squared: 0.8491
## F-statistic: 145 on 57 and 1402 DF, p-value: < 2.2e-16
```

```
# write.csv(prediction_1_lm,"prediction_1_lm.csv")
# Result: 0.15878
```

```
model_lm_2 <- lm(SalePrice~., data=train_2)
prediction_2_lm <-predict(model_lm_2,test_2_explanatory)
summary(model_lm_2)
```

```
##
## Call:
## lm(formula = SalePrice ~ ., data = train_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81129 -0.06243 0.00746 0.07661 0.42673
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3505161 0.1207881 93.970 < 2e-16 ***
## YearBuilt 0.0297172 0.0092620 3.208 0.001365 **
## YearRemodAdd 0.0477534 0.0051696 9.237 < 2e-16 ***
## GarageYrBlt 0.0007018 0.0070044 0.100 0.920202
## TotalSF 0.1448538 0.0064208 22.560 < 2e-16 ***
## NeighborhoodBlmngtn 0.0657802 0.1004570 0.655 0.512698
## NeighborhoodBrDale -0.0032878 0.1009943 -0.033 0.974035
## NeighborhoodBrkSide 0.1969100 0.0971688 2.026 0.042906 *
## NeighborhoodClearCr 0.2917663 0.0980998 2.974 0.002988 **
## NeighborhoodCollgCr 0.1812137 0.0955757 1.896 0.058163 .
## NeighborhoodCrawfor 0.3280792 0.0969032 3.386 0.000730 ***
## NeighborhoodEdwards 0.1208111 0.0960602 1.258 0.208724
## NeighborhoodGilbert 0.1548761 0.0955336 1.621 0.105207
## NeighborhoodIDOTRR 0.0131414 0.0982517 0.134 0.893618
## NeighborhoodMeadowV -0.0203047 0.1001542 -0.203 0.839372
## NeighborhoodMitchel 0.1391946 0.0968279 1.438 0.150786
## NeighborhoodNAmes 0.1860144 0.0954955 1.948 0.051628 .
## NeighborhoodNoRidge 0.2631891 0.0974714 2.700 0.007014 **
## NeighborhoodNPkVill 0.0919997 0.1042572 0.882 0.377696
## NeighborhoodNridgHt 0.2017102 0.0972665 2.074 0.038281 *
## NeighborhoodNWAmes 0.1622412 0.0958265 1.693 0.090664 .
## NeighborhoodOldTown 0.0889978 0.0968233 0.919 0.358161
## NeighborhoodSawyer 0.1514611 0.0963156 1.573 0.116049
## NeighborhoodSawyerW 0.1525022 0.0961973 1.585 0.113123
## NeighborhoodSomerst 0.1919414 0.0964127 1.991 0.046694 *
## NeighborhoodStoneBr 0.2589273 0.0991541 2.611 0.009115 **
## NeighborhoodSWISU 0.1635224 0.0996216 1.641 0.100932
## NeighborhoodTimber 0.1949059 0.0973337 2.002 0.045429 *
## NeighborhoodVeenker 0.2780633 0.1027745 2.706 0.006902 **
## OverallQual3 0.1184177 0.0685881 1.727 0.084477 .
## OverallQual4 0.2602259 0.0634072 4.104 4.29e-05 ***
## OverallQual5 0.3151594 0.0634709 4.965 7.70e-07 ***
## OverallQual6 0.3526653 0.0640632 5.505 4.39e-08 ***
## OverallQual7 0.4212284 0.0652527 6.455 1.49e-10 ***
## OverallQual8 0.5004081 0.0670072 7.468 1.43e-13 ***
## OverallQual9 0.6373284 0.0717545 8.882 < 2e-16 ***
## OverallQual10 0.7512531 0.0775695 9.685 < 2e-16 ***
## BsmtQual2 -0.1383783 0.0340579 -4.063 5.11e-05 ***
## BsmtQual3 -0.1209535 0.0259766 -4.656 3.53e-06 ***
## BsmtQual4 -0.1163078 0.0276947 -4.200 2.84e-05 ***
## BsmtQual5 -0.0700060 0.0330786 -2.116 0.034491 *
## FireplaceQu1 0.0189192 0.0305241 0.620 0.535482
## FireplaceQu2 0.0319677 0.0243722 1.312 0.189855
## FireplaceQu3 0.0448819 0.0112083 4.004 6.55e-05 ***
## FireplaceQu4 0.0620783 0.0106869 5.809 7.78e-09 ***
## FireplaceQu5 0.0711524 0.0302786 2.350 0.018915 *
## GarageFinish1 0.2047354 0.0638481 3.207 0.001374 **
## GarageFinish2 0.2222887 0.0645712 3.443 0.000593 ***
## GarageFinish3 0.2320994 0.0643741 3.605 0.000323 ***
## GarageCars1 -0.0965111 0.0625314 -1.543 0.122959
## GarageCars2 -0.0633508 0.0622048 -1.018 0.308653
## GarageCars3 0.0052211 0.0636209 0.082 0.934606
## GarageCars4 NA NA NA NA
## Bathrooms1.5 0.0642993 0.0159314 4.036 5.73e-05 ***
## Bathrooms2 0.0424251 0.0126571 3.352 0.000824 ***
## Bathrooms2.5 0.0827956 0.0157176 5.268 1.60e-07 ***
## Bathrooms3 0.0884189 0.0174060 5.080 4.29e-07 ***
## Bathrooms3.5 0.1535546 0.0190988 8.040 1.90e-15 ***
## Bathrooms4 0.2618792 0.0406901 6.436 1.68e-10 ***
## Bathrooms4.5 0.2870511 0.0657238 4.368 1.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.132 on 1399 degrees of freedom
## Multiple R-squared: 0.8953, Adjusted R-squared: 0.891
## F-statistic: 206.3 on 58 and 1399 DF, p-value: < 2.2e-16
```

```
prediction_2_lm <- exp(prediction_2_lm)
write.csv(prediction_2_lm,"prediction_2_lm.csv")
# Result: 0.14711
```

The good thing about Lasso Regression is that it handles variables that add no information to the model. As we already have undertaken a variable selection, this should not affect much. You can find further explanation on Lasso Regression here.

Thus, we could pass all the variables to the model and probably the accuracy would increase.

```
my_control <-trainControl(method="cv", number=5)
lassoGrid <- expand.grid(alpha = 1, lambda = seq(0,1,by = 0.1))
model_1_glml <- train(
x = train_1[,-which(names(train_1)=="SalePrice")],
y = train_1$SalePrice, method='glmnet',
trControl= my_control, tuneGrid=lassoGrid)
model_1_glml$bestTune
```

```
## alpha lambda
## 11 1 1
```

So we will use lambda = 0.1 for tuning the model.

```
# prediction_1_glml <- predict(model_1_glml,test_1_explanatory)
# prediction_1_glml <- data.frame(Id = (1461:2919), SalePrice = prediction_1_glml)
# write.csv(prediction_1_glml,"model_1_glml.csv", quote=FALSE,row.names = FALSE)
# Score: 0.16289
```

```
model_2_glml <- train(
x = train_2[,-which(names(train_2)=="SalePrice")],
y = train_2$SalePrice, method='glmnet',
trControl= my_control, tuneGrid=lassoGrid)
```

```
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
```

`model_2_glml$bestTune`

```
## alpha lambda
## 1 1 0
```

```
# prediction_2_glml <- predict(model_2_glml,test_2_explanatory)
# prediction_2_glml <- data.frame(Id = (1461:2919), SalePrice = exp(prediction_2_glml))
# write.csv(prediction_2_glml,"model_2_glml.csv", quote=FALSE,row.names = FALSE)
#Score: 0.15498
```

Ridge Regression is quite similar to Lasso Regression, with the main difference that it does not handle the variables that are not important as well as Lasso does. Instead, it is good at handling good long term predictions. For further explanation, whatch this video.

Thus, if we see that Lasso Regression outperforms Ridge Regression, maybe we have some non significant variables in the model.

In this case, we have tried several lambda values, and the model picks lambda = 1, which means that the variables are shrank 1.

```
my_control <-trainControl(method="cv", number=5)
ridgeGrid <- expand.grid(alpha = 0, lambda = seq(0,1,0.1))
model_1_glmr <- train(
x = train_1[,-which(names(train_1)=="SalePrice")],
y = train_1$SalePrice, method='glmnet',
trControl= my_control, tuneGrid=ridgeGrid)
model_1_glmr$bestTune
```

```
## alpha lambda
## 11 0 1
```

```
prediction_1_glmr <- predict(model_1_glmr,test_1_explanatory)
prediction_1_glmr <- data.frame(Id = (1461:2919), SalePrice = prediction_1_glmr)
write.csv(prediction_1_glmr,"model_1_glmr.csv", quote=FALSE,row.names = FALSE)
#Score: 0.16694
```

In this case, lamba = 0, so there is no shrinkage. Thus, **the regression is an ordinary least squares**.

```
my_control <-trainControl(method="cv", number=10)
ridgeGrid <- expand.grid(alpha = 0, lambda = seq(0,1,0.1))
model_2_glmr <- train(
x = train_2[,-which(names(train_2)=="SalePrice")],
y = train_2$SalePrice, method='glmnet',
trControl= my_control, tuneGrid=ridgeGrid)
model_2_glmr$bestTune
```

```
## alpha lambda
## 1 0 0
```

```
prediction_2_glmr <- predict(model_2_glmr,test_2_explanatory)
prediction_2_glmr <- data.frame(Id = (1461:2919), SalePrice = exp(prediction_2_glmr))
write.csv(prediction_2_glmr,"model_2_glmr.csv", quote=FALSE,row.names = FALSE)
#Score: 0.15702
```

Elastic Net regression is a mix of both Lasso Regression and Ridge Regression. By doing so, we get the ability to handle useless variables (Lasso) and better long term predictions (Ridge)

As you can see, the Best Tune for this model is when Alpha = 0 and Lambda = 1, which means both Lasso and Ridge Regression together.

```
my_control <-trainControl(method="cv", number=10)
elasticGrid <- expand.grid(alpha = seq(0,1,0.1), lambda = seq(0,1,0.1))
model_1_glme <- train(
x = train_1[,-which(names(train_1)=="SalePrice")],
y = train_1$SalePrice, method='glmnet',
trControl= my_control, tuneGrid=elasticGrid)
model_1_glme$bestTune
```

```
## alpha lambda
## 11 0 1
```

```
# prediction_1_glme <- predict(model_1_glme,test_1_explanatory)
# prediction_1_glme <- data.frame(Id = (1461:2919), SalePrice = prediction_1_glme)
# write.csv(prediction_1_glme,"model_1_glme.csv", quote=FALSE,row.names = FALSE)
#Score: 0.16693
```

In this case, the best parameters have changed, being alpha = 0.4 and lambda = 0 the ones that best tune the model.

```
my_control <-trainControl(method="cv", number=10)
elasticGrid <- expand.grid(alpha = seq(0,1,0.1), lambda = seq(0,1,0.1))
model_2_glme <- train(
x = train_2[,-which(names(train_2)=="SalePrice")],
y = train_2$SalePrice, method='glmnet',
trControl= my_control, tuneGrid=elasticGrid)
model_2_glme$bestTune
```

```
## alpha lambda
## 12 0.1 0
```

```
# prediction_2_glme <- predict(model_2_glme,test_2_explanatory)
# prediction_2_glme <- data.frame(Id = (1461:2919), SalePrice = exp(prediction_2_glme))
# write.csv(prediction_2_glme,"model_2_glme.csv", quote=FALSE,row.names = FALSE)
#Score: 0.15064
```

We will make a for loop to see how the explanatory capacity of the model changes witht different amount of minimum variables in the tree (minsplit). By doing so, we will get a better model.

```
set.seed(1234)
my_control <- trainControl(method = "repeatedcv", number=10, repeats=5)
model_dt_1 <- train(SalePrice~., data=train_1, method = "rpart", tuneLength= 7,trControl = my_control)
plot(model_dt_1$finalModel)
text(model_dt_1$finalModel)
```

Now we know that the best performance is done with Thus, we will pick this minsplit in order to try the decision tree.

```
# prediction_1_dt <-predict(model_dt_1,test_1_explanatory)
# prediction_1_dt <- data.frame(Id=(1461:2919), SalePrice=prediction_1_dt)
# write.csv(prediction_1_dt,"model_dt_1.csv", quote=FALSE,row.names = FALSE)
#Result: 0.26823
```

As you can see, the variables by which the split is made have changed. In this case, the most important variable is TotalSF.

As it still is a very short tree, the predictions it will give are obviously very bad ones.

```
model_dt_2 <- train(SalePrice~., data=train_2, method = "rpart", tuneLength= 7,trControl = my_control)
plot(model_dt_2$finalModel)
text(model_dt_2$finalModel)
```

```
# prediction_2_dt <-predict(model_dt_2,test_2_explanatory)
# prediction_2_dt <- data.frame(Id=(1461:2919), SalePrice=exp(prediction_2_dt))
# write.csv(prediction_2_dt,"model_dt_2.csv", quote=FALSE,row.names = FALSE)
#Result: 0.23130
```

In this case, we will fit a Random Forest. Random Forest is one of the most versatile ML algorithms there are. However, in this case we are not taking full advantage of it’s capabilities. It undertakes featue selection and handles categorical data, so we could have saved some steps along the way.

However, as many times have ocurred, the idea here is not to see how well this ML algorithm can perform, but rather how the accuracy of the model changes if with feature engineering and data preprocessing.

We will try to best tune the RF model.

```
set.seed(1)
bestMtry <- tuneRF(train_1[,-which(names(train_1) == "SalePrice")], train_1$SalePrice, stepFactor = 1.5, improve = 1e-5, ntree = 500)
```

```
## mtry = 19 OOB error = 931888231
## Searching left ...
## mtry = 13 OOB error = 929337844
## 0.002736795 1e-05
## mtry = 9 OOB error = 944050009
## -0.0158308 1e-05
## Searching right ...
## mtry = 28 OOB error = 910028385
## 0.02077765 1e-05
## mtry = 42 OOB error = 966012509
## -0.0615191 1e-05
```

```
# model_rf_1 <- randomForest(SalePrice~., data=train_1,ntree=500, mtry = 28)
# prediction_1_rf <-predict(model_rf_1,test_1_explanatory)
#
# write.csv(prediction_1_rf,"model_rf_1.csv")
# Result: 0.16143
```

```
set.seed(1)
bestMtry <- tuneRF(train_2[,-which(names(train_2) == "SalePrice")], train_2$SalePrice, stepFactor = 1.5, improve = 1e-5, ntree = 500)
```

```
## mtry = 19 OOB error = 0.02047286
## Searching left ...
## mtry = 13 OOB error = 0.02077389
## -0.01470416 1e-05
## Searching right ...
## mtry = 28 OOB error = 0.0205968
## -0.006053951 1e-05
```

```
# model_rf_2 <- randomForest(SalePrice~., data=train_2,ntree=100, mtry=19)
# prediction_2_rf <-predict(model_rf_2,test_2_explanatory)
#
# prediction_2_rf <- exp(prediction_2_rf)
# write.csv(prediction_2_rf,"model_rf_2.csv")
# Result: 0.15164
```

```
# knn_ctrl <- trainControl(method="repeatedcv",repeats = 3)
# prePro_1 <- preProcess(train_1, method = c("center", "scale"))
# norm_1 <- predict(prePro_1, train_1)
#
# model_knn_1 <- train(SalePrice ~ ., data = norm_1, method = "knn", trControl = knn_ctrl, tuneLength = 20)
#
# prePro_1 <- preProcess(test_1_explanatory, method = c("center", "scale"))
# norm_1 <- predict(prePro_1, train_1)
#
#
# prediction_1_knn <-predict(model_dt_1,test_1_explanatory)
# prediction_1_knn <- data.frame(Id=(1461:2919), SalePrice=prediction_1_knn)
# write.csv(prediction_1_knn,"model_knn_1.csv", quote=FALSE,row.names = FALSE)
#
#Results: 0.26823
```

```
# prePro_2 <- preProcess(train_2, method = c("center", "scale"))
# norm_2 <- predict(prePro_2, train_2)
# model_knn_2 <- train(SalePrice ~ ., data = norm_2, method = "knn", trControl = knn_ctrl, tuneLength = 20)
# prediction_2_knn <-predict(model_dt_2,test_2_explanatory)
# prediction_2_knn <- data.frame(Id=(1461:2919), SalePrice=exp(prediction_2_knn))
# write.csv(prediction_2_knn,"model_knn_2.csv", quote=FALSE,row.names = FALSE)
#Results: 0.23130
```

XGBoost is one of the most popular models used in Kaggle. However, it takes quite a lot of time to train it. Thus, in this situation, **I have not search the best tuning parameters for both models**. Instead, I have found the best tuning parameters among the “recommended” standard parameters for the dataset 2. Once I have found the best tune for model 2, I have applied those same parameters to model 1. Thus, model 1 could still gain significant improvement.

As said before, in order to save some time, the parameters I have used to train this XGBoost model are the ones that best tune XGBoost for the second dataset. These parameters are:

**eta = 0.05**. It refers to the learning rate and it is used to prevent overfitting.**max_depth = 2**. It is refered to the maximum depth the trees can have. According to the XGBoost documentation, the higher the maximum depth of the trees, the more likely to overfit.**min_child_weigth = 4**. As explained in stats.exchange:*“stop trying to split once you reach a certain degree of purity in a node and your model can fit it”*.

Thus, the model was built like this:

```
# tune_grid <- expand.grid(
# nrounds = 1000,
# eta = 0.05,
# max_depth = 2,
# gamma = 0,
# colsample_bytree=1,
# min_child_weight=4,
# subsample=1
# )
#
# xgb_tune_1 <- caret::train(
# x = train_1[,-which(names(train_1)=="SalePrice")],
# y = train_1$SalePrice,
# tuneGrid = tune_grid,
# method = "xgbTree",
# verbose = TRUE)
```

With the model trained we did a prediction. This prediction had a 0.17989 RMSE in the test dataset. (Quite high actually).

```
# prediction_1_xgb <-predict(xgb_tune_1,test_1_explanatory)
# prediction_1_xgb <- data.frame(Id=(1461:2919), SalePrice=prediction_1_xgb)
# write.csv(prediction_1_xgb,"model_xgb_1.csv", quote=FALSE,row.names = FALSE)
# Result: 0.17989
```

As explained before, in this case I did actually tune the parameters. This is obsviously not a good implementation of XGBoost. You can see how it is correctly done in this kernel.

```
# tune_grid <- expand.grid(
# nrounds = 1000,
# eta = c(0.1, 0.05, 0.01),
# max_depth = c(2, 3, 4, 5, 6),
# gamma = 0,
# colsample_bytree=1,
# min_child_weight=c(1, 2, 3, 4 ,5),
# subsample=1
# )
#
# xgb_tune_2$bestTune
#
# xgb_tune_2 <- caret::train(
# x = train_2[,-which(names(train_2)=="SalePrice")],
# y = train_2$SalePrice,
# tuneGrid = tune_grid,
# method = "xgbTree",
# verbose = TRUE)
```

When we undertake the predictions we get a much more acurate model, with a RMSE of 0.15447. So far, XGBoost is one of the models that has performed best even though we haven’t undertaken a thorough parameter tuning.

```
# prediction_2_xgb <-predict(xgb_tune_2,test_2_explanatory)
#
# xgb_tune_2$bestTune
#
# prediction_2_xgb <- exp(prediction_2_xgb)
# prediction_2_xgb <- data.frame(Id=(1461:2919), SalePrice=prediction_2_xgb)
# write.csv(prediction_2_xgb,"model_xgb_2.csv", quote=FALSE,row.names = FALSE)
# Result: 0.15447
```

As Indresh Bhattacharyya explained, the main difference between SVM for regression and a liner regression is that *in simple regression we try to minimise the error rate. While in SVR we try to fit the error within a certain threshold*.

Thus, as linear regression did a good job in the prediction, probably SVM will also do a great job too. Let’s check it out.

As we can see, SVM have actually made a good job at predicting the price. In fact, it is the model that has perform the best for the “raw data” dataset.

```
# my_control <- trainControl(method = "cv", number = 10)
# model_svm_1 <- train(SalePrice~., data=train_1, method = "svmLinear",trControl=my_control,tuneLength = 10)
# prediction_1_svm <-predict(model_svm_1,test_1_explanatory)
# prediction_1_svm <- data.frame(Id=(1461:2919), SalePrice=prediction_1_svm)
# write.csv(prediction_1_svm,"model_svm_1.csv", quote=FALSE,row.names = FALSE)
# Result: 0.15255
```

Once again, we can see how in this case, the SVM has outperform the rest of models.

```
# model_svm_2 <- train(SalePrice~., data=train_2, method = "svmLinear",trControl=my_control,tuneLength = 10)
# prediction_2_svm <-predict(model_svm_2,test_2_explanatory)
# prediction_2_svm <- data.frame(Id=(1461:2919), SalePrice=exp(prediction_2_svm))
#
# write.csv(prediction_2_svm,"model_svm_2.csv", quote=FALSE,row.names = FALSE)
# Result: 0.14695
```

Despite I have tried several tunings, I have not been able to make a neural netwok converge. Thus, I will simply put the code I have used (maybe you can help me on finding a better hidden parameters) and I will drop this model off the comparison.

```
# model_nn_1 <- neuralnet(formula = paste('SalePrice ~ ',paste(colnames(train_1)[1:(ncol(train_1)-1)],collapse=" + ")),
# data = train_1, hidden = c(5), linear.output = TRUE, lifesign = "full",
# stepmax = 1e6, rep = 1, algorithm = "rprop+", threshold = 0.5)
#
# prediction_1_nn <-predict(model_nn_1,test_1_explanatory)
# prediction_1_nn <- data.frame(Id=(1461:2919), SalePrice=prediction_1_nn)
# write.csv(prediction_1_nn,"model_nn_1.csv", quote=FALSE,row.names = FALSE)
```

If we analyze the results we get something clear: **all models work much better if we undertake a good missing value handling, feature generation and feature selection**. This can be clearly seen in the following graph.

```
results %>%
ggplot() +
geom_path(aes(x= RMSE, y = Model ),
arrow = arrow(length =unit(1.5,"mm"), type = "closed")) +
geom_text(aes(x = RMSE, y = Model, label = round(RMSE,2),
hjust = ifelse(Data == "Non_worked_data",-0.3,1.2)),
size = 3, color = "gray25"
) +
labs(title = "Model performance with processed data vs not processed data ", x= "RMSE on test", y = "") +
theme_minimal()
```

This kernel tries to emphasize the importance of a good data processing, which is often forgotten. That being said, choosing the right ML model (or models) and tuning them correctly are also crucial for obtaining the best results possible.

Acepto

Política de Cookies

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are as essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.