Predicción del precio de una casa (10 modelos de ML)

Ander Fernández Jauregui.

Ander Fernández Jauregui

Data Scientist y Business Intelligence.

Problema Inicial

Muchas veces se presta mucha atención al machine learning, pero poco a la preparación de los datos (eliminación de la simetría, imputación de valores perdidos, feature engineering).

Este proyecto busca demostrar cómo afecta el hecho de preparar bien los datos más allá del modelo que se utilice.

Cuestiones Involucradas

  • Machine Learning
    • Cajas Negras (SVM, NN, XGBoost…)
    • Cajas Blancas (Elastic Net, Lasso, Árboles de decisión..)
  • Preparación de datos
    • Feature Engineering

Solución

Para ello, basándome en una competición de Kaggle, he aplicado 10 modelos de machine learning diferentes (SVM, XGBoost, Lasso, etc.) cada uno a dos bases de datos: una en la que se hace una buena preparación de datos y otra en la que no.

Desarrollo del Proyecto

Share on linkedin
Share on twitter
Share on email
Share on linkedin
Share on twitter
Share on email

Introduction

This project aims to emphasize the importance of a good missing value handling, feature generation and feature selection. To do so, I will predict the price of houses in Iowa by using several different machine learningn algortihms.

The idea is that by the end of this kernel you will have a good grasp on how important a good data preprocessing is and how it might affect the outcome of your model.

The models we will compare are the following:

  1. Linear model
  2. Lasso Regression
  3. Ridge Regression
  4. Elastic-Net Regression
  5. Decision Trees
  6. Random Forest
  7. k-NN
  8. XGBoost
  9. Support Vector Machines
  10. Neural Network

As the aim is to show the importance of data preprocessing, we will require two datasets:

  1. Dataset with raw data. In tis case we will do minimal processing (such as a basic NA imputation). This will be Dataset 1.
  2. Dataset with cleaned data. In this case, we will undertake a more thorough process NA imputation, data preprocessing, feature engineering and so on.

So let’s get started.

Data & Library Import and File Merge

These are the libraries we will use.

library(plyr)
library(dplyr) 
library(randomForest) 
library(ggplot2) 
library(psych) 
library(caret)
library(xgboost)
library(rpart)
library(neuralnet)
library(nnet)
library(tidyr)
library(forcats)
cat("\014")

Now I will read both files.

train <- read.csv("train.csv", stringsAsFactors = FALSE)
test <- read.csv("test.csv", stringsAsFactors = FALSE)

We will delete the Id variable for both train and test.

index_train <- train$Id
train$Id <- NULL
test$Id <- NULL

In order to merge both variables I will first create the SalePrice variable in test dataset. Then I will merge them both.

test$SalePrice <- NA
total <- rbind(train, test)

Missing data

Completness of data

Let’s see how complete is the data.

withNA <- which(colSums(is.na(total)) > 0)
sort(colSums(sapply(total[withNA], is.na)), decreasing = TRUE)
##       PoolQC  MiscFeature        Alley        Fence    SalePrice 
##         2909         2814         2721         2348         1459 
##  FireplaceQu  LotFrontage  GarageYrBlt GarageFinish   GarageQual 
##         1420          486          159          159          159 
##   GarageCond   GarageType     BsmtCond BsmtExposure     BsmtQual 
##          159          157           82           82           81 
## BsmtFinType2 BsmtFinType1   MasVnrType   MasVnrArea     MSZoning 
##           80           79           24           23            4 
##    Utilities BsmtFullBath BsmtHalfBath   Functional  Exterior1st 
##            2            2            2            2            1 
##  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1

Imputing missing data

If you check the data description provided in the competition, you will see that many of those variables with many NA NA actually means that they do not have that feature. Therefore to be fair we will consider those variables NA’s as a category “None” for both datasets.

By reading the data descriptions you will see that the variables that have NAs that should be “None” are the following: PoolQc, MiscFeature, Alley, Fence, FireplaceQu, GarageFinish, GarageQual, GarageCond, GarageType, BsmtCond, BsmtExposure, BsmtQual, BsmtFinType2 and BsmtFinType.

We will coerce this variable’s NAs into “None”.

change <- c("PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageFinish", "GarageQual", "GarageCond", "GarageType", "BsmtCond", "BsmtExposure", "BsmtQual","BsmtFinType1", "BsmtFinType2", "BsmtFinType")

i = 1
for(i in 1:dim(total)[2]){
  if(colnames(total[i]) %in% change == TRUE){
  total[,colnames(total[i])][is.na(total[,colnames(total[i])])] <- "None"
  i = i + 1
  }else{
  i = i + 1
  }
}

If we now analyze the amount of NA in each column, we will see that these have significantly decreased.

withNA <- which(colSums(is.na(total)) > 0)
sort(colSums(sapply(total[withNA], is.na)), decreasing = TRUE)
##    SalePrice  LotFrontage  GarageYrBlt   MasVnrType   MasVnrArea 
##         1459          486          159           24           23 
##     MSZoning    Utilities BsmtFullBath BsmtHalfBath   Functional 
##            4            2            2            2            2 
##  Exterior1st  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF 
##            1            1            1            1            1 
##  TotalBsmtSF   Electrical  KitchenQual   GarageCars   GarageArea 
##            1            1            1            1            1 
##     SaleType 
##            1

Now we will create two different datasets. In each we will undertake different value imputation procedures as state in the introduction section.

Dataset 1. Mode imputation

This is not the way we should work. Or at least that is what we uphold. I am just doing this to show you how differently models work when you have the data cleaned compared to when you don’t. Cleaning is tedious, but it is the way to go.

That being said, we will coerce NA’s into the mode of the variable. We use the mode instead od the mean mainly because if we had an outlier, for example, the mean would be affected by that extreme value. The mode, in the other hand, will suffer little change.

We will exlude “SalePrice” from the imputation for obvious reasons: this is the variable we want to predict.

Besides, as R does not include a function to calculate the mode, we will create it ourselves. In my case, I got it from Stackoverflow. Thanks Ken Williams for the contribution.

# Creating the Mode function 

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
# Applying it to the variables

total_noclean <- total

i = 1
for(i in 1:dim(total_noclean)[2]){
  if(colnames(total_noclean[i]) %in% names(withNA) == TRUE){
  if(colnames(total_noclean[i]) == "SalePrice"){
    i = i +1
  }else{
    total_noclean[,colnames(total_noclean[i])][is.na(total_noclean[,colnames(total_noclean[i])])] <- Mode(    total_noclean[,colnames(total_noclean[i])][!is.na(total_noclean[,colnames(total_noclean[i])])])
    i = i + 1
  }}else{
  i = i + 1
  }

}

Now we should have no more variables to impute. We rerun the code just to make sure.

withNA2 <- which(colSums(is.na(total_noclean)) > 0)
sort(colSums(sapply(total_noclean[withNA2], is.na)), decreasing = TRUE)
## SalePrice 
##      1459

As you can see, we are done here. Simple, isn’t i? Of course it is, that’s why most of the times won’t work;)

Dataset 2. Advanced NA imputation

First, we must be aware that when doing the previous step of coercing NA’s into “None”, we might have missed some interesting things. Just by looking the amount of NA’s that we get when we first looked at the “Completness of data” you can see some weird things:

  • There were 157 NAs at “GaraType” but 159 for the rest of Garage variables. Maybe there are 2 observations with wrong data inputed.
  • BsmtFinType2 had 80 NAs, while BsmtQual had 81 and the rest of Bsmt variables 82.

Besides we will also have to “predict”, to the extent possbible, the value of missing data in other ways rather than the mode.

Garage variables

As I have mentioned before, GarageType had 2 NA less that the rest of observations. We will check those two observations:

total_clean[total_clean$GarageFinish == "None" & total_clean$GarageType != "None" ,c ("GarageFinish","GarageQual","GarageCond","GarageType")]
##      GarageFinish GarageQual GarageCond GarageType
## 2127         None       None       None     Detchd
## 2577         None       None       None     Detchd

As we though,these two variables should also be consiedered as “None”.

total_clean[total_clean$GarageFinish == "None" & total_clean$GarageType != "None" ,"GarageType"] <- "None"

Other variables concerning the garage are: GarageCars, GarageArea and GarageYrBlt. We will check if any of these does not have a “None” in the rest variables regarding Garage.

total_clean %>%
  filter(GarageType == "None") %>%
  filter(GarageCars!=0, GarageArea!=0) %>%
  filter(!is.na(GarageCars) |!is.na(GarageArea) | !is.na(GarageYrBlt)) %>%
  select(GarageCars,GarageArea,GarageYrBlt, GarageType,YearBuilt)
##   GarageCars GarageArea GarageYrBlt GarageType YearBuilt
## 1          1        360          NA       None      1910

It is clear that in this case GarageType should not be “None”. We will impute the mode of the GarageType instead.

Mode(total_clean$GarageType)
## [1] "Attchd"
total_clean[total_clean$GarageType == "None" & total_clean$GarageCars == 1 & total_clean$GarageArea == 360 & total_clean$YearBuilt == 1910, "GarageType"] <- Mode(total_clean$GarageType) 

Finally, we will have to impute the “GarageYrBlt”. We will first check that there is no observation with NA in this variable but not in the rest of Garage variables.

total_clean %>%
  filter(is.na(GarageYrBlt) == TRUE ) %>%
  filter(GarageArea != "None", GarageCars != 0, GarageFinish != "None", GarageQual != "None", GarageArea != 0)
##  [1] MSSubClass    MSZoning      LotFrontage   LotArea       Street       
##  [6] Alley         LotShape      LandContour   Utilities     LotConfig    
## [11] LandSlope     Neighborhood  Condition1    Condition2    BldgType     
## [16] HouseStyle    OverallQual   OverallCond   YearBuilt     YearRemodAdd 
## [21] RoofStyle     RoofMatl      Exterior1st   Exterior2nd   MasVnrType   
## [26] MasVnrArea    ExterQual     ExterCond     Foundation    BsmtQual     
## [31] BsmtCond      BsmtExposure  BsmtFinType1  BsmtFinSF1    BsmtFinType2 
## [36] BsmtFinSF2    BsmtUnfSF     TotalBsmtSF   Heating       HeatingQC    
## [41] CentralAir    Electrical    X1stFlrSF     X2ndFlrSF     LowQualFinSF 
## [46] GrLivArea     BsmtFullBath  BsmtHalfBath  FullBath      HalfBath     
## [51] BedroomAbvGr  KitchenAbvGr  KitchenQual   TotRmsAbvGrd  Functional   
## [56] Fireplaces    FireplaceQu   GarageType    GarageYrBlt   GarageFinish 
## [61] GarageCars    GarageArea    GarageQual    GarageCond    PavedDrive   
## [66] WoodDeckSF    OpenPorchSF   EnclosedPorch X3SsnPorch    ScreenPorch  
## [71] PoolArea      PoolQC        Fence         MiscFeature   MiscVal      
## [76] MoSold        YrSold        SaleType      SaleCondition SalePrice    
## <0 rows> (or 0-length row.names)

Now, what do we do with the Garage Yearl Built? We will impute the year the house was built, except if the house has been remodeled, in which case we will input this year.

It makes sense that the garage and the house were both built in the same year and that if you remodel the house it is highly likely you will also remodel the garage.

for (i in 1:nrow(total_clean)){
  if(is.na(total_clean$GarageYrBlt[i])){
  total_clean$GarageYrBlt[i] <- ifelse(!is.na(total_clean$YearRemodAdd[i]),total_clean$YearRemodAdd[i],total_clean$YearBuilt[i])    
  i = i + 1
  }else{
    i = i + 1
  }
}

Finally, we will have to clear the two observations that have missing data. We will check their values:

total_clean %>%
  filter(is.na(GarageCars) | is.na(GarageArea)) %>%
  select("GarageCars","GarageArea","GarageFinish","GarageQual","GarageCond","GarageType")
##   GarageCars GarageArea GarageFinish GarageQual GarageCond GarageType
## 1         NA         NA         None       None       None       None

As we can see, both missing data are from the same exact observation. As all the rest variables makes us believe that this house does not have Garage, we will impute a 0 in both variables.

total_clean$GarageCars[is.na(total_clean$GarageCars)] <- 0
total_clean$GarageArea[is.na(total_clean$GarageArea)] <- 0
Basement Variables

Now we will do the same with the Basement variables. In this case, there might some observations that we can get.

total_clean %>%
  select(BsmtCond,BsmtExposure,BsmtQual,BsmtFinType2,BsmtFinType1) %>%
  filter(BsmtFinType1 != "None") %>%
  filter(BsmtCond == "None" | BsmtExposure == "None" | BsmtQual == "None" | BsmtFinType2 == "None" )
##   BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1
## 1       TA           No       Gd         None          GLQ
## 2       TA         None       Gd          Unf          Unf
## 3       TA         None       Gd          Unf          Unf
## 4     None           Mn       Gd          Rec          GLQ
## 5     None           No       TA          Unf          BLQ
## 6       Fa           No     None          Unf          Unf
## 7       TA           No     None          Unf          Unf
## 8       TA         None       Gd          Unf          Unf
## 9     None           Av       TA          Unf          ALQ

As you can see, there are 9 variables where whe have inputed “None” when we actually shouldn’t. In this cases, we will impute the corresponding Mode.

total_clean[total_clean$BsmtFinType1 != "None" & total_clean$BsmtCond == "None", "BsmtCond"] <- Mode(total_clean$BsmtCond)

total_clean[total_clean$BsmtFinType1 != "None" & total_clean$BsmtExposure == "None", "BsmtExposure"] <- Mode(total_clean$BsmtExposure)

total_clean[total_clean$BsmtFinType1 != "None" & total_clean$BsmtQual == "None", "BsmtQual"] <- Mode(total_clean$BsmtQual)

total_clean[total_clean$BsmtFinType1 != "None" & total_clean$BsmtFinType2 == "None", "BsmtFinType2"] <- Mode(total_clean$BsmtFinType2)

Now we will estimate other variables that have missing data.

total_clean[(is.na(total_clean$BsmtFullBath)|is.na(total_clean$BsmtHalfBath)|is.na(total_clean$BsmtFinSF1)|is.na(total_clean$BsmtFinSF2)|is.na(total_clean$BsmtUnfSF)|is.na(total_clean$TotalBsmtSF)), c("BsmtQual", "BsmtFullBath", "BsmtHalfBath", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF")]
##      BsmtQual BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 2121     None           NA           NA         NA         NA        NA
## 2189     None           NA           NA          0          0         0
##      TotalBsmtSF
## 2121          NA
## 2189           0

As you can see, all these variable’s NA correspond to two observations and neither of them have a Bsmt. Thus, these observations value should be 0.

 total_clean[2121, c("BsmtFullBath", "BsmtHalfBath", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF")] <- 0
 total_clean[2189, c("BsmtFullBath", "BsmtHalfBath")] <- 0
Masonry

Once again, there are 24 missing values in “MasVnrType”, while there are just 23 in “MasVnrArea”. We will try to figure out that extra missing value.

total_clean[is.na(total_clean$MasVnrType) & !is.na(total_clean$MasVnrArea), c("MasVnrType","MasVnrArea")]
##      MasVnrType MasVnrArea
## 2611       <NA>        198

As we don’t have enough variables to impute a precise value, we will just impute the Mode.

total_clean[is.na(total_clean$MasVnrType) & !is.na(total_clean$MasVnrArea), "MasVnrType"] <- Mode(total_clean$MasVnrType[total_clean$MasVnrType!="None"])

Finally, for the remaining variables we will impute “None” in the case of “MasVnrType” and 0 in the case of “MasVnrArea”.

total_clean$MasVnrType[is.na(total_clean$MasVnrType)] <- "None"
total_clean$MasVnrArea[is.na(total_clean$MasVnrArea)] <- 0
Pool

Even though we have previously imputed the value for the missing data of Pool related variables, as there are more than 1 variable, we can cross-check the internal coherence of variable values.

total_clean[total_clean$PoolQC == "None" & total_clean$PoolArea>0, c("PoolQC","PoolArea","OverallQual")]
##      PoolQC PoolArea OverallQual
## 2421   None      368           4
## 2504   None      444           6
## 2600   None      561           3

As we can see, there are 3 values that have been uncorrectly imputed. Thus we will fix it. We will check whether the overall quality of the house it’s a good proxy variable to explain the pool quality.

table(total_clean$PoolQC,total_clean$OverallQual)
##       
##          1   2   3   4   5   6   7   8   9  10
##   Ex     0   0   0   0   0   0   0   2   0   2
##   Fa     0   0   0   0   0   1   1   0   0   0
##   Gd     0   0   0   0   0   2   1   0   0   1
##   None   4  13  40 226 825 728 598 340 107  28
chisq.test(total_clean$PoolQC , total_clean$OverallQual)
## Warning in chisq.test(total_clean$PoolQC, total_clean$OverallQual): Chi-
## squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  total_clean$PoolQC and total_clean$OverallQual
## X-squared = 126.16, df = 27, p-value = 9.196e-15

As we have no better proxy, we will use the overall quality as proxy. We have to take into account that overall quality goes from 1-10, while PoolQc have just 5 categories. So, we will divide the Overall Quality by 2 so that they both have the same scale, round it up and assign that value.

total_clean$PoolQC[2421] <- "Fa"
total_clean$PoolQC[2504] <- "TA"
total_clean$PoolQC[2600] <- "Fa"
Lot

Therea re three variables of Lot: “LotConfig”, “LotShape” and “LotFrontage”. The first two do not have “None” level as one of their categories. Thus, this time we have no way to estimate the LotFrontage.

According to variable descriptions, Lot Frontage is the “Linear feet of street connected to property”. Thus it makes sense that it changes according to the Neighborhood (more residential neighbours should have higher Lot Frontage). We will check if this is right or not with a simple ANOVA.

summary(aov(LotFrontage ~ Neighborhood, total_clean))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Neighborhood   24 354702   14779   36.66 <2e-16 ***
## Residuals    2408 970700     403                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 486 observations deleted due to missingness
ggplot(total_clean, aes(Neighborhood,  LotFrontage)) + geom_bar(stat = "summary", fun.y = "median") + coord_flip() + labs(title="LotFrontage median by neighborhood", x="")
## Warning: Removed 486 rows containing non-finite values (stat_summary).

As there is indeed a significant variance, we will impute the median of the neighborhood to the observations with NA in their LotFrontage.

for(i in 1:length(total_clean$MSSubClass)){
  if(is.na(total_clean$LotFrontage[i])){
    total_clean$LotFrontage[i] <- median(total_clean$LotFrontage[total_clean$Neighborhood == total_clean$Neighborhood[i]], na.rm=TRUE)
    i = i + 1
  }else{
    i = i + 1
  }
}
Zoning

According to variable description zoning refers to “the general zoning classification of the sale”. Thus, the zoning is supposed to be highly correlated with the Neighhborhood. We will check that out.

chisq.test(total_clean$Neighborhood,total_clean$MSZoning)
## Warning in chisq.test(total_clean$Neighborhood, total_clean$MSZoning): Chi-
## squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  total_clean$Neighborhood and total_clean$MSZoning
## X-squared = 5011.1, df = 96, p-value < 2.2e-16
table(total_clean$Neighborhood,total_clean$MSZoning)
##          
##           C (all)  FV  RH  RL  RM
##   Blmngtn       0   0   0  25   3
##   Blueste       0   0   0   0  10
##   BrDale        0   0   0   0  30
##   BrkSide       0   0   0  43  65
##   ClearCr       0   0   0  44   0
##   CollgCr       0   0   0 253  14
##   Crawfor       0   0   2  91  10
##   Edwards       0   0   2 180  12
##   Gilbert       0   0   0 165   0
##   IDOTRR       22   0   0   0  68
##   MeadowV       0   0   0   0  37
##   Mitchel       0   0   0 104   9
##   NAmes         0   0   7 436   0
##   NoRidge       0   0   0  71   0
##   NPkVill       0   0   0  23   0
##   NridgHt       0   0   0 165   1
##   NWAmes        0   0   0 131   0
##   OldTown       2   0   0  39 198
##   Sawyer        0   0   0 148   3
##   SawyerW       0   0   6 119   0
##   Somerst       0 139   0  43   0
##   StoneBr       0   0   0  51   0
##   SWISU         1   0   9  38   0
##   Timber        0   0   0  72   0
##   Veenker       0   0   0  24   0

As you can see the supposition was right. Then, we will impute the MSZoning according to the neighborhoods. To do so, we first check out the neighbourhoods.

total_clean[is.na(total_clean$MSZoning), c("MSZoning", "Neighborhood")]
##      MSZoning Neighborhood
## 1916     <NA>       IDOTRR
## 2217     <NA>       IDOTRR
## 2251     <NA>       IDOTRR
## 2905     <NA>      Mitchel

Now we simply impute the values just by looking at the table above.

total_clean$MSZoning[c(1916,2217,2251)] <- "RM"
total_clean$MSZoning[2905] <- "RL"
Exterior

There are four variables regarding exterior: “Exterior1st”, “Exterior2nd”, “ExterCond” and “ExterQual”. The first two variables refer to the material covering the house and the other two to how good it is conservated. Let’s check out those variables to see if we have any clue.

total_clean[is.na(total_clean$Exterior1st)|is.na(total_clean$Exterior2nd), c("Exterior1st","Exterior2nd","ExterCond","ExterQual","Neighborhood")]
##      Exterior1st Exterior2nd ExterCond ExterQual Neighborhood
## 2152        <NA>        <NA>        TA        TA      Edwards

As we can see both NAs refer to the same observation and both conditions are TA. I suspect that the condition of the exterior might be correlated with the material, as some material might be better than others and thus provide better condition.

Besides, we will check if both materials can actually be the same, so that we just have to analyze one variable.

table(total_clean$Exterior1st, total_clean$Exterior2nd)[1:10,1:10]
##          
##           AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc
##   AsbShng      35       0       0       0      0       1       0       0
##   AsphShn       0       2       0       0      0       0       0       0
##   BrkComm       0       0       4       0      0       0       0       0
##   BrkFace       1       0       0      44      0       0       3       0
##   CBlock        0       0       0       0      1       0       0       0
##   CemntBd       0       0       0       0      0     124       0       0
##   HdBoard       0       1       0       1      0       0     383       6
##   ImStucc       0       0       0       0      0       0       0       1
##   MetalSd       0       1       0       0      1       0       3       0
##   Plywood       0       0      18       0      1       0       6       4
##          
##           MetalSd Other
##   AsbShng       0     0
##   AsphShn       0     0
##   BrkComm       0     0
##   BrkFace       3     0
##   CBlock        0     0
##   CemntBd       0     0
##   HdBoard       1     0
##   ImStucc       0     0
##   MetalSd     437     0
##   Plywood       0     0

As we can see in most cases both materials are the same. Now, let’s guess the material by considering the Neighborhood.

table(total_clean$Exterior1st[total_clean$Neighborhood == "Edwards"], total_clean$ExterQual[total_clean$Neighborhood == "Edwards"])
##          
##           Ex Fa Gd TA
##   AsbShng  0  0  0  7
##   AsphShn  0  0  0  1
##   BrkComm  0  0  0  2
##   BrkFace  0  0  0  4
##   CemntBd  3  0  0  0
##   HdBoard  0  0  4 17
##   MetalSd  0  1  0 36
##   Plywood  0  0  1 15
##   Stucco   1  0  0  5
##   VinylSd  0  0  9 29
##   Wd Sdng  0  1  1 43
##   WdShing  0  0  0 13

As we can see, most of the houses in that neighborhood that have a “TA” Exterior Quality have “Wd Sdng” as Exterior material. Thus, we will impute this.

total_clean[is.na(total_clean$Exterior1st)|is.na(total_clean$Exterior2nd), c("Exterior1st","Exterior2nd")] <- "Wd Sdng"
Remainign variables

Utilities Let’s see which is the most common type of utility.

summary(as.factor(total_clean$Utilities))
## AllPub NoSeWa   NA's 
##   2916      1      2

As we can see, all variables have the same exact value. Thus, this variable is actually not significant and we should drop it out. As this is something we haven’t seen in the Dataset 1 (see the other tab), I won’t change it there.

total_clean$Utilities <- NULL

Functional If you read the variable description it says “Home functionality (Assume typical unless deductions are warranted)”. Thus, we wil consider the two NAs as typical.

total_clean$Functional[is.na(total_clean$Functional)] <- "Typ"

Electrical It refers to the electrical system of the house. I suspect that all will have similar electrical systems.

summary(as.factor(total_clean$Electrical))
## FuseA FuseF FuseP   Mix SBrkr  NA's 
##   188    50     8     1  2671     1

As the vast majority of houses have the same electrical system, we will impute that value (the mode).

total_clean$Electrical[is.na(total_clean$Electrical)] <- "SBrkr"

Kitchen Quality As we have done before, we will use the overall quality of the house as a proxy variable for the quality of the kitchen.

total_clean[is.na(total_clean$KitchenQual), c("OverallQual")]
## [1] 5
total_clean$KitchenQual[is.na(total_clean$KitchenQual)] <- "TA"

Sale Type As we have no good way to evaluate the type of sale, we will simply impute the mode of the variable.

total_clean$SaleType[is.na(total_clean$SaleType)] <- Mode(total_clean$SaleType)

Finally we will check that is ok.

withNA3 <- which(colSums(is.na(total_clean)) > 0)
sort(colSums(sapply(total_clean[withNA3], is.na)), decreasing = TRUE)
## SalePrice 
##      1459

Data Encoding

Data relabeling

By looking at the variables descriptions you will see that most categorical variabes have a very similar encoding:Ex,Gd,TA,Fa, NA (now None). As they all have the same levels I will call the “standarized”. The variables that have this encoding are ordinal variables. However, as most ML algorithms work with numerical variables, we will coerce them into integers (from 1 to 5).

In this sense, we differenciate three types of character variables:

  1. Variables with standarized ordinal values
  2. Ordinal non standarized variables
  3. Nominal variables

We will split the relabeling into these three categories:

Standarized Ordinal Variables

These refer to the status of features that not all houses have, such as the pool as well as features that all have (kitchen). By checking the variables description you get that these variables are the following: BsmtQual, BsmtCond, FireplaceQu, GarageQual, GarageCond, PoolQC, ExterQual, ExterCond, HeatingQC, KitchenQual.

For both datasets, we will relabel the variables, considering “None” as 0 and “Ex” as 5. To save some space I will just show you one code.

relabel <- c("BsmtQual","BsmtCond","FireplaceQu","GarageQual","GarageCond","PoolQC","ExterQual","ExterCond","HeatingQC","KitchenQual")
quality <- c("None" = 0, "Po" = 1, "Fa"= 2, "TA" = 3, "Gd" = 4, "Ex" = 5) 

for(i in 1:ncol(total_clean)){
  x = colnames(total_clean[i])
  if( x %in% relabel){
    total_clean[i] <- as.integer(revalue(total_clean[,x],quality))
    i = i + 1
  } else{
    i = i + 1
  }
}

We will check whether the process has been correctly undertaken or not by analyzing the levels of one of the variables we have changed.

levels(as.factor(total_clean$BsmtQual))
## [1] "0" "2" "3" "4" "5"

Ordinal non standarized variables

In this case, as they don’t follow the same pattern, we will rename them one by one. The variables to be revalued will by found by looking at variable’s description. Besides, we will undertake the process for both datasets, but I will only be showing one.

#Street
total_noclean$Street <- as.integer(revalue(total_noclean$Street,c("Pave" = 0,"Grvl" = 1)))

#Alley
total_noclean$Alley <- as.integer(revalue(total_noclean$Alley,c("None" = 0,"Pave" = 1,"Grvl" = 2)))

#LotShape
total_noclean$LotShape <- as.integer(revalue(total_noclean$LotShape,c("IR3" = 0,"IR2" = 1,"IR1" = 2,"Reg" = 3)))

#BsmtExposure
total_noclean$BsmtExposure <- as.integer(revalue(total_noclean$BsmtExposure,c("None" = 0, "No" = 1, "Mn" = 2, "Av" = 3, "Gd" = 4)))

#BsmtFinType1
total_noclean$BsmtFinType1 <- as.integer(revalue(total_noclean$BsmtFinType1,c("None" = 0, "Unf" = 1, "LwQ" = 2, "Rec" = 3, "BLQ" = 4, "ALQ" = 5, "GLQ" = 6)))

#BsmtFinType2
total_noclean$BsmtFinType2 <- as.integer(revalue(total_noclean$BsmtFinType2,c("None" = 0, "Unf" = 1, "LwQ" = 2, "Rec" = 3, "BLQ" = 4, "ALQ" = 5, "GLQ" = 6)))

#CentralAir
total_noclean$CentralAir <- as.integer(revalue(total_noclean$CentralAir,c("N" = 0, "Y" = 1)))

#Functional
total_noclean$Functional <- as.integer(revalue(total_noclean$Functional,c("Sal" = 0, "Sev" = 1, "Maj2" = 2, "Maj1" = 3, "Mod" = 4, "Min2" = 5, "Min1" = 6, "Typ" = 7)))
## The following `from` values were not present in `x`: Sal
#GarageFinish
total_noclean$GarageFinish <- as.integer(revalue(total_noclean$GarageFinish,c("None"= 0, "Unf" = 1, "RFn" = 2, "Fin" = 3)))

#PavedDrive
total_noclean$PavedDrive <- as.integer(revalue(total_noclean$PavedDrive,c("N"= 0, "P" = 1, "Y" = 2)))
## The following `from` values were not present in `x`: Sal

Nominal variables

As there should no longer be character variables factors that are indeed ordinal, we will make all character variables categoric.

#Index of non numeric variables 
character <- which(sapply(total_clean, is.character))

for(i in 1:ncol(total_clean)){
x = colnames(total_clean[i])
  if(i %in% character == TRUE){
    total_clean[i] <- as.factor(total_clean[,x])
    i = i + 1
  }else{
    i = i + 1
  }
}

total_clean$SalePrice <- as.numeric(total_clean$SalePrice)

To ensure everything is Ok, we will analyze if there is any character variable.

which(sapply(total_clean, is.character))
## named integer(0)

Data Structure

We can finally analyze the structure of the data.

str(total_clean)
## 'data.frame':    2919 obs. of  79 variables:
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : num  65 80 68 60 84 85 75 80 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Alley        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LotShape     : int  3 3 2 2 2 2 3 2 3 3 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : num  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : int  4 3 4 3 4 3 4 3 3 3 ...
##  $ ExterCond    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : int  4 4 4 3 4 4 5 4 3 3 ...
##  $ BsmtCond     : int  3 3 3 4 3 3 3 3 3 3 ...
##  $ BsmtExposure : int  1 4 2 1 3 1 3 2 1 1 ...
##  $ BsmtFinType1 : int  6 5 6 5 6 6 6 5 1 6 ...
##  $ BsmtFinSF1   : num  706 978 486 216 655 ...
##  $ BsmtFinType2 : int  1 1 1 1 1 1 1 4 1 1 ...
##  $ BsmtFinSF2   : num  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : num  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : num  856 1262 920 756 1145 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : int  5 5 5 4 5 5 5 5 4 5 ...
##  $ CentralAir   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : num  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : int  4 3 4 4 4 3 4 3 3 3 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : int  7 7 7 7 7 7 7 7 6 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : int  0 3 3 4 3 0 4 3 3 3 ...
##  $ GarageType   : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : int  2 2 2 1 2 1 2 2 1 2 ...
##  $ GarageCars   : num  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : num  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : int  3 3 3 3 3 3 3 3 2 4 ...
##  $ GarageCond   : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ PavedDrive   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fence        : Factor w/ 5 levels "GdPrv","GdWo",..: 5 5 5 5 5 3 5 5 5 5 ...
##  $ MiscFeature  : Factor w/ 5 levels "Gar2","None",..: 2 2 2 2 2 4 2 4 2 2 ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : num  208500 181500 223500 140000 250000 ...

As we can see, there are two factor variables that are labeled as integers. These two variables are the month and year sold. We will change these two variables.

total_clean$MoSold <- as.factor(total_clean$MoSold)
total_noclean$MoSold <- as.factor(total_noclean$MoSold)

Feature Engineering

Note. In order to show the importance of feature engineering I will only undertake feature engineering changes in the 2nd Dataset.

Age of the house

One of the most important things while considering to purchase a house is it’s age. Newer houses are usually more expensive, because they have more technologically advanced and unused materials. A remodeled house on the other hand is usually not as expensive as a new house, but is more expensive than an old one for the exact reasons.

Thus, we will calculate the age of the house.

#We convert the variable into numeric
total_clean$Age <- total_clean$YrSold - total_clean$YearRemodAdd

#We convert the variable back to factor
total_clean$YrSold <- as.factor(total_clean$YrSold)
total_noclean$YrSold <- as.factor(total_noclean$YrSold)

We will see whether there is a positive correlation or not

ggplot(total_clean, aes(Age, SalePrice)) + geom_point(alpha=0.4) + geom_smooth(method = "lm", se=FALSE) + labs(title="House price by year", x ="House Age")
## Warning: Removed 1459 rows containing non-finite values (stat_smooth).
## Warning: Removed 1459 rows containing missing values (geom_point).

As you can see, generally the elder the house the lower the saleprice.

Total Square feet

It is obvious that the bigger the house the higher the price. However, we do not have a variable that resumes the amount of square feet in the house. Thus, we will create it.

total_clean$TotalSF <- total_clean$GrLivArea + total_clean$TotalBsmtSF + total_clean$OpenPorchSF + total_clean$WoodDeckSF + total_clean$EnclosedPorch + total_clean$X3SsnPorch + total_clean$ScreenPorch + total_clean$PoolArea

ggplot(total_clean, aes(TotalSF, SalePrice)) + geom_point(alpha=0.4) + geom_smooth(method = "loess", se=FALSE) + labs(title="Houseprice by total square feet")
## Warning: Removed 1459 rows containing non-finite values (stat_smooth).
## Warning: Removed 1459 rows containing missing values (geom_point).

cor(total_clean$TotalSF,total_clean$SalePrice, use = "complete.obs" )
## [1] 0.7800884

As we can see, despite two variables seem to have a strange pattern, they do have a strong positive relationship. Thus, we will simply analyze and remove if possible those two cases.

total_clean[total_clean$TotalSF>7500, c("GrLivArea","TotalBsmtSF","OpenPorchSF","WoodDeckSF","PoolArea","SalePrice") ]
##      GrLivArea TotalBsmtSF OpenPorchSF WoodDeckSF PoolArea SalePrice
## 524       4676        3138         406        208        0    184750
## 1183      4476        2396          78        171      555    745000
## 1299      5642        6110         292        214      480    160000
## 2550      5095        5095         484        546        0        NA

We see how the “problematic” values are the observation 524 and 1299. We will analyze this variables more in detail, together with the 2550, which we have to predict, just to check if they have any similarities.

t(total_clean[c(524,1299,2550),])
##               524       1299      2550     
## MSSubClass    "60"      "60"      "20"     
## MSZoning      "RL"      "RL"      "RL"     
## LotFrontage   "130"     "313"     "128"    
## LotArea       "40094"   "63887"   "39290"  
## Street        "0"       "0"       "0"      
## Alley         "0"       "0"       "0"      
## LotShape      "2"       "0"       "2"      
## LandContour   "Bnk"     "Bnk"     "Bnk"    
## LotConfig     "Inside"  "Corner"  "Inside" 
## LandSlope     "Gtl"     "Gtl"     "Gtl"    
## Neighborhood  "Edwards" "Edwards" "Edwards"
## Condition1    "PosN"    "Feedr"   "Norm"   
## Condition2    "PosN"    "Norm"    "Norm"   
## BldgType      "1Fam"    "1Fam"    "1Fam"   
## HouseStyle    "2Story"  "2Story"  "1Story" 
## OverallQual   "10"      "10"      "10"     
## OverallCond   "5"       "5"       "5"      
## YearBuilt     "2007"    "2008"    "2008"   
## YearRemodAdd  "2008"    "2008"    "2009"   
## RoofStyle     "Hip"     "Hip"     "Hip"    
## RoofMatl      "CompShg" "ClyTile" "CompShg"
## Exterior1st   "CemntBd" "Stucco"  "CemntBd"
## Exterior2nd   "CmentBd" "Stucco"  "CmentBd"
## MasVnrType    "Stone"   "Stone"   "Stone"  
## MasVnrArea    " 762"    " 796"    "1224"   
## ExterQual     "5"       "5"       "5"      
## ExterCond     "3"       "3"       "3"      
## Foundation    "PConc"   "PConc"   "PConc"  
## BsmtQual      "5"       "5"       "5"      
## BsmtCond      "3"       "3"       "3"      
## BsmtExposure  "4"       "4"       "4"      
## BsmtFinType1  "6"       "6"       "6"      
## BsmtFinSF1    "2260"    "5644"    "4010"   
## BsmtFinType2  "1"       "1"       "1"      
## BsmtFinSF2    "0"       "0"       "0"      
## BsmtUnfSF     " 878"    " 466"    "1085"   
## TotalBsmtSF   "3138"    "6110"    "5095"   
## Heating       "GasA"    "GasA"    "GasA"   
## HeatingQC     "5"       "5"       "5"      
## CentralAir    "1"       "1"       "1"      
## Electrical    "SBrkr"   "SBrkr"   "SBrkr"  
## X1stFlrSF     "3138"    "4692"    "5095"   
## X2ndFlrSF     "1538"    " 950"    "   0"   
## LowQualFinSF  "0"       "0"       "0"      
## GrLivArea     "4676"    "5642"    "5095"   
## BsmtFullBath  "1"       "2"       "1"      
## BsmtHalfBath  "0"       "0"       "1"      
## FullBath      "3"       "2"       "2"      
## HalfBath      "1"       "1"       "1"      
## BedroomAbvGr  "3"       "3"       "2"      
## KitchenAbvGr  "1"       "1"       "1"      
## KitchenQual   "5"       "5"       "5"      
## TotRmsAbvGrd  "11"      "12"      "15"     
## Functional    "7"       "7"       "7"      
## Fireplaces    "1"       "3"       "2"      
## FireplaceQu   "4"       "4"       "4"      
## GarageType    "BuiltIn" "Attchd"  "Attchd" 
## GarageYrBlt   "2007"    "2008"    "2008"   
## GarageFinish  "3"       "3"       "3"      
## GarageCars    "3"       "2"       "3"      
## GarageArea    " 884"    "1418"    "1154"   
## GarageQual    "3"       "3"       "3"      
## GarageCond    "3"       "3"       "3"      
## PavedDrive    "2"       "2"       "2"      
## WoodDeckSF    "208"     "214"     "546"    
## OpenPorchSF   "406"     "292"     "484"    
## EnclosedPorch "0"       "0"       "0"      
## X3SsnPorch    "0"       "0"       "0"      
## ScreenPorch   "0"       "0"       "0"      
## PoolArea      "  0"     "480"     "  0"    
## PoolQC        "0"       "4"       "0"      
## Fence         "None"    "None"    "None"   
## MiscFeature   "None"    "None"    "None"   
## MiscVal       "    0"   "    0"   "17000"  
## MoSold        "10"      "1"       "10"     
## YrSold        "2007"    "2008"    "2007"   
## SaleType      "New"     "New"     "New"    
## SaleCondition "Partial" "Partial" "Partial"
## SalePrice     "184750"  "160000"  NA       
## Age           "-1"      " 0"      "-2"     
## TotalSF       " 8428"   "12738"   "11220"

If you look closer you will find some things in common:

  • They all share the same Neighborhood.
  • They all were sold before the house was finished (negative age).
  • The SaleCondition was “Partial”.

According to data description Partial Sale means that “Home was not completed when last assessed”. Thus, SalePrice of this variable may not represent the actual market value of the house, but rather the amount of money they have already payed in advance.

We should definitively remove this, as these are two outliers. However, they will be very useful to predict the value of the house 2550, as it has similar conditions.

Let’s see how the model works without these two observations.

total_clean <- total_clean[-c(524,1299),]

ggplot(total_clean, aes(TotalSF, SalePrice)) + geom_point(alpha=0.4) + geom_smooth(method = "loess", se=FALSE) + labs(title="Houseprice by total square feet (adjusted)")
## Warning: Removed 1459 rows containing non-finite values (stat_smooth).
## Warning: Removed 1459 rows containing missing values (geom_point).

cor(total_clean$TotalSF,total_clean$SalePrice, use = "complete.obs" )
## [1] 0.8289224

Number of Batrhooms

Another interesting variable might be the amount of bathrooms in the house. I have also think about the total number of rooms (bathrooms + rooms), but it is likely that this last one will be highly correlated with the total SF.

In order to correctly count the amount of bathrooms we have to add both the basement bathrooms, as FullBath only indicates “Full bathrooms above grade”.

total_clean$Bathrooms <- (total_clean$BsmtHalfBath + total_clean$HalfBath)*0.5 + total_clean$FullBath + total_clean$BsmtFullBath   

ggplot(total_clean, aes(Bathrooms, SalePrice)) + geom_jitter(alpha=0.4) + geom_smooth(method = "loess", se=FALSE) + scale_x_continuous(breaks=seq(1,6,0.5)) + labs(title = "Houseprice by Amount of bathrooms")
## Warning: Removed 1459 rows containing non-finite values (stat_smooth).
## Warning: Removed 1459 rows containing missing values (geom_point).

cor(total_clean$Bathrooms, total_clean$SalePrice, use = "complete.obs")
## [1] 0.6358963

Variables significance and correlations

Now we will try to identify for each datasets which are the most relevant variables. This process will enable us to use just certain specific variables for some (not all) ML models, like the linear model. We will undertake two differente processes.

  • Random Forest: as it is a tree based algorithm work with both categorical and numerical data. It will give us a glimpse of which the importante variables are.
  • Correlation matrix: as RF does not show colinearity among variables, I will undertake a correlation matrix. By doing so I will pick significant variables that are not highly correlated.

Variables significance: Random Forest

Dataset 2

As we have said, we will run a Random Forest to see which are the most important explanatory variables.

set.seed(1234)
train_forest <- total_clean %>%
  filter(!is.na(SalePrice))

random_forest <- randomForest(x= train_forest[,-which(names(train_forest) == "SalePrice")], y=train_forest$SalePrice, ntree=100,importance=TRUE)
random_forest <- importance(random_forest)
random_forest <- data.frame(Variables = row.names(random_forest), MSE = random_forest[,1])
random_forest <- random_forest[order(random_forest$MSE, decreasing = TRUE),]

par(mfrow=c(1,2))
ggplot(random_forest[1:30,], aes(x=reorder(Variables, MSE), y=MSE)) + geom_bar(stat = "identity") + labs(x = "Variable", y= "MSE") + coord_flip() + labs(title = "Variablles Importance (MSE explanation)")

ggplot(random_forest[1:30,], aes(x=reorder(Variables, -MSE), y=MSE, group=1)) + geom_point() + geom_line()+ labs(x = "Variable", y= "MSE") + labs(title = "MSE explanation evolution") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

As we can see, the most important variables are TotalSF, Neighborhood, OveralQual, GrLivArea, BsmtFinSF1, Bathrooms, GarageArea and YearBuilt. However, for sure that some of these variables (such as Total SF and BsmtFinSF1) are highly correlated. Thus, we now have to analyze the correlations among variables.

Dataset 1.

In this case we will also run a Random Forest to analyze which are the most important variables.

set.seed(1234)
train_forest <- total_noclean %>%
  filter(!is.na(SalePrice))

random_forest <- randomForest(x= train_forest[,-which(names(train_forest) == "SalePrice")], y=train_forest$SalePrice, ntree=100,importance=TRUE)
random_forest <- importance(random_forest)
random_forest <- data.frame(Variables = row.names(random_forest), MSE = random_forest[,1])
random_forest <- random_forest[order(random_forest$MSE, decreasing = TRUE),]

par(mfrow=c(1,2))
ggplot(random_forest[1:30,], aes(x=reorder(Variables, MSE), y=MSE)) + geom_bar(stat = "identity") + labs(x = "Variable", y= "MSE") + coord_flip() + labs(title = "Variablles Importance (MSE explanation)")

ggplot(random_forest[1:30,], aes(x=reorder(Variables, -MSE), y=MSE, group=1)) + geom_point() + geom_line()+ labs(x = "Variable", y= "MSE") + labs(title = "MSE explanation evolution") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

It is obvious that in this cae the relevant variables have changed.

Variables correlations

Dataset 2.

In order to undertake a correlation matrix, it is good to first consider several things:

  • Pearson Correlations only works with numerical data. Thus, I will have to get a df with this information first.
  • Pearson Correlation just measures the strenght of linear correlation between variables. There might be some other type of correlation (non-linear) that we simply just won’t see.

First, we have to split the data and just get numeric variables.

# We create a new df for numeric variables
total_clean_numeric_index <- names(which(sapply(total_clean, is.numeric)))
total_clean_numeric <- total_clean[, which(names(total_clean) %in% total_clean_numeric_index)]

#We calculate correlations and filter for just  those higher than |0.5| and pick the names
correlations <- as.matrix(x = sort(cor(total_clean_numeric, use="pairwise.complete.obs")[,"SalePrice"], decreasing = TRUE))
names <- names(which(apply(correlations,1, function(x) abs(x)>0.5))) 

#We sort the dataset to just show those variables
total_clean_numeric <- total_clean_numeric[, names]

#We create and represent the correlations matrix
correlations <- cor(total_clean_numeric, use="pairwise.complete.obs")
cor.plot(correlations, numbers=TRUE, xlas = 2, upper= FALSE, main="Correlations among important variables", zlim=c(abs(0.65),abs(1)), colors=FALSE)

By analyzing the matrix we realize of the following:

  • TotalSF is higly correlated with GrLivArea (0.86), TotalBsmntSF (0.8), 1stFlrSF(0.77). We pick TotalSF.
  • OverallQual is highly correlated with ExterQual (0.73) and KitchenQual (0.67). We pick OverallQual.
  • GarageCars is highly correlated with GarageArea (0.89). We keep GarageCars.
  • Bathrooms is highly correlated with FullBath (0.71). We keep Bathrooms.
  • TotRmsAbvGrd is highyl correlated with GrLivingArea. As we have drop off this last one, we keep TotRmsAbvGrd.
  • YearRemodAdd is perfectly correlated with Age. However, it has lower correlations with other variables (due to rounding I suppose). Thus, we keep YearRemodAdd.

Finally, We will create a df that just keeps the “important” variables for those models that do not cope with multicolinearity (such as linear model).

important <- c("SalePrice","TotalSF","OverallQual","Bathrooms","GarageCars","YearBuilt","BsmtQual","GarageFinish","GarageYrBlt","FireplaceQu","YearRemodAdd", "Neighborhood")

total_models_clean <- total_clean[, which(names(total_clean) %in% important)]

Dataset 1

This time we will undertake the same exact process.

# We create a new df for numeric variables
total_noclean_numeric_index <- names(which(sapply(total_noclean, is.numeric)))
total_noclean_numeric <- total_noclean[, which(names(total_noclean) %in% total_noclean_numeric_index)]

#We calculate correlations and filter for just  those higher than |0.5| and pick the names
correlations2 <- as.matrix(x = sort(cor(total_noclean_numeric, use="pairwise.complete.obs")[,"SalePrice"], decreasing = TRUE))
names2 <- names(which(apply(correlations2,1, function(x) abs(x)>0.5))) 

#We sort the dataset to just show those variables
total_noclean_numeric <- total_noclean_numeric[, names2]

#We create and represent the correlations matrix
correlations2 <- cor(total_noclean_numeric, use="pairwise.complete.obs")
cor.plot(correlations2, numbers=TRUE, xlas = 2, upper= FALSE, main="Correlations among important variables", zlim=c(abs(0.65),abs(1)), colors=FALSE)

In this case we will get rid of the following variables:

  • ExterQual: highly correlates with OverallQUal (0.73).
  • TotRmsAbvGrd: highly correlates with GrLivArea (0.81).
  • Garage Area: highly correlates with GarageCars (0.89).
  • X1stFlrSF highly correlates with TotalBsmtSF (0.8).

Thus we will keep the following variables:

important2 <- c("SalePrice","OverallQual","GrLivArea","KitchenQual","GarageCars","TotalBsmtSF","FullBath","BsmtQual","GarageFinish","YearBuilt","FireplaceQu","YearRemodAdd", "Neighborhood")

total_models_noclean <- total_noclean[, which(names(total_noclean) %in% important2)]

Data Preparation

Treating the Skewness

Note: removing skewness is a recommended yet not imprescinble process. Thus, this processes will not be undertaken in Dataset 1 due to it’s lazy perspective. As I have said at the beginning the idea is to show how important not just to run the correct model, but also to prepare the data.


Now that we already now which variables are the important ones, we will prepare those variables by finding outliers and deleting skewness.

Why is it important to fix the skewness? Well, some ML algorithms are based on the assumption of homoscedasticity, which means that all errors have the same variance. As explained in StackExchange:“the difference between what you predict Y^ and the true values Y should be constant. You can ensure that by making sure that Y follows a Gaussian distribution”.

numeric <- c("SalePrice","TotalSF","YearBuilt","GarageYrBlt","YearRemodAdd")
total_model_clean_numeric <- total_models_clean[,names(total_models_clean)%in% numeric]

Now that we have the numeric variables, we will simply create a for that analyzes the skewness and uses a logarithmic transformation is skewness is higher than 0.75, as it is a moderately high skewness. Explanation here.

for(i in 1:ncol(total_model_clean_numeric)){
        if (abs(skew(total_model_clean_numeric[,i]))>0.75){
                total_model_clean_numeric[,i] <- log(total_model_clean_numeric[,i] +1)
        }
}

We will check and see if any variable has been modified.

head(total_model_clean_numeric,10)
##    YearBuilt YearRemodAdd GarageYrBlt SalePrice  TotalSF
## 1       2003         2003        2003  12.24770 7.873978
## 2       1976         1976        1976  12.10902 7.945555
## 3       2001         2002        2001  12.31717 7.918992
## 4       1915         1970        1998  11.84940 7.930566
## 5       2000         2000        2000  12.42922 8.194229
## 6       1993         1995        1993  11.87061 7.843456
## 7       2004         2005        2004  12.63461 8.214194
## 8       1973         1973        1973  12.20608 8.259717
## 9       1931         1950        1931  11.77453 8.013674
## 10      1939         1950        1939  11.67845 7.636752

As you can see, SalePrice has been modified. Thus, will have to transform the predictions we make. Just for you to see, we will check how the skewness of the predicted variable has changes.

qqnorm(total_clean$SalePrice, main = "Skewness of data without transformation (SalePrice)")
qqline(total_clean$SalePrice)

qqnorm(total_model_clean_numeric$SalePrice, main = "Skewness of data transformed (SalePrice)")
qqline(total_model_clean_numeric$SalePrice)

Normalizations and standarization

It is important to normalize data so that the model is not affected by the scales of the data. You can find further explanation [here] (https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029.)

Obviously, we will only scale numeric data.

SalePrice <- total_model_clean_numeric$SalePrice
total_model_clean_numeric <- sapply(total_model_clean_numeric[, -which(names(total_model_clean_numeric) %in% "SalePrice")], scale)

total_model_clean_numeric <- cbind(SalePrice,total_model_clean_numeric)
total_model_clean_numeric <- as.data.frame(total_model_clean_numeric)

Variable Dummification

Dataset 2

For those variables thar are actually not numeric variables we will proceed to created dummy variables, which basically means, a variable for each level of each actualy variable with two possible values 1 (Yes), 0 (No).

To do so, we first have to coerce all categorical variables that are encoded as numeric into factors.

total_models_clean$OverallQual <- as.factor(total_models_clean$OverallQual)
total_models_clean$BsmtQual <- as.factor(total_models_clean$BsmtQual)
total_models_clean$FireplaceQu <- as.factor(total_models_clean$FireplaceQu)
total_models_clean$GarageFinish <- as.factor(total_models_clean$GarageFinish)
total_models_clean$GarageCars <- as.factor(total_models_clean$GarageCars)
total_models_clean$Bathrooms <- as.factor(total_models_clean$Bathrooms)

Now we will make all categorical variables dummy variables. This will make the dimensions of the dataset grow considerably.

categorical <- total_models_clean[!colnames(total_models_clean) %in% numeric]
dummy <- as.data.frame(model.matrix(~.-1, categorical))
total_final_2 <- cbind(SalePrice = total_model_clean_numeric$SalePrice, dummy)

Now we will get rid off the variables that appear less than 5 times in the train dataset along with others that do not appear in the test dataset. We do so because having less than 5 variables is not enough data to make a good prediction.

names1 <- names(colSums(total_final_2[!is.na(total_final_2$SalePrice),])[colSums(total_final_2[!is.na(total_final_2$SalePrice),])<5])

names2 <- names(colSums(total_final_2[is.na(total_final_2$SalePrice), -which(names(total_final_2) == "SalePrice") ])[colSums(total_final_2[is.na(total_final_2$SalePrice),-which(names(total_final_2) == "SalePrice")])==0])

delete <- c(names1,names2)

Now we will get rid off these columns and bind them with the normalized variables.

total_final_2 <- total_final_2[, -which(names(total_final_2)%in% c(delete, "SalePrice"))]
total_final_2 <- cbind(total_model_clean_numeric, total_final_2)

Finally we will split data into train and test, and the response and explanatory variables.

train_2 <- total_final_2[!is.na(total_final_2$SalePrice),]
train_2_explanatory <- train_2[, -which(names(train_2) %in% "SalePrice")]
train_2_response <- train_2[, which(names(train_2) %in% "SalePrice")]

test_2 <- total_final_2[is.na(total_final_2$SalePrice),]
test_2_explanatory <- test_2[, -which(names(test_2) %in% "SalePrice")]

Dataset1

Once again, we begin by coercing categorical data that re encoded as numeric to factors.

total_models_noclean$OverallQual <- as.factor(total_models_noclean$OverallQual)
total_models_noclean$KitchenQual <- as.factor(total_models_noclean$KitchenQual)
total_models_noclean$GarageCars <- as.factor(total_models_noclean$GarageCars)
total_models_noclean$FullBath <- as.factor(total_models_noclean$FullBath)
total_models_noclean$BsmtQual <- as.factor(total_models_noclean$BsmtQual)
total_models_noclean$GarageFinish <- as.factor(total_models_noclean$GarageFinish)
total_models_noclean$FireplaceQu <- as.factor(total_models_noclean$FireplaceQu)

Now we create the dummy variables.

categorical <- select_if(total_models_noclean, is.factor)
total_model_noclean_numeric <- total_models_noclean[, -which(colnames(total_models_noclean) %in% names(categorical))] 
dummy <- as.data.frame(model.matrix(~.-1, categorical))
total_final_1 <- cbind(total_model_noclean_numeric, dummy)

Now we check the colnames of the variables that do not appear a significant amout of times in train or that do not appear in test.

names1 <- names(colSums(total_final_1[!is.na(total_final_1$SalePrice),])[colSums(total_final_1[!is.na(total_final_1$SalePrice),])<5])

names2 <- names(colSums(total_final_1[is.na(total_final_1$SalePrice), -which(names(total_final_1) == "SalePrice") ])[colSums(total_final_1[is.na(total_final_1$SalePrice),-which(names(total_final_1) == "SalePrice")])==0])

delete <- c(names1,names2)

We delete this variables.

total_final_1 <- total_final_1[, -which(names(total_final_1)%in% delete)]

Finally we will split data into train and test, and the response and explanatory variables.

train_1 <- total_final_1[!is.na(total_final_1$SalePrice),]
train_1_explanatory <- train_1[, -which(names(train_1) %in% "SalePrice")]
train_1_response <- train_1[, which(names(train_1) %in% "SalePrice")]

test_1 <- total_final_1[is.na(total_final_1$SalePrice),]
test_1_explanatory <- test_1[, -which(names(test_1) %in% "SalePrice")]

10 ML models for predicting House Price

Model 1: Linear Model

We begin with a very simple model: linear regression. As you know, it basically find the line that reduces the Root Mean Square Error.

Model with raw data (Dataset 1)

model_lm_1 <- lm(SalePrice~., data=train_1)
prediction_1_lm <-predict(model_lm_1,test_1_explanatory)
summary(model_lm_1)
## 
## Call:
## lm(formula = SalePrice ~ ., data = train_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -368648  -13028     394   12148  217535 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -6.686e+05  1.711e+05  -3.907 9.78e-05 ***
## YearBuilt            1.056e+02  6.970e+01   1.515 0.130072    
## YearRemodAdd         2.377e+02  5.953e+01   3.993 6.85e-05 ***
## TotalBsmtSF          1.374e+01  2.820e+00   4.873 1.23e-06 ***
## GrLivArea            3.791e+01  2.753e+00  13.771  < 2e-16 ***
## NeighborhoodBlmngtn  6.840e+03  2.343e+04   0.292 0.770326    
## NeighborhoodBrDale   6.333e+03  2.357e+04   0.269 0.788217    
## NeighborhoodBrkSide  2.621e+04  2.267e+04   1.156 0.247823    
## NeighborhoodClearCr  5.341e+04  2.286e+04   2.336 0.019627 *  
## NeighborhoodCollgCr  3.471e+04  2.231e+04   1.556 0.119976    
## NeighborhoodCrawfor  5.520e+04  2.264e+04   2.437 0.014914 *  
## NeighborhoodEdwards  1.188e+04  2.240e+04   0.530 0.596127    
## NeighborhoodGilbert  2.706e+04  2.225e+04   1.216 0.224196    
## NeighborhoodIDOTRR   9.059e+03  2.293e+04   0.395 0.692883    
## NeighborhoodMeadowV -4.516e+02  2.339e+04  -0.019 0.984598    
## NeighborhoodMitchel  2.400e+04  2.256e+04   1.064 0.287587    
## NeighborhoodNAmes    2.796e+04  2.228e+04   1.255 0.209710    
## NeighborhoodNoRidge  8.081e+04  2.278e+04   3.548 0.000401 ***
## NeighborhoodNPkVill  1.765e+04  2.435e+04   0.725 0.468754    
## NeighborhoodNridgHt  4.951e+04  2.269e+04   2.182 0.029260 *  
## NeighborhoodNWAmes   2.996e+04  2.234e+04   1.341 0.180081    
## NeighborhoodOldTown  1.040e+04  2.258e+04   0.461 0.645124    
## NeighborhoodSawyer   2.526e+04  2.247e+04   1.124 0.261001    
## NeighborhoodSawyerW  3.197e+04  2.245e+04   1.424 0.154722    
## NeighborhoodSomerst  3.646e+04  2.248e+04   1.622 0.105108    
## NeighborhoodStoneBr  6.706e+04  2.311e+04   2.902 0.003766 ** 
## NeighborhoodSWISU    1.723e+04  2.323e+04   0.742 0.458427    
## NeighborhoodTimber   4.049e+04  2.270e+04   1.784 0.074675 .  
## NeighborhoodVeenker  6.239e+04  2.399e+04   2.601 0.009395 ** 
## OverallQual3         4.872e+03  1.675e+04   0.291 0.771186    
## OverallQual4         1.676e+04  1.553e+04   1.079 0.280559    
## OverallQual5         1.922e+04  1.562e+04   1.231 0.218577    
## OverallQual6         2.439e+04  1.571e+04   1.552 0.120816    
## OverallQual7         3.643e+04  1.593e+04   2.287 0.022334 *  
## OverallQual8         5.341e+04  1.629e+04   3.278 0.001070 ** 
## OverallQual9         9.361e+04  1.740e+04   5.380 8.71e-08 ***
## OverallQual10        1.145e+05  1.881e+04   6.086 1.49e-09 ***
## BsmtQual2            4.146e+03  8.112e+03   0.511 0.609383    
## BsmtQual3            6.458e+03  6.292e+03   1.026 0.304869    
## BsmtQual4            9.471e+03  6.656e+03   1.423 0.154998    
## BsmtQual5            2.966e+04  7.898e+03   3.755 0.000181 ***
## FullBath1           -5.428e+03  1.116e+04  -0.486 0.626868    
## FullBath2           -8.231e+03  1.121e+04  -0.734 0.462921    
## FullBath3            3.167e+04  1.276e+04   2.482 0.013192 *  
## KitchenQual3         4.986e+03  5.592e+03   0.892 0.372733    
## KitchenQual4         1.068e+04  6.082e+03   1.755 0.079455 .  
## KitchenQual5         2.971e+04  7.380e+03   4.027 5.96e-05 ***
## FireplaceQu1         3.442e+03  7.136e+03   0.482 0.629671    
## FireplaceQu2         1.049e+04  5.666e+03   1.851 0.064341 .  
## FireplaceQu3         9.898e+03  2.599e+03   3.808 0.000146 ***
## FireplaceQu4         9.679e+03  2.486e+03   3.893 0.000104 ***
## FireplaceQu5         1.813e+04  7.079e+03   2.561 0.010532 *  
## GarageFinish1        3.588e+04  1.457e+04   2.462 0.013937 *  
## GarageFinish2        4.082e+04  1.474e+04   2.769 0.005699 ** 
## GarageFinish3        4.456e+04  1.470e+04   3.031 0.002479 ** 
## GarageCars1         -2.714e+04  1.422e+04  -1.909 0.056504 .  
## GarageCars2         -2.172e+04  1.414e+04  -1.536 0.124835    
## GarageCars3          2.617e+03  1.447e+04   0.181 0.856530    
## GarageCars4                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30860 on 1402 degrees of freedom
## Multiple R-squared:  0.855,  Adjusted R-squared:  0.8491 
## F-statistic:   145 on 57 and 1402 DF,  p-value: < 2.2e-16
# write.csv(prediction_1_lm,"prediction_1_lm.csv")
# Result: 0.15878

Model with cleaned Data (Dataset 2)

model_lm_2 <- lm(SalePrice~., data=train_2)
prediction_2_lm <-predict(model_lm_2,test_2_explanatory)
summary(model_lm_2)
## 
## Call:
## lm(formula = SalePrice ~ ., data = train_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81129 -0.06243  0.00746  0.07661  0.42673 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         11.3505161  0.1207881  93.970  < 2e-16 ***
## YearBuilt            0.0297172  0.0092620   3.208 0.001365 ** 
## YearRemodAdd         0.0477534  0.0051696   9.237  < 2e-16 ***
## GarageYrBlt          0.0007018  0.0070044   0.100 0.920202    
## TotalSF              0.1448538  0.0064208  22.560  < 2e-16 ***
## NeighborhoodBlmngtn  0.0657802  0.1004570   0.655 0.512698    
## NeighborhoodBrDale  -0.0032878  0.1009943  -0.033 0.974035    
## NeighborhoodBrkSide  0.1969100  0.0971688   2.026 0.042906 *  
## NeighborhoodClearCr  0.2917663  0.0980998   2.974 0.002988 ** 
## NeighborhoodCollgCr  0.1812137  0.0955757   1.896 0.058163 .  
## NeighborhoodCrawfor  0.3280792  0.0969032   3.386 0.000730 ***
## NeighborhoodEdwards  0.1208111  0.0960602   1.258 0.208724    
## NeighborhoodGilbert  0.1548761  0.0955336   1.621 0.105207    
## NeighborhoodIDOTRR   0.0131414  0.0982517   0.134 0.893618    
## NeighborhoodMeadowV -0.0203047  0.1001542  -0.203 0.839372    
## NeighborhoodMitchel  0.1391946  0.0968279   1.438 0.150786    
## NeighborhoodNAmes    0.1860144  0.0954955   1.948 0.051628 .  
## NeighborhoodNoRidge  0.2631891  0.0974714   2.700 0.007014 ** 
## NeighborhoodNPkVill  0.0919997  0.1042572   0.882 0.377696    
## NeighborhoodNridgHt  0.2017102  0.0972665   2.074 0.038281 *  
## NeighborhoodNWAmes   0.1622412  0.0958265   1.693 0.090664 .  
## NeighborhoodOldTown  0.0889978  0.0968233   0.919 0.358161    
## NeighborhoodSawyer   0.1514611  0.0963156   1.573 0.116049    
## NeighborhoodSawyerW  0.1525022  0.0961973   1.585 0.113123    
## NeighborhoodSomerst  0.1919414  0.0964127   1.991 0.046694 *  
## NeighborhoodStoneBr  0.2589273  0.0991541   2.611 0.009115 ** 
## NeighborhoodSWISU    0.1635224  0.0996216   1.641 0.100932    
## NeighborhoodTimber   0.1949059  0.0973337   2.002 0.045429 *  
## NeighborhoodVeenker  0.2780633  0.1027745   2.706 0.006902 ** 
## OverallQual3         0.1184177  0.0685881   1.727 0.084477 .  
## OverallQual4         0.2602259  0.0634072   4.104 4.29e-05 ***
## OverallQual5         0.3151594  0.0634709   4.965 7.70e-07 ***
## OverallQual6         0.3526653  0.0640632   5.505 4.39e-08 ***
## OverallQual7         0.4212284  0.0652527   6.455 1.49e-10 ***
## OverallQual8         0.5004081  0.0670072   7.468 1.43e-13 ***
## OverallQual9         0.6373284  0.0717545   8.882  < 2e-16 ***
## OverallQual10        0.7512531  0.0775695   9.685  < 2e-16 ***
## BsmtQual2           -0.1383783  0.0340579  -4.063 5.11e-05 ***
## BsmtQual3           -0.1209535  0.0259766  -4.656 3.53e-06 ***
## BsmtQual4           -0.1163078  0.0276947  -4.200 2.84e-05 ***
## BsmtQual5           -0.0700060  0.0330786  -2.116 0.034491 *  
## FireplaceQu1         0.0189192  0.0305241   0.620 0.535482    
## FireplaceQu2         0.0319677  0.0243722   1.312 0.189855    
## FireplaceQu3         0.0448819  0.0112083   4.004 6.55e-05 ***
## FireplaceQu4         0.0620783  0.0106869   5.809 7.78e-09 ***
## FireplaceQu5         0.0711524  0.0302786   2.350 0.018915 *  
## GarageFinish1        0.2047354  0.0638481   3.207 0.001374 ** 
## GarageFinish2        0.2222887  0.0645712   3.443 0.000593 ***
## GarageFinish3        0.2320994  0.0643741   3.605 0.000323 ***
## GarageCars1         -0.0965111  0.0625314  -1.543 0.122959    
## GarageCars2         -0.0633508  0.0622048  -1.018 0.308653    
## GarageCars3          0.0052211  0.0636209   0.082 0.934606    
## GarageCars4                 NA         NA      NA       NA    
## Bathrooms1.5         0.0642993  0.0159314   4.036 5.73e-05 ***
## Bathrooms2           0.0424251  0.0126571   3.352 0.000824 ***
## Bathrooms2.5         0.0827956  0.0157176   5.268 1.60e-07 ***
## Bathrooms3           0.0884189  0.0174060   5.080 4.29e-07 ***
## Bathrooms3.5         0.1535546  0.0190988   8.040 1.90e-15 ***
## Bathrooms4           0.2618792  0.0406901   6.436 1.68e-10 ***
## Bathrooms4.5         0.2870511  0.0657238   4.368 1.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.132 on 1399 degrees of freedom
## Multiple R-squared:  0.8953, Adjusted R-squared:  0.891 
## F-statistic: 206.3 on 58 and 1399 DF,  p-value: < 2.2e-16
prediction_2_lm <- exp(prediction_2_lm)
write.csv(prediction_2_lm,"prediction_2_lm.csv")
# Result: 0.14711

Model 2: Lasso Regression

The good thing about Lasso Regression is that it handles variables that add no information to the model. As we already have undertaken a variable selection, this should not affect much. You can find further explanation on Lasso Regression here.

Thus, we could pass all the variables to the model and probably the accuracy would increase.

Model with raw data

my_control <-trainControl(method="cv", number=5)
lassoGrid <- expand.grid(alpha = 1, lambda = seq(0,1,by = 0.1))

model_1_glml <- train( 
  x = train_1[,-which(names(train_1)=="SalePrice")],
  y = train_1$SalePrice, method='glmnet',
  trControl= my_control, tuneGrid=lassoGrid) 

model_1_glml$bestTune
##    alpha lambda
## 11     1      1

So we will use lambda = 0.1 for tuning the model.

# prediction_1_glml <- predict(model_1_glml,test_1_explanatory)
# prediction_1_glml <- data.frame(Id = (1461:2919), SalePrice = prediction_1_glml)
# write.csv(prediction_1_glml,"model_1_glml.csv", quote=FALSE,row.names = FALSE)

# Score: 0.16289

Model with cleaned Data

model_2_glml <- train( 
  x = train_2[,-which(names(train_2)=="SalePrice")],
  y = train_2$SalePrice, method='glmnet',
  trControl= my_control, tuneGrid=lassoGrid) 
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
model_2_glml$bestTune
##   alpha lambda
## 1     1      0
# prediction_2_glml <- predict(model_2_glml,test_2_explanatory)
# prediction_2_glml <- data.frame(Id = (1461:2919), SalePrice = exp(prediction_2_glml))
# write.csv(prediction_2_glml,"model_2_glml.csv", quote=FALSE,row.names = FALSE)

#Score: 0.15498

Model 3: Ridge Regression

Ridge Regression is quite similar to Lasso Regression, with the main difference that it does not handle the variables that are not important as well as Lasso does. Instead, it is good at handling good long term predictions. For further explanation, whatch this video.

Thus, if we see that Lasso Regression outperforms Ridge Regression, maybe we have some non significant variables in the model.

Model with raw data

In this case, we have tried several lambda values, and the model picks lambda = 1, which means that the variables are shrank 1.

my_control <-trainControl(method="cv", number=5)
ridgeGrid <- expand.grid(alpha = 0, lambda = seq(0,1,0.1))

model_1_glmr <- train( 
  x = train_1[,-which(names(train_1)=="SalePrice")],
  y = train_1$SalePrice, method='glmnet',
  trControl= my_control, tuneGrid=ridgeGrid) 

model_1_glmr$bestTune
##    alpha lambda
## 11     0      1
prediction_1_glmr <- predict(model_1_glmr,test_1_explanatory)
prediction_1_glmr <- data.frame(Id = (1461:2919), SalePrice = prediction_1_glmr)
write.csv(prediction_1_glmr,"model_1_glmr.csv", quote=FALSE,row.names = FALSE)

#Score: 0.16694

Model with cleaned Data

In this case, lamba = 0, so there is no shrinkage. Thus, the regression is an ordinary least squares.

my_control <-trainControl(method="cv", number=10)
ridgeGrid <- expand.grid(alpha = 0, lambda = seq(0,1,0.1))


model_2_glmr <- train( 
  x = train_2[,-which(names(train_2)=="SalePrice")],
  y = train_2$SalePrice, method='glmnet',
  trControl= my_control, tuneGrid=ridgeGrid) 

model_2_glmr$bestTune
##   alpha lambda
## 1     0      0
prediction_2_glmr <- predict(model_2_glmr,test_2_explanatory)
prediction_2_glmr <- data.frame(Id = (1461:2919), SalePrice = exp(prediction_2_glmr))
write.csv(prediction_2_glmr,"model_2_glmr.csv", quote=FALSE,row.names = FALSE)

#Score: 0.15702

Model 4: Elastic-Net Regression

Elastic Net regression is a mix of both Lasso Regression and Ridge Regression. By doing so, we get the ability to handle useless variables (Lasso) and better long term predictions (Ridge)

Model with raw data

As you can see, the Best Tune for this model is when Alpha = 0 and Lambda = 1, which means both Lasso and Ridge Regression together.

my_control <-trainControl(method="cv", number=10)
elasticGrid <- expand.grid(alpha = seq(0,1,0.1), lambda = seq(0,1,0.1))

model_1_glme <- train( 
  x = train_1[,-which(names(train_1)=="SalePrice")],
  y = train_1$SalePrice, method='glmnet',
  trControl= my_control, tuneGrid=elasticGrid) 

model_1_glme$bestTune
##    alpha lambda
## 11     0      1
# prediction_1_glme <- predict(model_1_glme,test_1_explanatory)
# prediction_1_glme <- data.frame(Id = (1461:2919), SalePrice = prediction_1_glme)
# write.csv(prediction_1_glme,"model_1_glme.csv", quote=FALSE,row.names = FALSE)

#Score: 0.16693

Model with cleaned Data

In this case, the best parameters have changed, being alpha = 0.4 and lambda = 0 the ones that best tune the model.

my_control <-trainControl(method="cv", number=10)
elasticGrid <- expand.grid(alpha = seq(0,1,0.1), lambda = seq(0,1,0.1))

model_2_glme <- train( 
  x = train_2[,-which(names(train_2)=="SalePrice")],
  y = train_2$SalePrice, method='glmnet',
  trControl= my_control, tuneGrid=elasticGrid) 

model_2_glme$bestTune
##    alpha lambda
## 12   0.1      0
# prediction_2_glme <- predict(model_2_glme,test_2_explanatory)
# prediction_2_glme <- data.frame(Id = (1461:2919), SalePrice = exp(prediction_2_glme))
# write.csv(prediction_2_glme,"model_2_glme.csv", quote=FALSE,row.names = FALSE)

#Score: 0.15064 

Model 5: Decission Trees

Model with raw data

We will make a for loop to see how the explanatory capacity of the model changes witht different amount of minimum variables in the tree (minsplit). By doing so, we will get a better model.

set.seed(1234)
my_control <- trainControl(method = "repeatedcv", number=10, repeats=5)
model_dt_1 <- train(SalePrice~., data=train_1, method = "rpart", tuneLength= 7,trControl = my_control)

plot(model_dt_1$finalModel)
text(model_dt_1$finalModel)

Now we know that the best performance is done with Thus, we will pick this minsplit in order to try the decision tree.

# prediction_1_dt <-predict(model_dt_1,test_1_explanatory)
# prediction_1_dt <- data.frame(Id=(1461:2919), SalePrice=prediction_1_dt)
# write.csv(prediction_1_dt,"model_dt_1.csv", quote=FALSE,row.names = FALSE)

#Result: 0.26823

Model with cleaned Data

As you can see, the variables by which the split is made have changed. In this case, the most important variable is TotalSF.

As it still is a very short tree, the predictions it will give are obviously very bad ones.

model_dt_2 <- train(SalePrice~., data=train_2, method = "rpart", tuneLength= 7,trControl = my_control)

plot(model_dt_2$finalModel)
text(model_dt_2$finalModel)

# prediction_2_dt <-predict(model_dt_2,test_2_explanatory)
# prediction_2_dt <- data.frame(Id=(1461:2919), SalePrice=exp(prediction_2_dt))
# write.csv(prediction_2_dt,"model_dt_2.csv", quote=FALSE,row.names = FALSE)


#Result: 0.23130

Model 6: Random Forest

In this case, we will fit a Random Forest. Random Forest is one of the most versatile ML algorithms there are. However, in this case we are not taking full advantage of it’s capabilities. It undertakes featue selection and handles categorical data, so we could have saved some steps along the way.

However, as many times have ocurred, the idea here is not to see how well this ML algorithm can perform, but rather how the accuracy of the model changes if with feature engineering and data preprocessing.

Model with raw data

We will try to best tune the RF model.

set.seed(1)
bestMtry <- tuneRF(train_1[,-which(names(train_1) == "SalePrice")], train_1$SalePrice, stepFactor = 1.5, improve = 1e-5, ntree = 500)
## mtry = 19  OOB error = 931888231 
## Searching left ...
## mtry = 13    OOB error = 929337844 
## 0.002736795 1e-05 
## mtry = 9     OOB error = 944050009 
## -0.0158308 1e-05 
## Searching right ...
## mtry = 28    OOB error = 910028385 
## 0.02077765 1e-05 
## mtry = 42    OOB error = 966012509 
## -0.0615191 1e-05

# model_rf_1 <- randomForest(SalePrice~., data=train_1,ntree=500, mtry = 28)
# prediction_1_rf <-predict(model_rf_1,test_1_explanatory)
# 
# write.csv(prediction_1_rf,"model_rf_1.csv")
# Result: 0.16143

Model with cleaned Data

set.seed(1)
bestMtry <- tuneRF(train_2[,-which(names(train_2) == "SalePrice")], train_2$SalePrice, stepFactor = 1.5, improve = 1e-5, ntree = 500)
## mtry = 19  OOB error = 0.02047286 
## Searching left ...
## mtry = 13    OOB error = 0.02077389 
## -0.01470416 1e-05 
## Searching right ...
## mtry = 28    OOB error = 0.0205968 
## -0.006053951 1e-05

# model_rf_2 <- randomForest(SalePrice~., data=train_2,ntree=100, mtry=19)
# prediction_2_rf <-predict(model_rf_2,test_2_explanatory)
# 
# prediction_2_rf <- exp(prediction_2_rf)
# write.csv(prediction_2_rf,"model_rf_2.csv")
# Result: 0.15164

Model 7: k-NN

Model with raw data

# knn_ctrl <- trainControl(method="repeatedcv",repeats = 3) 
# prePro_1 <- preProcess(train_1, method = c("center", "scale"))
# norm_1 <- predict(prePro_1, train_1)
# 
# model_knn_1 <- train(SalePrice ~ ., data = norm_1, method = "knn", trControl = knn_ctrl, tuneLength = 20)
# 
# prePro_1 <- preProcess(test_1_explanatory, method = c("center", "scale"))
# norm_1 <- predict(prePro_1, train_1)
# 
# 
# prediction_1_knn <-predict(model_dt_1,test_1_explanatory)
# prediction_1_knn <- data.frame(Id=(1461:2919), SalePrice=prediction_1_knn)
# write.csv(prediction_1_knn,"model_knn_1.csv", quote=FALSE,row.names = FALSE)
# 
#Results: 0.26823

Model with cleaned Data

# prePro_2 <- preProcess(train_2, method = c("center", "scale"))
# norm_2 <- predict(prePro_2, train_2)
# model_knn_2 <- train(SalePrice ~ ., data = norm_2, method = "knn", trControl = knn_ctrl, tuneLength = 20)

# prediction_2_knn <-predict(model_dt_2,test_2_explanatory)
# prediction_2_knn <- data.frame(Id=(1461:2919), SalePrice=exp(prediction_2_knn))
# write.csv(prediction_2_knn,"model_knn_2.csv", quote=FALSE,row.names = FALSE)

#Results: 0.23130

Model 8: XGBoost

XGBoost is one of the most popular models used in Kaggle. However, it takes quite a lot of time to train it. Thus, in this situation, I have not search the best tuning parameters for both models. Instead, I have found the best tuning parameters among the “recommended” standard parameters for the dataset 2. Once I have found the best tune for model 2, I have applied those same parameters to model 1. Thus, model 1 could still gain significant improvement.

Model with raw data

As said before, in order to save some time, the parameters I have used to train this XGBoost model are the ones that best tune XGBoost for the second dataset. These parameters are:

  • eta = 0.05. It refers to the learning rate and it is used to prevent overfitting.
  • max_depth = 2. It is refered to the maximum depth the trees can have. According to the XGBoost documentation, the higher the maximum depth of the trees, the more likely to overfit.
  • min_child_weigth = 4. As explained in stats.exchange:“stop trying to split once you reach a certain degree of purity in a node and your model can fit it”.

Thus, the model was built like this:

# tune_grid <- expand.grid(
# nrounds = 1000,
# eta = 0.05,
# max_depth = 2,
# gamma = 0,
# colsample_bytree=1,
# min_child_weight=4,
# subsample=1
# )
# 
# xgb_tune_1 <- caret::train(
#   x = train_1[,-which(names(train_1)=="SalePrice")],
#   y = train_1$SalePrice,
#   tuneGrid = tune_grid,
#   method = "xgbTree",
#   verbose = TRUE)

With the model trained we did a prediction. This prediction had a 0.17989 RMSE in the test dataset. (Quite high actually).

# prediction_1_xgb <-predict(xgb_tune_1,test_1_explanatory)
# prediction_1_xgb <- data.frame(Id=(1461:2919), SalePrice=prediction_1_xgb)
# write.csv(prediction_1_xgb,"model_xgb_1.csv", quote=FALSE,row.names = FALSE)
# Result: 0.17989

Model with cleaned Data

As explained before, in this case I did actually tune the parameters. This is obsviously not a good implementation of XGBoost. You can see how it is correctly done in this kernel.

# tune_grid <- expand.grid(
# nrounds = 1000,
# eta = c(0.1, 0.05, 0.01),
# max_depth = c(2, 3, 4, 5, 6),
# gamma = 0,
# colsample_bytree=1,
# min_child_weight=c(1, 2, 3, 4 ,5),
# subsample=1
# )
# 
# xgb_tune_2$bestTune
# 
# xgb_tune_2 <- caret::train(
#   x = train_2[,-which(names(train_2)=="SalePrice")],
#   y = train_2$SalePrice,
#   tuneGrid = tune_grid,
#   method = "xgbTree",
#   verbose = TRUE)

When we undertake the predictions we get a much more acurate model, with a RMSE of 0.15447. So far, XGBoost is one of the models that has performed best even though we haven’t undertaken a thorough parameter tuning.

# prediction_2_xgb <-predict(xgb_tune_2,test_2_explanatory)
# 
# xgb_tune_2$bestTune
# 
# prediction_2_xgb <- exp(prediction_2_xgb)
# prediction_2_xgb <- data.frame(Id=(1461:2919), SalePrice=prediction_2_xgb)
# write.csv(prediction_2_xgb,"model_xgb_2.csv", quote=FALSE,row.names = FALSE)
# Result: 0.15447

Model 9: Support Vector Machines

As Indresh Bhattacharyya explained, the main difference between SVM for regression and a liner regression is that in simple regression we try to minimise the error rate. While in SVR we try to fit the error within a certain threshold.

Thus, as linear regression did a good job in the prediction, probably SVM will also do a great job too. Let’s check it out.

Model with raw data

As we can see, SVM have actually made a good job at predicting the price. In fact, it is the model that has perform the best for the “raw data” dataset.

# my_control <- trainControl(method = "cv", number = 10)
# model_svm_1 <- train(SalePrice~., data=train_1, method = "svmLinear",trControl=my_control,tuneLength = 10)
# prediction_1_svm <-predict(model_svm_1,test_1_explanatory)
# prediction_1_svm <- data.frame(Id=(1461:2919), SalePrice=prediction_1_svm)
# write.csv(prediction_1_svm,"model_svm_1.csv", quote=FALSE,row.names = FALSE)
# Result: 0.15255

Model with cleaned Data

Once again, we can see how in this case, the SVM has outperform the rest of models.

# model_svm_2 <- train(SalePrice~., data=train_2, method = "svmLinear",trControl=my_control,tuneLength = 10)
# prediction_2_svm <-predict(model_svm_2,test_2_explanatory)
# prediction_2_svm <- data.frame(Id=(1461:2919), SalePrice=exp(prediction_2_svm))
# 
# write.csv(prediction_2_svm,"model_svm_2.csv", quote=FALSE,row.names = FALSE)
# Result: 0.14695

Model 10: Neural Network

Despite I have tried several tunings, I have not been able to make a neural netwok converge. Thus, I will simply put the code I have used (maybe you can help me on finding a better hidden parameters) and I will drop this model off the comparison.

 # model_nn_1 <- neuralnet(formula = paste('SalePrice ~ ',paste(colnames(train_1)[1:(ncol(train_1)-1)],collapse=" + ")),
 #                           data = train_1, hidden = c(5), linear.output = TRUE, lifesign = "full",
 #                           stepmax = 1e6, rep = 1, algorithm = "rprop+", threshold = 0.5)
 # 
 # prediction_1_nn <-predict(model_nn_1,test_1_explanatory)
 # prediction_1_nn <- data.frame(Id=(1461:2919), SalePrice=prediction_1_nn)
 # write.csv(prediction_1_nn,"model_nn_1.csv", quote=FALSE,row.names = FALSE)

Results and conclusion

If we analyze the results we get something clear: all models work much better if we undertake a good missing value handling, feature generation and feature selection. This can be clearly seen in the following graph.

results %>%
  ggplot() +
    geom_path(aes(x= RMSE, y = Model ),
              arrow = arrow(length =unit(1.5,"mm"), type = "closed")) + 
                geom_text(aes(x = RMSE, y = Model, label = round(RMSE,2),
                              hjust = ifelse(Data == "Non_worked_data",-0.3,1.2)),
size = 3, color = "gray25"
              ) +
  labs(title = "Model performance with processed data vs not processed data ", x= "RMSE on test", y = "")  +
  theme_minimal()

This kernel tries to emphasize the importance of a good data processing, which is often forgotten. That being said, choosing the right ML model (or models) and tuning them correctly are also crucial for obtaining the best results possible.