Predicción del precio de una casa (10 modelos de ML)

Ander Fernández Jauregui

anderfernandezj@gmail.com

Problema Inicial.

Muchas veces se presta mucha atención al machine learning, pero poco a la preparación de los datos (eliminación de la simetría, imputación de valores perdidos, feature engineering).

Este proyecto busca demostrar cómo afecta el hecho de preparar bien los datos más allá del modelo que se utilice.

Cuestiones Involucradas.

Machine Learning

Cajas Negras (SVM, NN, XGBoost…)

Cajas Blancas (Elastic Net, Lasso, Árboles de decisión..)

Preparación de datos

Feature Engineering

Solución.

Para ello, basándome en una competición de Kaggle, he aplicado 10 modelos de machine learning diferentes (SVM, XGBoost, Lasso, etc.) cada uno a dos bases de datos: una en la que se hace una buena preparación de datos y otra en la que no.

Desarrollo del proyecto Predicción del precio de una casa (10 modelos de ML)

Predicción del precio de una casa mediante 10 modelos de Machine Learning

Introduction

This project aims to emphasize the importance of a good missing value handling, feature generation and feature selection. To do so, I will predict the price of houses in Iowa by using several different machine learningn algortihms.

The idea is that by the end of this kernel you will have a good grasp on how important a good data preprocessing is and how it might affect the outcome of your model.

The models we will compare are the following:

Linear model

Lasso Regression

Ridge Regression

Elastic-Net Regression

Decision Trees

Random Forest

k-NN

XGBoost

Support Vector Machines

Neural Network

As the aim is to show the importance of data preprocessing, we will require two datasets:

Dataset with raw data. In tis case we will do minimal processing (such as a basic NA imputation). This will be Dataset 1.

Dataset with cleaned data. In this case, we will undertake a more thorough process NA imputation, data preprocessing, feature engineering and so on.

If you check the data description provided in the competition, you will see that many of those variables with many NA NA actually means that they do not have that feature. Therefore to be fair we will consider those variables NA’s as a category “None” for both datasets.

By reading the data descriptions you will see that the variables that have NAs that should be “None” are the following: PoolQc, MiscFeature, Alley, Fence, FireplaceQu, GarageFinish, GarageQual, GarageCond, GarageType, BsmtCond, BsmtExposure, BsmtQual, BsmtFinType2 and BsmtFinType.

We will coerce this variable’s NAs into “None”.

change <- c("PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageFinish", "GarageQual", "GarageCond", "GarageType", "BsmtCond", "BsmtExposure", "BsmtQual","BsmtFinType1", "BsmtFinType2", "BsmtFinType")
i = 1
for(i in 1:dim(total)[2]){
if(colnames(total[i]) %in% change == TRUE){
total[,colnames(total[i])][is.na(total[,colnames(total[i])])] <- "None"
i = i + 1
}else{
i = i + 1
}
}

If we now analyze the amount of NA in each column, we will see that these have significantly decreased.

Now we will create two different datasets. In each we will undertake different value imputation procedures as state in the introduction section.

Dataset 1. Mode imputation

This is not the way we should work. Or at least that is what we uphold. I am just doing this to show you how differently models work when you have the data cleaned compared to when you don’t. Cleaning is tedious, but it is the way to go.

That being said, we will coerce NA’s into the mode of the variable. We use the mode instead od the mean mainly because if we had an outlier, for example, the mean would be affected by that extreme value. The mode, in the other hand, will suffer little change.

We will exlude “SalePrice” from the imputation for obvious reasons: this is the variable we want to predict.

Besides, as R does not include a function to calculate the mode, we will create it ourselves. In my case, I got it from Stackoverflow. Thanks Ken Williams for the contribution.

# Creating the Mode function
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

# Applying it to the variables
total_noclean <- total
i = 1
for(i in 1:dim(total_noclean)[2]){
if(colnames(total_noclean[i]) %in% names(withNA) == TRUE){
if(colnames(total_noclean[i]) == "SalePrice"){
i = i +1
}else{
total_noclean[,colnames(total_noclean[i])][is.na(total_noclean[,colnames(total_noclean[i])])] <- Mode( total_noclean[,colnames(total_noclean[i])][!is.na(total_noclean[,colnames(total_noclean[i])])])
i = i + 1
}}else{
i = i + 1
}
}

Now we should have no more variables to impute. We rerun the code just to make sure.

As you can see, we are done here. Simple, isn’t i? Of course it is, that’s why most of the times won’t work;)

Dataset 2. Advanced NA imputation

First, we must be aware that when doing the previous step of coercing NA’s into “None”, we might have missed some interesting things. Just by looking the amount of NA’s that we get when we first looked at the “Completness of data” you can see some weird things:

There were 157 NAs at “GaraType” but 159 for the rest of Garage variables. Maybe there are 2 observations with wrong data inputed.

BsmtFinType2 had 80 NAs, while BsmtQual had 81 and the rest of Bsmt variables 82.

Besides we will also have to “predict”, to the extent possbible, the value of missing data in other ways rather than the mode.

Garage variables

As I have mentioned before, GarageType had 2 NA less that the rest of observations. We will check those two observations:

Other variables concerning the garage are: GarageCars, GarageArea and GarageYrBlt. We will check if any of these does not have a “None” in the rest variables regarding Garage.

Finally, we will have to impute the “GarageYrBlt”. We will first check that there is no observation with NA in this variable but not in the rest of Garage variables.

Now, what do we do with the Garage Yearl Built? We will impute the year the house was built, except if the house has been remodeled, in which case we will input this year.

It makes sense that the garage and the house were both built in the same year and that if you remodel the house it is highly likely you will also remodel the garage.

for (i in 1:nrow(total_clean)){
if(is.na(total_clean$GarageYrBlt[i])){
total_clean$GarageYrBlt[i] <- ifelse(!is.na(total_clean$YearRemodAdd[i]),total_clean$YearRemodAdd[i],total_clean$YearBuilt[i])
i = i + 1
}else{
i = i + 1
}
}

Finally, we will have to clear the two observations that have missing data. We will check their values:

## GarageCars GarageArea GarageFinish GarageQual GarageCond GarageType
## 1 NA NA None None None None

As we can see, both missing data are from the same exact observation. As all the rest variables makes us believe that this house does not have Garage, we will impute a 0 in both variables.

## BsmtQual BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 2121 None NA NA NA NA NA
## 2189 None NA NA 0 0 0
## TotalBsmtSF
## 2121 NA
## 2189 0

As you can see, all these variable’s NA correspond to two observations and neither of them have a Bsmt. Thus, these observations value should be 0.

Even though we have previously imputed the value for the missing data of Pool related variables, as there are more than 1 variable, we can cross-check the internal coherence of variable values.

As we can see, there are 3 values that have been uncorrectly imputed. Thus we will fix it. We will check whether the overall quality of the house it’s a good proxy variable to explain the pool quality.

As we have no better proxy, we will use the overall quality as proxy. We have to take into account that overall quality goes from 1-10, while PoolQc have just 5 categories. So, we will divide the Overall Quality by 2 so that they both have the same scale, round it up and assign that value.

Therea re three variables of Lot: “LotConfig”, “LotShape” and “LotFrontage”. The first two do not have “None” level as one of their categories. Thus, this time we have no way to estimate the LotFrontage.

According to variable descriptions, Lot Frontage is the “Linear feet of street connected to property”. Thus it makes sense that it changes according to the Neighborhood (more residential neighbours should have higher Lot Frontage). We will check if this is right or not with a simple ANOVA.

As there is indeed a significant variance, we will impute the median of the neighborhood to the observations with NA in their LotFrontage.

for(i in 1:length(total_clean$MSSubClass)){
if(is.na(total_clean$LotFrontage[i])){
total_clean$LotFrontage[i] <- median(total_clean$LotFrontage[total_clean$Neighborhood == total_clean$Neighborhood[i]], na.rm=TRUE)
i = i + 1
}else{
i = i + 1
}
}

Zoning

According to variable description zoning refers to “the general zoning classification of the sale”. Thus, the zoning is supposed to be highly correlated with the Neighhborhood. We will check that out.

As you can see the supposition was right. Then, we will impute the MSZoning according to the neighborhoods. To do so, we first check out the neighbourhoods.

There are four variables regarding exterior: “Exterior1st”, “Exterior2nd”, “ExterCond” and “ExterQual”. The first two variables refer to the material covering the house and the other two to how good it is conservated. Let’s check out those variables to see if we have any clue.

## Exterior1st Exterior2nd ExterCond ExterQual Neighborhood
## 2152 <NA> <NA> TA TA Edwards

As we can see both NAs refer to the same observation and both conditions are TA. I suspect that the condition of the exterior might be correlated with the material, as some material might be better than others and thus provide better condition.

Besides, we will check if both materials can actually be the same, so that we just have to analyze one variable.

As we can see, most of the houses in that neighborhood that have a “TA” Exterior Quality have “Wd Sdng” as Exterior material. Thus, we will impute this.

Utilities Let’s see which is the most common type of utility.

summary(as.factor(total_clean$Utilities))

## AllPub NoSeWa NA's
## 2916 1 2

As we can see, all variables have the same exact value. Thus, this variable is actually not significant and we should drop it out. As this is something we haven’t seen in the Dataset 1 (see the other tab), I won’t change it there.

total_clean$Utilities <- NULL

Functional If you read the variable description it says “Home functionality (Assume typical unless deductions are warranted)”. Thus, we wil consider the two NAs as typical.

By looking at the variables descriptions you will see that most categorical variabes have a very similar encoding:Ex,Gd,TA,Fa, NA (now None). As they all have the same levels I will call the “standarized”. The variables that have this encoding are ordinal variables. However, as most ML algorithms work with numerical variables, we will coerce them into integers (from 1 to 5).

In this sense, we differenciate three types of character variables:

Variables with standarized ordinal values

Ordinal non standarized variables

Nominal variables

We will split the relabeling into these three categories:

Standarized Ordinal Variables

These refer to the status of features that not all houses have, such as the pool as well as features that all have (kitchen). By checking the variables description you get that these variables are the following: BsmtQual, BsmtCond, FireplaceQu, GarageQual, GarageCond, PoolQC, ExterQual, ExterCond, HeatingQC, KitchenQual.

For both datasets, we will relabel the variables, considering “None” as 0 and “Ex” as 5. To save some space I will just show you one code.

relabel <- c("BsmtQual","BsmtCond","FireplaceQu","GarageQual","GarageCond","PoolQC","ExterQual","ExterCond","HeatingQC","KitchenQual")
quality <- c("None" = 0, "Po" = 1, "Fa"= 2, "TA" = 3, "Gd" = 4, "Ex" = 5)
for(i in 1:ncol(total_clean)){
x = colnames(total_clean[i])
if( x %in% relabel){
total_clean[i] <- as.integer(revalue(total_clean[,x],quality))
i = i + 1
} else{
i = i + 1
}
}

We will check whether the process has been correctly undertaken or not by analyzing the levels of one of the variables we have changed.

levels(as.factor(total_clean$BsmtQual))

## [1] "0" "2" "3" "4" "5"

Ordinal non standarized variables

In this case, as they don’t follow the same pattern, we will rename them one by one. The variables to be revalued will by found by looking at variable’s description. Besides, we will undertake the process for both datasets, but I will only be showing one.

## The following `from` values were not present in `x`: Sal

Nominal variables

As there should no longer be character variables factors that are indeed ordinal, we will make all character variables categoric.

#Index of non numeric variables
character <- which(sapply(total_clean, is.character))
for(i in 1:ncol(total_clean)){
x = colnames(total_clean[i])
if(i %in% character == TRUE){
total_clean[i] <- as.factor(total_clean[,x])
i = i + 1
}else{
i = i + 1
}
}
total_clean$SalePrice <- as.numeric(total_clean$SalePrice)

To ensure everything is Ok, we will analyze if there is any character variable.

As we can see, there are two factor variables that are labeled as integers. These two variables are the month and year sold. We will change these two variables.

Note. In order to show the importance of feature engineering I will only undertake feature engineering changes in the 2nd Dataset.

Age of the house

One of the most important things while considering to purchase a house is it’s age. Newer houses are usually more expensive, because they have more technologically advanced and unused materials. A remodeled house on the other hand is usually not as expensive as a new house, but is more expensive than an old one for the exact reasons.

Thus, we will calculate the age of the house.

#We convert the variable into numeric
total_clean$Age <- total_clean$YrSold - total_clean$YearRemodAdd
#We convert the variable back to factor
total_clean$YrSold <- as.factor(total_clean$YrSold)
total_noclean$YrSold <- as.factor(total_noclean$YrSold)

We will see whether there is a positive correlation or not

ggplot(total_clean, aes(Age, SalePrice)) + geom_point(alpha=0.4) + geom_smooth(method = "lm", se=FALSE) + labs(title="House price by year", x ="House Age")

As you can see, generally the elder the house the lower the saleprice.

Total Square feet

It is obvious that the bigger the house the higher the price. However, we do not have a variable that resumes the amount of square feet in the house. Thus, we will create it.

cor(total_clean$TotalSF,total_clean$SalePrice, use = "complete.obs" )

## [1] 0.7800884

As we can see, despite two variables seem to have a strange pattern, they do have a strong positive relationship. Thus, we will simply analyze and remove if possible those two cases.

We see how the “problematic” values are the observation 524 and 1299. We will analyze this variables more in detail, together with the 2550, which we have to predict, just to check if they have any similarities.

If you look closer you will find some things in common:

They all share the same Neighborhood.

They all were sold before the house was finished (negative age).

The SaleCondition was “Partial”.

According to data description Partial Sale means that “Home was not completed when last assessed”. Thus, SalePrice of this variable may not represent the actual market value of the house, but rather the amount of money they have already payed in advance.

We should definitively remove this, as these are two outliers. However, they will be very useful to predict the value of the house 2550, as it has similar conditions.

Let’s see how the model works without these two observations.

total_clean <- total_clean[-c(524,1299),]
ggplot(total_clean, aes(TotalSF, SalePrice)) + geom_point(alpha=0.4) + geom_smooth(method = "loess", se=FALSE) + labs(title="Houseprice by total square feet (adjusted)")

cor(total_clean$TotalSF,total_clean$SalePrice, use = "complete.obs" )

## [1] 0.8289224

Number of Batrhooms

Another interesting variable might be the amount of bathrooms in the house. I have also think about the total number of rooms (bathrooms + rooms), but it is likely that this last one will be highly correlated with the total SF.

In order to correctly count the amount of bathrooms we have to add both the basement bathrooms, as FullBath only indicates “Full bathrooms above grade”.

cor(total_clean$Bathrooms, total_clean$SalePrice, use = "complete.obs")

## [1] 0.6358963

Variables significance and correlations

Now we will try to identify for each datasets which are the most relevant variables. This process will enable us to use just certain specific variables for some (not all) ML models, like the linear model. We will undertake two differente processes.

Random Forest: as it is a tree based algorithm work with both categorical and numerical data. It will give us a glimpse of which the importante variables are.

Correlation matrix: as RF does not show colinearity among variables, I will undertake a correlation matrix. By doing so I will pick significant variables that are not highly correlated.

Variables significance: Random Forest

Dataset 2

As we have said, we will run a Random Forest to see which are the most important explanatory variables.

As we can see, the most important variables are TotalSF, Neighborhood, OveralQual, GrLivArea, BsmtFinSF1, Bathrooms, GarageArea and YearBuilt. However, for sure that some of these variables (such as Total SF and BsmtFinSF1) are highly correlated. Thus, we now have to analyze the correlations among variables.

Dataset 1.

In this case we will also run a Random Forest to analyze which are the most important variables.

It is obvious that in this cae the relevant variables have changed.

Variables correlations

Dataset 2.

In order to undertake a correlation matrix, it is good to first consider several things:

Pearson Correlations only works with numerical data. Thus, I will have to get a df with this information first.

Pearson Correlation just measures the strenght of linear correlation between variables. There might be some other type of correlation (non-linear) that we simply just won’t see.

First, we have to split the data and just get numeric variables.

# We create a new df for numeric variables
total_clean_numeric_index <- names(which(sapply(total_clean, is.numeric)))
total_clean_numeric <- total_clean[, which(names(total_clean) %in% total_clean_numeric_index)]
#We calculate correlations and filter for just those higher than |0.5| and pick the names
correlations <- as.matrix(x = sort(cor(total_clean_numeric, use="pairwise.complete.obs")[,"SalePrice"], decreasing = TRUE))
names <- names(which(apply(correlations,1, function(x) abs(x)>0.5)))
#We sort the dataset to just show those variables
total_clean_numeric <- total_clean_numeric[, names]
#We create and represent the correlations matrix
correlations <- cor(total_clean_numeric, use="pairwise.complete.obs")
cor.plot(correlations, numbers=TRUE, xlas = 2, upper= FALSE, main="Correlations among important variables", zlim=c(abs(0.65),abs(1)), colors=FALSE)

By analyzing the matrix we realize of the following:

TotalSF is higly correlated with GrLivArea (0.86), TotalBsmntSF (0.8), 1stFlrSF(0.77). We pick TotalSF.

OverallQual is highly correlated with ExterQual (0.73) and KitchenQual (0.67). We pick OverallQual.

GarageCars is highly correlated with GarageArea (0.89). We keep GarageCars.

Bathrooms is highly correlated with FullBath (0.71). We keep Bathrooms.

TotRmsAbvGrd is highyl correlated with GrLivingArea. As we have drop off this last one, we keep TotRmsAbvGrd.

YearRemodAdd is perfectly correlated with Age. However, it has lower correlations with other variables (due to rounding I suppose). Thus, we keep YearRemodAdd.

Finally, We will create a df that just keeps the “important” variables for those models that do not cope with multicolinearity (such as linear model).

important <- c("SalePrice","TotalSF","OverallQual","Bathrooms","GarageCars","YearBuilt","BsmtQual","GarageFinish","GarageYrBlt","FireplaceQu","YearRemodAdd", "Neighborhood")
total_models_clean <- total_clean[, which(names(total_clean) %in% important)]

Dataset 1

This time we will undertake the same exact process.

# We create a new df for numeric variables
total_noclean_numeric_index <- names(which(sapply(total_noclean, is.numeric)))
total_noclean_numeric <- total_noclean[, which(names(total_noclean) %in% total_noclean_numeric_index)]
#We calculate correlations and filter for just those higher than |0.5| and pick the names
correlations2 <- as.matrix(x = sort(cor(total_noclean_numeric, use="pairwise.complete.obs")[,"SalePrice"], decreasing = TRUE))
names2 <- names(which(apply(correlations2,1, function(x) abs(x)>0.5)))
#We sort the dataset to just show those variables
total_noclean_numeric <- total_noclean_numeric[, names2]
#We create and represent the correlations matrix
correlations2 <- cor(total_noclean_numeric, use="pairwise.complete.obs")
cor.plot(correlations2, numbers=TRUE, xlas = 2, upper= FALSE, main="Correlations among important variables", zlim=c(abs(0.65),abs(1)), colors=FALSE)

In this case we will get rid of the following variables:

ExterQual: highly correlates with OverallQUal (0.73).

TotRmsAbvGrd: highly correlates with GrLivArea (0.81).

Garage Area: highly correlates with GarageCars (0.89).

X1stFlrSF highly correlates with TotalBsmtSF (0.8).

Note: removing skewness is a recommended yet not imprescinble process. Thus, this processes will not be undertaken in Dataset 1 due to it’s lazy perspective. As I have said at the beginning the idea is to show how important not just to run the correct model, but also to prepare the data.

Now that we already now which variables are the important ones, we will prepare those variables by finding outliers and deleting skewness.

Why is it important to fix the skewness? Well, some ML algorithms are based on the assumption of homoscedasticity, which means that all errors have the same variance. As explained in StackExchange:“the difference between what you predict Y^ and the true values Y should be constant. You can ensure that by making sure that Y follows a Gaussian distribution”.

Now that we have the numeric variables, we will simply create a for that analyzes the skewness and uses a logarithmic transformation is skewness is higher than 0.75, as it is a moderately high skewness. Explanation here.

for(i in 1:ncol(total_model_clean_numeric)){
if (abs(skew(total_model_clean_numeric[,i]))>0.75){
total_model_clean_numeric[,i] <- log(total_model_clean_numeric[,i] +1)
}
}

We will check and see if any variable has been modified.

As you can see, SalePrice has been modified. Thus, will have to transform the predictions we make. Just for you to see, we will check how the skewness of the predicted variable has changes.

qqnorm(total_clean$SalePrice, main = "Skewness of data without transformation (SalePrice)")
qqline(total_clean$SalePrice)

qqnorm(total_model_clean_numeric$SalePrice, main = "Skewness of data transformed (SalePrice)")
qqline(total_model_clean_numeric$SalePrice)

For those variables thar are actually not numeric variables we will proceed to created dummy variables, which basically means, a variable for each level of each actualy variable with two possible values 1 (Yes), 0 (No).

To do so, we first have to coerce all categorical variables that are encoded as numeric into factors.

Now we will get rid off the variables that appear less than 5 times in the train dataset along with others that do not appear in the test dataset. We do so because having less than 5 variables is not enough data to make a good prediction.

The good thing about Lasso Regression is that it handles variables that add no information to the model. As we already have undertaken a variable selection, this should not affect much. You can find further explanation on Lasso Regression here.

Thus, we could pass all the variables to the model and probably the accuracy would increase.

Ridge Regression is quite similar to Lasso Regression, with the main difference that it does not handle the variables that are not important as well as Lasso does. Instead, it is good at handling good long term predictions. For further explanation, whatch this video.

Thus, if we see that Lasso Regression outperforms Ridge Regression, maybe we have some non significant variables in the model.

Model with raw data

In this case, we have tried several lambda values, and the model picks lambda = 1, which means that the variables are shrank 1.

Elastic Net regression is a mix of both Lasso Regression and Ridge Regression. By doing so, we get the ability to handle useless variables (Lasso) and better long term predictions (Ridge)

Model with raw data

As you can see, the Best Tune for this model is when Alpha = 0 and Lambda = 1, which means both Lasso and Ridge Regression together.

We will make a for loop to see how the explanatory capacity of the model changes witht different amount of minimum variables in the tree (minsplit). By doing so, we will get a better model.

In this case, we will fit a Random Forest. Random Forest is one of the most versatile ML algorithms there are. However, in this case we are not taking full advantage of it’s capabilities. It undertakes featue selection and handles categorical data, so we could have saved some steps along the way.

However, as many times have ocurred, the idea here is not to see how well this ML algorithm can perform, but rather how the accuracy of the model changes if with feature engineering and data preprocessing.

XGBoost is one of the most popular models used in Kaggle. However, it takes quite a lot of time to train it. Thus, in this situation, I have not search the best tuning parameters for both models. Instead, I have found the best tuning parameters among the “recommended” standard parameters for the dataset 2. Once I have found the best tune for model 2, I have applied those same parameters to model 1. Thus, model 1 could still gain significant improvement.

Model with raw data

As said before, in order to save some time, the parameters I have used to train this XGBoost model are the ones that best tune XGBoost for the second dataset. These parameters are:

eta = 0.05. It refers to the learning rate and it is used to prevent overfitting.

max_depth = 2. It is refered to the maximum depth the trees can have. According to the XGBoost documentation, the higher the maximum depth of the trees, the more likely to overfit.

min_child_weigth = 4. As explained in stats.exchange:“stop trying to split once you reach a certain degree of purity in a node and your model can fit it”.

As explained before, in this case I did actually tune the parameters. This is obsviously not a good implementation of XGBoost. You can see how it is correctly done in this kernel.

When we undertake the predictions we get a much more acurate model, with a RMSE of 0.15447. So far, XGBoost is one of the models that has performed best even though we haven’t undertaken a thorough parameter tuning.

As Indresh Bhattacharyya explained, the main difference between SVM for regression and a liner regression is that in simple regression we try to minimise the error rate. While in SVR we try to fit the error within a certain threshold.

Thus, as linear regression did a good job in the prediction, probably SVM will also do a great job too. Let’s check it out.

Model with raw data

As we can see, SVM have actually made a good job at predicting the price. In fact, it is the model that has perform the best for the “raw data” dataset.

Despite I have tried several tunings, I have not been able to make a neural netwok converge. Thus, I will simply put the code I have used (maybe you can help me on finding a better hidden parameters) and I will drop this model off the comparison.

If we analyze the results we get something clear: all models work much better if we undertake a good missing value handling, feature generation and feature selection. This can be clearly seen in the following graph.

results %>%
ggplot() +
geom_path(aes(x= RMSE, y = Model ),
arrow = arrow(length =unit(1.5,"mm"), type = "closed")) +
geom_text(aes(x = RMSE, y = Model, label = round(RMSE,2),
hjust = ifelse(Data == "Non_worked_data",-0.3,1.2)),
size = 3, color = "gray25"
) +
labs(title = "Model performance with processed data vs not processed data ", x= "RMSE on test", y = "") +
theme_minimal()

This kernel tries to emphasize the importance of a good data processing, which is often forgotten. That being said, choosing the right ML model (or models) and tuning them correctly are also crucial for obtaining the best results possible.

Esta web utiliza cookies para que podamos ofrecerte la mejor experiencia de usuario posible. La información de las cookies se almacena en tu navegador y realiza funciones tales como reconocerte cuando vuelves a nuestra web o ayudar a nuestro equipo a comprender qué secciones de la web encuentras más interesantes y útiles.

Cookies estrictamente necesarias

Las cookies estrictamente necesarias tiene que activarse siempre para que podamos guardar tus preferencias de ajustes de cookies.

Si desactivas esta cookie no podremos guardar tus preferencias. Esto significa que cada vez que visites esta web tendrás que activar o desactivar las cookies de nuevo.

Cookies de Analítica

Esta web utiliza Google Analytics para recopilar información anónima tal como el número de visitantes del sitio, o las páginas más populares.

Dejar esta cookie activa nos permite mejorar nuestra web.

¡Por favor, activa primero las cookies estrictamente necesarias para que podamos guardar tus preferencias!