A multivariable linear regression was used to predict and validate the price of pre owned cars. In this post, you can find a step by step model implemntation, including data preprocessing, variable selection, model building, checking regression assumptions, taking corrective actions and testing the final model.
The data was obtained from the course material of Data mining in Engineering at Northeastern University, which was used to predict the price of pre owned Toyota Corolla using neural network. Here, I used a commonly used predictive model, linear regression, to predict the price of pre owned car. The data includes 38 variables and 1436 observations. The description of the variables is provided here.
Multiple linear regression model is used to fit a linear relationship between quantitative dependent variable (response variable) and multiple predictor variables. The model has the form:
$$ \hat{Y}= \beta_0 + \beta_1x_1+ \beta_2x_2 +...+ \beta_nx_n $$$$ \hat{Y}= X*\beta \hspace{0.4cm} (matrix \hspace{0.15cm} notation) $$Where $\hat{Y}$ the predicted value of the response variable, $x_1$, $x_2$, ..., $x_n$ are predictor variables, $\beta_0$, $\beta_1$, ..., $\beta_n$ are unknown parameters to be predicted, and n is the number of predictor variables. Note that the model is called linear in the parameters ($\beta$'s) not in the $x$’s. Model parameters($\beta$s) are estimated in such a way that the difference between the actual (observed, Y) response values and the predicted values are minimized. Mathematically, minimizing the sum of squared residuals is provided as: $$min J(\beta) = min \sum\limits_{i=1}^m (Y_i - \hat{Y}_i)^2 = min \sum\limits_{i=1}^m (Y_i - (\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_nx_{in}))^2$$
where $J(\beta)$ is the cost function, $Y_i$ is the $i^{th}$ observed response value and $m$ is total number of observations.
Optimization Algorithms:
Gradient decent: is an iterative method used to find the optimal values of $\beta$. It starts with some initial values of $\beta$ and simultaneously updates all $\beta$s using the following equation until it converges to minimum $J(\beta)$.
$\beta_j = \beta_j -\alpha \frac{\partial}{\partial\beta_j} J(\beta)$ , for all j from 0 to n(number of predictor variables)
$ = \beta_j - 2\alpha \sum\limits_{i=1}^n (Y_i - (\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_nx_{in}))^2 * x_{ij}$
Where $\alpha$ is the learning rate.
Normal equation: Unlike gradient decent, this method finds the optimum $\beta$ without iteration. $$\beta = (X^T X)^{-1}X^T Y$$
Where: $\beta$ is the vector of $\beta_0$, $\beta_1$, ..., $\beta_n$; $X$ is an $mx(n+1)$ matrix of the value of the predictor variables (one is added to handle the intercept term ($\beta_0$)); $X^T$ stands for the transpose of $X$, and $Y$ is the vector of the observed response values.
First, load all the necessary R packages that can be used to import, explore and alyze the data.
# Important libraries
library(rJava)
library(xlsxjars)
library(xlsx) # to read a file with .xlsx extension
library(car) # to evaluate regression conditions
library(biglm)
library(leaps) # to perform all subsets regression
library(stargazer)
The data was provided in excel file (xlsx format), and can be imported using the xlsx
package as shown below, or can be converted into CSV file and then import into R using read.csv
.
Corolla<- read.xlsx('ToyotaCorolla.xlsx', sheetName="data")
#names(Corolla) # Variable names
#head(Corolla, 4)
You can list the variable names using names(mydata)
, and print the top n
observations using head(mydata, n)
. The summary statistics of car price by each of the categorical variables is provided here. The overall mean and median car price were 10,731 and 9,900 respectively.
#Summary statistics
stargazer(Corolla["Price"], type = "text", title="Summary statistics of Car Price",
digits=0, median=TRUE, iqr=TRUE, min.max=TRUE)
Note that some of the categorical variables do have more than one categories and therefore a design matrix with dummy variables were created manually. It can also be done using the R function factor
. For example, the variable Fuel_Type has three categories (Diesel, Petrol and CNG) and therefore we can create two dummy variables; i.e. if the fuel is diesel, a dummy variable "Fuel_Diesel" has a value of 1 otherwise 0. Similarly, a dummy variable "Fuel_Petrol" is 1 if it is petrol otherwise 0. CNG is taken as a reference.
# Add dummy variables for fuel type Diesel and Petrol by taking CNG as a base
Corolla$Fuel_Diesel= ifelse(Corolla$Fuel_Type=="Diesel", 1, 0)
Corolla$Fuel_Petrol=ifelse(Corolla$Fuel_Type=="Petrol", 1, 0)
The data is divided into training and test data set to fit the regression model and to test the generalizablity of the model, respectively. Seventy percent of the observations was randomly picked as training data and the remaining 30% as test data.
# Split the data into training and test data for validation purpose
training_size= round(nrow(Corolla)* 0.7) # the size of training dataset (70%)
test_size= nrow(Corolla)-training_size # the size of test data set
training_sample= sample(nrow(Corolla), training_size,replace = FALSE, set.seed(1)) # select training data set
training_data= Corolla[training_sample, ]
test_data=Corolla[-training_sample, ]
A set of predictor variables among all list of variables were selected using all subset regression analysis. I prefer this method over stepwise regression variable selection method due to the fact that all subset methods checks possible combinations of variables.
All variables except model description, manufacturing month and year were included in the variable selection. The later two variables were excluded because they provide the same information as age of the car, which is already included in the model.
#all subset regression plot
SubReg=regsubsets(Price ~., data=training_data[, -c(1,2,5,6,8,11,15)], nbest=2) # chooses the best two models for each subset size
plot(SubReg, scale = "adjr2")
You can read the above graph as follows: the rows correspond to a model, the shaded rectangle indicated the variables that are included in a given model, the labels in the x-axis are variables names, and the numbers in the y-axis are adjusted R-square. You can see from the above plot, starting from the bottom , a model with the intercept and Boardcomputer has adjusted R-square of 0.35 where as intercept and age has 0.78.
The following set of variables contains the dependent variable(price) and subset of predictor variables having the highest adjusted R-square from subset regression analysis.
var= c("Price", "Age_08_04", "KM", "HP", "Quarterly_Tax", "Weight", "Automatic_airco"
,"Guarantee_Period" , "Powered_Windows")
The model is fitted using ordinary least square method using the selected variables. As you can see the result, all coefficients are significantly different from zero at P<0.0001 level.
model1= lm(Price~., data= training_data[var])
summary(model1)
Model Interpretation: the model accounts 88.15% (see Multiple R-square)of the variance in car prices. The average error in predicting price using this model is 1226 Euro(see residual standard error).
#Estimated coeffiecients and their 95% confidence interval
round(model1$coefficients, digits=2) # intercept and coefficents
confint(model1) # 95% confidence interval of the intercept and the coefficients
Price = 4157.81 -111.93Age_08_04 - 0.02KM + 21.02HP + 6.68 Quarterly_Tax + 10.11 Weight + 2716.86Automatic_airco +54.88Guarantee_Period + 506.79Powered_Windows
The model is interpreted in such a way that, for each increase in month of the age of the car , the selling price decreases by 111.93 Euro keeping other variables constant and so on.
In order to use the fitted regression model, the following conditions have to be met:
a) Normality: In the bivariate relations, for fixed values of the independent variable X, the dependent variable Y is normally distributed.If the dependent variable is normally distributed, the residual values should also be normally distributed.
b) Constant variance(homoscedasticity): In the bivariate relations, for fixed values of the independent variable X, the variance of the dependent variable Y is constant.
c) Linearity: All bivariate relations involved are linear. Thus, there should no be systematic relationship between the residual and the predicted values.
d) Independence: the cases are independent of each other.
e) Multicolinearity: little or no correlation between predictor variables.
f) Influencial Observations: observations that influence parameter estimation.
The statistical assumptions in regression analysis can be evaluated using plot() function from the base R or other functions from car package. Here, I used functions from car package to evaluate regression conditions.
The function qqPlot() from car package is used to check the normality assumption. It plots the studentized residuals against a t- distribution with n(sample size) minus p(number of regression parameters including intercept) minus one degrees of freedom. As it is shown in the Q-Q plot, some values are out of the confidence envelop and far from the central line.
qqPlot(model1, lables=row.names(No), simulate=TRUE, id.method="identify", main= "Q-Q Plot")
The histogram and Kernel density curve, shown below, show the distribution of the error. It seams that the error closes to normal curve with the exception of some outliers.
#Ploting studentized residual plot as histogram
rs= rstudent(model1)
hist(rs, breaks=20, freq=FALSE, xlab="Studentized Residuals", main="Distribution of Errors")
curve(dnorm(x, mean=mean(rs), sd=sd(rs)), add=TRUE, col= "blue", lwd=2)
lines(density(rs)$x, density(rs)$y, col= "red", lwd=2, lty=2)
legend("topright", legend=c("Normal Curve", "Kernel Density Curve"), lty=1:2, col=c("blue", "red"), cex=.7)
Since normality assumption is primarly relevant to derive confidence interval for predictions and I have test data set to validate the model, the model can still be used without normality assumption is staisfied. Ingeneral, transformation of the response variable may help if normality assumption is violated. The function powerTransform() from car package can be used to estimate the transformation power of the response variable.
The hypothesis of constant error variance against the alternative that error variance changes is tested and the result is obtained using the function ncvTest() from car package. The significant p value(8.585757e-19 ) indicates non constant error variance.
ncvTest(model1)
To stabilize the non-constant error variance, power transformation of 0.5213714(approximated as square root) is suggested by spread level plot.
# creates scatter plot of the absolute standardized residuals vs fitted value
spreadLevelPlot(model1)
Linearity assumption is checked by using crPlots() function from car package. The component plus residual plot (Partial residual plot) shows whether the relationships between the dependent and independent variables are linear. From the plot shown below, it is difficult to assume linear relationship between some independent and dependent variables.
crPlots(model1, smooth=FALSE)
Violation of linearity assumption can be corrected by power transformation of predictor variables. The function boxTidwell() from car package can be used to estimate the power of predictor variables. In this demonstration,it doesn't make sense to make power tranformation for variables that represent catagories such as Automatic_airco. Whereas, square root transformation of Age_08_04, the power of three transformation of HP and inverse of weight are estimated. Note that normality assumption and constant variance have to be met before transformation of predictor variable. Thus, square root transformation of the respons variable is done as suggested above.
boxTidwell( sqrt(Price)~ Age_08_04 + KM + HP+ Quarterly_Tax + Weight,~ Guarantee_Period + Automatic_airco + Powered_Windows , data=training_data[var])
Generally, this assumption can be assesed by the knowledge about how the data were collected. But, we can also use Durbin-Watson test to detect correlated errors. The p-value is nonsignificant(>0.05) and thus the null hypothesis(defined as no autocorrelation) is accepted. Thus, the errors are independent.
durbinWatsonTest(model1)
Multicollinearity between variables is also checked for this mode based on Variance Inflation Factor (VIF). I found that there is no multicollinearity problem.
sqrt(vif(model1))>2 # if this is true there is multicollinearity problem
Unusual observations such as outliers(having large positive or negative residuals), high leverage points(unusual combination of predictor values) and influential observations(influence on parameter estimation) are detected using different methods. I used the function outlierTest() to identify the outliers and influencePlot() to visualize outlier, leverage points and influencials in a single plot.
outlierTest(model1) # list the significance of outliers
# Combining outlier, leverage and influential plot together using influencial plot function
influencePlot(model1, id.method="identify", main="Influence Plot", sub="Circle size is proportional to Cook's Distance")
Observations that have studentized residuals beyond -2 and 2 are outliers. Points to the right of the dotted vertical lines are leverage points and points that have large circle size are influencing observations.
None of the observations are influencial as their Cook's distance is less than one.
plot(model1, which=4, cook.levels=1)
From regression diagnostic, we have seen whether regression assumptions are satisfied or not and corrective measures have been made. By combining all measures, a new model is proposed as:
model2=lm(sqrt(Price)~sqrt(Age_08_04)+ KM + I(HP^3) + Quarterly_Tax + I(Weight^-1) + Automatic_airco + Powered_Windows ,data=training_data)
summary(model2)
This model is tested for all the assumptions and satisfies all. Despite the complexity of the model, this model is better than the original model as it has smaller Akaike Information Criterion(AIC) value.
AIC(model1, model2)
The generalizablity of the model is tested using a new data set called test data. The model performs well for the new data set. One way to check its performance is by calculating the correlation between the predicted and actual values. They are highly correlated(0.959)
pred_test= predict(model2,test_data)
pred_price=pred_test^2
cat("Correlation = ", cor(test_data$Price,pred_price))
#Mean Absolute percentage error
cat("\nMean Absolute Percentage Error=", mean(abs(test_data$Price-pred_price)/test_data$Price))
head(data.frame(Observed=test_data$Price, Predicted=pred_price))
Multiple linear regression is a commonly used predictive model for numerical response variable. This post demonstrates how multiple linear regression analysis can be done using R software from data processing all the way to testing the generalizablity of the selected model.