This project will analyze with a Multiple linear regression the Asphalt suface free energy in mJ/m^2 (srf.fr.eng) with different variables.

The data has been taken from http://users.stat.ufl.edu/~winner/data/asphalt_binder.csv and the variables are listed below:

% Saturates (saturates)

% Aromatics (aromatics)

% Resins (resins)

% Asphaltenes (asptenes)

% Wax (wax)

% Carbon (carbon)

% Hydrogen (hydrogen)

% Oxygen (oxygen)

% Nitrogen (nitrogen)

% Sulfur (sulfur)

Nickel in ppm (nickel)

Vanadium in ppm (vanadium)

For the sake of the model i have removed the asp code.

starting with reading the data and doing a preliminary analysis:

1 2 3 4 5 |
dataset4 = read.csv('http://users.stat.ufl.edu/~winner/data/asphalt_binder.csv', header = T) head(dataset4) View(dataset4) summary(dataset4) str(dataset4) |

As the asp code will be removed below the code for the data that will be used for the analysis:

1 2 3 4 5 |
mydata <- dataset4[2:14] head(mydata) View(mydata) summary(mydata) str(mydata) |

Next the visualization of the correlation matrix and correlation plot

1 2 |
cor(mydata) pairs(mydata) |

Below the results for the correlation

and also the Corr Plot

The corr matrix and the plot shows positive and negative linear relationship within different variables, therefore I am going to eliminate some variable with biggest correlation in order to remove Multicolinearity

*“Multicolinearity is a term you use if two x variables are highly correlated.*

*Not only is it redundant to include both related variables in the multiple regression model, but it’s also problematic. *

*Basically: If two x variables are significantly correlated, only include one of them in the regression model, not both. *

*If you include both, the computer won’t know what numbers to give as coefficients for each of the two variables, because they share their contribution to determining the value of y. *

*Multicolinearity can really mess up the model-fitting process and give answers that are inconsistent and oftentimes not repeatable in subsequent studies”*

The variable removed are: resins, asptenes and sulfur.

Now i am going to apply the model:

1 2 3 |
mydataMLR <- lm(srf.fr.eng ~ saturates+aromatics+wax+carbon+hydrogen+oxygen+nitrogen+nickel+vanadium, data=mydata) mydataMLR summary(mydataMLR) |

From the results 4 variables are not significant therefore I am removing those and run the new model removing the variables aromatics, oxygen, nitrogen and nickel:

1 2 |
mydataMLR2 <- lm(srf.fr.eng ~ saturates+wax+carbon+hydrogen+vanadium, data=mydata) summary(mydataMLR2) |

We are looking at the marginal contribution of each x variable when all the other variables in the model are held constant.

The Residuals section provides summary statistics for the errors in our predictions. Since a residual is equal to the true value minus the predicted value, the maximum error of 2.6178 suggests that the model under-predicted surface free energy by 2.61 mJ/m^2 for at least one observation.

On the other hand, 50 percent of errors fall within the 1Q and 3Q values (the first and third quartile), so the majority of predictions were between -0.6711 over the true value and 0.6587 under the true value.

The model has several significant variables, and they seem to be related to the outcome give the significance of Pr(>|t|)

The Multiple R-squared value (also called the coefficient of determination) provides a measure of how well our model as a whole explains the values of the dependent variable. It is similar to the correlation coefficient in that the closer the value is to 1.0 the better the model perfectly explains the data

Since the R-squared value is 0.8864, we know that 88.65% of the variation in the dependent variable is explained by our model.

Then checking residual and normality plot:

1 2 3 4 5 6 7 8 9 |
#check the residual par(mfrow = c(1,2)) mydataMLR2.stdRes = rstandard(mydataMLR2) plot(mydataMLR2.stdRes, col="red") abline(0,0) #normality plot qqnorm(mydataMLR2.stdRes,ylab="Standardized Residuals",xlab="Normal Scores", main="Normality Plot", col="red") qqline(mydataMLR2.stdRes) |

Most (95 percent) of the standardized residuals fall within two standard deviations of the mean, which in this case is –2 to +2 (via the 68-95-99.7 Rule / Empirical rule).

We should see more residuals hovering around zero and we should have fewer and fewer of the residuals as they go away from zero.

If the residuals fall in a straight line, that means the normality condition is met.

From the model looks like condition is met.

In conclusion the model y = -121.30 – 0.39saturates – 1.70wax +1.23carbon + 4.33hydrogen -0.0027vanadium looks a quite good model to explain the Asphalt surface free energy in mJ/m^.