Multiple Linear Regression with R – CA4

Mathematical relationships describe many aspects of everyday life and to model this relationship we use Linear Regression Analysis. One of the most common linear regression analysis is Multiple Linear Regression and is used to explain/predict the relationship between one continuous dependent variable and two or more independent variables where the independent variables can be continuous or categorical.

For this post we are going to show how to apply Multiple Linear Regression in R and we have chosen the dataset “Birth Rates and Economic Development ” available for download at : http://users.stat.ufl.edu/~winner/data/birthrate.dat

See below data description:

Note that this dataset doesn’t come with column names so we have to add their names and save it as .csv file before loading the data into R which is easily done using Excel.

Loading data into R:

mortality <- read.csv(file.choose(), header = T)

Now, lets have a look at our dataset details:

>

>

The function head( )gives us a taste on how the data looks like and we can also get information about the structure of this data using str( )function. As we can see the dataset has 30 observations and 5 variables, being Nation nominal and Birth_Rate, PC_Income, Pop_farm and Mort_Rate numerical.

For this data we are interested in studying the possible causes contributing to Infant Mortality Rate so we are going to analyse the relationships between Infant Mortality Rate in relation to Birth Rates, Per Capita Income and Proportion of population on farms. In other words, our dependent variable will be Mort_Rate and our independent variables will be Birth_Rate, PC_Income and Pop_farm.

Lets first visualize our data using the function pairs( ) to examine the correlation between them:

We should bare in mind that for a simple linear regression is easier to visualize  if there is a problem with de model by graphing X and Y. However, with multiple linear regression is not that simple, there may be some interaction between variables that a simple scatter plot won’t show.

So lets have a look at the correlation matrix and see if we can get more information:

cor(mortality[c(“Birth_Rate”, “PC_Income”, “Pop_farm”, “Mort_Rate”)])

>

>

We can see that Mort_Rate and Pop_farm seem to have a fairly strong correlation of 0.69. Also, is interesting to note that Mort_Rate and PC_Income are showing a negative correlation of -0.74 as well as Pop_farm and PC_Income, with-0.77, suggesting that infant mortality is higher where per capita income is lower, and that per capita income is lower among population farm. Note that Birth_Rate doesn’t seem to show significant correlation in relation to the other variables apart from Mort_Rate.

Lets go ahead and build our model:

modelmortality <- lm(Mort_Rate ~ .-Nation, mortality)

#note that when we use ~ . we incorporate all independent variables against our dependent variable Mort_Rate minus the variable Nation which is not relevant to our model as is not numerical.

When we call the function modelmortality we get the regression function coefficients:

>

>
When we call the created model the parameters tell us about how the independent variables are related to the dependent variable but to find out how well the model fit the data we use the function summary( ). This function allow us to evaluate the model’s performance:
 >
 >
Lets analyse the outcomes:
The Residuals section provides summary statistics for the errors in our predictions, some of which are apparently quite substantial.
The stars indicate the predictive power of each feature in the model giving the p-value for each estimated regression coefficient. A common practice is to use a significance level of 0.05 to denote a statistically significant variable. Here we can see statistically significant results for Birth_Rate and PC_Income, but Pop_farm doesn’t seem to add much to our model.
The Multiple R-squared value (also called the coefficient of determination) provides a measure of how well our model as a whole explains the values of the dependent variable and we got a result of 71% which is a very good result.
Now lets build a second model excluding Pop_farm and see if we can improve our model:
model2 <-lm(Mort_Rate ~ Birth_Rate + PC_Income, data=mortality)
 >
 >
We can see that the second model has a slightly lower R-squared value of 70% and does not add much significance in terms of improvement to our model, so we will stick to the first model modelmortality
Overall, given the preceding three performance indicators, our model is performing fairly well showing that there is an important influence on Mort_Rate in relation to Birth_Rate and, most importantly, PC_Income but not so much for Pop_farm as initially seen on the Correlation Matrix.
Lets plot the model:

Our first graph shows if residuals have non-linear patterns, in our model’s case the residuals.

The Normal Q-Q shows if residuals are normally distributed which is clearly the case here.

Scale-Location plot shows if residuals are spread equally along the ranges of predictors denoting equal variance. As we can see we have a horizontal line with equally spread points.
Residual vs. Leverage (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). We see that points 7, 9 and 16 have great influence on the model.

____________________________________________________________________________

References:

https://www.statisticssolutions.com/what-is-multiple-linear-regression/

Multiple Linear Regression – Course Notes

https://www.investopedia.com/terms/m/mlr.asp

R. Weintraub (1962). “The Birth Rate and Economic Development: An Empirical Study”, Econometrica, Vol. 40, #4, pp 812-817.

Posted in CA4

Leave a Reply

Your email address will not be published. Required fields are marked *