Mathematical relationships describe many aspects of everyday life and to model this relationship we use Linear Regression Analysis. One of the most common linear regression analysis is Multiple Linear Regression and is used to explain/predict the relationship between one continuous dependent variable and two or more independent variables where the independent variables can be continuous or categorical.

For this post we are going to show how to apply Multiple Linear Regression in R and we have chosen the dataset “Birth Rates and Economic Development ” available for download at : http://users.stat.ufl.edu/~winner/data/birthrate.dat

See below data description:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
Dataset: birthrate.dat Description: Birth Rates, per capita income, proportion (ratio?) of population in farming, and infant mortality during early 1950s for 30 nations. Variables/Columns: Nation 1-20 Birth Rate 22-25 /* 1953-1954 (Units not given) */ Per Capita Income 30-33 /* 1953-1954 in 1948 US $ */ Proportion of population on farms 38-41 /* Circa 1950 */ Infant Mortality Rate 45-49 /* 1953-1954 */ |

Note that this dataset doesn’t come with column names so we have to add their names and save it as .csv file before loading the data into R which is easily done using Excel.

Loading data into R:

*mortality <- read.csv(file.choose(), header = T)*

Now, lets have a look at our dataset details:

>

>

The function *head( )*gives us a taste on how the data looks like and we can also get information about the structure of this data using *str( )*function. As we can see the dataset has 30 observations and 5 variables, being **Nation **nominal and **Birth_Rate, PC_Income, Pop_farm **and **Mort_Rate **numerical.

For this data we are interested in studying the possible causes contributing to Infant Mortality Rate so we are going to analyse the relationships between Infant Mortality Rate in relation to Birth Rates, Per Capita Income and Proportion of population on farms. In other words, our dependent variable will be **Mort_****Rate **and our independent variables will be **Birth_Rate**, **PC_Income **and **Pop_farm**.

Lets first visualize our data using the function *pairs( ) *to examine the correlation between them:

We should bare in mind that for a simple linear regression is easier to visualize if there is a problem with de model by graphing X and Y. However, with multiple linear regression is not that simple, there may be some interaction between variables that a simple scatter plot won’t show.

So lets have a look at the correlation matrix and see if we can get more information:

*cor(mortality[c(“Birth_Rate”, “PC_Income”, “Pop_farm”, “Mort_Rate”)])*

>

>

We can see that **Mort_Rate **and **Pop_farm **seem to have a fairly strong correlation of **0.69**. Also, is interesting to note that **Mort_Rate **and **PC_Income **are showing a negative correlation of **-0.74 **as well as **Pop_farm **and **PC_Income, **with**-0.77**, suggesting that infant mortality is higher where per capita income is lower, and that per capita income is lower among population farm. Note that **Birth_Rate **doesn’t seem to show significant correlation in relation to the other variables apart from Mort_Rate.

Lets go ahead and build our model:

*modelmortality <- lm(Mort_Rate ~ .-Nation, mortality)*

*#note that when we use ~ . we incorporate all independent variables against our dependent variable Mort_Rate minus the variable Nation which is not relevant to our model as is not numerical.*

When we call the function **modelmortality **we get the regression function coefficients:

>

*summary( )*. This function allow us to evaluate the model’s performance:

**Residuals**section provides summary statistics for the errors in our predictions, some of which are apparently quite substantial.

**p-value**for each estimated regression coefficient. A common practice is to use a significance level of 0.05 to denote a statistically significant variable. Here we can see statistically significant results for

**Birth_Rate**and

**PC_Income**, but

**Pop_farm**doesn’t seem to add much to our model.

**Multiple R-squared**value (also called the coefficient of determination) provides a measure of how well our model as a whole explains the values of the dependent variable and we got a result of

**71%**which is a very good result.

**Pop_farm**and see if we can improve our model:

*model2 <-lm(Mort_Rate ~ Birth_Rate + PC_Income, data=mortality)*

**70%**and does not add much significance in terms of improvement to our model, so we will stick to the first model

**modelmortality**

**Mort_Rate**in relation to

**Birth_Rate**and, most importantly,

**PC_Income**but not so much for

**Pop_farm**as initially seen on the Correlation Matrix.

Our first graph shows if residuals have non-linear patterns, in our model’s case the residuals.

The Normal Q-Q shows if residuals are normally distributed which is clearly the case here.

____________________________________________________________________________

**References:**

https://www.statisticssolutions.com/what-is-multiple-linear-regression/

Multiple Linear Regression – Course Notes

https://www.investopedia.com/terms/m/mlr.asp

R. Weintraub (1962). “The Birth Rate and Economic Development: An Empirical Study”, Econometrica, Vol. 40, #4, pp 812-817.