Mathematical relationships describe many aspects of everyday life and to model this relationship we use Linear Regression Analysis. One of the most common linear regression analysis is Multiple Linear Regression and is used to explain/predict the relationship between one continuous dependent variable and two or more independent variables where the independent variables can be continuous or categorical.
For this post we are going to show how to apply Multiple Linear Regression in R and we have chosen the dataset “Birth Rates and Economic Development ” available for download at : http://users.stat.ufl.edu/~winner/data/birthrate.dat
See below data description:
Description: Birth Rates, per capita income, proportion (ratio?) of
population in farming, and infant mortality during early 1950s for
Birth Rate 22-25 /* 1953-1954 (Units not given) */
Per Capita Income 30-33 /* 1953-1954 in 1948 US $ */
Proportion of population on farms 38-41 /* Circa 1950 */
Infant Mortality Rate 45-49 /* 1953-1954 */
Note that this dataset doesn’t come with column names so we have to add their names and save it as .csv file before loading the data into R which is easily done using Excel.
Loading data into R:
mortality <- read.csv(file.choose(), header = T)
Now, lets have a look at our dataset details:
The function head( )gives us a taste on how the data looks like and we can also get information about the structure of this data using str( )function. As we can see the dataset has 30 observations and 5 variables, being Nation nominal and Birth_Rate, PC_Income, Pop_farm and Mort_Rate numerical.
For this data we are interested in studying the possible causes contributing to Infant Mortality Rate so we are going to analyse the relationships between Infant Mortality Rate in relation to Birth Rates, Per Capita Income and Proportion of population on farms. In other words, our dependent variable will be Mort_Rate and our independent variables will be Birth_Rate, PC_Income and Pop_farm.
Lets first visualize our data using the function pairs( ) to examine the correlation between them:
We should bare in mind that for a simple linear regression is easier to visualize if there is a problem with de model by graphing X and Y. However, with multiple linear regression is not that simple, there may be some interaction between variables that a simple scatter plot won’t show.
So lets have a look at the correlation matrix and see if we can get more information:
cor(mortality[c(“Birth_Rate”, “PC_Income”, “Pop_farm”, “Mort_Rate”)])
We can see that Mort_Rate and Pop_farm seem to have a fairly strong correlation of 0.69. Also, is interesting to note that Mort_Rate and PC_Income are showing a negative correlation of -0.74 as well as Pop_farm and PC_Income, with-0.77, suggesting that infant mortality is higher where per capita income is lower, and that per capita income is lower among population farm. Note that Birth_Rate doesn’t seem to show significant correlation in relation to the other variables apart from Mort_Rate.
Lets go ahead and build our model:
modelmortality <- lm(Mort_Rate ~ .-Nation, mortality)
#note that when we use ~ . we incorporate all independent variables against our dependent variable Mort_Rate minus the variable Nation which is not relevant to our model as is not numerical.
When we call the function modelmortality we get the regression function coefficients:
Our first graph shows if residuals have non-linear patterns, in our model’s case the residuals.
The Normal Q-Q shows if residuals are normally distributed which is clearly the case here.
R. Weintraub (1962). “The Birth Rate and Economic Development: An Empirical Study”, Econometrica, Vol. 40, #4, pp 812-817.