CA4 Perform Multiple Linear Regression on a chosen data set

Introduction – CA4 Multiple Linear Regression

Regression analysis is a statistical technique used to investigate and model relationships between variables. There are numerous examples of regression analysis in industry, health insurance being one of the most popular but other fields include engineering, economics, life and social science.

This blog describes an exercise in Multiple Linear Regression (MLR) analysis. MLR, can be described as a generalisation of the simple linear regression (SLR). SLR models the relationship between a single explanatory variable and a continuous response variable. MLR regresses a continuous response variable onto multiple features.

To model an MLR analysis, a data set on immigrant workers to the US has been selected from http://users.stat.ufl.edu/~winner/datasets.html. We will perform MLR analysis on the data set, create a model, perform summary analysis of the model and discuss the results. Finally, we will plot the model/ the linear regression line and discuss the plot.

Perform MLR analysis on the data set

In our example data set we have recorded (among other variables) the Average weekly wage of immigrants to the US, their Literacy levels and ability to speak English. We suspect that that average weekly wage is impacted by the immigrants’ literacy levels and their skills at speaking English. Let’s begin by reading in the data using R:

The first 6 lines of data are displayed by the head() and summary() functions. The names() function provides the names of the columns in the data set. A sample of the data is provided below.

Visualise the relationships in the data:

Next, we isolate the variables we plan to use in our regression analysis.

We can now populate our variables into a data frame to be use for the regression analysis.

Check the correlation between the variables:

The results are:


The correlation between Speaks English (x1) and Average Weekly Wage (y) is 0.81. This indicates a strong positive linear relationship between these two variables.

We also see that the correlation between Literacy (x2) and Average Weekly Wage (y) is 0.82, indicating another strong positive linear relationship between these two variables.

The correlation between (number of year) Living in Us (x3) and Average Weekly Wage (y) is 0.73, also a strong to moderate positive linear relationship.

A value above plus or minus 0.75 indicates a strong linear correlation. If two x variables are significantly correlated, it is advised to included only one in the model, not both. If both are included, r may not know which numbers to give as coefficients for each of the two variables, because they share their contribution to determining the value of y.

We see above that x1 and x2 have a correlation of 0.73, while x1 and x3 have a correlation of 0.93. The correlation between x2 and x3 is 0.64 therefore we will fit the model with variables x2 and x3.

Next we create the model

We proceed to fit the model with x2 and x3:

The coefficients:

The coefficient of the x variable in an MLR model is the amount by which y changes if that x variable increases by one and the values of all other variables in the model do not change. We are looking at the marginal contribution of each x variable when you hold the other variables in the model constant.

Interpreting the coefficients:

First, we specify the units. Let’s say that Literacy is measured in percentage points and Living in the US is measured in years. Average Weekly Wage is measured in dollars.

The coefficient of x2 (Literacy) equals 0.095. So, y (Average Weekly Wage) increases by 0.095 dollars when Literacy improves by one percentage point, assuming other variables remain constant.

Similarly, the coefficient of x3 (‘LivinginUS’) equals 0.029. So, Average Weekly Wage increases by 0.029 dollars when ‘LivinginUS’ increases by one year, again assuming other variables remain constant.

Summary Analysis of the model – discuss the results:

Let’s check the residuals:

Plotted residuals with abline:

The residual points are:

Most of the standardised residuals (approx. 95%) fall within two standard deviations of the means, which in this case is -2 to +2. We should see more residuals hovering around zero. Also, the concentration of residuals should reduce as we move further from zero.

Plot the model, the linear regression line and discuss the plots

Let’s plot the model in a normality plot and fit the linear regression line:

If the residuals fall in a straight line, that means the normality condition is met. The plot below looks like the negative residuals do not fall in a straight line, in fact the first few points show marked departure from the reference fitted line. The normal probability plot shows a reasonably linear pattern in the centre of the data. However, the tails, particularly the lower tail, show departures from the fitted line. A distribution other than the normal distribution would be a good model for these data.