Multiple Linear Regression – CA4 – Student Test Grades

Student Performance Data Set and Applying Multiple Linear Regression

By James Gallagher

For my first blog post I will be conducting Multiple Linear Regression in the statistical programming software R. The data set I have chosen is from the University of California and is an open data set and is called the Student Performance Data Set. The data looks at student achievement in two secondary schools in Portugal. It was collected using school reports and questionnaires. I downloaded the CSV file from the following URL:

https://archive.ics.uci.edu/ml/machine-learning-databases/00320/

First I downloaded some packages in R that may be required further into the analysis they include ‘MASS’ and ‘car’ packages. I then read the file into a new script in RStudio using the ‘read.csv’ function, claiming the headers of the data as true and the separation of the data as ‘;’. I then used the ‘head()’ function to check that the data had been read in properly and ‘str()’ function to see the structure of the data.

The data structure showed a data frame of 395 observations across 33 variables. After looking at the names of the attributes and reading the supplementary breakdown of the variables in the zipped download file, I knew the response variable that I required was ‘G3’ which is the final grade of the students. I want to see which other variables from the data affect the grades of the students the most, if any. In order to get an insight into what variables may be best for the multiple linear regression I conducted a ‘pairs’ function graph and also a scatter plot matrix on a subset of the data I found of interest. See code as follows:

Also see the out put of both ‘pairs.panels and ‘scatterplotMatrix’:

Plot 1: pairs

Plot 2: scatterplotMatrix

After analysis of the plots I decided the variables of most interest as predictors to grades were ‘studytime’, ‘absences’, ‘freetime’ and ‘goout’. Failures was also highly correlated to grades but this seemed a very obvious and uninteresting regression predictor. Now that I have my predictor variables it is time to build our model in R.

R has a very useful ‘lm()’ function or ‘Linear Model’ function. We use the same function to conduct a simple linear model which has one variable as we do a multiple linear model with multiple variables. This is one of the many of the great capabilities in R. To understand the syntax of the function ‘lm()’ you can either type a question mark in front of the function or summon the ‘help()’ function like so:

?lm()

help(lm)

This will return a dialogue box in the console to help you plug in your variables. There is a ‘na.omit’ element within the ‘lm()’ function if you require to handle and ‘NA’s’ in your data, however, out of the variables I chose the data was clean and did not have any.

As you can see above, the dependent variable ‘G3’ is plugged in first, followed by my chosen predictors of ‘studytime’, ‘absences’ and ‘freetime’. I have inserted this linear model into a variable called model1 in order to call it for further analysis with different functions. I then called the summary function on  ‘model1’ to see the performance of the model, as follows:

Call:
lm(formula = G3 ~ studytime + absences + freetime + goout)

Residuals:
Min 1Q Median 3Q Max
-12.880 -1.861 0.319 2.997 8.792

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.96372 1.14299 8.717 < 2e-16 ***
studytime 0.55537 0.27553 2.016 0.04452 *
absences 0.02940 0.02869 1.025 0.30612
freetime 0.32700 0.24125 1.355 0.17605
goout -0.61271 0.21434 -2.859 0.00448 **

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.53 on 390 degrees of freedom
Multiple R-squared: 0.03229, Adjusted R-squared: 0.02236
F-statistic: 3.253 on 4 and 390 DF, p-value: 0.01212

The first number we are interested in is the Multiple R-squared of 0.03229, which shows us that approximately 3.2% of the variance in grades can be explained by our model. The F-statistic and P-value test the null hypothesis that all the model coefficients are 0. The Residual standard error gives us an idea of how far observed grades are from the predicted or fitted grades. The intercept estimate above shows us what grades would be achieved for a student who had a ‘studytime’, ‘absences’ and ‘freetime’ of 0. We can also see the significance of the predictors as they correlate to grades, with ‘goout’ ans ‘studytime being the most significant out of the three,

The model didn’t perform as well as I expected from the high correlation seen in the graphs, but with so many variables to consider, a more comprehensive feature analysis need to be conducted. The fact that study time positively affected a students grades and that going out negatively affected the students grades is completely intuitive. I think an interesting aspect of this data set that would be interesting for further analysis would be a breakdown by sex and parents occupation to see what affect this might have on a students grade.

Leave a Reply

Your email address will not be published. Required fields are marked *