CA5 Use GLM to model OT to TeamsPts and OppPts
Using question 4 from the sample paper Oct 2017 perform a GLM on ndbaodds201415.csv.
Part a) Train the model using 80% of this data set and suggest an appropriate GLM to model OT to TeamsPts and OppPts variables
Read in the data from www.stat.ufl.edu.
Read the file into a dataset set, ensure that you include “header=TRUE”.
Be sure to look at the data. Note only 76 games go to overtime from the 1230 observations in the file. Approx. 1 in 13 go into overtime. Our task is to create a model to predict the games that will go into overtime. Part a) asks us to use a GLM model.
In this example, we use the caTools library as it has a function called Split which will allow us to split the observations into a training set and a test set for our model.
The subset set function separates the data set into training set and the test set below. You many notice that the training set and test set are not split evenly between 80/20, this behaviour was observed only when I set a seed value in code.
We notice that OT is a binary variable – 0 or 1. Therefore, we are required to use the binomial family in our GLM function. We pass in only the training set. Now, we build our generalised linear model. (Note if this was a counting variable, we would have used Poisson.)
And the output of the summary of the model:
We can conclude that both independent variables are significant; TeamPts and OppPts.
Part b) requires that we specify the significant variables on OT at the level of alpha=0.05, and estimate the parameters of the model.
We see that our Intercept is -14.46 with a z value of -9.069 if we looked up z tables, we would see that this is well outside – 3, which would give use less than 0.05 (the alpha). We can see that TeamPts has a z value just inside 3 (2.977) and has a p-value of less than 0.05 also (0.00291). We see that OppPts has a z value outside 3 again (5.55). So, all our parameters are all statistically significant.
Part c) Predict the test data set using the trained model
We pass in the model generated from the training set, we also pass in the test set and we specify the type as response. We get 246 predictions.
The model returns our values in ‘res’, sample included below:
We create an object called predictedvalues, it is equivalent to the responses in size, and then we set the value to 1 where the predicted value is greater than 0.5.
And our predicted data is displayed below. We predicted 2, when we probably should have predicted 13-15 approx.
Part d) predict the confusion matrix and obtain the probability of correctness of predictions.
In R we try to predict the mean value. In our model we have only predicted 2 out of 15 times when the game is going into overtime.
The mean of 0.94 makes our prediction look good but we have only predicted 2 out of 15 which is poor.