The post CA5 – A brief look at C++ in RStudio appeared first on DBS Student's Data Project Blog.

]]>__CA5 Understanding the basics of C++ with R__

Sometimes R code is not fast enough to process large data set and extra speed is required. Rcpp allows R programmers to seamlessly integrate C++ code into their R workflow. This post briefly discussing getting the two languages working together, write some C++ code and integrate it into R packages.

To use C++ in R packages you need to use Rcpp. To use Rcpp a compiler is required on your system or local environment. Cran have built a suite of Rtools which can be downloaded and installed to work with our environment.

RStudio can find the Rcpp files without much help from the user. Just go ahead and type require(Rcpp).

RStudio offers the option from the new file drop down to choose a c++ file with a basic function.

You can see from the image that you need to include the Rcpp library.

Rcpp has introduced types and objects that are similar to R and ensure that both work seamlessly together. C++ code can then be integrated into a package and this has been simplified by RStudio and development tools.

The post CA5 – A brief look at C++ in RStudio appeared first on DBS Student's Data Project Blog.

]]>The post CA5 Multi-agent Systems appeared first on DBS Student's Data Project Blog.

]]>__CA5 Multi-agent Systems – Question 5 from the 2017 sample paper__

Using question 5 from the sample paper Oct 2017 use the simulated data set from Question 1 on the same paper to:

__Part a) Adopt a centralised scheme to all agents and sketch the graphical scheme.__

Question 1 described a financial system, with four local agents analysing their own data sets. The graph of the centralised scheme is as follows:

To resolve multi-agent system problems, think of them as different samples, each performing an individual t-test. Then for the multi-agent, consider how you put those samples together to calculate the averages. In question 1 we have four samples.

__Part b) Compute the normalised weights and find the global arithmetic mean. Please compute the global solution using R.__

We calculate the mean of each of the 4 ‘samples’ or local agents. Normal distribution was specified with N(j,16) therefore the samples were calculated using rnorm() function, with a variance of 4.

We now take the samples and calculate the mean for each data set. We create a new sample, which we called xvect below. This is the mean of the 4 samples.

We calculate our weights for each agent:

Below you can see how we can calculate the global arithmetic mean.

We normalise the weights and then use these values to calculate the arithmetic mean. The final calculation for the arithmetic mean is 2.312.

The post CA5 Multi-agent Systems appeared first on DBS Student's Data Project Blog.

]]>The post CA5 – GLM to model OT to TeamsPts and OppPts appeared first on DBS Student's Data Project Blog.

]]>__CA5 Use GLM to model OT to TeamsPts and OppPts__

Using question 4 from the sample paper Oct 2017 perform a GLM on ndbaodds201415.csv.

__Part a) Train the model using 80% of this data set and suggest an appropriate GLM to model OT to TeamsPts and OppPts variables__

Read in the data from www.stat.ufl.edu.

Read the file into a dataset set, ensure that you include “header=TRUE”.

Be sure to look at the data. Note only 76 games go to overtime from the 1230 observations in the file. Approx. 1 in 13 go into overtime. Our task is to create a model to predict the games that will go into overtime. Part a) asks us to use a GLM model.

In this example, we use the caTools library as it has a function called Split which will allow us to split the observations into a training set and a test set for our model.

The subset set function separates the data set into training set and the test set below. You many notice that the training set and test set are not split evenly between 80/20, this behaviour was observed only when I set a seed value in code.

We notice that OT is a binary variable – 0 or 1. Therefore, we are required to use the binomial family in our GLM function. We pass in only the training set. Now, we build our generalised linear model. (Note if this was a counting variable, we would have used Poisson.)

And the output of the summary of the model:

We can conclude that both independent variables are significant; TeamPts and OppPts.

__Part b) requires that we specify the significant variables on OT at the level of alpha=0.05, and estimate the parameters of the model.__

We see that our Intercept is -14.46 with a z value of -9.069 if we looked up z tables, we would see that this is well outside – 3, which would give use less than 0.05 (the alpha). We can see that TeamPts has a z value just inside 3 (2.977) and has a p-value of less than 0.05 also (0.00291). We see that OppPts has a z value outside 3 again (5.55). So, all our parameters are all statistically significant.

__Part c) Predict the test data set using the trained model__

We pass in the model generated from the training set, we also pass in the test set and we specify the type as response. We get 246 predictions.

The model returns our values in ‘res’, sample included below:

We create an object called predictedvalues, it is equivalent to the responses in size, and then we set the value to 1 where the predicted value is greater than 0.5.

And our predicted data is displayed below. We predicted 2, when we probably should have predicted 13-15 approx.

__Part d) predict the confusion matrix and obtain the probability of correctness of predictions.__

In R we try to predict the mean value. In our model we have only predicted 2 out of 15 times when the game is going into overtime.

The mean of 0.94 makes our prediction look good but we have only predicted 2 out of 15 which is poor.

The post CA5 – GLM to model OT to TeamsPts and OppPts appeared first on DBS Student's Data Project Blog.

]]>The post CA5 – Factor Analysis appeared first on DBS Student's Data Project Blog.

]]>__Introduction – CA5 Factor Analysis__

The link between factor analysis and regression.

When you are given, a large data set it is often beneficial to use factor analysis and the PCA technique to apply factor analysis. Both regression and factor analysis allow you to connect the dots between the variables in a large data set. Regression allows you to understand the relationships between the different variables and quantify those relationships. Factor analysis allows you to understand the underlying drivers that influence those relationships. It may be necessary to undertake some method to transformation as these drivers are often not immediately observable.

While regression looks at the cause (independent/explanatory variable) of the effect (dependent variable), factor analysis seeks to identify the underlying causes, which impacts the observable behaviour. Explaining the underlying causes, can improve your predictions.

Why use factor analysis?

When you have n-dimensional data and there are multiple causes for a single effect. In multiple linear regression, you look for a regression plane to find the cause of the effect. You can have many variables in multiple regression and it is possible to use all the variables to explain the effect. However, if you use all variables then you have a problem called multicollinearity

MulticollinearitySome of these variables are correlated with each other. They can contain the same information and not independently provide you with new information. You need to identify the underlying causes which are uncorrelated but still effect the observed behaviour and enable you to build a better model. Complexity and computation effort would be reduced.

This exercise of taking a large number of variables, extracting the underlying causes from those variables and using them to explain an effect is called factor analysis. Regression models where you have such highly correlated variables are weak and not very stable.

__Factor and Analysis and PCA__

Principal component analysis or PCA for short is useful when you are trying to fit a curve through a set of data points, then regression is an appropriate technique to use but if first you wish to extract the facts that explain the data, then PCA is the recommended technique.

Factor Extraction

It’s not unusual to use a rule based approach where human experts identify the relevant factors but the alternative machine learning approach. PCA extracts the factors using an algorithm. Expert analysis and intuition are not relevant. PCA can identify latent factors and dimensionality reduction.

In PCA you are looking for the ‘best’ direction through the data. When you have 2-dimensional data you may also need the ‘next best’ direction also. These directions are the principal components and they tend to be orthogonal to each other to carry the maximum information with the least number of dimensions.

PCA can be used to convert highly correlated variables into a new set of variables. Each of these new variables is orthogonal and uncorrelated to each of the other variables. These new variables are ordered according the highest variance first.

PCA is using Eigen decomposition to find your principal component. Each component has a corresponding Eigen vector, which helps to compute the principal components and the Eigen values.

Information sourced @ Connect the Dots: Factor Analysis

Publisher: Packt Publishing

Release Date: December 2017

ISBN: 9781788997522

The post CA5 – Factor Analysis appeared first on DBS Student's Data Project Blog.

]]>The post CA4 Perform Multiple Linear Regression on a chosen data set appeared first on DBS Student's Data Project Blog.

]]>__Introduction – CA4 Multiple Linear Regression__

Regression analysis is a statistical technique used to investigate and model relationships between variables. There are numerous examples of regression analysis in industry, health insurance being one of the most popular but other fields include engineering, economics, life and social science.

This blog describes an exercise in Multiple Linear Regression (MLR) analysis. MLR, can be described as a generalisation of the simple linear regression (SLR). SLR models the relationship between a single explanatory variable and a continuous response variable. MLR regresses a continuous response variable onto multiple features.

To model an MLR analysis, a data set on immigrant workers to the US has been selected from http://users.stat.ufl.edu/~winner/datasets.html. We will perform MLR analysis on the data set, create a model, perform summary analysis of the model and discuss the results. Finally, we will plot the model/ the linear regression line and discuss the plot.

__Perform MLR analysis on the data set__

In our example data set we have recorded (among other variables) the Average weekly wage of immigrants to the US, their Literacy levels and ability to speak English. We suspect that that average weekly wage is impacted by the immigrants’ literacy levels and their skills at speaking English. Let’s begin by reading in the data using R:

The first 6 lines of data are displayed by the head() and summary() functions. The names() function provides the names of the columns in the data set. A sample of the data is provided below.

Visualise the relationships in the data:

Next, we isolate the variables we plan to use in our regression analysis.

We can now populate our variables into a data frame to be use for the regression analysis.

Check the correlation between the variables:

The results are:

The correlation between Speaks English (x1) and Average Weekly Wage (y) is 0.81. This indicates a strong positive linear relationship between these two variables.

We also see that the correlation between Literacy (x2) and Average Weekly Wage (y) is 0.82, indicating another strong positive linear relationship between these two variables.

The correlation between (number of year) Living in Us (x3) and Average Weekly Wage (y) is 0.73, also a strong to moderate positive linear relationship.

A value above plus or minus 0.75 indicates a strong linear correlation. If two x variables are significantly correlated, it is advised to included only one in the model, not both. If both are included, r may not know which numbers to give as coefficients for each of the two variables, because they share their contribution to determining the value of y.

We see above that x1 and x2 have a correlation of 0.73, while x1 and x3 have a correlation of 0.93. The correlation between x2 and x3 is 0.64 therefore we will fit the model with variables x2 and x3.

__Next we create the model__

We proceed to fit the model with x2 and x3:

The coefficients:

The coefficient of the x variable in an MLR model is the amount by which y changes if that x variable increases by one and the values of all other variables in the model do not change. We are looking at the marginal contribution of each x variable when you hold the other variables in the model constant.

Interpreting the coefficients:

First, we specify the units. Let’s say that Literacy is measured in percentage points and Living in the US is measured in years. Average Weekly Wage is measured in dollars.

The coefficient of x2 (Literacy) equals 0.095. So, y (Average Weekly Wage) increases by 0.095 dollars when Literacy improves by one percentage point, assuming other variables remain constant.

Similarly, the coefficient of x3 (‘LivinginUS’) equals 0.029. So, Average Weekly Wage increases by 0.029 dollars when ‘LivinginUS’ increases by one year, again assuming other variables remain constant.

__Summary Analysis of the model – discuss the results:__

Let’s check the residuals:

Plotted residuals with abline:

The residual points are:

Most of the standardised residuals (approx. 95%) fall within two standard deviations of the means, which in this case is -2 to +2. We should see more residuals hovering around zero. Also, the concentration of residuals should reduce as we move further from zero.

__Plot the model, the linear regression line and discuss the plots__

Let’s plot the model in a normality plot and fit the linear regression line:

If the residuals fall in a straight line, that means the normality condition is met. The plot below looks like the negative residuals do not fall in a straight line, in fact the first few points show marked departure from the reference fitted line. The normal probability plot shows a reasonably linear pattern in the centre of the data. However, the tails, particularly the lower tail, show departures from the fitted line. A distribution other than the normal distribution would be a good model for these data.

The post CA4 Perform Multiple Linear Regression on a chosen data set appeared first on DBS Student's Data Project Blog.

]]>The post CA04_Multiple Linear Regression appeared first on DBS Student's Data Project Blog.

]]>This project will analyze with a Multiple linear regression the Asphalt suface free energy in mJ/m^2 (srf.fr.eng) with different variables.

The data has been taken from http://users.stat.ufl.edu/~winner/data/asphalt_binder.csv and the variables are listed below:

% Saturates (saturates)

% Aromatics (aromatics)

% Resins (resins)

% Asphaltenes (asptenes)

% Wax (wax)

% Carbon (carbon)

% Hydrogen (hydrogen)

% Oxygen (oxygen)

% Nitrogen (nitrogen)

% Sulfur (sulfur)

Nickel in ppm (nickel)

Vanadium in ppm (vanadium)

For the sake of the model i have removed the asp code.

starting with reading the data and doing a preliminary analysis:

dataset4 = read.csv('http://users.stat.ufl.edu/~winner/data/asphalt_binder.csv', header = T) head(dataset4) View(dataset4) summary(dataset4) str(dataset4)

As the asp code will be removed below the code for the data that will be used for the analysis:

mydata <- dataset4[2:14] head(mydata) View(mydata) summary(mydata) str(mydata)

Next the visualization of the correlation matrix and correlation plot

cor(mydata) pairs(mydata)

Below the results for the correlation

and also the Corr Plot

The corr matrix and the plot shows positive and negative linear relationship within different variables, therefore I am going to eliminate some variable with biggest correlation in order to remove Multicolinearity

*“Multicolinearity is a term you use if two x variables are highly correlated.*

*Not only is it redundant to include both related variables in the multiple regression model, but it’s also problematic. *

*Basically: If two x variables are significantly correlated, only include one of them in the regression model, not both. *

*If you include both, the computer won’t know what numbers to give as coefficients for each of the two variables, because they share their contribution to determining the value of y. *

*Multicolinearity can really mess up the model-fitting process and give answers that are inconsistent and oftentimes not repeatable in subsequent studies”*

The variable removed are: resins, asptenes and sulfur.

Now i am going to apply the model:

mydataMLR <- lm(srf.fr.eng ~ saturates+aromatics+wax+carbon+hydrogen+oxygen+nitrogen+nickel+vanadium, data=mydata) mydataMLR summary(mydataMLR)

From the results 4 variables are not significant therefore I am removing those and run the new model removing the variables aromatics, oxygen, nitrogen and nickel:

mydataMLR2 <- lm(srf.fr.eng ~ saturates+wax+carbon+hydrogen+vanadium, data=mydata) summary(mydataMLR2)

We are looking at the marginal contribution of each x variable when all the other variables in the model are held constant.

The Residuals section provides summary statistics for the errors in our predictions. Since a residual is equal to the true value minus the predicted value, the maximum error of 2.6178 suggests that the model under-predicted surface free energy by 2.61 mJ/m^2 for at least one observation.

On the other hand, 50 percent of errors fall within the 1Q and 3Q values (the first and third quartile), so the majority of predictions were between -0.6711 over the true value and 0.6587 under the true value.

The model has several significant variables, and they seem to be related to the outcome give the significance of Pr(>|t|)

The Multiple R-squared value (also called the coefficient of determination) provides a measure of how well our model as a whole explains the values of the dependent variable. It is similar to the correlation coefficient in that the closer the value is to 1.0 the better the model perfectly explains the data

Since the R-squared value is 0.8864, we know that 88.65% of the variation in the dependent variable is explained by our model.

Then checking residual and normality plot:

#check the residual par(mfrow = c(1,2)) mydataMLR2.stdRes = rstandard(mydataMLR2) plot(mydataMLR2.stdRes, col="red") abline(0,0) #normality plot qqnorm(mydataMLR2.stdRes,ylab="Standardized Residuals",xlab="Normal Scores", main="Normality Plot", col="red") qqline(mydataMLR2.stdRes)

Most (95 percent) of the standardized residuals fall within two standard deviations of the mean, which in this case is –2 to +2 (via the 68-95-99.7 Rule / Empirical rule).

We should see more residuals hovering around zero and we should have fewer and fewer of the residuals as they go away from zero.

If the residuals fall in a straight line, that means the normality condition is met.

From the model looks like condition is met.

In conclusion the model y = -121.30 – 0.39saturates – 1.70wax +1.23carbon + 4.33hydrogen -0.0027vanadium looks a quite good model to explain the Asphalt surface free energy in mJ/m^.

The post CA04_Multiple Linear Regression appeared first on DBS Student's Data Project Blog.

]]>The post Multiple Linear Regression – CA4 – Student Test Grades appeared first on DBS Student's Data Project Blog.

]]>For my first blog post I will be conducting Multiple Linear Regression in the statistical programming software R. The data set I have chosen is from the University of California and is an open data set and is called the Student Performance Data Set. The data looks at student achievement in two secondary schools in Portugal. It was collected using school reports and questionnaires. I downloaded the CSV file from the following URL:

https://archive.ics.uci.edu/ml/machine-learning-databases/00320/

First I downloaded some packages in R that may be required further into the analysis they include ‘MASS’ and ‘car’ packages. I then read the file into a new script in RStudio using the ‘read.csv’ function, claiming the headers of the data as true and the separation of the data as ‘;’. I then used the ‘head()’ function to check that the data had been read in properly and ‘str()’ function to see the structure of the data.

<span class="pun">[</span><span class="pln">sourcecode language</span><span class="pun">=</span><span class="str">"r"</span><span class="pun">] </span></code>#install packages for correlation matrix library(MASS) install.packages('car') library(car) dataset <- read.csv(file.choose(), header = TRUE, sep = ";") # import the data head(dataset) # quick view of data str(dataset) # structure of data n ames(dataset) # names of variables attach(dataset) # attach dataset <code><span class="pun"> [/<span class="pln">sourcecode</span>] </span>

The data structure showed a data frame of 395 observations across 33 variables. After looking at the names of the attributes and reading the supplementary breakdown of the variables in the zipped download file, I knew the response variable that I required was ‘G3’ which is the final grade of the students. I want to see which other variables from the data affect the grades of the students the most, if any. In order to get an insight into what variables may be best for the multiple linear regression I conducted a ‘pairs’ function graph and also a scatter plot matrix on a subset of the data I found of interest. See code as follows:

<span class="pun">[</span><span class="pln">sourcecode language</span><span class="pun">=</span><span class="str">"csharp"</span><span class="pun">]</span> # create a subset of the data df <- data.frame(age,traveltime,studytime,failures,freetime,goout,health, absences, G3) summary(df) # no NA's found in subset pairs.panels(df, col="red") scatterplotMatrix(df) <span class="pun">[/</span><span class="pln">sourcecode</span><span class="pun">]</span></code><code>

Also see the out put of both ‘pairs.panels and ‘scatterplotMatrix’:

Plot 1: pairs

Plot 2: scatterplotMatrix

After analysis of the plots I decided the variables of most interest as predictors to grades were ‘studytime’, ‘absences’, ‘freetime’ and ‘goout’. Failures was also highly correlated to grades but this seemed a very obvious and uninteresting regression predictor. Now that I have my predictor variables it is time to build our model in R.

R has a very useful ‘lm()’ function or ‘Linear Model’ function. We use the same function to conduct a simple linear model which has one variable as we do a multiple linear model with multiple variables. This is one of the many of the great capabilities in R. To understand the syntax of the function ‘lm()’ you can either type a question mark in front of the function or summon the ‘help()’ function like so:

?lm()

help(lm)

This will return a dialogue box in the console to help you plug in your variables. There is a ‘na.omit’ element within the ‘lm()’ function if you require to handle and ‘NA’s’ in your data, however, out of the variables I chose the data was clean and did not have any.

<span class="pun">[</span><span class="pln">sourcecode language</span><span class="pun">=</span><span class="str">"csharp"</span><span class="pun">]</span> help(lm) model1 <- lm(G3 ~ studytime + absences + freetime + goout) summary(model1) <span class="pun">[/</span><span class="pln">sourcecode</span><span class="pun">]</span>

As you can see above, the dependent variable ‘G3’ is plugged in first, followed by my chosen predictors of ‘studytime’, ‘absences’ and ‘freetime’. I have inserted this linear model into a variable called model1 in order to call it for further analysis with different functions. I then called the summary function on ‘model1’ to see the performance of the model, as follows:

Call:

lm(formula = G3 ~ studytime + absences + freetime + goout)Residuals:

Min 1Q Median 3Q Max

-12.880 -1.861 0.319 2.997 8.792Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 9.96372 1.14299 8.717 < 2e-16 ***

studytime 0.55537 0.27553 2.016 0.04452 *

absences 0.02940 0.02869 1.025 0.30612

freetime 0.32700 0.24125 1.355 0.17605

goout -0.61271 0.21434 -2.859 0.00448 **

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 4.53 on 390 degrees of freedom

Multiple R-squared: 0.03229, Adjusted R-squared: 0.02236

F-statistic: 3.253 on 4 and 390 DF, p-value: 0.01212

The first number we are interested in is the Multiple R-squared of 0.03229, which shows us that approximately 3.2% of the variance in grades can be explained by our model. The F-statistic and P-value test the null hypothesis that all the model coefficients are 0. The Residual standard error gives us an idea of how far observed grades are from the predicted or fitted grades. The intercept estimate above shows us what grades would be achieved for a student who had a ‘studytime’, ‘absences’ and ‘freetime’ of 0. We can also see the significance of the predictors as they correlate to grades, with ‘goout’ ans ‘studytime being the most significant out of the three,

The model didn’t perform as well as I expected from the high correlation seen in the graphs, but with so many variables to consider, a more comprehensive feature analysis need to be conducted. The fact that study time positively affected a students grades and that going out negatively affected the students grades is completely intuitive. I think an interesting aspect of this data set that would be interesting for further analysis would be a breakdown by sex and parents occupation to see what affect this might have on a students grade.

The post Multiple Linear Regression – CA4 – Student Test Grades appeared first on DBS Student's Data Project Blog.

]]>The post Multiple Linear Regression with R – CA4 appeared first on DBS Student's Data Project Blog.

]]>Mathematical relationships describe many aspects of everyday life and to model this relationship we use Linear Regression Analysis. One of the most common linear regression analysis is Multiple Linear Regression and is used to explain/predict the relationship between one continuous dependent variable and two or more independent variables where the independent variables can be continuous or categorical.

For this post we are going to show how to apply Multiple Linear Regression in R and we have chosen the dataset “Birth Rates and Economic Development ” available for download at : http://users.stat.ufl.edu/~winner/data/birthrate.dat

See below data description:

Dataset: birthrate.dat Description: Birth Rates, per capita income, proportion (ratio?) of population in farming, and infant mortality during early 1950s for 30 nations. Variables/Columns: Nation 1-20 Birth Rate 22-25 /* 1953-1954 (Units not given) */ Per Capita Income 30-33 /* 1953-1954 in 1948 US $ */ Proportion of population on farms 38-41 /* Circa 1950 */ Infant Mortality Rate 45-49 /* 1953-1954 */

Note that this dataset doesn’t come with column names so we have to add their names and save it as .csv file before loading the data into R which is easily done using Excel.

Loading data into R:

*mortality <- read.csv(file.choose(), header = T)*

Now, lets have a look at our dataset details:

>

>

The function *head( )*gives us a taste on how the data looks like and we can also get information about the structure of this data using *str( )*function. As we can see the dataset has 30 observations and 5 variables, being **Nation **nominal and **Birth_Rate, PC_Income, Pop_farm **and **Mort_Rate **numerical.

For this data we are interested in studying the possible causes contributing to Infant Mortality Rate so we are going to analyse the relationships between Infant Mortality Rate in relation to Birth Rates, Per Capita Income and Proportion of population on farms. In other words, our dependent variable will be **Mort_****Rate **and our independent variables will be **Birth_Rate**, **PC_Income **and **Pop_farm**.

Lets first visualize our data using the function *pairs( ) *to examine the correlation between them:

We should bare in mind that for a simple linear regression is easier to visualize if there is a problem with de model by graphing X and Y. However, with multiple linear regression is not that simple, there may be some interaction between variables that a simple scatter plot won’t show.

So lets have a look at the correlation matrix and see if we can get more information:

*cor(mortality[c(“Birth_Rate”, “PC_Income”, “Pop_farm”, “Mort_Rate”)])*

>

>

We can see that **Mort_Rate **and **Pop_farm **seem to have a fairly strong correlation of **0.69**. Also, is interesting to note that **Mort_Rate **and **PC_Income **are showing a negative correlation of **-0.74 **as well as **Pop_farm **and **PC_Income, **with**-0.77**, suggesting that infant mortality is higher where per capita income is lower, and that per capita income is lower among population farm. Note that **Birth_Rate **doesn’t seem to show significant correlation in relation to the other variables apart from Mort_Rate.

Lets go ahead and build our model:

*modelmortality <- lm(Mort_Rate ~ .-Nation, mortality)*

*#note that when we use ~ . we incorporate all independent variables against our dependent variable Mort_Rate minus the variable Nation which is not relevant to our model as is not numerical.*

When we call the function **modelmortality **we get the regression function coefficients:

>

>

When we call the created model the parameters tell us about how the independent variables are related to the dependent variable but to find out how well the model fit the data we use the function *summary( )*. This function allow us to evaluate the model’s performance:

>

>

Lets analyse the outcomes:

The **Residuals **section provides summary statistics for the errors in our predictions, some of which are apparently quite substantial.

The stars indicate the predictive power of each feature in the model giving the **p-value **for each estimated regression coefficient. A common practice is to use a significance level of 0.05 to denote a statistically significant variable. Here we can see statistically significant results for **Birth_Rate **and **PC_Income**, but **Pop_farm **doesn’t seem to add much to our model.

The **Multiple R-squared **value (also called the coefficient of determination) provides a measure of how well our model as a whole explains the values of the dependent variable and we got a result of **71% **which is a very good result.

Now lets build a second model excluding **Pop_farm **and see if we can improve our model:

>

>

We can see that the second model has a slightly lower R-squared value of **70% **and does not add much significance in terms of improvement to our model, so we will stick to the first model **modelmortality**

Overall, given the preceding three performance indicators, our model is performing fairly well showing that there is an important influence on **Mort_Rate **in relation to **Birth_Rate **and, most importantly, **PC_Income **but not so much for **Pop_farm **as initially seen on the Correlation Matrix.

Lets plot the model:

Our first graph shows if residuals have non-linear patterns, in our model’s case the residuals.

The Normal Q-Q shows if residuals are normally distributed which is clearly the case here.

Scale-Location plot shows if residuals are spread equally along the ranges of predictors denoting equal variance. As we can see we have a horizontal line with equally spread points.

Residual vs. Leverage (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). We see that points 7, 9 and 16 have great influence on the model.

____________________________________________________________________________

**References:**

https://www.statisticssolutions.com/what-is-multiple-linear-regression/

Multiple Linear Regression – Course Notes

https://www.investopedia.com/terms/m/mlr.asp

R. Weintraub (1962). “The Birth Rate and Economic Development: An Empirical Study”, Econometrica, Vol. 40, #4, pp 812-817.

The post Multiple Linear Regression with R – CA4 appeared first on DBS Student's Data Project Blog.

]]>The post Big data appeared first on DBS Student's Data Project Blog.

]]>** What’s Big Data**

** **

When we speak of Big Data we mean data sets or combinations of data sets whose size, complexity (variability) and speed of growth (velocity) make their capture, management, processing or analysis difficult by conventional technologies and tools Such as relational databases and conventional statistics or display packages, within the time needed to be useful.

Although the size used to determine if a given dataset is considered Big Data is not firmly defined and continues to change over time, most analysts and professionals currently refer to datasets ranging from 30-50 Terabytes to several Petabytes.

The complex nature of Big Data is mainly due to the unstructured nature of much of the data generated by modern technologies such as web logs, radio frequency identification (RFID), embedded sensors in devices, machinery, vehicles, Internet searches, social networks like Facebook, laptops, smartphones and other mobile phones, GPS devices and call center records.

In most cases, to effectively use Big Data, it must be combined with structured data (usually from a relational database) of a more conventional business application, such as an ERP (Enterprise Resource Planning) or CRM (Customer Relationship Management).

What makes Big Data so useful to many companies is the fact that it provides answers to many questions that companies did not even know they had. In other words, it provides a reference point. With such a large amount of information, data can be molded or tested in any way the company deems appropriate. In doing so, organizations are able to identify problems in a more understandable way.

Collecting large amounts of data and finding trends within the data allows companies to move much faster, smoothly and efficiently. It also allows them to eliminate problem areas before problems end up with their benefits or reputation.

Big Data analysis helps organizations leverage their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers. The most successful companies with Big Data get value in the following ways:

Cost reduction. Large data technologies, such as Hadoop and cloud-based analysis, provide significant cost advantages when it comes to storing large amounts of data, as well as identifying more efficient ways of doing business.

Faster, better decision-making. With Hadoop’s speed and memory analytics, combined with the ability to analyze new data sources, companies can analyze information immediately and make decisions based on what they have learned.

New products and services. With the ability to measure customer needs and satisfaction through analysis comes the power to give customers what they want. With Big Data analytics, more companies are creating new products to meet the needs of customers.

For example:

** **

**Tourism:** Keeping customers happy is key to the tourism industry, but customer satisfaction can be difficult to measure, especially in a timely manner. Resorts and casinos, for example, only have a small chance to turn around a bad customer experience. Big Data analysis gives these companies the ability to collect customer data, apply analysis and immediately identify potential problems before it is too late.

** **

**Health Care:** Big Data appears in large numbers in the healthcare industry. Patient records, health plans, insurance information, and other types of information can be difficult to manage, but they are full of key information once the analytics are applied. That’s why data analysis technology is so important to health care. By analyzing large amounts of information – both structured and unstructured – quickly, diagnoses or treatment options can be provided almost immediately.

Administration: Management faces a great challenge: maintaining quality and productivity with tight budgets. This is particularly problematic with respect to justice. Technology streamlines operations while giving management a more holistic view of activity.

** **

**Retail:** Customer service has evolved in recent years as smarter buyers expect retailers to understand exactly what they need, when they need it. Big Data helps retailers meet those demands. Armed with endless amounts of customer loyalty program data, purchasing habits and other sources, retailers not only have a deep understanding of their customers, they can also predict trends, recommend new products and increase profitability.

** **

**Manufacturing companies:** These deploy sensors in their products to receive telemetry data. This is sometimes used to provide communications, security and navigation services. This telemetry also reveals usage patterns, failure rates, and other product improvement opportunities that can reduce development and assembly costs.

** **

**Advertising:** The proliferation of smartphones and other GPS devices gives advertisers the opportunity to reach out to consumers when they are near a store, a coffee shop or a restaurant. This opens new revenue for service providers and offers many companies the opportunity to get new prospects.

Other examples of the actual use of Big Data exist in the following areas:

Use of IT logs to improve IT troubleshooting, as well as detection of security breaches, speed, effectiveness and prevention of future events.

Using the voluminous historical information of a Call Center quickly, to improve customer interaction and increase your satisfaction.

Use social media content to improve and more quickly understand customer sentiment and improve products, services and customer interaction.

Detection and prevention of fraud in any industry that processes online financial transactions, such as purchases, banking, investments, insurance and medical care.

Use financial market transaction information to more quickly assess risk and take corrective action.

Big Data Quality Challenges

The special features of Big Data make your data quality face multiple challenges. These are known as 5 Vs: Volume, Velocity, Variety, Veracity and Value, which define the Big Data problem.

These 5 characteristics of big data cause companies to have problems extracting real and high-quality data, from data sets so massive, changing and complicated.

Until the arrival of Big Data, through ETL we could load the structured information that we had stored in our system ERP and CRM, for example. But now, we can upload additional information that is no longer within the company’s domains: comments or likes on social networks, marketing campaign results, third-party statistical data, etc. All this information gives us information that helps us know if our products or services are working well or are having problems.

Some of the challenges Big Data’s data quality faces are:

- Many sources and types of data

With so many sources, data types and complex structures, the difficulty of data integration increases.

The data sources of big data are very broad:

Internet and mobile data.

Internet Data of Things.

Sectoral data compiled by specialized companies.

Experimental data.

And the data types are also:

Unstructured data types: documents, videos, audios, etc.

Semi-structured data types: software, spreadsheets, reports.

Structured Data Types

Only 20% of information is structured and this can lead to many errors if we do not undertake a data quality project.

**Big Data’s 5 V’s**

** **

The Big Data consists of five dimensions that characterize it, known as the 5 V’s Big Data. Let’s see what each of these aspects consists of:

**# 1 Volume**

Traditionally, the data have been generated manually. Now they come from machines or devices and are generated automatically, so the volume to analyze is massive. This feature of Big Data refers to the size of the amounts of data that are currently generated.

The numbers are overwhelming. And that is that the data produced in the world for two days are equivalent to all those generated before the year 2003. These large volumes of data that occur every time are important technical and analytical challenges for the companies that manage them.

**# 2 Velocity**

The data flow is massive and constant. In the Big Data environment, data is generated and stored at an unprecedented rate. This large volume causes the data to be out of phase quickly and to lose their value when new data appear.

Companies, therefore, must react very quickly in order to be able to collect, store and process them. The challenge for the technology area is to store and manage large amounts of data that are generated continuously. All other areas must also work at high speed to convert that data into useful information before it loses its value.

**# 3 Variety**

The origin of the data is highly heterogeneous. They come from multiple media, tools and platforms: cameras, smartphones, cars, GPS systems, social networks, travel registers, bank movements, etc. Unlike a few years ago, when the data that was stored was extracted mainly from spreadsheets and databases.

The data collected can be structured (are easier to manage) or unstructured (in the form of documents, videos, emails, social networks, etc.). Depending on this differentiation, each type of information will be treated differently, through specific tools. The essence of Big Data lies in later combining and configuring data with others.

Each type of information is treated differently, using specific tools, but then the essence of Big Data lies in combining and configuring data with others. It is for this reason that it increases the degree of complexity in the processes of storage and analysis of the data.

**# 4 Variability**

This feature of Big Data is likely to be the most challenging. The large volume of data that is generated can make us doubt the degree of veracity of all of them, since the great variety of data causes many of them to arrive incomplete or incorrect. This is due to multiple factors, for example, if the data come from different countries or if the suppliers use different formats. These data must be cleaned and analyzed, an incessant activity as new ones are continuously generated. The uncertainty as to the veracity of the data may raise doubts about its quality and availability in the future.

For this reason, companies must ensure that the data they are collecting are valid, that is, they are adequate for the objectives that are intended to be achieved with them.

**# 5 Value**

This feature represents the most relevant aspect of Big Data. The value generated by the data, once converted into information, can be considered the most important aspect. With this value, companies have the opportunity to make the most of the data to improve their management, define better strategies, gain a clear competitive advantage, make personalized offers to customers, increase the relationship with the public, and much more.

The post Big data appeared first on DBS Student's Data Project Blog.

]]>The post Big Data 5v´s appeared first on DBS Student's Data Project Blog.

]]>There are many definitions on BIG Data, ranging from the definition of Tim Kraska in which they consider the Big Data as the data class on the actual technology in use is not able to obtain in cost, time and quality responses to the exploitation of The same Going by the definition of the McKinsey Global Institute where it refers to the Big Data as the data set of excessive size the capabilities of the database applications to capture, store, manage and analyze them; Even reaching the IDC that focuses on obtaining data value by extending the concept of Big data to the set of new technologies and architectures designed to obtain value of large volumes and variety of data in a quick way, facilitating Its Capture, processing and analysis. Perhaps this last sea is the one that best represents the concept of Big data when putting technologies and data for the obtaining of value, characterizing the data by the volume, the variety and the velocity of the generation. The three v´s (3) For which have just been characterized by Big data.

The **volume** dimension is perhaps the most characteristic feature of the Big Data concept. Increased estimates of the data generated indicate unprecedented growth, due to the social networks and mobility that facilitate wireless networks and mobile telephony. This increase in data will determine a scale change from terabytes to petabytes and zetabytes of information, making it difficult to store and analyze. However much of this information according to the type of use, can happen to have a life cycle of its very short value, passing an obsolete very quickly. This type of valuation is linked to the Velocity dimension.

The **velocity** with data of the creeds has increased considerably, requiring an adequate response to its processing and analysis. This response velocity is required to cope with data obsolescence due to its rapid generation capacity, rendering obsolete what instants before was valid; Hence the distributed and parallel sea processing of technologies supported by the Big Data concept. On the other hand the need for a data analyst to identify for each application the data of its very short sea life cycle of a mayor life cycle, determination as fundamental when renting and optimizing the appropriate use thereof Increasing the accuracy and quality of the results.

**Variety** in Big Data is based on the diversity of data types and the different sources of data collection. Thus, data types are structured, semi-structured or unstructured, and their sources come from text and image files, web data, tweets, sensor data, audio, video, click streams, log files, etc. . The wealth that the Big Data concept entails. However, this potential wealth increases the degree of complexity both in its storage and in its processing and analysis.

One of the characteristics associated with data quality is the **veracity** of the data. Truthfulness can be understood as the degree of trust that is found in the data of a use. Within the characterization of the large data The determination of the truth in its fourth dimension, and is of great importance for a data analyst, since the veracity of the same determines the quality of the results and the confidence in them. Therefore a high volume of information that creates a very fast speed and based on structured and unstructured data and coming from a great variety of sources, make it inevitable to doubt the degree of veracity of the same. Therefore, depending on the application that is given, their veracity may be essential or become an act of trust without becoming vital

From the point of view of harvesting and exploitation, the **Value **dimension represents the most relevant aspect of the big data. It is observed that as the volume and complexity of the data increases, its marginal value decreases considerably, due to its difficulty of exploitation. (There should be a graph here which I find impossible to upload).

Marginal value of the data

Facilitating the exploitation of data to obtain value remains the fundamental objective of Business Intelligence and now of Big Data technologies. Increasing the marginal value of data is one of the current challenges from the point of view of technology, t in a fast, immediate and precise way ahead of the competition. So the evolution of the dimensions of Big Data passes through an academic interpretation of three dimensions (volume, variety and speed), a view of the analyst where the truth of the data is presented as a fundamental dimension facing the quality of Results, to the vision of the manager where the interpretation of the value becomes basic face to the decision making.

Finally, social networks, coupled with the immediacy of wireless networks and mobile telephony, new cloud storage services, etc., have led to an increasing volume of data, and very fast , Coming from few or many sources of information, whose truth is difficult to verify, and whose validity time may not be very great. Given these types of scenarios, as evidenced by the experience of Internet-based companies, getting to see them, not as a difficulty, but as a competitive advantage is one of the current challenges of implementing the technology associated with the Big Data concept .

The post Big Data 5v´s appeared first on DBS Student's Data Project Blog.

]]>