Oksana’s Blog

Critical analysis of “Why Data Analyst is The Sexiest Job of the 21st” Century?

                         “The ability to collect and present information in a way that a              business    can understand, so it can make decisions faster, will be key to keeping competitive” Lisa Kelly in “Computer Weekly”         

 

Big Data Analyst Profile                                                       

Having chosen and on being on the finishing line of current course – Diploma in Big Data my biggest question which I wish to transfer to critical analysis of jobs advantages and disadvantages would be: Why would be data analyst jobs so attractive?

Before going into research it would be good to define who are data analysts and which duties would they be doing?

Data analysts search through numbers and translate them into plain English. There are many different types of data analysts in the field, including operations analysts, marketing analysts, financial analysts, etc.

Data analysts perform a variety of tasks related to collecting, organising, and interpreting statistical information. In any capacity, though, people with this job look for ways of assigning numerical values to different business functions, and are responsible for identifying efficiencies, problem areas, and possible improvements.

What data do they collect? It depends on a business, it could be sales figures, market research, logistics or transportation costs. Data analysts make research to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.

A data analyst’s job is to take that data and use it to help companies make better business decisions. These analytical findings can not only to lead to more effective marketing and new revenue opportunities but also provide a better customer service, improved operational efficiency, competitive advantages over rival organisations and other business benefits.

 

Growth of Data

Data analysts are in extremely high demand, but the work itself is equally demanding. Data science sits at the intersection of statistics, business intelligence, sociology, computer science, and communication.

Data is the new frontier of the 21st century, ripe for exploration. Data science—obtaining, analyzing, and reporting on data insights ranging from business metrics to user behavior—is an ultra-buzzy field right now.

According to data analyst market specialist predictions – it is expected that over the next five years the growth of data of structured and unstructured data which in its turn would require more employees to serve data analysis needs. As per Dave Lounsbury who is chief technology officer  at “The Open Group” both economic and social changes are pushing the rise of data analytics together with big data and organisations to be successful have to respond to these changes. Things like social networks, mobile devices and information in a real time it all are producing big data which should be handled appropriately. For every company it is important to know and to understand competitive data.

If you’ve got what it takes, there are plenty of companies eager to take what you’ve got. The McKinsey Global Institute has predicted that by 2018 the U.S. could face a shortage of between 140,000 to 190,000 people with deep analytical skills, and a shortage of 1.5 million managers and analysts who know how to leverage data analysis to make effective decisions.

In a survey by Robert Half Technology of 1,400 U.S.-based CIOs, 53% of the respondents whose companies are actively gathering data said they lacked sufficient staff to access that data and extract insights from it. Translation: you are sorely needed.

Opportunities

Ireland and Dublin – offers a lot of jobs in the area when there are so many IT, software, pharmacological companies having their headquarters in the country. According to research by Forfas and the Expert Group on Future Skills Needs up to 21,000 data analytics jobs could be generated here between 2014 and 2020. Due to rapid growth in the area, there is currently a shortage of professionals with the required skills to fill direct high-end positions in the field of data analysis around the world. But although data analysis is a booming growth area, no one country has taken the lead in the area.

The authors of the Assessing The Demand For Big Data And Analytics Skills Report say this presents a significant opportunity for Ireland. 3,630 of the positions would be in deep analytical roles and 17,470 for big data type roles. However, the experts found that in order for Ireland to become a world leader in the area, it must have sufficient skilled professionals available to fill roles in indigenous companies and foreign multinationals working in the sector. In the short-term, the best way to bridge the skills gap is through up-skilling and retraining.

Big gap between what is on market right now and between people with the skills to take the job. What’s clear about big data and analytics is there is scale right across the board for job-seekers. Companies of all sizes seek their own variant of analysts, with that scale subsequently offering a clear ladder to use to guide your career. The talent pool is becoming stretched as businesses of all sizes seek to garner more and more information from the data they consume.A report into the Irish software landscape by the Irish software engineering research centre Lero back in June highlighted this area in particular for future growth, and indeed its immediate importance.

“The firms (surveyed) highlighted a number of key technology platforms they believed important for future competitiveness,”

“These included cloud computing, data analytics and cyber-physical systems, all closely related to emerging R&D priority themes in Ireland and around the world.

“I think we’re just actually at the beginning of the curve of a tremendous development,” he said. “I think data and analytics has the potential to change many industries, even society, without wanting to exaggerate.”

Payment/ salary

Median salary according to research for Data Analyst job would be around €33,731 PA (data as per 18 Mar 2016 from http://www.payscale.com source ) while www.glassdoor.com provides estimated €31,522 national average with remark – Salaries estimates based on 23 salaries submitted anonymously to Glassdoor.

American national salary ranges according to DataJobs are the following:

  • Data analyst (entry level): $50,000-$75,000
  • Data analyst (experienced): $65,000-$110,000
  • Data scientist: $85,000-$170,000
  • Database administrator (entry level): $50,000-$70,000
  • Database administrator (experienced): $70,000-$120,000
  • Data engineer (junior/generalist): $70,000-$115,000
  • Data engineer (domain expert): $100,000-$165,000

 

Benefits – in addition to payment many companies are offering money bonus on a yearly conditions, which might be a from 5 to 10% from yearly wages , health and life insurances, gym and other benefits.

Education to have:

Research has been done and majority of companies are looking for candidates with education in fields like math, statistics, computer science, or something related to these fields.

Skills to have:

General analytical skills – Because data analysts are working with huge volumes of data – it is necessary to have a good analysing skills to see the roots, to compare and to draw conclusions.

Information Compilation – There are several different strategies people can use to compile data, but there are typically three universal goals. The data must be regulated, normalized, and calibrated such that it could be taken out of context, used alone, or put in conjunction with other figures and still maintain its integrity.

Extrapolation and Interpretation – when the information has been collected, analysts are usually responsible for coming up with some conclusions about what it means, as well as educating business executives on how to use it.

Projections and Advisory Responsibilities –  advising project managers and leaders about how certain data points can be changed or improved over time. They are often the ones with the best sense of why the numbers are the way they are, which can make them a good resource when thinking about making changes.

Research and Writing Tasks –  some of project goals might involve writing tasks like drafting company memorandum, press releases, and formal reports. Analysts also collaborate with database programmers and administrators to write system modification recommendations or in-house instruction and training materials.

System Expertise and Troubleshooting – most of the work analysts do is completed with the help of computers and digitised statistical software programs but  job usually also requires program troubleshooting and system security measures, as well as an ability to adapt to changing technology and keeping updates current and useful across multiple platforms.

Attention to details:   job would require to see even small difference in data and to draw conclusions based on what has been found.

Math Skills – data analysts need math skills to estimate numerical data.

Critical thinking – based on the variety of data received data scientists should by looking  must look at the numbers, trends, and data and come to new conclusions based on the findings.

Good communication skills:  it is not only working with data itself but sharing findings with different departments, so important would be communicating clearly across different departments and shoving findings in simple and understandable way/ language.

 

Programms that normally required – according to payscale.com the mainly popular programms to obtain and to know to be able fully operate the job would be : Mycrosoft Office, Excel, World, Access, SQL, familiarity with Hadoop.

Simple research on showed the list of well-knowns which are hiring in the area:

Silicon Republic’s Featured Employers hiring in the area of big data and analytics include:

  • Accenture
  • AOL
  • Aon
  • Bank of America Merrill Lynch
  • Dropbox
  • EMC
  • Fenergo
  • Fidelity Investments
  • FireEye
  • Information Mosaic
  • Pramerica
  • Quantcast
  • TripAdvisor
  • Twitter

Just by doing some basic research in linkedin jobs (as per data on 20-04-16) there are 681 vacancies for data analyst positions in Ireland.

There are several universities which are offering courses for data analyst, short courses are offered by National College of Ireland and Dublin Business School.

As we can summarise from research this profession is on demand right now and many companies are looking for already experienced data analysts. But when asking around it was discovered that some companies were happy to take on board people who just recently completed Data Analyst course.

It seems according to surveys and researches that demand for data analyst will be growing in next few years.

Seems that course Diploma in Big Data for Business is taken at the right time and would bring good dividends like a new job to many of course completers.

 

data_analyst_jobs_variety

 

Bibliography:

What is MIS?, https://mis.eller.arizona.edu/what-is-mis, [accessed 10-04-16]

Chris Morris, The Sexiest Job of the 21st Century: Data Analyst, http://www.cnbc.com/id/100792215, [accessed 9-04-16]

Data Analyst Salary, http://www.payscale.com/research/IE/Job=Data_Analyst/Salary, [accessed 9-04-16]

Data Analyst Salaries in Dublin, Ireland, https://www.glassdoor.com/Salaries/dublin-data-analyst-salary-SRCH_IL.0,6_IM1052_KO7,19.htm, [accessed 16-04-16]

Big data analytics, http://searchbusinessanalytics.techtarget.com/definition/big-data-analytics, [accessed 10-04-16]

Data Analyst Job Description, http://www.snagajob.com/job-descriptions/data-analyst/, [accessed 11-04-16]

Big data and analytics: a large challenge offering great opportunities, http://www.computerweekly.com/feature/Big-data-and-analytics-a-large-challenge-offering-great-opportunities  [accessed 09-04-16]

What does a Data Analyst do?, http://www.wisegeek.com/what-does-a-data-analyst-do.htm [accessed 09-04-16]

Data Analysts: What You’ll Make and Where You’ll Make It, Alison Staad, , november 26,2014, http://blog.udacity.com/2014/11/data-analysts-what-youll-make.html [accessed 22-04-16]

Data analytics ‘could create 21,000 Irish jobs’, http://m.rte.ie/news/business/2014/0507/615831-data-analytics-report/,  [accessed 22-04-16]
Top Tech Jobs 2015 – Big data and analytics, https://www.siliconrepublic.com/portfolio/2014/12/04/top-tech-jobs-2015-big-data-and-analytics, [accessed 22-04-16]

Oksana’s Fusion Table

FUSIN TABLE BLOG

fution tables

Oksana Vinkarklina
Student ID 10129501

WHAT FOR IS FUSION TABLE?

Google Fusion Tables (or simply Fusion Tables) is a web service provided by Google for data management. Fusion tables can be used for gathering, visualising and sharing data tables. Data are stored in multiple tables that Internet users can view and download. The website launched in June 2009, announced by Alon Halevy and Rebecca Shapley. The web service provides means for visualising data with pie charts, bar charts, lineplots, scatterplots, timelines, and geographical maps. Data are exported in a comma-separated values file format. The size of uploaded data sets is currently limited to 250 MB per user.
In the 2011 upgrade of Google Docs, Fusion Tables became a default feature, under the title “Tables (beta)”.


WHY TO USE FUSION TABLE?

Fusion Tables is an experimental data visualisation web application to gather, visualise, and share data tables.
Find public data

Google Tables helps you search thousands of public Fusion Tables or millions of public tables from around the web you can import to Fusion Tables.
Import your own data

Upload data tables from spreadsheets or CSV files, even KML. Developers can use the Fusion Tables API to insert, update, delete and query data programmatically. You can export your data as CSV or KML too.
Visualise it instantly

See the data on a map or as a chart immediately. Filter for more selective visualisations. Publish your visualisation on other web properties
Now that you’ve got that nice map or chart of your data, you can embed it in a webpage or blog post. Or send a link by email or IM. It will always display the latest data values from your table and helps you communicate your story more easily.


COMMENTARY BLOG POST

1) For fusion table I have taken data from website –

http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/ because I was a bit rushed and could not find a file from website so I have copied data to excel and cleaned from all necessary rows.

2) Next step was to download KML file from www.independent.ie

3) Google Fusion Map app was downloaded from https://chrome.google.com/webstore/search/fussion%20table

4) Files were uploaded to my Google Drive

5) Both tables were merged – csv and kml

6) I have activated Map of Geometry and can see the population figures

7) I have formatted my map so the variance in population is easily identified. Clicking on the button ‘change feature styles’, opens a tool box for editing.

irish population

8) have changed name of table to “Irish population”

9) Some editing was done using Change feature styles, especially to differentiate population most density with bright colours.

10) It was also important to make my fusion map public – to implement this you just need to change feature setting from ‘Private’ to ‘Public’ in the top right of the page.

publishing

11) To obtain code and to publish it in the blog I have taken this from Tools- publish code which starts with

12) My fusion map is now live on my blog.

CONCLUSION:

When all amendments are done and map are live on the blog we can see that most populated cities are in Ireland are Dublin with 1,273,069 population count, next one is Cork ( population of 519,032) and third biggest by population is Galway (250,653).

Areas with smallest population are highlighted in light blue colour and could be easily identified on map.

 

Big data

BIG DATA

Big data is a term for data set that are so large or complex that traditional data processing applications are inadequate.This  include analyse, capture, search, sharing, storage, transfer, visualization, querying and information privacy. The term often refers simply to the use of predictive analytic or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

Big data is being generated by everything around us at all times. Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it.

Big data is arriving from multiple sources at an alarming VELOCITY, Volume and Variety.

Big data is changing the way people within organizations work together. It is creating a culture in which business and IT leaders must join forces to realize value from all data. Insights from big data can enable all employees to make better decisions—deepening customer engagement, optimizing operations, preventing threats and fraud, and capitalizing on new sources of revenue. But escalating demand for insights requires a fundamentally new approach to architecture, tools and practices

ADVANTAGES OF BIG DATA IN AN ORGANISATION

  • Competitive advantage

Data is emerging as the world’s newest resource for competitive advantage.

The use of Big Data is becoming a crucial way for leading companies to outperform their peers. In most industries, established competitors and new entrants alike will leverage data-driven strategies to innovate, compete, and capture value. Indeed, we found early examples of such use of data in every sector we examined. In healthcare, data pioneers are analyzing the health outcomes of pharmaceuticals when they were widely prescribed, and discovering benefits and risks that were not evident during necessarily more limited clinical trials. Other early adopters of Big Data are using data from sensors embedded in products from children’s toys to industrial goods to determine how these products are actually used in the real world. Such knowledge then informs the creation of  new service offerings and the design of future products.Big Data will help to create new growth opportunities and entirely new categories of companies, such as those that aggregate and analyse industry data. Many of these will be companies that sit in the middle of large information flows where data about products and services, buyers and suppliers, consumer preferences and intent can be captured and analysed. Forward-thinking leaders across sectors should begin aggressively to build their organisations’ Big Data capabilities.

  • Decision making
  • Decision making from elite few to empowered many,

Understanding the business impact of Big Data and its value will help organization leaders make better decisions and drive performance

The Economist Intelligence Unit surveyed over 600 business leaders, across the globe and industry sectors about the use of Big Data in their organizations. The research confirms a growing appetite for data and data-driven decisions and those who harness these correctly stay ahead of the game. The report provides insight on their use of Big Data today and in the future, and highlights the advantages seen and the specific challenges Big Data has on decision making for business leader

  • 75%of the business owners believe their organisation ti be data driven
  • 9 out of 10 decision made fin the past 3 years would have being better if they had all the relevant information
  • 42%says in the unstructured content is too difficult to interpret.
  • 85% say the issue is not about volume but both the ability to analyze and act on the data on real time.
  • Value of data,
  • As the value of data continues to grow, current systems won’t keep pace.

THE 3 V’S IN BIG DATA

VOLUME

We currently see the exponential growth in the data storage as the data is now more than text data. We can find data in the format of videos, music and large images on our social media channels. It is very common to have Terabytes and Petabytes of the storage system for enterprises. As the database grows the applications and architecture built to support the data needs to be re-evaluated quite often. Sometimes the same data is re-evaluated with multiple angles and even though the original data is the same the new found intelligence creates explosion of the data. The big volume indeed represents Big Data.

VELOCITY

The data growth and social media explosion have changed how we look at the data. There was a time when we used to believe that data of yesterday is recent. The matter of the fact newspapers is still following that logic. However, news channels and radios have changed how fast we receive the news. Today, people reply on social media to update them with the latest happening. On social media sometimes a few seconds old messages (a tweet, status updates etc.) is not something interests users. They often discard old messages and pay attention to recent updates. The data movement is now almost real time and the update window has reduced to fractions of the seconds. This high velocity data represent Big Data.

Variety

Data can be stored in multiple formats. For example database, excel, csv, access or for the matter of the fact, it can be stored in a simple text file. Sometimes the data is not even in the traditional format as we assume, it may be in the form of video, SMS, pdf or something we might have not thought about it. It is the need of the organization to arrange it and make it meaningful. It will be easy to do so if we have data in the same format, however it is not the case most of the time. The real world has data in many different formats and that is the challenge we need to overcome with the Big Data. These varieties of the data represent Big Data.

CONCLUSION

Big Data is not just about lots of data, it is actually a concept providing an opportunity to find new insight into your existing data as well guidelines to capture and analysis your future data. It makes any business more agile and robust so it can adapt and overcome business challenges

Fusion Tables

As part of our assignment on the Irish Census Data for 2011, we were ask to use Google’s Fusion Tables to create a Heat map, this was done by cleansing the data file provided, extracting the information that was required for the Heatmap.  I didn’t have a Gmail account set up, so the first step was to create an account, one set up, I logged into Google Drive & selected Fusion Tables.  The first file I loaded was the clean map KML file which was uploaded to Moodle, I repeated the same steps to load the cleansed census data I had manipulated. Once loaded, selected File, Merge from the dropdown.

I updated the county column on both files to ensure the had the same naming convention, once completed I used this column to merge the data together.

Next a pop up appeared to select the columns to be used on the heat map, I deselected males, females & description & hit view table.

Once I navigated to the map of geometry tab, my first map appeared- as below

When feature map was selected the map appears as below

Map 2

Fusion Tables allows you to present your data in a visual, interactive way. This data was collated after importing into fusion table and merged on the County attribute to obtain the total population and male and female population heat maps as required.

Map 3

The map aims to reflect the population data from the 26 counties in ROI.

The benefits of data visualisation to a business/decision maker is to collate & understand the data more easily.

Determine & spot patterns & trends in business operations.

It is quite apparent from the heatmap above that Dublin & Cork are jumping out above the other 24 counties, for a business this information helps, decision  makers identify to abnormalities or highlights immediately. In this case Both Dublin & Cork are the most densely populated counties, & visual synopsis’s such as this would can prove to be very useful for fast moving businesses who need to spot trends & risks as quickly as possible.

 

 

 

 

Movie Performance

The aim of this project was to investigate how a movie performs over the first three weeks after release. How a movie will perform will include many factors including what actors are starring, how popular a franchise is, competition at time of release etc. The two main factors investigated in this project are critical reception and movie budget. The critical reception is being measured using the score the movie received on RottenTomatoes.com. This website takes looks at the rating a film receives from a number of sources. It then gives the percentage of positive scores. So a film with a score of 73% means that 73% of the critics scored the film positively. Another factor investigated was the budget of the film. Ideally the marketing budget would have been investigated but this information was not available. For this analysis the assumption is that the production budget is proportional to the marketing budget.

The sample investigated was the highest grossing movies of 2015 (http://www.the-numbers.com/market/2015/top-grossing-movies). It should be noted that the majority of data was the-numbers.com and where the-numbers.com did not have the required data, boxofficemojo.com was used. The data from these two sites were generally were consistent; however there were some differences especially in the production budget.

Analysis

Budget and Performance

This analysis was carried out using R studio after the data was collected and stored in an excel file. A linear model was created for all relationships and the summary of the linear model was observed. The first relationship investigated was production budget and total gross.

Call:lm(formula = Total.Gross.. ~ PB..m, data = movies) Residuals:       Min         1Q     Median         3Q        Max -210670972  -33115380   -5740931   15731709  653073398  Coefficients:            Estimate Std. Error t value Pr(>|t|)    (Intercept) 24737681   15233454   1.624    0.108    PB..m        1286692     164301   7.831 6.12e-12 ***—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 103700000 on 97 degrees of freedomMultiple R-squared:  0.3874,   Adjusted R-squared:  0.381 F-statistic: 61.33 on 1 and 97 DF,  p-value: 6.115e-12

 

This shows a reasonably strong relationship between the budget and success of a movie. A low p-value rejects the null hypothesis and an R-squared value of 0.39 shows a correlation between success and budget.

The opening weekend is generally the most profitable for a big budget movie and a strong indicator for how movie will finish. It can be used as a measure of how successful a marketing campaign for a movie was. A linear model of this relationship was created with and the following summary report was found:

Residuals:

Min        1Q    Median        3Q       Max

-92762124 -17709634  -4254184  10566517 263492439

 

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)  5966236    6649620   0.897    0.372

PB..m         606987      71720   8.463 2.75e-13 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 45260000 on 97 degrees of freedom

Multiple R-squared:  0.4248,   Adjusted R-squared:  0.4188

F-statistic: 71.63 on 1 and 97 DF,  p-value: 2.754e-13

 

The R-squared value shows 0.42; this is a slight increase from 0.39 but may not be large enough increase to call significant. This indicates a movie with a strong marketing campaign will get customers excited to see a film who will want to see it as soon as the movie is released. However, this model does have a number of flaws. This model assumes that the production budget is proportional to the marketing budget, which may not be the case. Secondly, an expensive budget may not always be used as effectively as cheaper ones.

budwk1

Figure 1 – Opening week (USD) vs Budget (USD in millions)

 

Value for Money

A big budget movie should expect to see large tickets sales but this does not necessary mean the best return on investment. For the following analysis the data used in “Budget and Performance” i.e. Gross profit was divided by the production value to give Total Gross Factor. This will indicate the best value for money when judging the performance of a movie.

When comparing the total gross factor to the production budget the following summary was observed.

Residuals:    Min      1Q  Median      3Q     Max -1.9247 -1.0234 -0.5483  0.0784 16.8228  Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept)  2.098272   0.379341   5.531 2.69e-07 ***PB..m       -0.011042   0.004091  -2.699  0.00821 ** —Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.582 on 97 degrees of freedomMultiple R-squared:  0.06984,  Adjusted R-squared:  0.06025 F-statistic: 7.283 on 1 and 97 DF,  p-value: 0.008211

 

This showed a higher p-value so the null hypothesis could be rejected, but not as definitively as before. The R-squared value is 0.07 which is not significant. This concludes that spending more money on a movie will not increase the chances of getting a better return on your money.

Critical reception

This portion of the analysis looks at how critical receptions will influence the performance of a movie. This uses the same logic as before but the independent variable is now the score found on rotten tomatoes.

The relationship between week 1 and the score on rotten tomatoes was as follows:

Residuals:      Min        1Q    Median        3Q       Max -66987362 -28636601 -12231937  10615737 326215017  Coefficients:            Estimate Std. Error t value Pr(>|t|)  (Intercept) 17979800   13907457   1.293   0.1991  RT..          507187     220459   2.301   0.0236 *—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 58110000 on 97 degrees of freedomMultiple R-squared:  0.05174,  Adjusted R-squared:  0.04197 F-statistic: 5.293 on 1 and 97 DF,  p-value: 0.02356

 

Again this shows a higher p-value than before but it is still below the cut-off point of 0.05. The null hypothesis could be rejected but not with much confidence. The R-squared 0.05 shows there is no significant correlation. The same analysis was carried out for weeks 2 and 3 which showed similar figures, concluding that critical reception made no difference to box office performance in the first 3 weeks.

Drop off

Another important measure when projecting the success of a movie is the drop off rate. Big budget movies expect large numbers during the opening weekend. These are the customers who were sold by the marketing but if there is poor reviews a movie may suffer a large drop off by week 2. While the previously concluded that critical reception does not affect performance it may still contributed to the decline for weeks 2 and 3. For this analysis, the gross for the week in question was found as a percentage of the gross for week 1. This percentage was then compared to the score on rotten tomatoes.

When comparing critical reception to the drop off for week 2 the following summary was found:

Residuals:    Min      1Q  Median      3Q     Max -0.6549 -0.3367 -0.1862  0.0597  3.8212  Coefficients:            Estimate Std. Error t value Pr(>|t|)   (Intercept) 0.242155   0.177543   1.364  0.17575   RT..        0.008711   0.002814   3.095  0.00257 **—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7418 on 97 degrees of freedomMultiple R-squared:  0.0899,   Adjusted R-squared:  0.08051 F-statistic: 9.581 on 1 and 97 DF,  p-value: 0.00257

 

This shows similar result to the relationship between critical reception and performance. The p-value is close to the cut off and the R-squared is too low to be significant. When comparing week 2 to week 3 and week 1 to week 3, the results were similar. Another analysis carried out was determining whether critical reception has any effect on how the film finishes compared to how the film performed in week 1. Again no significance in correlation was found.

Budget and Reception

Finally the last relationship examined was if the budget would have any relationship with how the movie was received. The summary was as follows:

Residuals:

Min      1Q  Median      3Q     Max

-51.568 -22.971   4.497  18.736  44.763

 

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 52.97059    3.88668   13.63   <2e-16 ***

PB.m         0.06331    0.04192    1.51    0.134

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 26.45 on 97 degrees of freedom

Multiple R-squared:  0.02298,  Adjusted R-squared:  0.0129

F-statistic: 2.281 on 1 and 97 DF,  p-value: 0.1342

 

This was the weakest correlation of all the tests and the only one with a p-value above a cut off of 0.05.

Predicting Opening Week Performance

As stated earlier, the success of a movie will include many factors, including popularity of a franchise, actors, directors and competition at time of release. For this example I have chosen to predict the domestic (U.S.) opening week gross to a film being released in late May of 2016, “X-Men Apocalypse”. It is difficult to find a reliable production budget for a movie that has not yet been released but one site I found stated $234 million (https://ihumanmedia.com/2015/12/13/234-million-budget-xmen-apocalypse-review/ – accessed 16th April 2016).

The linear model gives an equation of 606,986.9 * Production budget in millions + 5,966,236.1 which predicts an opening week of $148,001,170. A 95% confidence interval predicts an opening week between $122,657,183 and $173,345,156.

The other factors (franchise popularity, actors etc.) contributing to the opening numbers of a movie are difficult to quantify but may be possible to capture by looking at previous movies in the franchise. By looking at how these movies performed these others factors may be accounted for.

I have put together a table that looks at the previous movies in the franchise and compared how they actually preformed to the model. When calculating this I also took in account inflation.

Inflation to 2015 Release Date Movie Budget in Millions Budget adjusted for inflation Model Model with adjust inflation Actual Actual adjusted for inflation Error
1.376 Jul 14, 2000 X-Men 75 103.2 51,490,254 70,850,589 75,850,059 104,369,681 47.3%
1.288 May 2, 2003 X2 125 161 81,839,599 105,409,403 107,645,000 138,646,760 31.5%
1.176 May 26, 2006 X-Men: The Last Stand 210 246.96 133,433,485 156,917,778 141,331,162 166,205,447 5.9%
1.105 May 1, 2009 X-Men Origins: Wolverine 150 165.75 97,014,271 107,200,770 102,642,147 113,419,572 5.8%
1.054 Jun 3, 2011 X-Men: First Class 160 168.64 103,084,140 108,650,684 73,894,349 77,884,644 -28.3%
1.017 Jul 26, 2013 The Wolverine 115 116.955 75,769,730 77,057,815 73,313,850 74,560,185 -3.2%
1.001 May 23, 2014 X-Men: Days of Future Past 200 200.2 127,363,616 127,490,980 129,469,103 129,598,572 1.7%
0.995 May 27, 2016 X-Men : Apocalypse 234 148,001,171

Table 1 – X-Men Franchise – Performance and Model

This from this table it can be found that the average error of the model is +8.7%. This means that on average the model under project by 8.7%. However the first two movies of the franchise were among the first movies to be released that helped the current comic book movies boom. Before this point studios may not have been as confident and budgeted less for these to minimise the risk. Since then, comic book movies have proven to be good investments so studios will feel more confident investing in them.

A reasonably big under performer according to the model is “X-Men: First class”. One reason this underperformed may be due to the competition at the time of release. When comparing the competition faced by “X-men: Days of future past”, it appears that First Class had significantly tougher competition. Looking at day 7 for each film, the three next highest grossing movies (including Pirates of the Caribbean, one of the most popular franchises of all time) that faced First Class, went on to gross over $500 million domestically while the competition for Days of future past only grossed $162 million. For this reason, I took the first 2 movies and “First Class” as outliers and found the average without these. Excluding these outliers, the model under-projects 2.5% for the X-men franchise. Taking this into consideration, the model now predicts a gross of $150,991,210 for week 1. This assumes average competition and that the budget is actually $234 million. However another movie being released on the same day is “Alice through the looking glass”. This is the sequel to a very successful movie that will likely hurt the performance. It is also 3 weeks after the release of “Captain America: Civil War”, a successful franchise in the same genre with a strong fan following, and 1 week after the release of “Angry Birds”, a somewhat wild card at this stage that could take the family audience.

When taking this into consideration, the model could easily over-project. My best guess would be  $146,461,474 for week 1 which is 7% lower than previous number given (model with previous franchise performance taken in consideration). Although I have given my prediction, I do not feel confident in it as it is difficult to assess how the competition will perform and I believe that will be the deciding factor.

Conclusion

The results indicate that the budget of a film has a correlation to how it will perform, especially in the first week. This does not mean that putting more money into a movie will increase the chances of success but may be a measure of how confident studios are in a film. However, critical reception seems to have no impact on the performance of a movie.

However, it should be noted that the sample analysed were the 100 top performers of 2015 (excluding a few outliers like rereleases). This is a relatively small sample taken from only one year. This may be a biased selection as there were nearly 800 movies in the list, so this sample represents the top 12%. The data for all movies from 2015 could not be retrieved in its entirety because the time needed to cleanse and sort the data. If all movies were taken into account a stronger correlation between performance and critical reception may be found. The top performers will generally be large budget movies. Smaller budgeted movies may be expected lower in the list and may behave differently. The correlation for lower budget films may depend more on critical reception as many small independent movies can become popular through word of mouth. As far as the higher performing movies, critical reception does not seem to affect the overall performance of a movie nor does a big budget influence the critical response.

If looking to improve the model, accuracy may be improved by measuring the popularity of a movie franchise (by measuring previous performance) and measuring the competition (looking at budget and franchise popularity). These factors could be included in a multiple linear regression model.

Time Series Sunspot Analysis, and Russia 1812

Introduction:

By Colm Dougan

CA4 for Darren Redmond

Student Id 10205174

Our Sun is a somewhat mysterious object.

A glowing ball of Gas with arcing loops of magnetic fields which cool the surface and generate Sunspots which come and go like the flow and ebb of the tides. Using the right equipment these sunspots can be seen and counted and recorded. Fortunately Astronomers have been doing this for hundreds of years and we have a record of sunspot activity from 1700 to 1988 available in R. It contains some very useful functions and packages for analysing data which is a function of Time.

In this short Article we will analyse sunspot data from the year 1700 to the year 1988 with a number of functions and packages in R and plot the results.

Sun1

Figure1 Sunspots on the Sun

sun2

Figure 2 Close Up of Sunspots on the Sun

Loading the Data into R

The first thing we have to do is load the data into R.

In this case it is quite easy to do, since the data is already built into R. We just have to tell R where the data is and load it into a vector with the command.

> sun1 = sunspot.year

 

To store the data as a time series object in R , we use the command:

 

> sun1timeseries = ts(sun1)

 

We now have a vector of 289 objects which we can see in the top right hand “Global Environment” Window.

 

Basic Analysis

We can now plot the Data using the time series command:

 

> plot.ts(sun1timeseries, ylab=”number of sunspots”, xlab = “Year” , col = “red”)

 

And we get the following plot:

sun3

 

 

Figure 3 Basic Plot of the Sunspot Data

We can immediately a regular repeating cycle of approximately eleven years.

But notice that the sunspot cycle dipped from around 1800 to about 1825. Lets see if we can analyse this trend.

First we will use a Log Function to spot any trends using the Command:

> sun1Log = log(sun1)

> plot.ts(sun1Log, ylab=”Log(number of sunspots)”, xlab = “Year” , col = “red”)

sun4

 

Figure 4 Log – Linear Plot of the Sunspot Data

We immediately notice another cyclic trend of about 110 years duration. If we did a Fourier Transform on the data we would see a peak at eleven years and a peak at, a hundred and ten years.

Looking closer at the data

We can do a simple moving average on the data to take out the eleven year variation and look at the overall trend using the package “ TTR ”

We load the package into R using the command:

>library(“TTR”)

Then we perform a Simple Moving Average on the data using the command:

sun1SMA = SMA(sun1, n=8) Note: n is found by experiment

Then we plot our results using:

> plot.ts(sun1SMA, ylab=”number of sunspots”, xlab = “Year” , col = “red”)

sun5

Figure5: Simple Moving Average of our Sunspot Data.

Notice the Large dip in our data around 1812, the year of Napoleons Retreat from Moscow !

Records indicate it was very cold then, I wonder if the cold climate is related to the lack of sunspots on the sun.

sun6

Figure6 Napoleons Retreat from Moscow.

sun7

Figure7 Napoleons Retreat from Moscow, notice how cold it is.

sun8

Figure8: Famous Diagram of Napoleons retreat from Moscow

Conclusion

R is indeed powerful for analysing Time Series Data.

By using Simple Moving Averages on Sunspot Data recorded between 1700 and 1988 we were able to show a big dip in sunspot activity around the year 1812 when Napoleon was invading Russia. The climate was recorded as being very cold at this time. It seems General Winter Saved Russia in 1812 as it did once again in 1941.

Cluster Analysis of Finnish Cell Phone Data.

Clustering Analysis of Finnish Cell Phone Data.

CA3 For Darren Redmond

By Colm Dougan

Student ID: 10205174

Introduction:

We were given an assignment to preform two methods of clustering analysis in the R Language.

1:            K-Means Clustering Analysis

2:            K-Nearest Neighbour Analysis

K-Means Clustering in R

For the first part K-Means Clustering, I obtained a Dataset from the Website: http://cs.joensuv.fi/sipu/datasets

All the Data consists of 6014 observations of cell-phone data in Finland.

The Latitude and Longitude only of each of the 6014 phone call was recorded in a Text File over a period of Time. It is important to stress no identifying features of the phone calls were recorded; only the rough geographical location of each call was recorded.

The Dataset was downloaded from the Website and imported, into Excel 2010 and then cleaned. Then the data was saved as a CSV File and imported into R. For Comparison the Data was also imported into Tableau 9.1 and RapidMiner.

The Data was visualised in Tableau. Most of the calls were found to occur in Southern Finland around the Lakes Area. See Screenshot Below:

Figure1

Figure 1: Importation of the data set into Tableau 9.1. Some clustering is observed in the middle of the Map of Finland. Note: Click on picture to make bigger.

 

Next the Data was imported into Rapidminer. A model was built and Run. The Results are that four clustering events were found, see next two figures.

Figure2

Figure Two: The Rapidminer Model.

Note the pink box in the middle which was used to convert the data into a form which Rapidminer could accept. The Actual model itself is the green box on the right. The gray box on the left is used to import the data into rapidminer.

Figure3

Figure3: the Rapidminer results.

Note that rapidminer found four clusters in the Data, this is in agreement with R as we shall see later.

Figure4

Figure 4: Rapidminer Cluster Model. Note the finding of four clusters.

K-Means Clustering Analysis in R

The Cell phone dataset was imported into R and given the name maplocationsfullset.

It was then analysed using the “kmeans” command, as Follows:

> kmeans_model <- kmeans(x= maplocationsfullset, centers=4)

>

> kmeans_model

K-means clustering with 4 clusters of sizes 275, 193, 745, 4800

 

Cluster means:

X62.59809 X29.74448

1 62.25052 28.24391

2 63.20879 29.33286

3 63.26212 30.08932

4 62.59993 29.76301

 

Clustering vector:

[1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 4 4 4 4 4

[29] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[57] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3

[85] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[113] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[141] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[169] 4 4 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[197] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[225] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[253] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[281] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4

[309] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[337] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[365] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[393] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[421] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[449] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[477] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[505] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[533] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[561] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[589] 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4

[617] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[645] 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[673] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[701] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[729] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[757] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[785] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[813] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 4

[841] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1

[869] 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[897] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[925] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[953] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[981] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[1009] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[1037] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[1065] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[1093] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[1121] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 2 4 4 4

[1149] 3 3 3 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[1177] 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 1 1 1 4 4 4 4 4 3 3

[1205] 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[1233] 3 3 4 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[1261] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[1289] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[1317] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[1345] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3

[1373] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[1401] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[1429] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[1457] 3 3 3 3 3 3 3 4 3 4 4 4 4 4 4 3 3 3 4 4 3 3 3 3 3 3 3 3

[1485] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 3 3 3 4 4 3 3 3 4 4 4

[1513] 4 3 3 3 3 3 3 3 3 2 2 3 3 3 4 4 4 4 3 3 3 4 4 4 3 3 3 3

[1541] 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[1569] 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 2 2 3 4 3 3 3 3 3 3 3

[1597] 3 4 4 4 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3

[1625] 3 3 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 4 4 4 2 3 3 4 4 2 3

[1653] 3 3 4 4 3 3 3 3 3 3 3 4 4 4 4 3 3 3 3 3 3 3 4 3 3 3 3 3

[1681] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 3 4 3 3 3 3 3

[1709] 3 4 4 4 4 4 4 4 4 4 3 4 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 3

[1737] 3 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 1 1 1 1

[1765] 1 1 1 3 3 3 3 4 4 4 4 4 3 3 3 3 3 3 3 3 3 4 3 3 3 4 4 4

[1793] 3 3 3 3 3 3 3 4 4 4 4 4 3 3 3 3 3 3 4 4 4 4 4 3 3 3 4 4

[1821] 4 4 3 3 3 4 3 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 4 4 4 3 3

[1849] 3 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 4 4 3 3 3 4 4 4 4 4 4 4

[1877] 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 1 4

[1905] 3 4 4 4 4 4 4 3 4 3 3 1 1 3 3 3 3 3 3 3 3 3 4 4 4 4 4 3

[1933] 3 4 4 4 4 4 4 4 4 3 3 3 3 3 3 2 3 2 3 3 4 4 4 3 4 4 4 4

[1961] 4 4 4 4 4 3 3 3 3 3 3 3 3 4 4 4 2 4 4 4 2 2 2 2 4 4 4 3

[1989] 3 3 4 3 4 3 4 4 3 3 1 1 1 3 3 4 4 3 3 3 3 3 3 3 3 3 4 4

[2017] 3 3 3 4 3 3 3 3 2 2 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[2045] 3 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 3 3 3 3 4 4 4 3 3

[2073] 3 3 4 4 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 4 4 4 3 4 4 4 3 4

[2101] 3 3 3 3 3 4 4 4 4 4 4 4 4 3 3 4 4 4 3 4 4 3 3 4 4 4 4 4

[2129] 3 3 3 3 3 3 3 3 3 3 3 3 2 1 4 4 4 4 3 3 4 3 3 3 4 4 4 4

[2157] 4 3 3 3 3 3 3 3 3 4 3 3 3 3 4 4 3 3 3 3 4 1 1 3 4 3 3 3

[2185] 3 3 3 3 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 4

[2213] 4 2 3 3 3 3 3 4 4 4 4 4 4 3 3 3 3 3 4 4 3 3 3 3 3 3 3 3

[2241] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4

[2269] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2297] 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4

[2325] 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2353] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2381] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 4 4 4 4 4 4 3 3 2 2

[2409] 2 2 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4

[2437] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2465] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2493] 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2521] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2549] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2577] 4 4 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2605] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2633] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2661] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2689] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2717] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2745] 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1

[2773] 1 1 1 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2801] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2829] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2857] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2885] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2913] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2941] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2969] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[2997] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3025] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3053] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3081] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3109] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3137] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3165] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3193] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3221] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3249] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3277] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3305] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3333] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3361] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3389] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3417] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3445] 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3473] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3501] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3529] 4 4 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3557] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 4 4

[3585] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3613] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3641] 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4 4

[3669] 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4

[3697] 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4

[3725] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3753] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4

[3781] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3809] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3837] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[3865] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 4 4 4 4 4 4 4 4 4

[3893] 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 2 2 2

[3921] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

[3949] 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1

[3977] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[4005] 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4033] 4 4 4 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4061] 4 4 4 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4089] 4 4 4 4 4 4 4 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4117] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4145] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4173] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1

[4201] 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4229] 4 4 4 4 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4257] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4285] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4313] 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 1 1 4

[4341] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4369] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4397] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4425] 4 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4453] 4 4 4 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4481] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4509] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4537] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1

[4565] 1 4 4 4 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4593] 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4621] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4649] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4677] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3

[4705] 3 3 3 3 3 4 4 4 4 4 4 4 4 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4

[4733] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4761] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4789] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 2 4 4 4 4 4 2 2

[4817] 2 4 4 4 4 1 1 1 1 1 1 1 4 4 4 4 4 4 4 2 2 4 4 4 4 4 4 4

[4845] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 4

[4873] 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2

[4901] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4

[4929] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4957] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[4985] 4 4 4 4 4 4 4 4 2 2 2 2 2 4 4 4 4 4 4 4 2 4 4 2 4 4 4 4

[5013] 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 4 4

[5041] 4 4 2 2 2 2 2 2 2 4 4 4 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4

[5069] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 4 4

[5097] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5125] 2 2 4 4 4 4 4 4 2 2 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2

[5153] 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5181] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 4 4 4 1 1

[5209] 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5237] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 4 4

[5265] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5293] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2

[5321] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 4 4 4 4 4 4

[5349] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5377] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5405] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5433] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5461] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5489] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5517] 4 4 4 4 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5545] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5573] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5601] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5629] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5657] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5685] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5713] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4 1 1 1 1 4 4 4 1 4 4 4

[5741] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4 4 1 1

[5769] 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5797] 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1

[5825] 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5853] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5881] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5909] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5937] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5965] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[5993] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

 

Within cluster sum of squares by cluster:

[1] 213.41023 39.16224 39.38587 69.13552

(between_SS / total_SS = 75.7 %)

 

Available components:

 

[1] “cluster”     “centers”     “totss”       “withinss”

[5] “tot.withinss” “betweenss”   “size”         “iter”

[9] “ifault”

 

As can be seen from the data four clusters were found.

K-Nearest Neighbour in R

The famous Iris dataset was used in the analysis. It is already built into R.

The following R-Packages were used: “class” and “gmodels”

The dataset was normalised and divided up into a Training Set and a Test Set with the commands:

> ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))

 

> iris.training <- iris[ind==1, 1:4]

 

> iris.test <- iris[ind==2, 1:4]

 

Training Labels and test labels were assigned with the following commands:

 

iris.trainLabels <- iris[ind==1, 5]> > iris.testLabels <- iris[ind==2, 5] 

The Model was run with the following command:

iris_pred <- knn(train = iris.training, test = iris.test, cl = iris.trainLabels, k=3)

We are looking for three clusters, so k is set to 3 in the above command. The models predictions for each class are as follows:

iris_pred

[1] setosa     setosa     setosa     setosa     setosa   [6] setosa     setosa     setosa     setosa     setosa   [11] setosa     setosa     setosa     versicolor versicolor[16] versicolor versicolor versicolor versicolor versicolor[21] versicolor versicolor versicolor versicolor versicolor[26] versicolor versicolor versicolor virginica virginica [31] virginica virginica virginica virginica virginica [36] virginica virginica virginica virginica virginica [41] virginica Levels: setosa versicolor virginica

We can evaluate the correctness of the models predictions by using the library “gmodels” The command we use is as follows:

CrossTable(x = iris.testLabels, y = iris_pred, prop.chisq=FALSE)

The output we get shows us the model is accurate.

iris.testLabels |     setosa | versicolor | virginica | Row Total | —————-|————|————|————|————|         setosa |         13 |         0 |         0 |         13 |                |     1.000 |     0.000 |     0.000 |     0.317 |                |     1.000 |     0.000 |     0.000 |           |                |     0.317 |     0.000 |     0.000 |           | —————-|————|————|————|————|     versicolor |         0 |         15 |         0 |         15 |               |     0.000 |     1.000 |     0.000 |     0.366 |                |     0.000 |     1.000 |     0.000 |           |                |     0.000 |     0.366 |     0.000 |           | —————-|————|————|————|————|     virginica |         0 |         0 |         13 |         13 |                |     0.000 |     0.000 |     1.000 |     0.317 |                |     0.000 |     0.000 |     1.000 |           |                |     0.000 |     0.000 |     0.317 |           | —————-|————|————|————|————|   Column Total |         13 |         15 |         13 |         41 |                |     0.317 |     0.366 |     0.317 |           | —————-|————|————|————|————|

Conclusion:

R is very powerful for looking for the presence of clusters in the Dataset.

Caution:

But on a note of caution, the results of the analysis should be cross tabulated with Excel, Tableau and Rapidminer to be sure and to have confidence in the Results.

Cleaning and Scrubbing of Air-BnB Kaggle Competition Data

CA2 Airbnb Data Analysis

 

Lecturer:   Darren Redmond

Class:     Advanced Data Analytics

Student:   Colm Dougan

Student Id 10205174

Abstract

 

This Report is about the analysis of the Airbnb Dataset. The dataset comes from a Kaggle Competition supported by Airbnb. I first did some comprehensive analysis on the Dataset, explored most features and analysed the features I thought were most useful using the R-Studio Software Package.

Note: The R Package “ggplot2” was used in the production of some of the Graphics.

Introduction

 

In the last seven years Airbnb has become one of the most aggressive and fastest-growing companies in the marketplace.

More than 25 million people in 190-plus countries have used its service, helping it reach a huge thirteen billion dollar evaluation.[1]

One of the biggest reasons for its rapid growth is its strong data science technology. By collecting and analyzing the huge treasure trove of Data, Airbnb is able to use predictive analytics to match customers and renters together.

At a Wednesday meeting, Mike Curtis, Airbnb’s VP of engineering, outlined some of the things his team in data analysis. Here’s what he highlighted:

  • A/B Website Testing: A/B testing is a common method used by marketing to fine tune a website or service. It will test many configurations or designs of a product or web site to figure out how people respond to certain products or promotions. At Airbnb, users are exposed to different ranking algorithms or recommendation algorithms, and their behavior will be tied back to the actual reviews or star rankings they leave to test the effectiveness of a change to a Website.
  • Natural Language Processing: Airbnb has deployed a natural language processing technology to review the text on the message threads or review boards on websites to lift some sentiment out of it as well. Costumers tend to over enthuse about the hosts place of stay, so they don’t offend the host. Airbnb takes this into account when processing user sentiment.

Kaggle Competition

 

On Wednesday 25 November 2015 a new competition appeared on the Website “ Kaggle ”. The competition involved analyzing a set of records of customer data. The Question posed by Airbnb was this: Where will new customers book their first travel experience?  In this assignment I propose to try and answer this Question and hopefully Mine interesting information from the Data Set using the “ R “ Statistical Package.

Preliminary Dataset Examination:

 

1:            The data file “ train_users_2.csv “ was downloaded from the Kaggle competition website and loaded into R using the Command:

Airbnb <- read.csv(“train_users_2.csv”, header = FALSE)

2:            The file was viewed with the command:

View(Airbnb)

Comment:          213,452 Observations of 16 Variables were found in the Dataset.

3:            Column Names were then assigned to the Dataset using the Command:

colnames(Airbnb) <- c(‘id’, ‘date_account_created’, ‘timestamp_first_active’, ‘date_first_booking’, ‘gender’, ‘age’, ‘signup_method’, ‘signup_flow’, ‘language’, ‘affiliate_channel’, ‘affiliate_provider’, ‘first_affiliate_tracked’, ‘signup_app’, ‘first_device_type’, ‘first_browser’, ‘country_destination’)

4:            This assignment was viewed and checked using the Command.

head(Airbnb)

5:            The Structure of the Data-Set was checked using the Command:

str(Airbnb)

6:            A Summary of the Data-set was then produced using the Command:

Summary(Airbnb)

Comment:          The Output was as follows:

—————————————————————————————————————————-

id         date_account_created timestamp_first_active

00023iyk9l:     1   13/05/2014:   674   2.01406E+13: 15741

0005ytdols:     1   24/06/2014:   670   2.01405E+13: 14888

000guo2307:     1   25/06/2014:   636   2.01404E+13: 12685

000wc9mlv3:     1   20/05/2014:   632   2.01403E+13: 12051

0012yo8hu2:     1   14/05/2014:   622   2.01401E+13: 11104

001357912w:     1   03/06/2014:   602   2.01402E+13: 9961

(Other)   :213446   (Other)   :209616   (Other)   :137022

date_first_booking     gender          age

:114452   FEMALE :63041         : 80888

NULL     : 10091   gender :   1   NULL   : 7102

22/05/2014:   248   MALE   :54440   30     : 6124

11/06/2014:   231   OTHER : 282   31     : 6016

24/06/2014:   226   UNKNOWN:95688   29     : 5963

21/05/2014:   225                   28     : 5939

(Other)   : 87979                   (Other):101420

signup_method     signup_flow       language

basic       :152897   0     :164739   en     :206314

facebook     : 60008   25     : 14659   zh     : 1632

google       :   546   12     : 9329   fr     : 1172

signup_method:     1   3     : 8822   es     :   915

2     : 6881   ko     :   747

24     : 4328   de     :   732

(Other): 4694   (Other): 1940

affiliate_channel   affiliate_provider first_affiliate_tracked

direct       :137727   direct   :137426   untracked   :109232

sem-brand   : 26045   google   : 51693   linked       : 46287

sem-non-brand: 18844   other     : 12549   omg         : 43982

other       : 8961   craigslist: 3471   tracked-other: 6156

seo         : 8663   bing     : 2328   NULL        : 4301

api         : 8167   facebook : 2273               : 1764

(Other)     : 5045   (Other)   : 3712   (Other)     : 1730

signup_app           first_device_type       first_browser

Android   : 5454   Mac Desktop   :89600   Chrome       :63845

iOS       : 19019   Windows Desktop:72716   Safari       :45169

Moweb     : 6261   iPhone         :20759   Firefox     :33655

signup_app:     1   iPad           :14339   UNKNOWN     :27266

Web       :182717   OtherorUnknown :10667   IE           :21068

Android Phone : 2803   Mobile Safari:19274

(Other)       : 2568   (Other)     : 3175

country_destination

NDF   :124543

US     : 62376

other : 10094

FR     : 5023

IT     : 2835

GB     : 2324

(Other): 6257

—————————————————————————————————————————–

From the above it was noticed that the Dataset contained a lot of Superfluous data that was not really relevant to the Analysis. So it was decided to clean the Dataset up before the Analysis was done.

Cleaning and Scrubbing of the Data

 

From the below Summary, it was found the Age Range varied between a Min of 1.00 and a Max of 2014.00, with the Median being 34.00 Years.

summary(AirBnb$age)

   Min. 1st Qu. Median   Mean 3rd Qu.   Max.   NA’s

   1.00   28.00   34.00   49.67   43.00 2014.00   87990

 

This is incorrect, so a working copy of the dataset was made.

 

AirData = AirBnb

 

Any age value cells below 18 years or above 70 years were filled with ‘NA’ for Not Applicable.

 

AirData$age[AirData$age < 18]<-NA

AirData$age[AirData$age > 70]<-NA

 

The New Age Distribution is as Follows:

 

summary(AirData$age)

   Min. 1st Qu. Median   Mean 3rd Qu.   Max.   NA’s

18.00   28.00   33.00   36.07   42.00   70.00   91940

NA’s were removed from the Dataset with the Following Command:

 

AirData = na.omit(AirData)

 

The Cells Marked “ -unknown- “ were also removed from the Gender Column in the Dataset with the Following Commands:

Before:

summary(AirData$gender)

-unknown-   FEMALE     MALE     OTHER

   15928     56038     49324       221

Command:

> AirData <- subset(AirData, gender != “-unknown-“)

After:

summary(AirData$gender)

-unknown-   FEMALE     MALE     OTHER

       0     56038     49324       221

 

Next the empty cells were removed from the Date of First Booking Column.

AirData <- subset(AirData, date_first_booking != “”)

Then “ other ” was removed from the Gender Column.

AirData <- subset(AirData, gender != “other”)

Next Cells with “-unknown-“ was removed from the first_ browser Column.

AirData <- subset(AirData, first_browser != “-unknown-“)

Then Cells marked “other” was removed from the country_destination Column.

AirData <- subset(AirData, country_destination != “other”)

This reduced the Dataset down from 213,451 Objects of 16 Variables

to 45,437 objects of 16 Variables.

First Analysis, The Marketing Analysis:

 

A Summary of the Cleaned Data Set was produced with the Command

>summary(AirData)

id       date_account_created timestamp_first_active

000wc9mlv3:   1   2014-05-20: 129     Min.   :2.009e+13

001357912w:   1   2014-06-24: 129     1st Qu.:2.012e+13

001xf4efvm:   1   2014-05-13: 123     Median :2.013e+13

001y3jr7xc:   1   2014-06-16: 118     Mean   :2.013e+13

002qnbzfs5:   1   2014-06-11: 117     3rd Qu.:2.014e+13

0043i3w366:   1   2014-05-28: 116     Max.   :2.014e+13

(Other)   :45431   (Other)   :44705

date_first_booking       gender           age         signup_method

2014-06-05: 120   -unknown-:   0   Min.   :18.00   basic   :27030

2014-05-14: 117   FEMALE   :24692   1st Qu.:28.00   facebook:18403

2014-06-11: 116   MALE     :20616   Median :33.00   google :   4

2014-04-29: 113   OTHER   : 129   Mean   :35.79

2014-05-22: 113                     3rd Qu.:41.00

2014-06-10: 112                     Max.   :70.00

(Other)   :44746

signup_flow       language         affiliate_channel affiliate_provider

Min.   : 0.000   en     :44221   direct       :30200   direct   :30146

1st Qu.: 0.000   zh     : 261   sem-brand   : 6021   google   :10931

Median : 0.000   fr     : 240   sem-non-brand: 3499   other     : 2105

Mean   : 1.274   es     : 153   seo         : 2174   craigslist: 830

3rd Qu.: 0.000   de     : 145   other       : 2021   facebook : 492

Max.   :25.000   ko     : 101   api         : 1050   bing     : 397

(Other): 316   (Other)     : 472   (Other)   : 536

first_affiliate_tracked   signup_app         first_device_type

untracked   :23424     Android: 107   Mac Desktop   :24848

linked       :11405     iOS   : 1224   Windows Desktop:15842

omg         : 8881     Moweb : 888   iPad           : 2706

tracked-other: 1407     Web   :43218   iPhone         : 1296

product     : 270                    Desktop (Other): 300

marketing   :   45                     Android Phone : 215

(Other)     :   5                     (Other)       : 230

first_browser   country_destination

Chrome       :17543   US     :36010

Safari       :11203   FR     : 2829

Firefox     : 8602   IT     : 1523

Mobile Safari: 3872   GB     : 1347

IE           : 3676   ES     : 1318

Chrome Mobile: 208   CA     : 821

(Other)     : 333   (Other): 1589

Analysis from the above output.

 

Gender was unevenly split between Males and Females, 24,692 Females v’s 20,616 Males

Age was between 18 and 70 Years of Age, as expected for the adjusted Age Column.

It is proposed to simplify the website with a single Button to Press below the Question:-

The Question being:-     “ I confirm that I am over Eighteen Years of Age “

Sign-up Method

 

The Signup Method was:- “Basic “and “Facebook” at 27,030 and 18,403 respectively.

From the above, it is proposed a marketing campaign be targeted through Facebook.

Signup Method

Figure 1: Signup Method.

—————————————————————————————————————————–

Sign-up Flow

The Signup Flow was:- Customers used a mean of 1.274 Steps to Sign-upSignup Flow

Figure 2: Signup Flow

From the above, it is proposed the Sign-Up Method; reduce the number of steps required to sign-up.

—————————————————————————————————————————–

Language Used

 

The Language used by Customers was mainly English at 44,221 Users.

Language by Destination:

> t = table(AirData$language, AirData$country_destination)

> d = as.data.frame(t / rowSums(t))

> names(d) = c(“language”, “destination_country”, “Freq”)

> print(ggplot(d, aes(language, Freq, fill=destination_country)) + geom_bar(stat=”identity”))

Language Used v Destination

Figure 3: Language Used vs Destination

Also:

The Affiliate Channel used was:- Direct at 30,200 Customers.

The Affiliate Provider used was:- Direct at 30,146 Google at 10,931

From this:- it is proposed that a Marketing Campaign be directed on Google using Google Ads.

—————————————————————————————————————————–

The Sign-Up App.

 

The Sign-up App. Used was:- WEB at 43,218 iOS at 1,224 and Android at 107.

Commands Used:

>t = table(AirData$signup_app, AirData$country_destination)

>print(signif(t / rowSums(t) * 100, digits=2))

 

AU     CA     DE     ES     FR     GB     IT   NDF     NL

Android 0.000 0.930 2.800 0.930 0.930 0.930 3.700 0.000 1.900

iOS     0.900 1.800 0.490 2.100 4.200 1.700 2.300 0.000 0.570

Moweb   0.450 2.100 0.680 1.800 3.400 1.700 1.500 0.000 0.790

Web     0.770 1.800 1.500 3.000 6.400 3.000 3.400 0.000 1.000

From this:- it is proposed that software development work should concentrate on Web Application Development.

Signup App by Country:

> d = as.data.frame(t / rowSums(t))

> names(d) = c(“signup_app”, “destination_country”, “Freq”)

> print(ggplot(d, aes(signup_app, Freq, fill=destination_country)) + geom_bar(stat=”identity”))

Signup App by Country

Figure 4: Signup App. Used by Country.

—————————————————————————————————————————

First Device Type:

 

The First Device type used was:- Mac Desktop at 24,848 Windows Desktop at 15,842 iPhone at 1,296 and iPad at 2,706.

From this:- it is proposed Web Application Development split their time between Windows Webpage and Apple Desktop and iOS Webpage Development.

First Device Type

Figure 5: First Device Type:

—————————————————————————————————————————–

First Browser Used:

 

First Browser used was:- Chrome at 17,543 Safari at 11,203 Firefox at 8,602

From this:- it is proposed that Web Apps be tuned for Chrome, Safari and Firefox.

First Browser Used

Figure 6: First Browser Used.

—————————————————————————————————————————-

Destination Country Chosen:

 

The destination chosen by AirBnB US Costumers was:- USA 36,010 France 2,829 Italy 1,523 and GB 1,347.

The Percentage of users that go to each country is shown as follows:

> t = table(AirData$country_destination)

> print(signif(t / sum(t) * 100), digits = 2)

 

AU   CA   DE   ES   FR   GB   IT   NDF   NL other   PT   US

0.76 1.81 1.47 2.90 6.23 2.96 3.35 0.00 0.99 0.00 0.27 79.25

Country Of Holiday Destination

Figure 7: Country of Holiday Destination

From this:- It is concluded that US Customers holiday mostly in the US using Airbnb and a marketing campaign for the US Market should concentrate mainly on US Locations.

—————————————————————————————————————————–

ANOVA Analysis

 

Part 1: 2-Way ANOVA Analysis No interaction assumed between Age and Gender

The Commands Used:

> anova2 <- aov(as.numeric(country_destination) ~ age + gender , data = AirData)

>anova2

Output:

Call:

   aov(formula = as.numeric(country_destination) ~ age + gender,

   data = AirData)

Terms:

                     age   gender Residuals

Sum of Squares     301.2     85.5 424329.9

Deg. of Freedom       1       2     55134

 

Residual standard error: 2.774228

Estimated effects may be unbalanced

 

> summary(anova2)

               Df Sum Sq Mean Sq F value   Pr(>F)  

age             1   301 301.23 39.139 3.98e-10 ***

gender         2     85   42.75   5.554 0.00387 **

Residuals   55134 424330   7.70                    

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Part 2: 2-Way ANOVA Interaction is assumed between Age and Gender

Commands:

anova3 = aov(as.numeric(country_destination) ~ age * gender,

+     data = AirData)

 

> anova3

Call:

   aov(formula = as.numeric(country_destination) ~ age * gender,

   data = AirData)

 

Terms:

                     age   gender age:gender Residuals

Sum of Squares     301.2     85.5     123.0 424206.9

Deg. of Freedom       1       2         2     55132

 

Residual standard error: 2.773876

Estimated effects may be unbalanced

 

> summary(anova3)

               Df Sum Sq Mean Sq F value   Pr(>F)  

age             1   301 301.23 39.149 3.96e-10 ***

gender         2     85   42.75   5.556 0.003868 **

age:gender     2   123   61.50   7.992 0.000338 ***

Residuals   55132 424207   7.69                    

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Conclusion

 

The R-Studio Package is very powerful package when analysing data.

The most likely destination for Airbnb first time customers is from US to US from the above analysis.

References:

[1] Tim Bradshaw , Financial Times , Thursday 23 October , 2014

 

 

 

 

 

 

 

 

Data Quality, A Quick Essay.

Data Quality

by Colm Dougan

Student Id: 10205174

Introduction

In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. Mail sent out is incorrectly addressed[1]. One reason for this is the address information held in databases becomes out of date very quickly – more than 45 million Americans change their address every year.[2]

In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from data warehousing  and business intelligence to customer relationship management and  supply chain management. One industry study estimated the total cost to the U.S. economy of data quality problems at over U.S. $600 billion per annum (Eckerson, 2002). Incorrect data – which includes invalid and outdated information – can originate from different sources – through data entry, or data migration or conversion projects.[3]

Resolving data quality problems is often the biggest effort in a data mining project.

Types of Data Errors

  • Changes in data layout / data types.
    • Integer becomes a  string , String becomes Integer , ect.
  • Changes in scale / format
    • Dollars vs. Euros , Binary vs. Hex , 20/07/2016 vs. 07/20/2016
  • Temporary reversion to default setup.
    • Failure of a Processing Step , Missing a Processing Step
  • Missing and default values problems.
    • Application programs that do not handle NULL or Missing values well.
  • Gaps in time series
    • Especially when records represent incremental changes.

Definition of Data Quality

  • Accuracy
    • The data was recorded correctly.
  • Completeness
    • All relevant data was recorded.
  • Uniqueness
    • All Entities are recorded once.
  • Timeliness
    • The data is kept up to date.
  • Consistency
  • The data agrees with itself.

 

The Data Quality Environment

• Data and information is not static, it flows in a data collection and usage process:

1:      Data Types

 

  • There are many types of data, which have different uses and typical quality problems
    • Government data
    • High dimensional data
    • Descriptive data
    • Longitudinal data
    • Streaming data
    • Web (scraped) data
    • Numeric vs. categorical vs. text data

 

2:      Data gathering

 

  • How does the data enter the system?
  • Sources of problems:
    • Manual entry (Typing Errors)
    • No uniform standards for content and formats
    • Parallel data entry (duplicates in data)
    • Approximations, Rule of Thumb.
    • Measurement errors. SW/HW constraints
  • Potential Solutions:
    • Preemptive:
      • Process architecture  (build in integrity checks)
      • Process management (reward accurate data entry, data sharing, data stewards)
    • Retrospective:
      • Cleaning focus  (duplicate removal, merge/purge, name & address matching, field value standardization)
      • Diagnostic focus (automated detection of problems).

 

3:      Data delivery

  • Destroying or damaging information by inappropriate pre-processing
    • Inappropriate aggregation of Fields
    • Nulls converted to inappropriate default values
  • Loss of data:
    • Buffer overflows (Video)
    • Transmission problems (Wi-Fi, Ethernet)
    • No Checks and Balances

Potential Solutions

  • Build reliable transmission protocols
    • Use a proxy server
  • Verification
    • Checksums, verification parser.
    • Do the uploaded files fit an expected form?
  • Relationships
    • Are there dependencies between data streams and processing steps
  • Interface agreements

Data quality commitments from the data stream supplier.

 

4:      Data storage

  • You get a data set. What do you do with it?
  • Problems in physical storage
    • Can be an issue, but terabytes and memory are cheap.
  • Problems in logical storage (ER à data relations)
    • Poor metadata.
      • Data feeds are often derived from application programs or legacy data sources.
    • Inappropriate data models.
      • Missing timestamps, incorrect normalization, etc.
    • Ad-hoc , On the Fly,
      • Structure the data to fit the GUI, Application.
    • Hardware / software constraints.
      • Data transmission via Excel Spreadsheets, SQL
    • Potential Solutions
    • Metadata
      • Document and publish data specifications.
    • Planning
      • Assume that everything bad will happen. (Murphy’s Law)
      • Can be very difficult to anticipate problems.
    • Data exploration
      • Use data browsing and data mining tools to examine the data.
        • Does it meet the specifications you assumed?
        • Has something changed in the Data ?
        • Document Changes
        • Version Control

5:        Data integration

  • Combine data sets ( company acquisitions, integration across departments).
  • Common source of problems
    • Heterogeneous data : no common key, different field formats, Excel – SQL
      • Approximate matching (sometimes only option)
    • Different definitions of Data
      • What is a customer, an account, an individual, a family …
    • Time synchronization
      • Do the data relate to the same time periods? Are the time windows compatible?
    • Legacy data
      • IBM, VAX, spreadsheets, ad-hoc structures
    • Sociological factors
      • Reluctance to share – loss of power, loss of control, resistance to change.
    • Potential Solutions
    • Commercial Tools
      • Significant body of research in data integration
      • Many tools for address matching , schema mapping are available.
    • Data browsing and exploration
      • Many hidden problems and meanings: must extract metadata.
      • View before and after results : did the integration go the way you thought?
      • Brown Bag Meetings
      • Reports and Updates
      • Keep Everyone in the Loop

6:      Data retrieval

 

  • Exported data sets are often a snapshot of the actual data.
  • Problems occur because:
    • Source data not properly understood.
    • Need for derived data not properly understood.
    • Just plain mistakes.
      • Inner join vs. outer join in SQL
      • Not Understanding NULL values
    • Computational constraints
      • g., too expensive to give a full analysis, we’ll supply a snapshot.
    • Incompatibility
      • Access vs. SQL , Apple vs.  PC

7:        Data mining/analysis

 

  • Problems in the analysis.
    • Scale and performance
    • Confidence bounds?
    • Black boxes and dart boards
    • Attachment to models
    • Insufficient domain expertise
    • Blinkered Vision
  • Potential Solutions
  • Data exploration
    • Determine which models and techniques are appropriate, find data bugs, develop domain expertise.
  • Continuous analysis
    • Are the results stable? How do they change?
  • Accountability
    • Make the analysis part of the feedback loop.

8:      Data Publishing

 

  • Make the contents of a database available in a readily accessible and digestible way
    • Web interface (universal client).
    • Data Squashing :
    • Publish aggregates, cubes, samples, parametric representations.
    • Publish the metadata.
  • Close feedback loops by getting a lot of people to look at the data.
  • Surprisingly difficult sometimes.
    • Organizational boundaries, loss of control interpreted as loss of power, desire to hide problems.
  • Remove Sensitive Data before Publishing

 

9: Metadata

  • Data about the data
  • Data types, domains, and constraints.
  • Interpretation of values
    • Scale, units of measurement, meaning of labels
  • Interpretation of tables
    • Frequency of table refresh, associations, view definitions
  • Most work done for scientific databases
    • Metadata can include programs for interpreting the data set.

 

 

Conclusion

Data Quality is becoming ever more important in today’s Business Climate and Poor Data Quality can cost a company a great deal of Money. The role of the Data Steward and The Chief Data Officer is becoming increasingly important.

 

—————————————————————————————————————————-

References

[1]  http://www.directionsmag.com/article.php?article_id=509

[2]  http://ribbs.usps.gov/move_update/documents/tech_guides/PUB363.pdf

[3] http://www.information-management.com/issues/20060801/1060128-1.html

 

—————————————————————————————————————————–

Glossary

Glossary of data quality terms published by IAIDQ

 

Be careful of Anscombe’s Quartet !

Introduction:

CA1 for Darren Redmond

by Colm Dougan

Student Id: 10205174

In the World of Data Analytics it is a mistake to assume that statistical analysis of a Dataset can be relied upon to predict the shape of the Dataset. To be completely accurate you must graph the dataset and do the Statistical Analysis. The Graphs below show a small Dataset of X and Y values.

Note: The Statistical Package R was used through out this Blog for the Calculations and Graphs.

Statistical Analysis:

Let’s say you start your Analysis by, drawing out the regression line. You will get a line equal to Y=3.00 + 0.500X. The Correlation between X and Y (the Person Correlation Coefficient) is 0.816   This is very close to the value one (1), so you might think the data is tightly bunched together around the regression line. Then you might find the mean for x and y which is 9 and 7.5 respectively. Next you might get the Variance of the X values, which is 11 so the average distance of the X Data Points away from the mean is 3.3 (Root 11). After that you get the Variance of the Y Values which is 4.12 giving the average distance of the Points away from the mean as 2.03 (Root 4.12). You might put all these figures together in a Table which may make you pretty confident of the shape of the Graph that you are going to Sketch Out.

Unfortunately you would have sketched the wrong Graph for the Data Points. !!

Analysis:

The Errors in these thought processes are demonstrated by Four Graphs reproduced here called, Anscombe’s Quartet, which were created by F. J. Anscombe for his classic 1973 paper, Graphs in Statistical Analysis. All four graphs have identical (to two decimal places) statistical coefficients. However as these graphs demonstrate and here is the big takeaway, summary statistics don’t tell us everything about a Dataset.

To really understand the Dataset you must obtain the summary statistics and you must Graph the relationship of the dataset. !

————————————————————————————————

The Graphs of the Datasets :     Dataset 1 to 4

Note: That In each Dataset the Red line is the Regression Line.

Dataset1Dataset2 Dataset3 Dataset4

Now let’s look at the Dataset’s:

  • Dataset I: consists of a set of points that appear to follow a somewhat linear relationship which includes some variance.
  • Dataset II: seems to fit a ( Quadratic ?? ) curve but doesn’t follow a linear relationship.
  • Dataset III: looks like a good linear relationship between x and y, except for one large outlier.
  • Dataset IV: looks like x remains constant, except for one outlier.

——————————————————————————————————-

The Values of  Each  Dataset:   Datasets 1 to 4

Dataset1

x1 = c(10 ,8 ,13 ,9 ,11 ,14 ,6 ,4 ,12 ,7 ,5 )

y1 = c(8.04 ,6.95 ,7.58 ,8.81 ,8.33 ,9.96 ,7.24 ,4.26 ,10.84 ,4.82 ,5.68 )

 

Dataset2

x2 = c(10 ,8 ,13 ,9 ,11 ,14 ,6 ,4 ,12 ,7 , 5)

y2 = c(9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74)

 

Dataset3

x3 = c(10 ,8 ,13 ,9 ,11 ,14 ,6 ,4 ,12 ,7 ,5)

y3 = c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73)

 

Dataset4

x4 = c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8)

y4 = c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89)

 

The Correlations of the Datasets 1 to 4 :

( Pearson’s Correlation Coefficient )

Correlation of Dataset1
> cor(x1,y1)

[1] 0.8164205

Correlation of Dataset2

>cor(x2,y2)

[1] 0.8162365

 

Correlation of Dataset3
> cor(x3,y3)

[1] 0.8162867

Correlation of Dataset4

> cor(x4,y4)

[1] 0.8165214

 Regression Line for each Data Set.

Calculation of the regression line is straightforward. The best-fit straight line has the form y = bx + a, where b is the slope and a is the y-intercept of the line. The slope and intercept are given by:

1.1       a = Y – b .  X

1.2       b = (Sum( x – x ) . ( y – y )) / (Sum( x – x )^2)

The regression Line for each of the Four Data sets is: y = 3.000 + 0.500x

 

————————————————————————————————

In each Dataset 1 to 4 the following Statistical Values hold true:

Mean of X-axis data points: x1, x2, x3, x4 = 9

Variance of X-Axis data points: x1, x2, x3, x4 = 11

Mean of Y-axis data points:    y1, y2, y3, y4 = 7.50

Variance of Y-Axis data points: y1, y2, y3, y4 = 4.12

And Yet, each Graph is different when we plot the Datasets. OMG !!

—————————————————————————————————–

Conclusion:

From the above it should be clear, don’t assume that your summary statistics reflect your Graph Correctly.

A Motto to live by,  Graph your Dataset to be sure. !!