An Introduction to Data Analytics

Data Analytics Essay

Colm Dougan

Introduction:

Today Data Analytics is a fast growing Industry. When we look at the News on TV, the Newspaper or a Magazine or an Article on the Web, we see data being presented in more and more complex ways. A casual look at the Graphics for the Nepal Earthquake on CNN or Sky News or even the Weather Report will show just how true this is. Data is everywhere and is increasing in Volume year by year. Rapid advances in Technology starting with the Space Race of the 1960 drove the evolution of Semiconductors who exponential progression was ruled by Moore’s Law. To the advances in storage technology which has meant there are now vast amounts of stored data to be analysed. Modern Smartphones are more powerful than the navigation computer built into the Lunar Module that landed on the Moon in 1969. We have gone from the Intel  i4004 Four Bit CPU released in 1971 on ten micrometre chip technology  to modern processers with over a billion transistors each on fourteen nanometre chip technology which are being built in Leixslip Co. Kildare using the P1272 Process. We have advanced from the Five to Ten Megabyte Hard Disk in the IBM XT introduced in 1983 to the huge Server Farm that Apple is going to build in Athenry Co. Galway that will store Petabytes of Data.

Examples of Data Analytics

Retail:

In the Retail Sector a Data Revolution has taken place since the 1970’s. Point of Sale (POS) systems have been introduced at the Till. Bar Code Scanners now “Beep” in our goods.  RFID Tags are now attached to our clothes and Plastic Cards pay for our Goods.  Our Smartphones carry NFC Chips that will soon allow us to pay for our goods by just holding the phone to the receiver on the Checkout Desk and typing in a Pin Code.

Google Glasses:

Google Glasses will allow us to look at an item in the store and receive information on it, such as style colours make etc. Similarly we will be able to scan a Restaurant Menu and receive calorie information about the ingredients.  Or, if we are walking down a street we will be able to ask our smartphones for recommendations for and directions to a good restaurant.

Automotive:

Large Changes are also occurring here too. The systems technology in the Apollo Rockets has now been transferred to our cars. Our engines are run by computers and sensors. We can adjust the profile of our engines for urban and rural settings at the pressing of a button. Our Cars now talk to us and give us directions by receiving signals from satellites. The car keeps a log of its performance which can be downloaded to a laptop. Soon Jet Fighter Technology such as Head Up Displays (HUDs) and Forward Looking Infra-Red Displays (FLIRs) will be built into modern cars allowing us to see clearly when driving at night.

Medical:

Revolutionary advances in the Medical Field have also occurred since the nineteen sixties. Magnetic Resonance Imaging (MRI) has allowed us to see inside the body in unprecedented detail. This requires enormous processing power which has only recently become available. Medical Records are now being Computerised and stored on a server to be retrieved by the Doctor while sitting at his or her desk. It will be soon be possible to Skype a Doctor from Home if one is not feeling well. Related to this will be the possibility to buy remote monitoring equipment that you can attach to your body and hook up to an Ethernet Router. The Doctor will then be able to access this equipment remotely and do a range of Diagnostic Tests on the person who is not feeling well while talking to them on Skype. In the future it may be possible to analysis ECG Traces and MRI images with a computer and produce a 3D image of the beating heart. Chaos Theory may be applied to tell if the heart is beating Arrhythmically. It may be possible to investigate the Electrical Patterns sweeping over the Heart and deduce which area of the heart needs medical attention.

Genetics:

The processing of a vast amount of genetic information requires a vast amount of storage and computing power. DNA Sequencing has come a long way since the nineteen seventies. The Four Letters( G- A- T- C) of DNA create long sequence chains for Cells , Viruses , Enzymes and Proteins. A Project called The Human Genome Project is attempting to map all of the genes and functions of human DNA. It is the World’s Largest Biological Project. In the future it may be able to possible to tell a persons, ethnicity, eye colour, age, hair colour and skin colour by analysing the DNA from a small sample of human material left behind at a crime scene.

Army:

An Information Revolution is occurring in the modern Army. The Modern Battlefield Commander sitting in a large green tent, has a vast wealth of information coming in to him or her at HQ . Looking like a modern call centre with a vast array of computer screens sitting on desks,  the modern HQ is ground zero for real time Battlefield Information. This information can be streaming in from individual Tanks and Jet Fighters and Predator Drones. Headquarters can know information such as where the jet or tank is, where it is going, how much fuel it has left, how much ammo it has left and what type of ammo it is. The Commander using digital maps on the computer screen can zoom in and communicate with individual tanks or field commanders. Then arrange for a tank running low on fuel to withdraw and refuel while another tank takes its place. All this information is continuously updated , filtered and fed into the main server by the military personal sitting at the desks.

It is even possible for individual solders to carry helmet mounted cameras and headpieces so they can show HQ what they are seeing and talk to the commander through the headpiece and receive tactical advice. The soldiers can even update the computer screens at HQ by using a small handheld computer, putting in such information such as strongpoints, waypoints, civilian areas, ammo dumps ect .

It is no longer necessary for headquarters to be near the battlefield, as through satellite links HQ can be left in the home country. Airstrikes and battlefield surveillance can also be conducted from thousands of miles away also using satellite links and predator drones. In the future robot torpedo drones may be able patrol the seas for months or years waiting for orders from satellites and powered by the Radioactive Decay of Thorium Batteries.

CERN:

It is through data mining the vast amounts of data coming from the Atlas Detector in CERN for months that we discovered the Higgs Boson Particle.  No longer do we look at particle tracks on black and white photographs as was done during the early days of particle physics. Instead powerful computers look for the signatures of new particles from thousands of particle tracks generated when two proton beams generated by the Large Hadron Collider collide in the Atlas detector.

Techniques of Data Mining

Data Mining is generally divided into two major areas.

A: Predictive Mining

B: Descriptive Mining

Predictive Tasks:  Here we try to anticipate the status of one particular variable from other known variables in a model. An Example would be Scenario Manager in Excel.

Descriptive Tasks: Here we try to derive patterns that summarise the underlying relationships in the Data. An Example would be trying to understand the underlying mechanisms of Cancer.

These areas are further divided into four main methods of modelling the Data.

1: Cluster Analysis

2: Association Analysis

3: Anomaly Detection

4: Predictive Modelling

Models can be built for a given data set in each of these main areas and analysed.

LIFT Analysis can be used inside Association Analysis methods to determine which of several models that we have built, is a best fit for the data. However we must be careful as LIFT Analysis breaks down and gives false positives when the data model contains a lot of NULL  Data Entry’s. A better method is to use Kulczynski Analysis which is Null invariant.

Cluster Analysis:

This is used to find Groups or Sub Sets of closely related Data in the Main Data Set. An Example would be analyzing clusters of Cancer Cases around a Chemical Plant.

Association Analysis:

This used to discover features that describe strongly linked parts of the Data Set. An Example would be to understand the different relationships between different parts of the Earth’s Climate System.

Anomaly Detection:

This is used to detect Outliners or Anomalies in the Data Set. An Example would be Defect Analysis or Statistical Process Control in Manufacturing.

Predictive Modeling:

There are two subsets here:

A:  Classification:     Used for Data Sets containing Discrete Variable Types.  i.e.   types of cars

B:  Regression:     Used for Data Sets containing Continuous Variable Types.  i.e. time, temperature

This is used to determine the value of a target variable from its relationship to surrounding variables. Also sometimes used in Marketing to try and predict say next year’s fashion trends so factory’s and retail can prepare in advance.

Types of Data Sets

The Data Set:

A Data Set is commonly defined as a Collection or Set of Data Objects.

Data Objects:

Common Names used for a data object are Record, Point, Vector, Pattern, Event, Case, Sample and Entity.

Data Attributes:

Data Objects are described by a Set or Class, of Characteristics or Attributes, which define that Object.

Common names for Attributes of an Object are Variable, Characteristic, Field, Feature or Dimension.

The Data Matrix:

If the Data Objects in a fixed data set have all the same number of Attributes, then the data objects can be represented by an m by n matrix. The Rows represent the Data Objects and the Columns represent the Data Attributes. These matrix arrays are also known as Records. Matrix Algebra can be used to process these arrays.

Characteristics of Data Sets

Dimensionality:

This is the number of Attributes that the Objects in the data set possess. Data Sets with High Dimensionality are generally undesirable and a process known as Dimensionality Reduction is generally undertaken to enable the Data Set more manageable.

Sparsity:

For some data sets most attributes will have a value of Zero or NULL. Sparsity is an indication of how “empty” a data set is.  Some Algorithms only work well when working with sparse data sets.

Resolution:

This Applies to Data Sets containing things such as Mapping Data and Digital Signal Processing Data. It can be important when looking for fine detail in the Data.

Transaction or Market Basket Data:

This Applies to Data Sets containing Retail Sales Data. Contains Fields such as: – a list of items, whether the item was purchased or not, the price it was purchased for, the time it was purchased, till number ect.

Graph based Data Sets:

A Graph can be used to visualize relationships between Data Objects. As an example consider the chemical structure of Ethanol (below)

Or a Binary Decision Tree (above)

 

Sequence Data Sets:

This consists of a data set with a sequence of different entities. An Example would be the DNA sequence of a Genome or the first thousand values of the mathematical constants  PI or e

Time Series Data Sets:

This is a type of sequence data where each record has a time stamp. An example would be an Engineering Data Logger used to monitor process equipment in a Factory.

Spatial Data Sets;

This is where data objects have spatial attributes such as Latitude and Longitude or Vector Fields. An example would be the GPS Navigator in your car or the model of the Air Flow through a Jet Engine.

Conclusion

Data Analytics is a merging of many disciplines such as Statistics, Machine Learning, Pattern Recognition, Database Technology and Parallel Computing. It is a Young Field with plenty of opportunities for growth.

Leave a Reply

Your email address will not be published. Required fields are marked *