Introduction – CA5 Factor Analysis
The link between factor analysis and regression.
When you are given, a large data set it is often beneficial to use factor analysis and the PCA technique to apply factor analysis. Both regression and factor analysis allow you to connect the dots between the variables in a large data set. Regression allows you to understand the relationships between the different variables and quantify those relationships. Factor analysis allows you to understand the underlying drivers that influence those relationships. It may be necessary to undertake some method to transformation as these drivers are often not immediately observable.
While regression looks at the cause (independent/explanatory variable) of the effect (dependent variable), factor analysis seeks to identify the underlying causes, which impacts the observable behaviour. Explaining the underlying causes, can improve your predictions.
Why use factor analysis?
When you have n-dimensional data and there are multiple causes for a single effect. In multiple linear regression, you look for a regression plane to find the cause of the effect. You can have many variables in multiple regression and it is possible to use all the variables to explain the effect. However, if you use all variables then you have a problem called multicollinearity
MulticollinearitySome of these variables are correlated with each other. They can contain the same information and not independently provide you with new information. You need to identify the underlying causes which are uncorrelated but still effect the observed behaviour and enable you to build a better model. Complexity and computation effort would be reduced.
This exercise of taking a large number of variables, extracting the underlying causes from those variables and using them to explain an effect is called factor analysis. Regression models where you have such highly correlated variables are weak and not very stable.
Factor and Analysis and PCA
Principal component analysis or PCA for short is useful when you are trying to fit a curve through a set of data points, then regression is an appropriate technique to use but if first you wish to extract the facts that explain the data, then PCA is the recommended technique.
It’s not unusual to use a rule based approach where human experts identify the relevant factors but the alternative machine learning approach. PCA extracts the factors using an algorithm. Expert analysis and intuition are not relevant. PCA can identify latent factors and dimensionality reduction.
In PCA you are looking for the ‘best’ direction through the data. When you have 2-dimensional data you may also need the ‘next best’ direction also. These directions are the principal components and they tend to be orthogonal to each other to carry the maximum information with the least number of dimensions.
PCA can be used to convert highly correlated variables into a new set of variables. Each of these new variables is orthogonal and uncorrelated to each of the other variables. These new variables are ordered according the highest variance first.
PCA is using Eigen decomposition to find your principal component. Each component has a corresponding Eigen vector, which helps to compute the principal components and the Eigen values.
Information sourced @ Connect the Dots: Factor Analysis