Multicollinearity in Big Data

This posting reviews approaches to improve the effectiveness of regression in big-data analytics.  Snee (2015) mentions a challenging problem when using big-data to estimate relationships and make predictions.   The data are likely to be observational and multicollinearity may exist among the predictor variables (Clark 2016). When predictor variables are highly correlated, the estimates of regression coefficients have large variances and covariances (Montgomery, Peck and Vining 2006, page 327).  This means that multiple estimates even though the predictor variables have the same values could result in markedly different estimates of model parameters.  That increases the likelihood of identifying inappropriate model parameters.   Also, estimates using least squares tends to produce estimates of the dependent variables that are too large in absolute value.  Multicollinearity effects estimates and predictions from both regression and artificial neural network models (Lin and Lee 1996).

Montgomery, Peck and Vining 2006 outline several approaches for dealing with multicollinearity.   One is to collect additional date to reduce the multicollinearity.   This is not always possible.  Another approach is to respecify the model.  That could include defining a new set of predictors that are a function of the original predictors that reduces multicollinearity.  Another option is to eliminate a predictor that doesn’t contribute much explanatory capability.  Another approach is to develop predictions of the dependent variables that are slightly biased but have smaller variances.   Three of these approaches are:

  • Ridge Regression. The ridge estimator is a linear transformation of the least squares predictors with a biasing parameter.
  • Principal Component (PC) Regression. Define a new set of orthogonal predictors called principal components that are linear functions of the original predictors.  Then delete those that have the least effect on the overall variance of the predictors.
  • Partial Least Squares (PLS) Regression (Adbi 2010). PC regression finds a set of orthogonal predictors that relate to the original predictors.   PLS regression finds a set of orthogonal predictors, called latent vectors, that explain the covariance between the predictors and the dependent variables.

The next posting will review studies the compare the performance of the above three approaches to dealing with multicollinearity.


  1. Abdi, Herve. (2010). “Partial Least Squares Regression and Projection on Latent Structure Regression (PLS Regression)” Wiley Interdisciplinary Reviews: Computational Statistics 2(1): 97-106.
  2. Clark, Gordon (2016). “Quality Improvement Using Big-Data Analytics” ASQ Statistics Division Digest, 35(1): 25-29.
  3. Snee, R. D. (2015). “A Practical Approach to Data Mining: I Have All These Data; Now What Should I Do?” Quality Engineering 27(4): 477-487.
  4. Montgomery, D. C., E. A. Peck and G. G. Vining (2006). Introduction to Linear Regression Analysis Fourth Edition, John Wiley & Sons, New York.

Leave a Reply

Your email address will not be published. Required fields are marked *