Category Archives: Big-Data Analytics

Effectiveness of Approaches to Reduce Effect of Multicollinearity

The previous posting describes multicollinearity, a data limitation, and several methods for alleviating it when using existing data or observational data to construct a model for estimating relationships and making predictions.  This posting reviews the results of three experiments, described by Clark (2016), performed to compare the effectiveness of these approaches to alleviate the effects of multicollinearity.

Yeniay and Goktas (2002) used a real data set to predict the performance of the gross domestic product per capita (GDPPC) in Turkey.   They compared the performance of PLS regression, ridge regression (RR), principal component regression (PCR) and ordinary least squares (OLS).   The previous posting gives a brief description of these regression methods.   The data consisted of 80 observations, and each observation represents one of the 80 provinces in Turkey. The data set included 26 predictor variables, and 22 of these variables are highly correlated. They estimated the predictive capability of the models using the leave-one-out approach to estimating the variability of predictions.   PLS regression had the smallest prediction variability, but it was only slightly better than PCR.   The statistical significance of the difference was not examined.   However, the prediction variability of PLS regression and PCR was much smaller than the variability of RR and OLS.

Dumancas and Bello (2015) compared the performance of 12 predictive approaches using machine learning. The objective was to use lipid profile data to predict 5 year mortality after adjusting for confounding demographic variables. The approaches included PLS discriminant analysis, artificial neural network, ridge regression and logistics regression.   PLS discriminant analysis (PLS-DA) is used when Y is a categorical variable like 1 when a person dies in 5 years.     The dataset consisted of 726 individuals of which 121 died in the five year period.   The total dataset was divided into a training set (483 individuals) and a test set (243 individuals). The results ranked PLS-DA first among the 12 approaches for predictive accuracy.   However, the difference with PLS-DA was not statistically significant for artificial neural network, and logistics regression.

Dormann, Elith et al. (2013) evaluated methods for dealing with multicollinearity using simulation experiments.   They created training and test data sets that had 1000 cases and 21 predictors. The condition number is a measure of the degree of collinearity.   A condition number of 10 is approximately equivalent to |r| = .7.   The condition number is the square root of the ratio between the largest and smallest eigenvalue of X. On page 30, the statement is made that several of the latent variable methods were only marginally better than Multiple Linear Regression (MLR) delaying the degeneration of model performance from a condition level of 10 to 30.  MLR involves two or more explanatory variables when fitting a linear equation.   PLS regression uses latent variables.  However, the paper abstract states that latent variable methods did not outperform the MLR method. That conclusion did not apply when the condition number was less than 30.

Our conclusion is that evidence exists that PLS regression can outperform MLR in many situations with multicollinearity. However, severe multicollinearity can degrade PLS regression prediction performance.


    1. Clark, Gordon (2016). “Quality Improvement Using Big-Data Analytics” ASQ Statistics Division Digest, 35(1): 25-29.
    2. Dormann, C. F., J. Elith, et al. (2013). “Collinearity: A Review of Methods to Deal with It and a Simulation Study Evaluating Their Performance.” Ecography 36(1): 27-46.
    3. Dumancas, G. G. and G. A. Bello (2015). Comparison of Machine-learning Techniques for Handling Multicollinearity in Big Data Analytics and High-performance Data Mining. Supercomputing 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis.
    4. Yeniay, O. and A. Goktas (2002). “A Comparison of Partial Least Squares Regression with Other Prediction Methods.” Hacettpe Journal of Mathematics and Statistics 31: 99-111.


Using Big-Data Analytics to Improve Quality

This post addresses the expanding use of Big-Data Analytics to improve quality in many organizations.   Davenport (2013) defines Big Data as data that is either too unstructured, too voluminous or from too many different sources to be analyzed by traditional approaches.   The label analytics describes the use of big data to drive decisions and actions.  Continue reading Using Big-Data Analytics to Improve Quality