Detection and elimination of multicollinearity in regression analysis

Abstract

Multicollinearity occurs when there comes a high level of correlation between the independent variables. This correlation creates the problem because the independent variables should be independent. Higher the degree of correlation means more complex problems you will face while fitting the model and interpreting the results. In this paper, we have eliminated the problem of multicollinearity on the basis of Hatvalues. The variables with higher Hatvalues will be removed from the data before fitting the model. This paper presents the comparison of results achieved by the proposed technique and state of the art methods.

Keywords

Regression model multicollinearity correlation

1. Introduction

Regression analysis is a statistical process that allows us to analyze the relationship between two or more variables, one of them being dependent on the rest of the variables that we are using in our mathematical calculation [3]. In other words, a regressive analysis makes it possible to understand how the independent variables directly affect another variable that depends on them. Since a regression analysis makes it easier for us to calculate a future value of a variable, there are multiple possible applications on a day-to-day basis. Both business and personal or social level, the regression analysis is a very useful tool. For example, to evaluate the risk of accidents in a certain area of the road with respect to its geography, or to check the effectiveness of a change made in a commercial or academic project based on the results obtained after introducing a change. Regression analysis is widely used in the corporate world. Thanks to the results they produce, companies can better understand which elements have the greatest impact on results, which affect other elements of the company or which can be ignored. In this way, companies obtain important information that they can quickly apply in their organizations to improve their efficiency.

To apply such an analysis, we will necessarily use two types of variables.

•
Dependent variables: are those that we seek to study through statistical regression to understand how it adapts when modifying the independent variables.
•
Independent variables: these are the factors that we consider to influence and directly affect the dependent variables that are under study.

We can perform 3 different analysis models depending on the number of variables and the way of interacting between them [5]:

•
Simple linear regression model
•
Multiple linear regression model
•
Nonlinear regression model

It will depend on the number of variables that we need to include to choose between one model and another.
2. Simple linear regression model

Simple linear regression analysis is the most widely used and the simplest of all. It is about studying the effect of an independent variable on a single variable dependent on the first one or which at least at a theoretical level we have considered to be dependent. Using this simple linear regression equation, an estimate can be made based on the data obtained.

Simple linear regression formula

$\displaystyle y=B0+\left(x\right)B1+{\varepsilon}$

Where $B0$ is the value of the independent variable, $B1$ is the dependent variable and $\varepsilon$ represents the residual or error. The function of $\varepsilon$ is to explain the possible variability of the data that cannot be explained through the linear relationship of the formula.

3. Multiple linear regression model

In the case of multiple linear regression, we find a model that simply has more than one independent variable. We will apply this model when we have reasons to believe that there is more than one factor that affects the study variable.

Multiple linear regression formula

$\displaystyle Y=O+\left({X1}\right)B1+\left({X2}\right)B2+\ldots+\left({Xn}% \right)Bn+{\varepsilon}$

Again, $Y$ represents the dependent variable that is being studied and $B1$ , $B2$ , Bn are all the independent variables that can affect the value of the dependent variable $Y$ . Similarly, $\varepsilon$ continues to represent the possible existing error.

4. Nonlinear regression model

There are occasions in which the relationship that can occur between independent variables and the dependent variable does not have a linear development, but instead have, for example, an exponential growth. In those cases, the nonlinear regression model comes into play and allows us to obtain an approximation of the values of the dependent variable in a nonlinear environment. Let us bear in mind that the process of a non-linear regression is more complex, since the number of parameters may not coincide with that of the independent variables.

Something to keep in mind is that these predictive models are not accurate. In them it is possible to confuse the correlation of two variables with a causality. If the variables do not have a logical reason to relate them to each other, we can reach the wrong conclusions when analyzing data that are not related in reality.

As we can see, there are many different ways of applying a regression analysis, each one adapted to the particular needs of each case study. And although it is necessary to exercise caution and choose the variables to be studied well, regression allows us to obtain valuable data that we can use to our advantage.

5. Multicollinearity

A topic of growing interest is the study and modelingthe association between variables. In thedifferent areas of knowledge such as biology, ecology, medicine, psychology, in generalhuman sciences, among others, situations arisewhere the researcher is interested in adjusting amodel including each of the observed regressive variables. Often when modeling thistype of information, compliance with thenecessary to fit a multiple linear regressionare not satisfied, in particular, the problem ofcollinearity between the regression variables. In the specialized literature there are methodsalternatives to treat modeling when there ispresence of multicollinearity between predictors. Multicollinearity implies the existence of alinear dependence between regression variables, bringing with itproblems of no single estimation of the parametersothers and therefore a false relationship between theexplanatory variables and the response variable.

The term collinearity (or multicollinearity) in Econometrics refers to a situation inwhich two or more explanatory variables are very similar and, therefore, it is difficultmeasure their individual effects on the explained variable [7]. This phenomenon can occur frequently in the context of time seriesand with macroeconomic series. For example, population and GDP in general tend tobe highly correlated.

We can find:

•
Exact multicollinearity: It occurs when the values of an explanatory variable areobtained as an exact linear combination of others.
•
Degree multicollinearity: It occurs when the values of different variablesare so correlated that it is almost impossible to accurately estimate theindividual effects of each of them.

6. Problems of multicollinearity

To decide whether degree collinearity is a problem we must consideraccount for the objectives of our concrete analysis. For example, for collinearity we do not worries too much if our goal is to predict, but it is a very serious problem ifthe analysis focuses on interpreting the parameter estimates. A first approximation to diagnose it consists of obtaining the coefficients ofsimple sample correlation for each pair of explanatory variables and see if the degree ofcorrelation between these variables is high. But it can be the case of having a relationshipalmost perfect linear between three or more variables and yet simple correlationsbetween pairs of variables should not be greater than 0.5. Another method is to regress each explanatory variable on the rest. N number of regressions are performed, and the determination coefficients are obtained. If anyof them is high, we can suspect the existence of collinearity. The collinearity problem boils down to the fact that the sample does not contain enoughinformation to estimate all parameters. Therefore, solving the problem requiresadd new information, be it sample or extra sample, or change the specification [6]. Some possible solutions in this line are:

•
Add new observations

If it really is a sample problem, apossibility is to change the sample because it may be that with new data theproblem is resolved, although this does not always happen. The idea consists ofget less correlated data than the previous ones, either by changing allthe sample or simply incorporating more data into the initial sample. Notit is always easy to get better data so most likelywe must live with the problem being careful with the inference made andthe conclusions of it.
•
Restrict parameters

If Economic Theory or experience suggests somerestrictions on the parameters most affected by collinearity, imposing themwill reduce the problem. Obviously, you run the risk of imposingrestrictions that are not true.
•
Delete variables

If variables that are correlated with others are suppressed,the loss of explanatory power will be small and the collinearity will be reduced. This measure can cause other types of problems, since if the variable thatwe eliminate from the model is really significant, we will be omitting arelevant variable, which will make the estimators of the coefficients of themodel and its variance are biased so that the inference made would not bevalid.
•
Transform the variables of the model

If the collinearity is due to be interrelating time series with trend, it may be convenient to transform thevariables to eliminate this trend.

Imperfect multicollinearity is a very common problem in econometric models that violates the basic hypothesis of independence between the explanatory variables.

If a model exhibits multicollinearity, the ordinary least squares estimators are still the best estimators that can be obtained and they fulfill many of the desired properties for an estimator, such as unbiasedness and efficiency.

The following consequences are derived from imperfect multicollinearity:

1.
High standard errors of estimation or large variances in the estimators.
2.
Instability of the estimators before small sample variations. This problem is a direct consequence of the previous one, since if the variances of the estimators are large, the estimators are more unstable.
3.
Difficulty interpreting the coefficients and therefore their estimates. The regression coefficients ( $\beta_{i})$ are interpreted as the change that occurs in the dependent variable ( $y$ ) in the event of variations in the independent variable ( $x_{i}$ ) by one unit, provided that the rest of the explanatory variables remain constant. When there is imperfect multicollinearity, it is impossible to assume that the rest of the variables remain constant when one changes, since if they are highly related, changes in one will imply changes in the rest. For this reason the parameters lose their meaning.

7. Literature review

Since multicollinearity is a sampling problem, in many cases it can be solved simply by expanding the sample. However, if we had access to more information we should have used it from the beginning. A possible solution for imperfect multicollinearity is the elimination of some of the variables that cause this multicollinearity. In this case, a default specification error of a relevant variable may be incurred [8].

It must be taken into account that, the greater the information shared by the variables, that is, the greater the degree of multicollinearity, the lower the risk of making a specification error by default when eliminating one of the variables that generate it. If the objective of the model is mainly predictive, we can consider eliminating variables to solve a multicollinearity problem, but if it is about finding the factors that affect a variable, we should not eliminate any factor.

Another practical solution that is frequently used is the transformation of the variables included in the model, in an attempt to make the transformed variables show lower linear correlations [4]. The most commonly used transformations are the calculation of the variable increments (if it is a time series) or relativizing them with respect to a common variable (for example, putting them in per capita term).

7.1 Proposed exercise

Estimate the new model with the per capita variables and study its multicollinearity.

Another solution would be to use other estimation methods. However, if the aims pursued with the construction of the model are predictive, the problem of multicollinearity is not so relevant, since it does not affect the joint explanatory capacity of the variables and, therefore, their predictive capacity.

In the application of statistical models in clinical research, multicollinearity in regression models applied to observational studies is a frequent problem. The authors of the paper review the concept, origin and implications of multicollinearity. Some procedures to detect it and the methods that are frequently used to correct it are presented. The use of cluster analysis is proposed as an alternative strategy in problems involving highly correlated covariates [1].

This work addresses the problem of Multicollinearity between regression variables in the Multiple Linear Regression Model. There are diverse environments in the agricultural sciences where this difficulty can arise. In this work, in order to create a situation of multicollinearity in the data, necessary for the study, explanatory variables were generated so that between two of them there was a certain degree of dependence, that is, “almost linear combination”. Once these conditions are created, the multicollinearity analysis is established, passing through: symptom, diagnosis and treatment. For symptomatology, the correlations by pairs of regressive variables, the partial and total F test on the regression coefficients, the standard error of each estimator and the coefficient of determination, among other aspects, are analyzed. For the diagnosis, the diagonalization of the correlation matrix and the examination of the last eigenvalues ??were used, which provides precise information. For the treatment, the Ridge Regression and the Regression on Principal Components are approached, which are effective to describe with accuracy and precision the estimators in the Multiple Linear Regression Model [2].

We present the multidimensional scaling analysis as an alternative strategy to treat the multicollinearity problem in the multiple regression analysis [9], when the regression variables are qualitative, quantitative or mixed (quantitative and qualitative) and the response variable is continuous. Our purpose is to obtain the matrix of the principal coordinates, using as a metric the Gower distance when the predictive variables are mixed, or otherwise, the researcher must select an appropriate Euclidean distance and with this matrix to estimate the regression model. To observe the kindness of the proposed method, two cases of simulation are realized: the first one without presence of multicollinearity and the second one with presence of multi-collinearity. Two application cases are illustrated, using multiple regressions. In both cases simulated and in the applications, the R package was used. The results of the simulations and applications are compared with the classical multiple regression and regression based on principal component. The analysis strategy proposal is an alternative modeling that corrects collinearity, and allows work with predicted variables without loss of information, Additionally, this technique when transforming the original variables into coordinates, in its modeling hides the effect of the observed variables, so that the results are not manipulated.

A strategy is presented to treat non-compliance with the multicollinearity assumption in the multiple regression analysis, when the regression variables are qualitative, quantitative or mixed (quantitative and qualitative) and the response variable is continuous. The methodology is based on multidimensional scaling analysis, using the Gower distance as a metric if the predictor variables are mixed, or otherwise, another Euclidean type distance. The purpose is to obtain the principal coordinate matrix, and with it, to estimate the regression model. To observe the benefits of the proposed method, two simulation cases are carried out: the first without the presence of multicollinearity and the second, with the presence of multicollinearity. Two application cases analyzed by Draper and Smith (2014) are presented using multiple regression, both in the simulated cases and in the applications, the R-statistical package was used. The results of the simulations and applications are compared with the classic multiple regression and the based on principal components. The proposed analysis is a modeling alternative that corrects collinearity and allows variables to be worked without loss of information by linearly modeling situations where the true effect of the original variables is hidden, so that the results are not manipulated.

8. Methodology and results

We detect as atypical observations South Africa, Luxembourg, India and the United States and Russia. We can say that an observation presents leverage if its HatValue is greater than twice the average of the lever of the assumptions considered. In our model, we have 43 observations, therefore we can use the criterion of using points that have a lever greater than twice the average leverage value. In our case, the average leverage is:

$\displaystyle K=4;N=43$

$\displaystyle\frac{4}{43}=0.09302325581$

Twice this value is 0.1860465116 (twice the value of the average leverage).

We computed the correlation matrix of the regressors:

	Lifeexp	GDPpc	InfantMort	Healthexp
Lifeexp	1.000000	0.634274	$-$ 0.830252	0.622426
GDPpc	0.634274	1.000000	$-$ 0.605972	0.854895
InfantMort	$-$ 0.830252	$-$ 0.605972	1.000000	$-$ 0.539039
Healthexp	0.622426	0.854895	$-$ 0.539039	1.000000

We note that GDPpc and HealthExp have a correlation coefficient $>$ 0.8.

This high coefficient of simple correlation between these two regressors indicates the presence of multicollinearity in our model. Another suitable procedure to evaluate the possible presence of multicollinearity is the variance increase factor (IVF) of each of the regressors. The FIV is a statistic allows determining whether the variance of an estimator is inflated by the presence of collinearity in the model. Multicollinearity have a great influence of smaller inlier error then its predecessors and overall contain dataset are computationally expensive.

	VIF factor	Features
1	3,737415	Lifeexp
2	4,176,182	GDPpc
3	3.397794	InfantMort
4	3,946752	Healthexp

As we expected, GDPpc and HealthExp are the two parameters for high variance due to their explanation for the same variance in the dataset. When there is multicollinearity, are contrasts of individual significance affected by the model parameters. As we have a presence of remarkable multicollinearity between these two regressors, if we eliminate the GDP regressor (the regressor that has the highest $P$ value, and also the highest IVF among the model regressors) we can build a new regression model with only HealthExp and InfantMort as regressors. If we compare the contrasts of individual significance of the two models, we observe that they have very notable differences. (Also the error of the residues is slightly smaller and we have 1 more degree of freedom). HealthExp becomes a statistically significant regressor.

In short, contrasts of individual significance are affected by the presence of multicollinearity; since this has a destabilizing effect on the contrasts of individual significance T. From the presence of multicollinearity, it can be concluded that an explanatory variable is irrelevant, (As in the first model with HealthExp) when in reality it is significant as we have shown in the second model without multicollinearity. Note that ‘P’ area making predictions, the presence of a model with a degree of collinearity not necessarily prevent power achieve good predictive adjustment model; since the correlation between the explanatory variables, will also be given in the prediction. Therefore, nothing prevents predictions from our first model from being reliable.

9. Conclusion

In this document, the multicollinearity problem between independent variables and its main consequences in the “explanatory” use of linear models were reviewed. Some procedures that are carried out to detect it were proposed, as well as different traditional strategies to correct it. On the other hand, an alternative method is presented in this paper, which, through an integral interpretation of the variables, allows us to answer the explanatory problem of the regression. The presence of multicollinearity prevents an adequate assessment of the individual influence of the independent variables on the dependent one. The solution proposed here, when it comes to seeing the relationship of the dependent with the independent ones, is to remove the regressor having higher $P$ -value.

References

Asar

. Some new methods to solve multicollinearity in logistic regression. Communications in Statistics – Simulation and Computation. 2015; 46(4): 2576-2586.

Bager

Roman

Algedih

, et al. Addressing multicollinearity in regression models: a ridge regression application, 2017.

Chatterjee

Hadi

. Regression Analysis by Example. John Wiley & Sons, 2015.

Duzan

. Solution to the Multicollinearity Problem by Adding some Constant to the Diagonal. Journal of Modern Applied Statistical Methods. 2016; 15(1).

Gogtay

Deshpande

Thatte

. Principles of Regression Analysis. Journal of The Association of Physicians of India. 2017; 65: 48-52.

Kalnins

. Multicollinearity: How common factors cause Type 1 errors in multivariate regression. Strategic Management Journal. 2018; 39(8): 2362-2385.

Katrutsa

Strijova

. Comprehensive study of feature selection methods to solve multicollinearity problem according to evaluation criteria. Expert Systems with Applications. 2017; 76: 1-11.

Rodríguez-Sánchez

RS-G

García-García

. Diagnosis and quantification of the non-essential collinearity. Computational Statistics. 2019; 35: 647-666.

Saeed

Haewoon Nam

. A Survey on Multidimensional Scaling. ACM Computing Surveys. 2018 May; 51(3).