Abstract
Multicollinearity occurs when there comes a high level of correlation between the independent variables. This correlation creates the problem because the independent variables should be independent. Higher the degree of correlation means more complex problems you will face while fitting the model and interpreting the results. In this paper, we have eliminated the problem of multicollinearity on the basis of Hatvalues. The variables with higher Hatvalues will be removed from the data before fitting the model. This paper presents the comparison of results achieved by the proposed technique and state of the art methods.
Introduction
Regression analysis is a statistical process that allows us to analyze the relationship between two or more variables, one of them being dependent on the rest of the variables that we are using in our mathematical calculation [3]. In other words, a regressive analysis makes it possible to understand how the independent variables directly affect another variable that depends on them. Since a regression analysis makes it easier for us to calculate a future value of a variable, there are multiple possible applications on a day-to-day basis. Both business and personal or social level, the regression analysis is a very useful tool. For example, to evaluate the risk of accidents in a certain area of the road with respect to its geography, or to check the effectiveness of a change made in a commercial or academic project based on the results obtained after introducing a change. Regression analysis is widely used in the corporate world. Thanks to the results they produce, companies can better understand which elements have the greatest impact on results, which affect other elements of the company or which can be ignored. In this way, companies obtain important information that they can quickly apply in their organizations to improve their efficiency.
To apply such an analysis, we will necessarily use two types of variables.
Dependent variables: are those that we seek to study through statistical regression to understand how it adapts when modifying the independent variables. Independent variables: these are the factors that we consider to influence and directly affect the dependent variables that are under study.
We can perform 3 different analysis models depending on the number of variables and the way of interacting between them [5]:
Simple linear regression model Multiple linear regression model Nonlinear regression model
It will depend on the number of variables that we need to include to choose between one model and another.
Simple linear regression analysis is the most widely used and the simplest of all. It is about studying the effect of an independent variable on a single variable dependent on the first one or which at least at a theoretical level we have considered to be dependent. Using this simple linear regression equation, an estimate can be made based on the data obtained.
Where
Multiple linear regression model
In the case of multiple linear regression, we find a model that simply has more than one independent variable. We will apply this model when we have reasons to believe that there is more than one factor that affects the study variable.
Again,
Nonlinear regression model
There are occasions in which the relationship that can occur between independent variables and the dependent variable does not have a linear development, but instead have, for example, an exponential growth. In those cases, the nonlinear regression model comes into play and allows us to obtain an approximation of the values of the dependent variable in a nonlinear environment. Let us bear in mind that the process of a non-linear regression is more complex, since the number of parameters may not coincide with that of the independent variables.
Something to keep in mind is that these predictive models are not accurate. In them it is possible to confuse the correlation of two variables with a causality. If the variables do not have a logical reason to relate them to each other, we can reach the wrong conclusions when analyzing data that are not related in reality.
As we can see, there are many different ways of applying a regression analysis, each one adapted to the particular needs of each case study. And although it is necessary to exercise caution and choose the variables to be studied well, regression allows us to obtain valuable data that we can use to our advantage.
Multicollinearity
A topic of growing interest is the study and modelingthe association between variables. In thedifferent areas of knowledge such as biology, ecology, medicine, psychology, in generalhuman sciences, among others, situations arisewhere the researcher is interested in adjusting amodel including each of the observed regressive variables. Often when modeling thistype of information, compliance with thenecessary to fit a multiple linear regressionare not satisfied, in particular, the problem ofcollinearity between the regression variables. In the specialized literature there are methodsalternatives to treat modeling when there ispresence of multicollinearity between predictors. Multicollinearity implies the existence of alinear dependence between regression variables, bringing with itproblems of no single estimation of the parametersothers and therefore a false relationship between theexplanatory variables and the response variable.
The term collinearity (or multicollinearity) in Econometrics refers to a situation inwhich two or more explanatory variables are very similar and, therefore, it is difficultmeasure their individual effects on the explained variable [7]. This phenomenon can occur frequently in the context of time seriesand with macroeconomic series. For example, population and GDP in general tend tobe highly correlated.
We can find:
Exact multicollinearity: It occurs when the values of an explanatory variable areobtained as an exact linear combination of others. Degree multicollinearity: It occurs when the values of different variablesare so correlated that it is almost impossible to accurately estimate theindividual effects of each of them.
To decide whether degree collinearity is a problem we must consideraccount for the objectives of our concrete analysis. For example, for collinearity we do not worries too much if our goal is to predict, but it is a very serious problem ifthe analysis focuses on interpreting the parameter estimates. A first approximation to diagnose it consists of obtaining the coefficients ofsimple sample correlation for each pair of explanatory variables and see if the degree ofcorrelation between these variables is high. But it can be the case of having a relationshipalmost perfect linear between three or more variables and yet simple correlationsbetween pairs of variables should not be greater than 0.5. Another method is to regress each explanatory variable on the rest. N number of regressions are performed, and the determination coefficients are obtained. If anyof them is high, we can suspect the existence of collinearity. The collinearity problem boils down to the fact that the sample does not contain enoughinformation to estimate all parameters. Therefore, solving the problem requiresadd new information, be it sample or extra sample, or change the specification [6]. Some possible solutions in this line are:
Add new observations If it really is a sample problem, apossibility is to change the sample because it may be that with new data theproblem is resolved, although this does not always happen. The idea consists ofget less correlated data than the previous ones, either by changing allthe sample or simply incorporating more data into the initial sample. Notit is always easy to get better data so most likelywe must live with the problem being careful with the inference made andthe conclusions of it. Restrict parameters If Economic Theory or experience suggests somerestrictions on the parameters most affected by collinearity, imposing themwill reduce the problem. Obviously, you run the risk of imposingrestrictions that are not true. Delete variables If variables that are correlated with others are suppressed,the loss of explanatory power will be small and the collinearity will be reduced. This measure can cause other types of problems, since if the variable thatwe eliminate from the model is really significant, we will be omitting arelevant variable, which will make the estimators of the coefficients of themodel and its variance are biased so that the inference made would not bevalid. Transform the variables of the model If the collinearity is due to be interrelating time series with trend, it may be convenient to transform thevariables to eliminate this trend.
Imperfect multicollinearity is a very common problem in econometric models that violates the basic hypothesis of independence between the explanatory variables.
If a model exhibits multicollinearity, the ordinary least squares estimators are still the best estimators that can be obtained and they fulfill many of the desired properties for an estimator, such as unbiasedness and efficiency.
The following consequences are derived from imperfect multicollinearity:
High standard errors of estimation or large variances in the estimators. Instability of the estimators before small sample variations. This problem is a direct consequence of the previous one, since if the variances of the estimators are large, the estimators are more unstable. Difficulty interpreting the coefficients and therefore their estimates. The regression coefficients (
Since multicollinearity is a sampling problem, in many cases it can be solved simply by expanding the sample. However, if we had access to more information we should have used it from the beginning. A possible solution for imperfect multicollinearity is the elimination of some of the variables that cause this multicollinearity. In this case, a default specification error of a relevant variable may be incurred [8].
It must be taken into account that, the greater the information shared by the variables, that is, the greater the degree of multicollinearity, the lower the risk of making a specification error by default when eliminating one of the variables that generate it. If the objective of the model is mainly predictive, we can consider eliminating variables to solve a multicollinearity problem, but if it is about finding the factors that affect a variable, we should not eliminate any factor.
Another practical solution that is frequently used is the transformation of the variables included in the model, in an attempt to make the transformed variables show lower linear correlations [4]. The most commonly used transformations are the calculation of the variable increments (if it is a time series) or relativizing them with respect to a common variable (for example, putting them in per capita term).
Proposed exercise
Estimate the new model with the per capita variables and study its multicollinearity.
Another solution would be to use other estimation methods. However, if the aims pursued with the construction of the model are predictive, the problem of multicollinearity is not so relevant, since it does not affect the joint explanatory capacity of the variables and, therefore, their predictive capacity.
In the application of statistical models in clinical research, multicollinearity in regression models applied to observational studies is a frequent problem. The authors of the paper review the concept, origin and implications of multicollinearity. Some procedures to detect it and the methods that are frequently used to correct it are presented. The use of cluster analysis is proposed as an alternative strategy in problems involving highly correlated covariates [1].
This work addresses the problem of Multicollinearity between regression variables in the Multiple Linear Regression Model. There are diverse environments in the agricultural sciences where this difficulty can arise. In this work, in order to create a situation of multicollinearity in the data, necessary for the study, explanatory variables were generated so that between two of them there was a certain degree of dependence, that is, “almost linear combination”. Once these conditions are created, the multicollinearity analysis is established, passing through: symptom, diagnosis and treatment. For symptomatology, the correlations by pairs of regressive variables, the partial and total F test on the regression coefficients, the standard error of each estimator and the coefficient of determination, among other aspects, are analyzed. For the diagnosis, the diagonalization of the correlation matrix and the examination of the last eigenvalues ??were used, which provides precise information. For the treatment, the Ridge Regression and the Regression on Principal Components are approached, which are effective to describe with accuracy and precision the estimators in the Multiple Linear Regression Model [2].
We present the multidimensional scaling analysis as an alternative strategy to treat the multicollinearity problem in the multiple regression analysis [9], when the regression variables are qualitative, quantitative or mixed (quantitative and qualitative) and the response variable is continuous. Our purpose is to obtain the matrix of the principal coordinates, using as a metric the Gower distance when the predictive variables are mixed, or otherwise, the researcher must select an appropriate Euclidean distance and with this matrix to estimate the regression model. To observe the kindness of the proposed method, two cases of simulation are realized: the first one without presence of multicollinearity and the second one with presence of multi-collinearity. Two application cases are illustrated, using multiple regressions. In both cases simulated and in the applications, the R package was used. The results of the simulations and applications are compared with the classical multiple regression and regression based on principal component. The analysis strategy proposal is an alternative modeling that corrects collinearity, and allows work with predicted variables without loss of information, Additionally, this technique when transforming the original variables into coordinates, in its modeling hides the effect of the observed variables, so that the results are not manipulated.
A strategy is presented to treat non-compliance with the multicollinearity assumption in the multiple regression analysis, when the regression variables are qualitative, quantitative or mixed (quantitative and qualitative) and the response variable is continuous. The methodology is based on multidimensional scaling analysis, using the Gower distance as a metric if the predictor variables are mixed, or otherwise, another Euclidean type distance. The purpose is to obtain the principal coordinate matrix, and with it, to estimate the regression model. To observe the benefits of the proposed method, two simulation cases are carried out: the first without the presence of multicollinearity and the second, with the presence of multicollinearity. Two application cases analyzed by Draper and Smith (2014) are presented using multiple regression, both in the simulated cases and in the applications, the R-statistical package was used. The results of the simulations and applications are compared with the classic multiple regression and the based on principal components. The proposed analysis is a modeling alternative that corrects collinearity and allows variables to be worked without loss of information by linearly modeling situations where the true effect of the original variables is hidden, so that the results are not manipulated.
Methodology and results
We detect as atypical observations South Africa, Luxembourg, India and the United States and Russia. We can say that an observation presents leverage if its HatValue is greater than twice the average of the lever of the assumptions considered. In our model, we have 43 observations, therefore we can use the criterion of using points that have a lever greater than twice the average leverage value. In our case, the average leverage is:
Twice this value is 0.1860465116 (twice the value of the average leverage).
We computed the correlation matrix of the regressors:
We note that GDPpc and HealthExp have a correlation coefficient
This high coefficient of simple correlation between these two regressors indicates the presence of multicollinearity in our model. Another suitable procedure to evaluate the possible presence of multicollinearity is the variance increase factor (IVF) of each of the regressors. The FIV is a statistic allows determining whether the variance of an estimator is inflated by the presence of collinearity in the model. Multicollinearity have a great influence of smaller inlier error then its predecessors and overall contain dataset are computationally expensive.
As we expected, GDPpc and HealthExp are the two parameters for high variance due to their explanation for the same variance in the dataset. When there is multicollinearity, are contrasts of individual significance affected by the model parameters. As we have a presence of remarkable multicollinearity between these two regressors, if we eliminate the GDP regressor (the regressor that has the highest
In short, contrasts of individual significance are affected by the presence of multicollinearity; since this has a destabilizing effect on the contrasts of individual significance T. From the presence of multicollinearity, it can be concluded that an explanatory variable is irrelevant, (As in the first model with HealthExp) when in reality it is significant as we have shown in the second model without multicollinearity. Note that ‘P’ area making predictions, the presence of a model with a degree of collinearity not necessarily prevent power achieve good predictive adjustment model; since the correlation between the explanatory variables, will also be given in the prediction. Therefore, nothing prevents predictions from our first model from being reliable.
Conclusion
In this document, the multicollinearity problem between independent variables and its main consequences in the “explanatory” use of linear models were reviewed. Some procedures that are carried out to detect it were proposed, as well as different traditional strategies to correct it. On the other hand, an alternative method is presented in this paper, which, through an integral interpretation of the variables, allows us to answer the explanatory problem of the regression. The presence of multicollinearity prevents an adequate assessment of the individual influence of the independent variables on the dependent one. The solution proposed here, when it comes to seeing the relationship of the dependent with the independent ones, is to remove the regressor having higher
