2022-11-18

Multicolinearity

What is multicolinearity

When two explanatory variables are strongly correlated with each other, it is called colinearity. For example, when height and weight are included as explanatory variables in a model, we can say that the model is collinear because height and weight are correlated.

Multicolinearity refers to a state in which multiple collinearities occur in a multivariate analysis such as multiple regression analysis. In other words, there are multiple highly correlated combinations of explanatory variables.

Problems with multicolinearity

When analyzing data, multicollinearity must be considered. Failure to consider multicollinearity can lead to erroneous conclusions.

The problem with multicollinearity is that it can lead to β error, which makes it easy to miss a variable that significantly affects the target variable.

For example, suppose that running speed is the objective variable and the explanatory variables include height and weight. Here, height can be a factor that determines running speed, but weight is a factor that is not directly related to running speed. However, since height and weight are correlated with each other, weight can be considered a determinant of running speed. Therefore, it is not clear which of the correlated factors, height or weight, is the determinant of running speed, leading to a larger error. The larger the error, the harder it is to obtain a significant relationship. Therefore, both height and weight are considered not to be factors of running speed.

In other words, the problem with multicolinearity is that the standard errors for the explanatory variables that are collinear become abnormally large, making it impossible to obtain significance for explanatory variables that should be significant.

VIF, the criteria for multicollinearity

The presence or absence of multicollinearity can be determined using the VIF (Variance Inflation Factor), which is a value calculated for each explanatory variable and can be obtained using the following formula.

VIF_i = \frac{1}{1- R^2_i}

The R^2_i is the coefficient of determination of the regression with x_i as the objective variable and the other explanatory variables as the explanatory variables for which VIF is to be calculated.

While there are differing opinions on the exact standard value of VIF, VIF < 10 is often used as a minimum value. In other words, a VIF greater than 10 indicates that multicollinearity has occurred. However, since multivariate analysis inherently assumes that there is no correlation between explanatory variables, it can be said that the model results begin to be distorted when the VIF exceeds 3.

Why correlation coefficient is not sufficient to determine multicollinearity

Correlation coefficient is not sufficient to determine multicolinearity. The reason is that the correlation coefficient expresses only the relationship between two variables. For example, there are cases where two variables do not correlate, but three variables correlate with each other. In such a case, the correlation coefficient cannot express the correlation among the three variables.

How to eliminate multicollinearity

There are two major methods to avoid multicollinearity as follows

  • Remove the relevant explanatory variables
    In this example, multicollinearity can be avoided by removing weight from the explanatory variables.
  • Dimension reduction by PCA
    The synthetic variables called principal components generated by PCA (Principal Component Analysis) are independent of each other, thus eliminating the concern of multicollinearity.

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!