Select to view content in your preferred language

Perfect Multicollinearity in Exploratory Regression, FAST response needed

1159
3
01-02-2021 10:33 PM
Labels (2)
autumn
by
New Contributor

Thank you to everyone reading. I would love a response ASAP! Licensing, tech, travel, and snow issues have massively delayed my project and my presentation is in 2.5 days from now 🙃

I am trying to do a regression to explore the relationship between human community statistics (demography, rurality, water use, etc) and the total amount of ecosystem service provision areas in Iowa, by county. In order to prepare my data for the regression, I did the following:

I joined a layer of Iowa county polygons to a table of candidate variables (note: for all joins, features were exported to a new layer to avoid issues). I merged a bunch of polygons into layers for different types of ecosystem services provision areas in Iowa to use as a dependent variable for the regression (Quality areas: merged polygons for protected lands, Supply areas: merged polygons for all public areas of natural landscape or greenspace, Impaired areas: merged layer of points of spill incidents, emission points, wastewater outfalls, etc). I then took these Iowa layers for provision areas and spatially joined them to their respective county (to the county polygon layer), creating a dataset with 99 records (for each IA county) containing the sum total of the provision area within the county, and all associated county demographic stats.

I used the shape area field for the polygons of county provision areas as the dependent variable for the exploratory regression. I chose my candidate variables, I believe I left all default options (not sure), and I ran the regression. Result: no significance, warning 001304 for the vast majority of possible models (WARNING 001304: Unable to estimate the following models due to severe multicollinearity (data redundancy).) Some models could be estimated, but the majority had the warning 001176: Perfect Multicollinearity. A relatively brief sample of the script at the end of the post. The full details/running script for the process is attached as a pdf.

I chose a variety of candidate variables, some of which could not possibly display perfect multicollinearity. I tried looking up help for this issue, but all that I could find recommended removing redundant variables until the issue is resolved (VIF decreases sufficiently), but the multicollinearity exists among almost all possible combinations of variables.

I feel like it might be because of how I did my joins, or maybe that I can't use shape area as a dependent variable, but I have very little idea what is wrong. I would love help on the exploratory regression, but I am also open to other data analysis ideas. If you have any questions I would be happy to answer.

  • Choose 3 of 12 Summary

                                                              Highest Adjusted R-Squared Results                                                         

    AdjR2    AICc   JB K(BP)  VIF   SA   Model                                                                                                                 

     0.31 3725.42 0.00  0.99 7.57 0.01  -HOUSEHOLDS_WITH_INDIVIDUALS_65_YEARS_AND_OVER***  +VACANT_HOUSING_UNITS***  +IRGCROPSCONSUMPTIVEUSE4CROPSFRESH_MGAL_D**

     0.31 3726.49 0.00  0.96 7.93 0.01  -TOTAL_HOUSEHOLDS***  +VACANT_HOUSING_UNITS***  +IRGCROPSCONSUMPTIVEUSE4CROPSFRESH_MGAL_D**                            

     0.30 3727.41 0.00  0.93 7.55 0.01  -TOTAL_POPULATION***  +VACANT_HOUSING_UNITS***  +IRGCROPSCONSUMPTIVEUSE4CROPSFRESH_MGAL_D**                            

           Passing Models      

    AdjR2 AICc JB K(BP) VIF SA   Model

     WARNING 001304: Unable to estimate the following models due to severe multicollinearity (data redundancy).

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + MEDIAN_AGE__YEARS_ + TOTAL_HOUSEHOLDS.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + WHITE + BLACK_OR_AFRICAN_AMERICAN.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + WHITE + TOTAL_HOUSEHOLDS.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + WHITE + RENTER_OCCUPIED_HOUSING_UNITS.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + BLACK_OR_AFRICAN_AMERICAN + TOTAL_HOUSEHOLDS.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + TOTAL_HOUSEHOLDS + HOUSEHOLDS_WITH_INDIVIDUALS_UNDER_18_YEARS.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + TOTAL_HOUSEHOLDS + HOUSEHOLDS_WITH_INDIVIDUALS_65_YEARS_AND_OVER.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + TOTAL_HOUSEHOLDS + VACANT_HOUSING_UNITS.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + TOTAL_HOUSEHOLDS + RENTER_OCCUPIED_HOUSING_UNITS.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + TOTAL_HOUSEHOLDS + RURALITY_INDEX.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + TOTAL_HOUSEHOLDS + PSPERCAPITAUSE_GAL_PERSON_D.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + TOTAL_POPULATION + TOTAL_HOUSEHOLDS + IRGCROPSCONSUMPTIVEUSE4CROPSFRESH_MGAL_D.

     WARNING 001176: Perfect Multicollinearity in: SHAPE_AREA ~ Intercept + WHITE + BLACK_OR_AFRICAN_AMERICAN + TOTAL_HOUSEHOLDS.

3 Replies
DanPatterson
MVP Esteemed Contributor

if your population density is relatively the same in your area, then it would make sense that any measure between area and total population, total households would produce the  001176 warning.

as for IRGCROPSCONSUMPTIVEUSE4CROPSFRESH_MGAL_D trying to figure out what that is 😉


... sort of retired...
autumn
by
New Contributor

That would be irrigative water use in the county. So that especially should not exhibit multicollinearity with population because it is dependent on agricultural lands. There are also other water use variables included, as well as rurality (which should exhibit the opposite of multicollinearity with total households).

It is for this reason that I think there must be some error within the setup or design, the current results make no sense.

0 Kudos
DanPatterson
MVP Esteemed Contributor

Did you dump Total Population and reassess with it removed?
I don't think multicollinearity is directional

Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related. We have perfect multicollinearity if, for example as in the equation above, the correlation between two independent variables is equal to 1 or −1. In practice, we rarely face perfect multicollinearity in a data set. More commonly, the issue of multicollinearity arises when there is an approximate linear relationship among two or more independent variables.

https://en.wikipedia.org/wiki/Multicollinearity#:~:text=Multicollinearity%20refers%20to%20a%20situat... 


... sort of retired...
0 Kudos