Log transformation of variable "income" causes GWR to crap out?

2618
4
06-24-2011 10:55 PM
AndrewTimleck
New Contributor
Working on specifying my GWR model and doing my OLS work found, not surprisingly, that "income", and other variables like "population density", and "call_rate" needed transformation. So I used a normal log transformation and computed new field values for each and created "log_income" and "log_pop_density" and "log_call_rate". Before, running the OLS, everything seemed fine. But now, when running GWR, the analysis craps out with the ole...

[INDENT]ERROR 040038: Results cannot be computed because of severe model design problems.
Failed to execute (GeographicallyWeightedRegression).
[/INDENT]

So I narrowed it down to the "log_income" and "log_pop_density" variables - whenever the "log_income" is included, either as independent or dependent variable, the model completely tanks. For the "log_pop_den" variable only when it's a dependent variable does the model tank. Looking at the distribution graphically (attached) for income it seems OK. Perhaps this isn't the correct (best?) transformation though?

Methodologically I'm working on an city wide grid, measuring about 2000 x 2000 polygons high and wide, each measuring 250' x 250', where almost every one has data within it. I've set bandwidths to extremes, used minimum # of neighbors, AICc, etc. - every iteration possible - and can't seem to get it to fly. Clearly my "transformation" is the problem but not sure what that problem is.

Much thanks for anyone's eyes on this.
Reply
0 Kudos
4 Replies
LaurenRosenshein
New Contributor III
Hey Andrew,

I'm sorry you're having trouble getting GWR to run for your analysis.  Severe model design errors often indicate a problem with global or local multicollinearity. To determine where the problem is, run the model using OLS and examine the VIF value for each explanatory variable. If some of the VIF values are large (above 7.5, for example), global multicollinearity is preventing GWR from solving. More likely, however, local multicollinearity is the problem. Try creating a thematic map for each explanatory variable. If the map reveals spatial clustering of identical values, consider combining those variables with other explanatory variables to increase value variation. If, for example, you are modeling home values and have variables for both bedrooms and bathrooms, you may want to combine these to increase value variation or represent them as bathroom/bedroom square footage.

Another option is to try transforming it (although not in the traditional sense of logs or powers): create a new field, then calculate the values to be the value (in this case the log) minus the mean for all values in that field.  This doesn�??t actually change anything (the impact on results), but for some reason we've found that GWR likes variables in that form�?� and this transformation will often �??fix�?� the problem.

Also, just a reminder to make sure that you find a properly specified OLS model before moving on to GWR.  There is some great documentation about this, including this recent ArcUser article on Finding a Meaningful Model and the training seminar on Regression Analysis Basics.
Reply
0 Kudos
AndrewTimleck
New Contributor
Hi Ms. Rosenshein,

Thanks for the response, much appreciate the time looking at my problem. You wrote...

Hey Andrew,
To determine where the problem is, run the model using OLS and examine the VIF value for each explanatory variable. If some of the VIF values are large (above 7.5, for example), global multicollinearity is preventing GWR from solving.


Yup. Did this first. nothing above 1.99 for VIF. I had already dropped a number of variables because of high VIFs. But "Income" stands alone, as does "population density" - two of the variables I'm having the most issues with.


More likely, however, local multicollinearity is the problem. Try creating a thematic map for each explanatory variable. If the map reveals spatial clustering of identical values, consider combining those variables with other explanatory variables to increase value variation.


Yes, did this too. And ran LISA and Moran's I measures for all variables to detect for spatial autocorrelation. I see the idea behind combining some variables but then the model loses specificity in explanatory power (and given the VIFs aren't indicating a model issue I'm stumped somewhat.) For example, in Baltimore City, the area in question, there is high local correlation of "percent black" and "percent living in poverty". However, clearly the two variables are separate "entities" and collapsing them loses far too much explanatory utility to do so. Again, it's not these variables that are causing the model to fail structurally I suspect...


Another option is to try transforming it (although not in the traditional sense of logs or powers): create a new field, then calculate the values to be the value (in this case the log) minus the mean for all values in that field.  This doesn�??t actually change anything (the impact on results), but for some reason we've found that GWR likes variables in that form�?� and this transformation will often �??fix�?� the problem.


I'm thinking this might be the way to go. There is something in the numerical distribution of the variable values that seems to be problematic and causes the equations to falter at the immediate, local, rendering of the GWR. Log transformations seem problematic when untransformed values do work and the transformed don't.  I'll check that the transformation follows the assumptions of the variables in questions, though transforming income to a log value is pretty standard I thought? So it still makes me ask what about the combination of equation values could be causing it to fail? (in particular it's the open-ended (supposedly) distribution of "Income" and "population density" transformed to log values that appears the most problematic in the local GWR executions)


Also, just a reminder to make sure that you find a properly specified OLS model before moving on to GWR.  There is some great documentation about this, including this recent ArcUser article on Finding a Meaningful Model and the training seminar on Regression Analysis Basics.


Thanks for posting this stuff - I'd read the model specification stuff earlier, to be sure I was on the right track still, but the "Finding a Meaningful Model" piece will be good to check my model against core assumptions again (BTW, it's a nice, simple, elegant piece - well done and thanks for posting/sharing it). I've gone so far in my model estimations to use the spatial modeling tools for checking sills, nuggets and so on to get a properly determined spatial model, banding distances etc. And I also, earlier, went through and ran a global analysis for Pearsons correlations, running one variable against the next spatially, to determine if there are spurious relationships (mediating and moderating factors) that could be skewing variable values as well.

Thanks for the input and suggestions - will see where they take me. Clearly spatial analysis remains in the realm of "sausage making statistics" - not pretty to know what goes in but looks good on the other side, lol.

Best
Andrew
Reply
0 Kudos
StephenPeplow
New Contributor
I had the same problem: very nice OLS model, with five independent variable. VIF all below 2. Converted to logs and wouldn't run in ArcGIS as a GWR. I centred the variables (as advised above) and it works very well. Thanks for the advice. I would never have got there on my own! Stephen
Reply
0 Kudos
AnnaFischer1
New Contributor II
Hey Andrew,

Another option is to try transforming it (although not in the traditional sense of logs or powers): create a new field, then calculate the values to be the value (in this case the log) minus the mean for all values in that field.  This doesn�??t actually change anything (the impact on results), but for some reason we've found that GWR likes variables in that form�?� and this transformation will often �??fix�?� the problem.



This worked wonderfully! Thanks so much!!
Reply
0 Kudos