Ordinary Least Squares (OLS) and Geographically Weighted Regression (GWR) woes

BrianWallace · ‎05-07-2010

Hello forum community,

I am a graduate student charged with the task of exploring the potentiality of predicting archaeological "site" or more accurately artifact location based off of an existing archaeological site database as my thesis exercise. The task is to attempt to find any correlational relationships between the existing site location dependent variable and environmental indpendent variables. I have looked into various different logistic, multivariate, regression models but have been reading up on the relatively recently released tools of OLS and GWR in the Spatial Statistics toolbox. Needless to say I have begun working with the data using these regression tools but am quickly feeling a bit overwhelmed and fear spending a significant amount of time towards a goal that may not even be possible with this method.

I have watched the free web seminar a couple of times, have read;

The ESRI Guide to GIS Analysis, Volume 2
Mitchell, Andy. ESRI Press, 2005.
Geographically Weighted Regression: the analysis of spatially varying relationships
Fotheringham, Stewart A., Chris Brunsdon, and Martin Charlton. John Wiley & Sons, 2002.

And thought I was making progress however I am finding very little R-squared "goodness of fit" value in every attempt of analysis and am feeling as though I am spinning circles. This forum was suggested to me by the regional ESRI tech support as she too admitted as to knowing very little about the application. I have looked into attending an ESRI class "Performing Analysis with ArcGIS Desktop" which was suggested again after the webinar but before pulling the trigger on spending that kind of scratch I am hoping someone can shed some light as to whether I am barking up the wrong tree?

Could OLS and GWR successfully analyze the relationship between environmental (and perhaps other explanatory systemic behavioral) variables and archaelogical artifact locations? I vaguely recall in perhaps the webinar or perhaps Mitchell's chapter regarding this material the potentiality of success but again after early attempts just want to make sure before continuing the pursuit. I love the theory behind using local statistics to assess relationships on the dependent variable and the ability to formulate a model which can be later used to help predict unknown areas but at least with my own study have hit the preverbial wall.

Can anyone help?

Thanks- aspiring graduate school graduate

LaurenRosenshein · ‎05-14-2010

Hi Aspiring Graduate School Graduate!

I want to start by saying that you are on the right track! I�??m glad you were able to watch the free web seminar on regression and hope it was helpful. The books you mention are also great resources for getting started with this kind of analysis.

Speaking from my own Master�??s thesis experience, the analysis process is rarely as easy as it seems in books and tutorials. The real world is complex, and the relationships that we see in our data are sometimes pretty unexpected. That being said, there are a number of techniques that we can use to help us try to find the best model possible. Before you give up on using OLS and GWR, make sure that you ask yourself a few questions. The answers to these questions could help you find a good model using OLS, or they may help you realize why OLS is not the right method to use. Either way, you will be one step closer to finding solutions to the task at hand.

I will post these strategies as 4 separate entries due to length constraints.

Lauren Rosenshein
Geoprocessing Product Engineer
ESRI | Redlands, CA

View solution in original post

JeffreyEvans · ‎05-11-2010

It is not clear what your question is. However, past exploring data, ArcGIS is a poor choice for this type of analysis. The objective kernel function can yield very different results. You can define a n-th degree local polynomial as your kernel that can allow considerable flexibility. The only option in ArcMap is Gaussian or adaptive. My personal feeling on GWR is that the estimate is biased towards the 2nd order variation and does not account for global trend. This in turn over-disperses the estimate. Unless you expect a high level of nonstationarity, this is quite undesirable. There is also the possibility of loosing variation within a given local neighborhood (identical value problem).

It sounds like you are having some model specification issues. You would have a better idea of your statistical relationships if you performed some basic exploratory analysis. GeoDA is nice freeware for exploratory spatial data analysis (http://geodacenter.asu.edu/). I would highly recommend seeking help from somebody well versed in spatial statistics. If your data is autocorrelated the assumptions of Ordinary Least Squares (OLS) regression will be violated providing invalid results. Models such as Mixed Effects Models, Spatial Regression, and Conditional/Spatial Autoregressive Models (CAR/SAR) are frequentist approaches specifically designed to address these types of spatial questions. An alternative to spatial statistics is to use nonparametric models such as Random Forests or Spline regression. The preferred software for this type of analysis is R (or S+). There are several spatial statistical library's available and R is free and opensource (S+ is not). It has a steep learning curve but statistics is not something that you learn in a weekend.

Here is a link to the R website: http://cran.r-project.org/

And the standard text on spatial analysis in R is: Applied Spatial Data Analysis with R by R.S. Bivand, E.J. Pebesma, and V. Gómez-Rubio. Springer Press.

Feel free to contact me off-list if you have any questions.

LaurenRosenshein · ‎05-14-2010

Hi Aspiring Graduate School Graduate!

I want to start by saying that you are on the right track! I�??m glad you were able to watch the free web seminar on regression and hope it was helpful. The books you mention are also great resources for getting started with this kind of analysis.

Speaking from my own Master�??s thesis experience, the analysis process is rarely as easy as it seems in books and tutorials. The real world is complex, and the relationships that we see in our data are sometimes pretty unexpected. That being said, there are a number of techniques that we can use to help us try to find the best model possible. Before you give up on using OLS and GWR, make sure that you ask yourself a few questions. The answers to these questions could help you find a good model using OLS, or they may help you realize why OLS is not the right method to use. Either way, you will be one step closer to finding solutions to the task at hand.

I will post these strategies as 4 separate entries due to length constraints.

Lauren Rosenshein
Geoprocessing Product Engineer
ESRI | Redlands, CA

LaurenRosenshein · ‎05-14-2010

Do I have all of the variables that I need?

Often times we go into an analysis with a hypothesis about what variables are going to be important. Maybe we think there are 5 variables that are definitely good predictors of what we are trying to model, or maybe there are 10 that we think could be related. While it is important to approach a regression analysis with a hypothesis, it is also important to allow your creativity and insight to dig a little deeper. What I mean is that you shouldn�??t limit yourself to those 10 variables you think might be important. Really consider all of the variables that you might include. Consult literature and theory. Use common sense. Dig deep and include as many potential explanatory variables as you can! You will only increase your chances of finding a good model.

It is also important to think about the spatial variables that could be important to your analysis. These are variables like distance from the urban center, proximity to major highways, or distance to large bodies of water. This kind of variable can be very important in an analysis where you believe geographic processes are playing a major role in the relationships in your data. Whenever your dependent variable has a spatial structure to it (very common with spatial data), you will need to find explanatory variables that capture that spatial structure or else you will continue to see spatially autocorrelated residuals. These spatial variables not only have the potential to improve your model, they can also help you understand the phenomenon that you are modeling in a new, innovative way!

Another method is to try including some dummy variables in your analysis. A classic example of a dummy variable would be an urban/rural variable. By assigning all urban features a value of 0 and all rural features a value of 1 you are able to capture a spatial relationship in the landscape that could be influencing your model. They are easy to create, and might make a big difference.

[Note: while these spatial regime dummy variables are great to include in your OLS model, you will want to remove them when you run GWR. Since GWR takes these spatial relationships into consideration in its mathematics, these dummy variables aren�??t needed and, in fact, they create local redundancy between the dummy variable and the intercept, often making it impossible for GWR to solve].

Another strategy for helping you identify the key explanatory variables that might be missing from your model is to examine the residual map and the coefficient surfaces (which are an optional output from the GWR tool!) to see if these give you any clues about what�??s missing. A strong east to west trend in the residuals or a sharp boundary in a coefficient surface may suggest a missing spatial variable. Examining the coefficient surfaces also gives you a feel for how your variables are changing across space.

[ATTACH]946[/ATTACH]
You may find, for example, that a certain variable is really important in one part of your study area, and not important at all in another part. In some cases you might see that the coefficients for a variable switch signs completely. This is important to notice because this kind of non-stationarity violates underlying assumptions of OLS. OLS is a global model and is expecting those relationships to be consistent (stationary) across the study area. Consider, for example, the relationship between childhood obesity and access to healthy food options. It may be that in low income areas with poor access to cars, being far away from a supermarket is a real barrier to buying healthy food. In high income areas, however, having a supermarket within walking distance might actually be undesirable, and with potentially better access to vehicles the distance to the supermarket might not act as a barrier to buying healthy foods at all. In this example, the relationship between the distance to healthy food and childhood obesity rates has the potential to vary dramatically in different parts of the study area. OLS will likely indicate the relationship is weak (possibly insignificant) for variables like this with a strongly inconsistent (non-stationary) relationship to the dependent variable. In some extreme cases the same variable could have a positive coefficient in some areas and a negative relationship in other areas, effectively canceling each other out. Think of it as -1 + 1 = 0. If you notice that this is the case, and you believe strongly that the variable is an important predictor of what you are modeling, you should leave the variable in your OLS model, knowing that it will be effective when you move to GWR.

Lauren Rosenshein
Geoprocessing Product Engineer
ESRI | Redlands, CA

LaurenRosenshein · ‎05-14-2010

Are the relationships between my variables linear?

This may seem like a tricky question to answer, but it is actually very simple! You can use the Scatterplot Matrix to evaluate all of the relationships between the variables in your data. Linear relationships would look like diagonal lines in the scatterplot matrix. Non-linear relationships could look more like curved lines, or take some other shape.

[ATTACH]947[/ATTACH]
If you see that the variable you are trying to model (your dependent variable) has a non-linear relationship with one of your explanatory variables then you have some work to do! OLS is a linear regression model that assumes that the relationships between your variables are linear. If they aren�??t linear, you can try to transform your variables so that the relationships become linear. Common transformations include Log and Exponential transformations.

Another useful output of the scatterplot matrix is the histogram that is created for each of the variables. You can use these histograms to figure out if your data is normally distributed, or if it is skewed or has outliers. Skewness and outliers can cause problems in many types of statistics, including regression. You can use the same power transformations that I just mentioned to help you mitigate the impact of outliers and skewness. This image shows the way that different types of transformations can help you get your data into its most useful form.

[ATTACH]948[/ATTACH]

Lauren Rosenshein
Geoprocessing Product Engineer
ESRI | Redlands, CA

LaurenRosenshein · ‎05-14-2010

Are there multiple models to consider?

One of the main reasons that we get so excited about GWR is the fact that it allows the relationships in our model to vary across space. While it is a very powerful tool that can help us understand our data and improve our models, we cannot stress enough the importance of first finding a good model using OLS. One of the main reasons that this is important is the fact that OLS gives us all sorts of great diagnostics to help us figure out if our explanatory variables are significant, if our residuals are normally distributed, and ultimately if we have a good model.

Sometimes, however, it�??s difficult to find a good (properly specified) OLS model because a single global model doesn�??t fit the entire study area. It may be that one set of variables provides a great model in one part of your study area, and another set of variables provide a great model in another part of your study area. To see if this is the case, you can pick several small sample areas within your broader study area and then see if the explanatory variables for each subarea change. Pick sample areas where you think the processes associated with what you are trying to model might be different (high vs. low income areas, old vs. new housing, etc.). Alternatively map the Local R2 results from GWR and select areas where GWR performed well and where it performed badly. These might be areas with different sets of explanatory variables. It can be very useful to look at these areas individually using OLS.

If you do find good OLS models in several small study areas, then you can conclude that you�??ve found the proper variables (overall), but just aren�??t getting a good global model because of regional variation. You can move to GWR with the FULL set of variables from the combined smaller study area analyses, because GWR will adjust the coefficients to reflect that regional variation. If you don�??t get a good OLS model in any of the smaller areas, it may be that the key explanatory variables are too complex to represent as a simple series of numeric measurements, and you will need to look for other analytical methods.

Okay, so all of this is a bit of work, I know, but it really is a great exercise in exploratory data analysis, and will help you understand your data better, find new variables to use, and could even help you find a great model!

Lauren Rosenshein
Geoprocessing Product Engineer
ESRI | Redlands, CA

LaurenRosenshein · ‎05-14-2010

And don�??t forget�?�

The most important thing to remember when you are going through these steps is that the goal of your analysis is to understand your data and use that understanding to contribute to solving problems and answering questions. I am not going to lie�?�you may try a number of models, with and without transformed variables, explore several small study areas, analyze your coefficient surfaces, and still not find a good OLS model (again, I speak from experience!). But, and this is a big but, you will still be contributing to the body of knowledge on the phenomenon that you are modeling! If the model that you hypothesized would be a great predictor is not significant at all, that is incredibly important information. If one of the variables that you thought would be important has a positive relationship in some areas and a negative relationship in other areas, that is important information. It is true that you may ultimately want to move on to additional tools that will do other types of regression analysis (perhaps one that does not involve linear models, perhaps a spatial autoregressive model, the possibilities are endless). The work that you do here, however, trying to find a good model using OLS, and using GWR to understand your data and improve your model, is valuable. And speaking from experience�?�just writing up all of the work that you do and the valuable information that you have uncovered along the way will be more than enough to fill a thesis with meaningful content!

All the best,

Lauren Rosenshein
Geoprocessing Product Engineer
ESRI | Redlands, CA

BrianWallace · ‎05-18-2010

Thank you so much Ms./Mrs. Rosenshein and Dr. Evans,

Your input is most appreciated in helping me through the most difficult portion of this task that being exploring my data and the shortcomings of my approach.

I will continue to work within the OLS and GWR framework, if for any reason as you mentioned Ms./ Mrs. Rosenhein to diagnose the possibility to address my hypothesis with these methods. I'll let you know what I come up with.

Again thanks to you both for your invaluable input and taking the time to reply to my inquiry with a meaningful reply. The task of this was to set out and achieve some result in utilizing these new tools to explain what kind of correlation one might find towards the location of my dependent variable of archaeological artifact/site location based off the input of my environmental variables. Will continue to test and retest my data to ensure its proper application towards the model and observe the results which will tell a story one way or another.

All part of the thesis experience I suppose ;-). Again thanks and take care-

Sincerely-

Brian Wallace

MichaelMcManus · ‎03-09-2011

How does OLS in ArcMap handle dummy variables? If I have a factor that has three levels, 1, 2, and 3, does OLS in ArcMap automatically create the corresponding dummy variables? Or do, I need to create the 2 dummy columns that correspond to the the levels of the factor before I run the OLS regression? Such as:
Factor    d1    d2
1           0      0
2           1      0
3           0      1

Thanks,
Mike

LaurenRosenshein · ‎03-14-2011

Hi Mike,

That's a GREAT question, and one that we get often! The simple answer is that you are right, and dummy variables should be represented as individual fields for each category. That being said, that's only necessary when you have more than 2 "dummy categories". In your example, where you have d1 and d2, it would be fine to have just one field representing one of those categories, either d1 or d2, and have zeros and ones for that field, since they would be mutually exclusive. It gets trickier when you have more than 2 categories, however, and when that's the case your example works well.

In the example below, there are 3 dummy categories (think urban, suburban, rural for instance). There are actually two ways to do this. One way is to just create 2 fields, one for d1 and 1 for d2...and while you aren't creating a field for d3 it is still represented because those features that don't fall into d1 or d2 will have zeros for both and the calculations will reflect that. The other way is to create a field for each category. This is especially useful if you are doing an exploratory regression analysis or a stepwise regression analysis and you want to find out which variables are most important when explaining your dependent variable. In this case, having the 3rd field to represent the 3rd dummy category would be important.

Factor d1 d2 d3
1         0 0 1
2         1 0 0
3         0 1 0

Hope this helps!

Lauren Rosenshein
Geoprocessing Product Engineer