How do I interpret a hot spot analysis of OLS residuals?

717
2
Jump to solution
11-05-2020 11:45 PM
JustinLee
New Contributor

Does this mean that there are missing explanatory variables in these specific polygons for why the residuals are so high or low?

0 Kudos
1 Solution

Accepted Solutions
AlbertoNieto1
Esri Contributor

Hi Justin, thank you for your question! I may repeat some things you might already know, so please bear with me, but I find it helps to cover the whole goal of the analysis and then answer your question: 

To begin, let’s recall the goal of using OLS: to create a linear formula representing the relationships between a dependent variable and one or more explanatory variables.

 

 

Most things we want to predict will not have an exact linear relationship, and any linear formula we apply will be off from observations. The difference between the linear formula and the observations corresponds to the residuals.

 

 

This is not necessarily bad. Most models should be generalized so that they can capture a general trend, or a “signal” in the data, without mapping the formula to every single point observation. In fact, because a linear formula corresponds to a straight line, this will basically mean that our formula will be off somewhere.

What matters then is to try to make these residuals random; in other words, we want the error to look and feel as random as possible so that our linear formula isn’t biased towards a specific type of error. Here’s a rough example of a linear model that is biased to ignore smaller homes:

When we think of residuals in geography, the same idea applies: we want the residuals to be randomly distributed across your study area. If you see a clustering of low or high residuals somewhere in your study area, that would suggest that your model is missing an important characteristic of that area and the model is over-predicting (for low residuals) or under-predicting (for high residuals) for that area.

When you run a hot spot analysis of OLS residuals, you are testing for a hypothesis that residual values are randomly distributed across your study area; that there is no underlying process driving the clustering of significantly high or significantly low residual values. When a hot spot analysis provides hot spots (high residual values) the analysis suggests that the model under-predicted in that neighborhood, and that the neighborhood’s average residuals are significantly different to the entire study area. The same applies with cold spots: but your model is over-predicting in those cases.

There are several potential causes, and it’s a bit tricky to cover them all in a short conversation, but a common one is that your model is likely missing some key characteristic of that neighborhood. For example: if our home value prediction linear model has clustered high residuals (i.e. hot spots of OLS residuals) for homes near the beach, then we likely did not include enough information about expensive homes near the beach in our training dataset, and we could do a bit of work to test if using the distance to the beach and a wider sample of homes with varying prices would help these residuals be more spatially random. This would help your model have a higher chance of capturing those patterns and hopefully lead to better predictions for those homes as well.

A useful article about this can be found here, where the fourth check “Is my model biased?” covers a few additional considerations.

Please let us know if this helps, and feel free to post screenshots or data for your model so we can help if you have any other questions!

View solution in original post

2 Replies
AlbertoNieto1
Esri Contributor

Hi Justin, thank you for your question! I may repeat some things you might already know, so please bear with me, but I find it helps to cover the whole goal of the analysis and then answer your question: 

To begin, let’s recall the goal of using OLS: to create a linear formula representing the relationships between a dependent variable and one or more explanatory variables.

 

 

Most things we want to predict will not have an exact linear relationship, and any linear formula we apply will be off from observations. The difference between the linear formula and the observations corresponds to the residuals.

 

 

This is not necessarily bad. Most models should be generalized so that they can capture a general trend, or a “signal” in the data, without mapping the formula to every single point observation. In fact, because a linear formula corresponds to a straight line, this will basically mean that our formula will be off somewhere.

What matters then is to try to make these residuals random; in other words, we want the error to look and feel as random as possible so that our linear formula isn’t biased towards a specific type of error. Here’s a rough example of a linear model that is biased to ignore smaller homes:

When we think of residuals in geography, the same idea applies: we want the residuals to be randomly distributed across your study area. If you see a clustering of low or high residuals somewhere in your study area, that would suggest that your model is missing an important characteristic of that area and the model is over-predicting (for low residuals) or under-predicting (for high residuals) for that area.

When you run a hot spot analysis of OLS residuals, you are testing for a hypothesis that residual values are randomly distributed across your study area; that there is no underlying process driving the clustering of significantly high or significantly low residual values. When a hot spot analysis provides hot spots (high residual values) the analysis suggests that the model under-predicted in that neighborhood, and that the neighborhood’s average residuals are significantly different to the entire study area. The same applies with cold spots: but your model is over-predicting in those cases.

There are several potential causes, and it’s a bit tricky to cover them all in a short conversation, but a common one is that your model is likely missing some key characteristic of that neighborhood. For example: if our home value prediction linear model has clustered high residuals (i.e. hot spots of OLS residuals) for homes near the beach, then we likely did not include enough information about expensive homes near the beach in our training dataset, and we could do a bit of work to test if using the distance to the beach and a wider sample of homes with varying prices would help these residuals be more spatially random. This would help your model have a higher chance of capturing those patterns and hopefully lead to better predictions for those homes as well.

A useful article about this can be found here, where the fourth check “Is my model biased?” covers a few additional considerations.

Please let us know if this helps, and feel free to post screenshots or data for your model so we can help if you have any other questions!

JustinLee
New Contributor
Thanks a lot Alberto that was very clear. So the Hot Spots of OLS residuals are areas where the model is underpredicting as opposed to overpredicting? Essentially, that given the chosen explanatory variables and how it predicted the rest of the areas, the dependent variable in those hot spots SHOULD have been predicted to be higher? and cold spots are areas where the model SHOULD have predicted the dependent variable to be lower?
0 Kudos