Does this mean that there are missing explanatory variables in these specific polygons for why the residuals are so high or low?
Solved! Go to Solution.
Hi Justin, thank you for your question! I may repeat some things you might already know, so please bear with me, but I find it helps to cover the whole goal of the analysis and then answer your question:
To begin, let’s recall the goal of using OLS: to create a linear formula representing the relationships between a dependent variable and one or more explanatory variables.
Most things we want to predict will not have an exact linear relationship, and any linear formula we apply will be off from observations. The difference between the linear formula and the observations corresponds to the residuals.
This is not necessarily bad. Most models should be generalized so that they can capture a general trend, or a “signal” in the data, without mapping the formula to every single point observation. In fact, because a linear formula corresponds to a straight line, this will basically mean that our formula will be off somewhere.
What matters then is to try to make these residuals random; in other words, we want the error to look and feel as random as possible so that our linear formula isn’t biased towards a specific type of error. Here’s a rough example of a linear model that is biased to ignore smaller homes:
When we think of residuals in geography, the same idea applies: we want the residuals to be randomly distributed across your study area. If you see a clustering of low or high residuals somewhere in your study area, that would suggest that your model is missing an important characteristic of that area and the model is over-predicting (for low residuals) or under-predicting (for high residuals) for that area.
When you run a hot spot analysis of OLS residuals, you are testing for a hypothesis that residual values are randomly distributed across your study area; that there is no underlying process driving the clustering of significantly high or significantly low residual values. When a hot spot analysis provides hot spots (high residual values) the analysis suggests that the model under-predicted in that neighborhood, and that the neighborhood’s average residuals are significantly different to the entire study area. The same applies with cold spots: but your model is over-predicting in those cases.
There are several potential causes, and it’s a bit tricky to cover them all in a short conversation, but a common one is that your model is likely missing some key characteristic of that neighborhood. For example: if our home value prediction linear model has clustered high residuals (i.e. hot spots of OLS residuals) for homes near the beach, then we likely did not include enough information about expensive homes near the beach in our training dataset, and we could do a bit of work to test if using the distance to the beach and a wider sample of homes with varying prices would help these residuals be more spatially random. This would help your model have a higher chance of capturing those patterns and hopefully lead to better predictions for those homes as well.
A useful article about this can be found here, where the fourth check “Is my model biased?” covers a few additional considerations.
Please let us know if this helps, and feel free to post screenshots or data for your model so we can help if you have any other questions!
Hi Justin, thank you for your question! I may repeat some things you might already know, so please bear with me, but I find it helps to cover the whole goal of the analysis and then answer your question:
To begin, let’s recall the goal of using OLS: to create a linear formula representing the relationships between a dependent variable and one or more explanatory variables.
Most things we want to predict will not have an exact linear relationship, and any linear formula we apply will be off from observations. The difference between the linear formula and the observations corresponds to the residuals.
This is not necessarily bad. Most models should be generalized so that they can capture a general trend, or a “signal” in the data, without mapping the formula to every single point observation. In fact, because a linear formula corresponds to a straight line, this will basically mean that our formula will be off somewhere.
What matters then is to try to make these residuals random; in other words, we want the error to look and feel as random as possible so that our linear formula isn’t biased towards a specific type of error. Here’s a rough example of a linear model that is biased to ignore smaller homes:
When we think of residuals in geography, the same idea applies: we want the residuals to be randomly distributed across your study area. If you see a clustering of low or high residuals somewhere in your study area, that would suggest that your model is missing an important characteristic of that area and the model is over-predicting (for low residuals) or under-predicting (for high residuals) for that area.
When you run a hot spot analysis of OLS residuals, you are testing for a hypothesis that residual values are randomly distributed across your study area; that there is no underlying process driving the clustering of significantly high or significantly low residual values. When a hot spot analysis provides hot spots (high residual values) the analysis suggests that the model under-predicted in that neighborhood, and that the neighborhood’s average residuals are significantly different to the entire study area. The same applies with cold spots: but your model is over-predicting in those cases.
There are several potential causes, and it’s a bit tricky to cover them all in a short conversation, but a common one is that your model is likely missing some key characteristic of that neighborhood. For example: if our home value prediction linear model has clustered high residuals (i.e. hot spots of OLS residuals) for homes near the beach, then we likely did not include enough information about expensive homes near the beach in our training dataset, and we could do a bit of work to test if using the distance to the beach and a wider sample of homes with varying prices would help these residuals be more spatially random. This would help your model have a higher chance of capturing those patterns and hopefully lead to better predictions for those homes as well.
A useful article about this can be found here, where the fourth check “Is my model biased?” covers a few additional considerations.
Please let us know if this helps, and feel free to post screenshots or data for your model so we can help if you have any other questions!