spatial autocorrelation/hot spot anlaysis has upper limit for sample size?

MinjieXu · ‎08-19-2010

I try to use spatial autocorrelation and hot spot analysis to find the clusters on the change in appraised values of single family housing from 2004 to 2008 in Dallas County but failed. The process bar stuck at 0% and never goes up in 5 hours. The sample size is around 500,000. But if I try commerical and industrial parcel, it works. It needs 40 mins to run 27,000 parcels and get the results. I am wondering is there an upper limit for spatial autocorrelation/hot spot anlaysis? Which means if there are too many sample size it cannot work? Or I set some config wrong or need change some setting? If anyone know the solution please help me, thanks a million.

MinjieXu · ‎09-14-2010

Thanks a lot man, but I know nothing about C and R, could you please give me some steps or hints so I can try to learn by myself, if you can provide a short term of plug in code I will really appreciate.

JeffreyEvans · ‎09-15-2010

GeoDa has a GUI and is specifically designed for this type of question (http://geodacenter.asu.edu/software/downloads). There is no software out there that can calculate a spatial weights (neighbor) matrix faster. If you are running XP I would install the older version (GeoDa 0.9.5-i). Vista and Windows 7 must download OpenGeoDa. There is a lapse-rate (bivariate) LISA available, originally developed for looking at temporal change in spatial process.

However, take note that 500,000 obs will yield a HUGE neighbor matrix and will take time to process. I do not think that you will be able to address anything past first order neighbors. The SpDep package in R has a 64-bit compile and can address much larger problems if you have a 64-bit OS and the RAM to support it.

MarkJanikas · ‎09-30-2010

Hi kaerben,

In order for me to understand why your process is stalling at 0%, I would need to know what you have defined as your neighborhood. Can you respond with the "Conceptualization of Spatial Relationships" you are using... and if it is distance based... what is the cutoff? Are you using Polygon Contiguity? Have you attempted to use a Spatial Weights Matrix ahead of time? This tool prints out some information as to the sparseness of your weights matrix... which would enlighten the problem significantly. Also, what version of the software are you using?

As far as limitations are concerned... I believe that the upper limit on most CPUs for Spatial Auto and General G is roughly 30 million non-zero linkages in a Spatial Weights Matrix. So, if you construct a Spatial Weights Matrix using k-nearest neighbors and 60 as the value then you would be right at that threshold (60 * 500,000 = 30 million). Most applications require far less than 60 neighbors... so you may have some wiggle room. The local stats (Hot-Spot and Local Moran) are far more robust and should solve with more non-zero linkages... but they take a bit longer to calculate because they need to write features using an UpdateCursor (which is slow... but being sped up greatly in 10.1) Right off the bat, I would suggest running the tools with a Spatial Weights Matrix using 8 nearest neighbors... just to get an idea of the time required... even if that is not the concept you are interested in.

Just to give a (very) loose comparison for you... for a Random Set of 500,000 point features it took me:

1 minute 59 seconds to construct a Spatial Weights Matrix (*.swm file) with 8 Nearest Neighbors

4 minutes 20 seconds to run the Spatial Autocorrelation tool with given *.swm file

16 minutes 6 seconds to run Hot-Spot (most of the extra time due to previously stated UpdateCursor)

All of our algorithms for the Global and Local Stats in 10.0 are based on sparse math (you also get this math using a Spatial Weights Matrix in 9.3)... and if you request a solution where the spatial weights are dense (i.e. all features have lots of neighbors)... then the tool can break down when the dataset is large. I am thinking this is what is happening with your data. Perhaps you took the defaults and the distance cutoff is giving you features with perhaps thousands of neighbors. In 9.3 and beyond, the default cutoff is the distance that assures that every feature has at least one neighbor. When the polygons are irregular in size, you often get a vastly uneven amount of neighbors.

Moving along... if you are using the spatial weights matrix tool all of the algorithms for finding neighbors EXCEPT Polygon Contiguity are extremely fast... quite competitive with any other package. This speed has been added to the tools directly in 10.0... so if you are using 10.0, you could possibly skip the Spatial Weights construction... but of course, you wont get k-nearest neighs or Delaunay. We are currently working on a much faster implementation of Polygon Contiguity (scheduled for 10.1)... which is definitely needed as an improvement.

Lastly, I am quite willing to help you out directly if you want to send me your data. If any of the attributes are sensitive... then you could feel free to delete the fields before forwarding. Due to the size of your dataset... you would need to ftp or use something like DropBox in order to get it to me... the latter has a 2GB limit free... I have not used it yet... but we could try it out together if you are so inclined. As the programmer, I am at your service: mjanikas@esri.com

Best wishes,

MJ