Average Nearest Neighbor tool not detecting randomness? (ArcGIS 10.0 bug?)

1276
3
09-28-2011 10:25 AM
chrisgregory
New Contributor
Hello,

I have started to play with the Spatial Statistics toolbox using three shape files:

1) Evenly-spaced point data (generated from Create Fishnet tool)
2) Random point data (generated from Create Random Points tool)
3) Real-world point data (results of research in Australia)

The outputs of (1) and (2) are rectangular in extent, constrained by the extent of my Australia (3) shapefile. I ran the Average Nearest Neighbor tool on all three datasets and received the expected results (based on visual examination):

1) Dispersed (with associated p-values and z-scores)
2) Random (ditto)
3) Clustered (ditto)

Next, I clipped (1) and (2) by the actual outline of Australia (3) and re-ran the first two analyses. This is where I noticed an interesting - but troubling - problem:

1) Dispersed (no problem)
2) Clustered (?!?)

The clipped dataset contains ~80% of the original points, each in their original, random location (on terrestrial Australia). Thus, they should still output as Random, but they output as Clustered.

I have replicated this problem using multiple shapefiles of random points. In each case of using the Average Nearest Neighbor tool, the full rectangle of random points gives a result of Random but clipping out the middle of the rectangle gives a result of Clustered. I have also created shapefiles of random points using the Constraining Feature Class (versus the Constraining Extent) option and achieved the same result (Clustered).

I'm hoping to understand why this would happen in order to understand the results of more advanced analyses (such as Ripley's K and Moran's I). Anyone have any ideas about this problem? Cheers,

Chris
0 Kudos
3 Replies
LaurenRosenshein
New Contributor III
Hi Chris,

This is a good question, and my initial reaction is that this does not sound like a bug (although I do understand your concern).  The first part of your question involves clipping a dataset that was originally created using the Generate Random Points tool.  Once you clip those random points, depending on the polygon that you're using to clip the points you may be imposing a structure on what were once randomly distributed points, which could lead to a clustered distribution.

In terms of the new dataset that you created using the Australia boundary, what does that constraining dataset look like?  Is it one polygon representing Australia, or does it have multiple polygons?  If it has multiple polygons (regions, counties, etc.), then Generate Random Points actually generates a user-specified random number of points in each one of those polygons.  What that means is that if you have smaller polygons and larger polygons within that constraining dataset, then there will be 100 points (for example) in each one of the smaller polygons and 100 points in each one of the larger polygons.  What that means is that within each individual polygon the features will be "random", but for the entire study area you will have imposed some definite clustering in those smaller polygons.  So that's one thing to think about.

The other thing to think about, which is touched in a little bit on the documentation for Average Nearest Neighbor, is how sensitive the Average Nearest Neighbor (ANN) tool is to the study area or extent of your analysis.  Essentially what ANN does is look at the average distance between each feature and its closest feature in relation to the area of the analysis and compare that to the distances between random features in a circle of the same area.  So, for instance, the same exact distribution of points could be considered random or clustered depending on the extent/bounding geometry used for the analysis.  For this reason, one of the ways that we recommend using ANN is actually for making comparisons between multiple distributions within the same study area.  For instance, if you had points represnting the locations of various types of trees in Australia, you could use ANN to compare those distributions because the point locations/distributions would be changing, but the bounding geometry would stay the same.  That isn't to say that you cannot use ANN for your purposes, it is just important to remember the impact that your bounding geometry has on your output.
0 Kudos
JoelThomas
New Contributor II
Hi Chris,

This is a good question, and my initial reaction is that this does not sound like a bug (although I do understand your concern).  The first part of your question involves clipping a dataset that was originally created using the Generate Random Points tool.  Once you clip those random points, depending on the polygon that you're using to clip the points you may be imposing a structure on what were once randomly distributed points, which could lead to a clustered distribution.

In terms of the new dataset that you created using the Australia boundary, what does that constraining dataset look like?  Is it one polygon representing Australia, or does it have multiple polygons?  If it has multiple polygons (regions, counties, etc.), then Generate Random Points actually generates a user-specified random number of points in each one of those polygons.  What that means is that if you have smaller polygons and larger polygons within that constraining dataset, then there will be 100 points (for example) in each one of the smaller polygons and 100 points in each one of the larger polygons.  What that means is that within each individual polygon the features will be "random", but for the entire study area you will have imposed some definite clustering in those smaller polygons.  So that's one thing to think about.

The other thing to think about, which is touched in a little bit on the documentation for Average Nearest Neighbor, is how sensitive the Average Nearest Neighbor (ANN) tool is to the study area or extent of your analysis.  Essentially what ANN does is look at the average distance between each feature and its closest feature in relation to the area of the analysis and compare that to the distances between random features in a circle of the same area.  So, for instance, the same exact distribution of points could be considered random or clustered depending on the extent/bounding geometry used for the analysis.  For this reason, one of the ways that we recommend using ANN is actually for making comparisons between multiple distributions within the same study area.  For instance, if you had points represnting the locations of various types of trees in Australia, you could use ANN to compare those distributions because the point locations/distributions would be changing, but the bounding geometry would stay the same.  That isn't to say that you cannot use ANN for your purposes, it is just important to remember the impact that your bounding geometry has on your output.


Why if you repeat the same analysis over and over again would you get the same numbers? Isn't part of the equation a random distribution of same numbers of points that would seemingly change the output numbers?
0 Kudos
JeffreyEvans
Occasional Contributor III
I can't speak to how ESRI has Nearest Neighbor Index (NNI) implemented but the standard convention of the statistic does not create a set of random points to test against. Like many point pattern statistics, the NNI uses an expected as the NULL random process that is based on the number of observations per area unit of the extent. Thus, the sensitivity to changing the extent. The NNI can also be biased if the point process is inhomogenious or is poisson distributed. Needless to say, from a spatial process standpoint, there are obvious limitations to the NNI but it useful, in some cases, as an exploratory statistic. Although, I would be cautious drawing inference from it.    

The NNI, without boundary correction, is calculated as follows:
D = sum( nndist(x) ) / N
Expected = 0.5 / SQRT(N / A)
NNI = D / Expected

Where; N=Number of observations, nndist=Matrix of nearest neighbor distances, D=Average nearest neighbor distance, A=Area of analysis extent, Expected=expected NULL of point process.

I have no idea how or if the ESRI implementation performs boundary correction.

So the answer to your question: "Why if you repeat the same analysis over and over again would you get the same numbers? Isn't part of the equation a random distribution of same numbers of points that would seemingly change the output numbers?"

There is no stochasticity introduced to the statistic through a randomization process and if everything stays fixed you should get exactly the same answer every time.
0 Kudos