Spatial statistics -- how best to define the study area, and use of Getis-Ord Gi*

02-01-2015 11:08 AM
Occasional Contributor II

I have data from a village along the Amazon River, where homes were visited and individuals tested for having had particular disease.  In most homes, all the people tested were negative but in some homes 1 or more people tested were positive.

Determination of study area questions?

I want to see if there are clusters of those that have had the disease, so I am trying to use the nearest neighbor tool.  I have a question in regards to the calculation of the area for this tool.   Most of the village is on one side of some fields, around the corner of an L-shaped road, but another group of about 10 houses is over 150-m away from any houses in the rest of the village, across some fields.  In addition, one side of the village is bounded by the river.  I am trying to find out the most statistically-valid way to define the study area which has resulted in several questions.

     1.  Since no houses will ever be in the river, should that be the boundary on that side -- if I use a convex hull to define the boundary, part of the village area would be considered to be in the river since the shoreline is irregular?

     2.  Should a do a boundary around the main part of the village and a second boundary around the separate group of 10 houses, and add the areas together; or should I make it a single boundary around everything?

     3.  Because of the arrangement of the village around an L-shaped corner, using a convex hull to define the boundary includes a lot of space with no houses between the arms of the L.  Instead of using a convex hull, would it be appropriate to use a more L-shaped boundary to match the shape of the village, or is there a reason to use the convex hull?

Getis-Ord Gi* questions

If Average Nearest Neighbor does not indicate clustering, is there any reason to use Getis-Ord Gi*?

The frequency of individuals that have had the disease is about 1/4 of all those tested.  In more than half the homes have no one has had the disease, and no home has had more than two with the disease.  Consequently, the frequency of homes with 0, 1, or 2 individuals who have had the disease is skewed towards homes with 0, definitely not a normal distribution.  Is this appropriate data to use with Getis-Ord Gi*?  If not, can you suggest a statistical test that would be more appropriate to analyze cluster in this data?



0 Kudos
4 Replies
MVP Esteemed Contributor

You will get advice here...but I usually direct style questions...pruned down to geometry and

Cross Validated a home to lots of people with statistical expertise.


  • alternatives to the convex hull are the concave hull of which there are several incarnations available
  • is dwelling location purely random or is there some underlying association
  • disease spread need not be associated with dwelling the first place but controlled by interactions elsewhere
  • the river is nodata and not part of the study area
  • does any of your data conform to the requirements of the proposed methods
  • ... and as you can see the list goes on before a stats test should even be considered...
Occasional Contributor

Hi Cheryl,

If my study area wasn't well defined, I probably wouldn't use Average Nearest Neighbor unless I just wanted to casually compare the average nearest neighbor distances (and wasn't interested in determining statistical significance).  And that may be exactly what you want to do!    I say this because, unfortunately, the Average Nearest Neighbor statistic z-score and p-value calculation is very sensitive to study area size. 

Average Nearest Neighbor is a global statistic that tells you about overall clustering.  Getis-Ord Gi* is a local statistic that shows you where clusters are.  It would not be unusual for a global statistic to say there is no clustering but for a local statistic to find local statistically significant clusters.  So even if a global statistic tells you there is no clustering, that doesn't mean you shouldn't bother with a local statistic.

But I'm not sure Getis-Ord Gi* will actually be very helpful to you.  You indicate the positive cases are very rare... if you map the positive cases, can you tell just looking at the map if there is clustering?  I say that because for Getis-Ord Gi*, you really want a good range of values... if almost all of your households have zero cases and a couple have one or two cases, that really isn't enough variation in the analysis value to be appropriate for this tool.  If there are at least 30 neighborhoods in the village, aggregating the counts and creating ratios (positive to negative) for each of those neighborhoods might provide enough variation ??  Another issue that will be problematic, though, is appropriately modeling spatial interaction among the households.  You will want a model that reflects that there is no (or less) interaction across the river than on the same side of the river.  Finally, keep in mind what this analysis is saying... the expectation (null hypothesis) is that every positive case could be dropped down hither tither onto the households in a random manner.  Consequently, finding statistically significant hot or cold spots suggests other processes (beyond random chance) may be at work... Still, all we can really do in this case is reject that null hypothesis.  Would rejecting the null hypothesis be useful (would someone get excited to learn that who gets the disease is not completely random)?  If we believe getting the disease is entirely the result of bad luck (no contagious property, no spreading vector, no genetic components) at all, then yes, finding statistically significant clustering might be interesting.  Similarly, if you have no idea what the factors promoting the disease are, then WHERE the clusters occur might show you where to start looking for answers.  But keep in mind that Getis-Ord works by comparing the local mean (average positive cases for a household and its neighbors) to the global mean (average cases for all households).  The tool then determines if the local mean is significantly different from the global mean.  If you have a sea of households with zeros, any deviation from zero will be statistically significant and just mapping the positive cases will probably show you where to look for factors that may be promoting the disease. 

I guess this is what I would do:

1) If mapping the positive cases shows clear clustering, I would go with my map.  Done.

2) If I had a good range of values for the household ratio of people with and without the disease, and if I had more than 30 households on each side of the river with positive cases, and if it was tricky to see if the positive cases were clustered or not... I would run hot spot analysis on the ratios for each side of the river separately using the exact same distance band for both analyses... Then I could make comments about the clustering overall and also could compare the clustering on each side of the river.

I hope this helps a little bit!

Very best wishes,

Lauren Scott


Occasional Contributor II

Thank you Lauren, for your detailed reply.  It really helped me.

It turns out that due to small sample size, I am unable to use Getis-Ord Gi*, 24 positives out of 120 tested, scattered across 40 households (apparently I need at least 60).  It looks like there is some tendency to clustering but I cannot visibly tell if it is different from the clustering of individuals tested.  I used Kernal density on all individuals tested and on pos/neg individuals.  The outputs are very similar, and changing the distance or using None, doesn't make a lot of difference except in the overall size of the "hotspot", except of course if the distance is so small that each house is a hotspot.  From this I am concluding that there is no evidence for clustering.

As an additional test, I conducted a Poisson goodness-of-fit test for the number of individuals in each household that were positive, trying to determine if the presence of one individual in a household that was positive made it more likely that a second individual would be positive.  This came up "Not significant".  I am not a statistician, so I am not certain that this is a valid test.  Is the Poisson an appropriate test?

I have another similarly sized dataset, that does appear to have clustering of positive individuals that is different from the clustering of individuals in general (i.e., that hotspots are not in the same location).  From what you have already indicated, I think you are saying, "Let the maps tell the story."  I cannot say anything about the significance of this, but I can say the maps suggest there might be something going on.  The Poisson test was not significant in this case, as well.

We have 8 more villages to look at, ultimately.  Unfortunately, all of them have small sample sizes, but perhaps looking across the 10 different communities, we can visually get some idea if anything is going on.

Are there any kind of significance tests that I could use with these small sample sizes?  The data was collected several years ago for a different study and spatial analysis was not being considered at the time.  Consequently, considerations for spatial testing were not made.  We are just trying to use what we have available to see if a spatial  component can be detected.

0 Kudos
Occasional Contributor

Hi Cheryl,

Hmmm... small sample size, no obvious clustering when the data is mapped... There may be better solutions, but here is one idea: if you have samples from at least 30 households in a village and have results (positive or negative) for all occupants within each household surveyed, you can calculate a ratio for each household: number of positive cases (positive only) divided by total number of cases (positive + negative) to get the percent positive in each household.  If Euclidean Distance (as the crow flies) is a reasonable way to think about the relationships among the households in the village (not reasonable if there are barriers like rivers), you can run Optimized Hot Spot Analysis on the household points using the ratio as your analysis field and see if there is any clustering (that tool only requires 30 points if you have an Analysis Field).  If a network, however, is a better representation of the relationships among households (and this could also work if there are bridges from one side of the river to the other connecting households in a village), and you have data for the village transportation network, you can use the Generate Network Spatial Weights tool to create a network representation of the spatial relationships among the households.  You would then use the Hot Spot Analysis tool (rather than Optimized Hot Spot Analysis) and the network spatial weights file (set the Conceptualization of Spatial Relationships parameter to: Get Weights From File) to test for clustering.

I will also ask a colleague if he has other ideas or suggestions, and I will get back to you if he does (or ask him to please reply direc

Very best wishes, Cheryl!

Lauren Scott


0 Kudos