I have data from a village along the Amazon River, where homes were visited and individuals tested for having had particular disease. In most homes, all the people tested were negative but in some homes 1 or more people tested were positive.
Determination of study area questions?
I want to see if there are clusters of those that have had the disease, so I am trying to use the nearest neighbor tool. I have a question in regards to the calculation of the area for this tool. Most of the village is on one side of some fields, around the corner of an L-shaped road, but another group of about 10 houses is over 150-m away from any houses in the rest of the village, across some fields. In addition, one side of the village is bounded by the river. I am trying to find out the most statistically-valid way to define the study area which has resulted in several questions.
1. Since no houses will ever be in the river, should that be the boundary on that side -- if I use a convex hull to define the boundary, part of the village area would be considered to be in the river since the shoreline is irregular?
2. Should a do a boundary around the main part of the village and a second boundary around the separate group of 10 houses, and add the areas together; or should I make it a single boundary around everything?
3. Because of the arrangement of the village around an L-shaped corner, using a convex hull to define the boundary includes a lot of space with no houses between the arms of the L. Instead of using a convex hull, would it be appropriate to use a more L-shaped boundary to match the shape of the village, or is there a reason to use the convex hull?
Getis-Ord Gi* questions
If Average Nearest Neighbor does not indicate clustering, is there any reason to use Getis-Ord Gi*?
The frequency of individuals that have had the disease is about 1/4 of all those tested. In more than half the homes have no one has had the disease, and no home has had more than two with the disease. Consequently, the frequency of homes with 0, 1, or 2 individuals who have had the disease is skewed towards homes with 0, definitely not a normal distribution. Is this appropriate data to use with Getis-Ord Gi*? If not, can you suggest a statistical test that would be more appropriate to analyze cluster in this data?