What spatial analysis can I do with presence absence point data in a cross sectional?

AnthoniaOnyeahialam · ‎01-13-2014

I have gathered point data in a cross sectional survey of households which tells you the presence and absence of disease in each households without telling you how many disease cases occur. The idea is to measure households with a burden which is not necessarily translated to disease counts. I want to explore the spatial patterns of these burdens but from what I see of Moran's I or Getis and other spatial statistics tool in ArcGIS, they only deal with continuous data and not favourable for binary presence absence data.

The question I want to answer is "Where are there clusters of disease burdened households?"

is this a possibility with a sample of 435 households in a cross sectional survey?

Could you please advise me if at all I am missing something out on ArcGIS that I could do with my data or if another software/method can deal with my type of data.

Thanks in advance

JeffreyEvans · ‎01-13-2014

You are fairly limited here. The two common pattern/autocorrelation statistics appropriate for binary data are the "General Cross-Product Statistic" or the "Joint-Count Statistic". Unfortunately, these are not available in ArcGIS. Here is a nice introduction to these statistics: http://www.stat.ncsu.edu/people/fuentes/courses/madrid/lectures/areal2.pdf

There is a Bernoulli specification of the SaTScan statistic but it can be a bit tricky to implement correctly.
http://www.satscan.org/papers/k-cstm1997.pdf

AnthoniaOnyeahialam · ‎01-13-2014

You are fairly limited here. The two common pattern/autocorrelation statistics appropriate for binary data are the "General Cross-Product Statistic" or the "Joint-Count Statistic". Unfortunately, these are not available in ArcGIS. Here is a nice introduction to these statistics: http://www.stat.ncsu.edu/people/fuentes/courses/madrid/lectures/areal2.pdf

There is a Bernoulli specification of the SaTScan statistic but it can be a bit tricky to implement correctly.
http://www.satscan.org/papers/k-cstm1997.pdf

Thank you Jeffrey, for suggesting an alternative. ArcGIS never tells us what underlying assumptions of these methods are. I will explore SaTScan.

Best

LaurenScott · ‎01-15-2014

Hi Anthonia,
Here are some suggestions that may help you answer your question. There are actually several tools that could help you analyze your presence/absence data. You could run Average Nearest Neighbor on all your surveyed households and compare that to Average Nearest Neighbor for just the positive households. This would help you answer questions about overall clustering of the positives. You could create a composite line graph of K Function results for all households surveyed and also for just the positive households to understand if there are any particular distances where clustering is statistically significant. If you have ArcGIS 10.2 or later, however, I recommend that you use the the Optimized Hot Spot Analysis tool. This will give you information about where the disease impacted households are clustered. Because you have both presence AND absence information I would like to recommend the following:
1) Run Optimized Hot Spot Analysis (OHSA) on all households surveyed. Do not use an Analysis Field. Select to aggregate to a fishnet grid (COUNT_INCIDENTS_WITHIN_FISHNET_POLYGONS). This will create a fishnet polygon grid that overlays the households that you surveyed and for now, that's what you want! Let's call this fishnet polygon grid, TheStudyArea. Because you have done the analysis on ALL households surveyed you will get coverage even in locations with no PRESENCE... and that's what you want.
2) Now run Optimized Hot Spot Analysis again only on the points (households) that were positive. This time select to aggregate incidents to aggregation polygons (COUNT_INCIDENTS_WITHIN_AGGREGATION_POLYGONS). Use TheStudyArea for your aggregation polygons (Polygons For Aggregating Incidents Into Counts). This analysis answers the question: where are there lots of positive households?... This might be important if you want to make sure you get resources to the places where they are most needed. Note, however, that this analysis does not control for the number of households surveyed... we are only answering where do we have LOTS of households that are disease impacted. We will control for household density next.
3) Now do a spatial join so that each polygon grid cell in TheStudyArea gets a count representing the number of households that were surveyed (positive or negative) and also a count of the number of positive households. Add a new field to TheStudyArea and calculate it to be the ratio: positives divided by all households. Now run OHSA on this ratio. This will answer the question: where do we have unexpected clustering of disease impacted households given the spatial distribution and density of the households surveyed.

I hope this information is helpful. If you have questions about the spatial statistics tools, I hope you will contact me (LScott@esri.com). If the documentation is not clear or is not complete, please point that out to me so that I may correct the oversight as quickly as possible.

Very best wishes with your research!
Lauren

Lauren M. Scott, PhD
Esri
Geoprocessing, Spatial Analysis, Spatial Statistics

JeffreyEvans · ‎01-15-2014

Lauren,
These are very clever suggestions and really leverage available ArcGIS tools. However, I would respectably disagree with the statistical tractability of your recommendation.

1) By aggregating the binary data to a counts, represented by each fishnet cell, you have effectively invalidated the null spatial distribution that would indicate clustering. Additionally, since these units are arbitrary, you are adding an additional error component associated with the Modifiable Areal Unit Problem (MAUP). In effect, you are no longer representing the underlying Bernoulli spatial process but rather an arbitrary inhomogeneous intensity process. The assumption of a homogeneous random field does not hold for the LISA statistic in the same way that it does for PPA statistics, but you are recommending changing the underlying spatial process in a way that could profoundly change inference.

2) The way the data is being partitioned you are breaking the Bernoulli distribution and making it impossible to draw inference around the occurrence of the process.

If I had my druthers, I would use a Poisson point process model. In this way you could test competing hypotheses of ancillary generating processes (covariates) rather than just treating it as a pure spatial process with no deterministic characteristics.

MAUP References

Cressie, N.A. (1996) Change of support and the modifiable areal unit problem. Geographical Systems 3(2�??3):159�??180.

Holt, D., D. Steel, M. Tranmer, N. Wrigley (1996) Aggregation and ecological effects in geographically based data. Geographical Analysis 28 (3):244�??261

Openshaw, S. (1983) The modifiable areal unit problem. Norwick: Geo Books. ISBN 0860941345

Wrigley, N. (1995) Revisiting the modifiable areal unit problem and the ecological fallacy. In Cliff, A.D. Diffusing geography: essays for Peter Haggett. The Institute of British Geographers special publications series 31. pp. 123�??181

DanRoberts1 · ‎11-17-2021

I have exactly the same problem only with discards of marine fish vs fish retained. I initiated the OHS but left out the overall distribution of all data points (0,1). Thank you for your excellent support on this issue.