Handling Missing Data in a Hotspot Analysis

GISNewbie1122 · ‎12-21-2021

Hello, I am using Getis-Ord Gi* to conduct a hotspot analysis of census tracts in ArcGIS Pro (using a US Census Bureau shapefile for the census tracts), I have prevalence rates in the form of percentages for a certain disease.

Due to there being some census tracts with no cases in my data or with a denominator below a cut off we've specified (or just a denominator of 0 in some instances), some tracts are missing when I map the disease prevalence rates. I am wondering how missing data like this can be handled when conducting a hotspot analysis?

I should add that some of the missing tracts are missing because we either do not have any of our health plan members living in those tracts (so a 'denominator' of zero, there are also some that are extremely small like one or two people in some cases, hence the reason for the denominator threshold cut off I mentioned) or do not have any cases (a 'numerator' of zero).

So the missing isnt always due to simply missing cases I think but sometimes also missing the tract's 'denominator' as well.

This ESRI article says the following about missing data and hotspot analysis:

https://www.esri.com/about/newsroom/arcuser/dealing-with-missing-data/

The impacts of filling in missing values on statistical analyses are more difficult to determine, particularly if the analyses involve calculating local statistics. For example, hot spot analysis (Getis-Ord GI*) compares a local statistic to the global average. Filling in missing values can skew the distribution so that the mean of the dataset will be different once missing values are filled in. Since the impact of filling in missing values is difficult to predict, a best practice is to perform the statistical analysis before and after filling in missing values to compare the results.

So is the only recourse simply removing the missing census tracts? How does ArcGIS Pro in particular treat null/missing tracts? Should I just set my "rate" variable to 'null', or should it be set to zero?

Or should I just remove the missing tract shapes/features completely from the underlying census tracts shapefile when joining my prevalence rate data with the census tracts shapefile?

Any insights and especially literature sources would be much appreciated!

DanPatterson · ‎12-21-2021

For an opinion, either remove the CTs with no data or set them to null.

Do not set to 0 since 0 is a valid observation (ie. we looked but didn't find anything, which is different than saying there is nothing for other reasons

... sort of retired...

LaurenGriffin · ‎12-21-2021

I think you have a couple options. The first thing you’ll want to do is clarify the question you’re trying to answer. Which census tracts, with at least N of “our health plan members”, are part of a statistically significant cluster of high prevalence (a hot spot)? If that’s your question, you’ll want to begin by removing all census tracts with fewer than N members from your analysis. If the tract denominator reflecting number of members is larger than your threshold (N), and your numerator reflecting the number of cases is 0, your prevalence rate is zero (which is accurate and valid). Keep in mind that hot spot analysis looks at each tract within the context of neighboring tracts. If many tracts won’t have neighbors because they don’t have at least N members, then ask yourself if you’re really looking for clusters of high prevalence after all ?? Are you trying to determine which tracts have higher than expected prevalence (if so, just map prevalence, but also see the disparity index suggestion below) or do you want to know where tracts with high prevalence cluster spatially (and where that clustering is statistically significant)? Hot spot analysis will show you statistically significant regions of high prevalence.

I’m thinking that mixing tracts WITH members, and tracts with NO members will complicate your analysis… you wouldn’t know for sure if a cold spot was cold because of clustering of low prevalence or because of clustering of low membership (a cluster of zeros because there aren’t N members), for example. You could, however, aggregate census tracts so that all your polygons (tracts or groups of tracts) have at least N members. Here is a case study that provides a workflow that might help you do that: https://desktop.arcgis.com/en/analytics/case-studies/linguistic-diversity-1-intro.htm

I’m thinking the best solution, however, might be to compute disparity indices. The disparity indices would identify where the disease was not distributed “fairly”/evenly based on health plan membership. You could then run hot spot analysis on the disparity indices if you choose to. Computing the disparity indices addresses 2 problems with rates: division by zero, and small numbers problem (a tract has 2 people, one gets the disease, so the rate is 50%, yikes!). Running hot spot analysis on the disparity indices addresses a third problem: the artificial nature of tract boundaries in relation to disease cases. [These three issues with rates are discussed here: https://desktop.arcgis.com/en/analytics/case-studies/locating-a-new-retirement-community.htm. That case study refers to the disparity index as “Level of Service”, but it’s the same thing. In this other learn lesson, the disparity index is used to see how equitably trees are distributed across race/ethnicity and susceptible populations: https://learn.arcgis.com/en/projects/shade-equity-determine-tree-planting-locations-with-suitability... ]

Oh, and if you decide to go the disparity index route, you can use all your tracts, even those with 0 or only a few members.

Basically, the disparity index expects a census tract with 2% of all your health plan members to be associated with 2% of all the cases. The formula is this:

For each tract compute: (Ci / All Cases) – (Mi /All Members)

Where Ci is the number of cases in the tract, and Mi is the number of members in the tract.

All Cases is the sum of cases for all tracts. All Members is the sum of members for all tracts.

A positive result means the proportion of cases is higher than the proportion of members (so a higher-than-expected rate/prevalence). A negative result is a lower-than-expected proportion of cases. When the case proportion matches the member proportion (the expectation), the result is zero.

When you run hot spot analysis on the indices, you’ll see hot spots in locations where positive indices cluster and cold spots where negative indices cluster.

I hope this helps, or at least gives you some ideas for other options.

Best wishes!

Lauren