I've been trying to perform a cluster analysis on soil/till samples that were collected for mineral exploration. Basically, I have data that spans throughout Canada. I have a field in the data that indicates the total number of indicator grains (called 'Total_Positives') found in each sample. My intention is to find statistically significant clusters. I've read through the help documentation and watched some of the training Videos.

Here's is what I've done:

(1) I ran Morans I Spatial Autocorrelation tool at different distance bands in order to find the optimum distance band of clustering by attempting to find the highest Z-score values by distance. I used the 'Total_Positives' field as my Input field. As suggested in the help documentation, I've used "Fixed Distance Band" as the conceptualization.

(2) Using the distance band corresponding to the highest Z-score value, I ran the Getis-Ord-Gi* Hot Spot analysis tools.

The problems I have encountered are:

(a) When I run the Morans I Spatial Autocorrelation, at various distance bands, I get a message "xx features had no neighbors which generally invalidates the statistical properties of a test." The number of features with no neighbors decreases with increasing distance bands. My question is that, does this invalidate the calculated Z-score values and can I still use them to determine the optimum distance band for my clustering?

(b) The sampling density is quite variable all over Canada. Does this variation have any effect on my clustering calculation? For example, the sample spacing varies between project areas. As we haven't sampled in detail all over Canada, the actual sample locations cause some clustering which is not clustering of high values that we are searching. How can I remove the effect of variable sampling density in my cluster analysis assuming that it would have some effect on my cluster analysis?

(c) The Z-scores values by distance are highly variable and do not show a gradual increase to a maximum and then gradual decline. However, all z-score values are very high and higher than 2.85. Many of them are over 80. With some datasets, i was getting z-score values over 250. Are these values meaningful? In all cases, the p-values is shown as 0.

Here's is what I've done:

(1) I ran Morans I Spatial Autocorrelation tool at different distance bands in order to find the optimum distance band of clustering by attempting to find the highest Z-score values by distance. I used the 'Total_Positives' field as my Input field. As suggested in the help documentation, I've used "Fixed Distance Band" as the conceptualization.

(2) Using the distance band corresponding to the highest Z-score value, I ran the Getis-Ord-Gi* Hot Spot analysis tools.

The problems I have encountered are:

(a) When I run the Morans I Spatial Autocorrelation, at various distance bands, I get a message "xx features had no neighbors which generally invalidates the statistical properties of a test." The number of features with no neighbors decreases with increasing distance bands. My question is that, does this invalidate the calculated Z-score values and can I still use them to determine the optimum distance band for my clustering?

(b) The sampling density is quite variable all over Canada. Does this variation have any effect on my clustering calculation? For example, the sample spacing varies between project areas. As we haven't sampled in detail all over Canada, the actual sample locations cause some clustering which is not clustering of high values that we are searching. How can I remove the effect of variable sampling density in my cluster analysis assuming that it would have some effect on my cluster analysis?

(c) The Z-scores values by distance are highly variable and do not show a gradual increase to a maximum and then gradual decline. However, all z-score values are very high and higher than 2.85. Many of them are over 80. With some datasets, i was getting z-score values over 250. Are these values meaningful? In all cases, the p-values is shown as 0.

Using Global Moran�??s I to find the peak Z in order to help you decide the appropriate scale of your analysis (fixed distance band) is an excellent strategy. [Just by the way, if you are using ArcGIS 10.0, we have a sample script tool called Incremental Spatial Autocorrelation that automates this process and can save you some time. If you are interested in getting this tool, please go to our resources page, www.esriurl.com/spatialstats, and look for �??Supplementary Spatial Statistics�?�]. The message about some features not having any neighbors is indication that your distances are not large enough to ensure that every feature has at least one neighbor. To get a quick description of neighbor distances, you can run the Calculate Distance Band From Neighbor Count tool (in the Spatial Statistics toolbox, Utilities toolset). It returns the minimum nearest neighbor distance (this is the distance between the two features that are closest together in your dataset), the average nearest neighbor distance (this is the average distance each feature is away from its nearest neighbor), and the maximum nearest neighbor distance (this the smallest distance that will ensure that EVERY feature in your dataset has at least one neighbor). Sometimes when you have a couple outlier features, the distance to ensure every feature has at least one neighbor gets rather large, �?� possibly larger than is effective for your analysis. You can check this by using the measurement tool to �??draw�?� the maximum nearest neighbor distance.

When outliers are forcing you to use distances that are too large (too large to effectively capture the spatial processes you believe are at work in your data), we recommend:

1) Select all but the outlier features.

2) Run Incremental Spatial Autocorrelation (or Global Moran�??s I for increasing distances) on the selection set. [If you don�??t enter a distance for the beginning distance, btw, it will use the distance that ensures every feature has at least one neighbor]. This analysis will give you the peak distance for the majority of your features (unbiased by the handful of outliers). You still want to include the outliers in your final analysis, however, so:

3) Run the Generate Spatial Weights Matrix tool, selecting Fixed Distance, and using the distance you found in (2). Also, check ON Row Standardization (more about that below), and put 2 for the Number of Neighbors parameter <- the 2 will force each feature (even the outliers) to have at least 2 neighbors and you won�??t get the message about invalid results. With this parameter, the tool will look farther than the distance provided, only if required and only for the outliers, in order to ensure every feature gets at least 2 neighbors.

4) Run your hot spot analysis and/or your Cluster and Outlier Analysis using the .swm file created in (3). You do this by selecting �??Get Weights From File�?� for the Conceptualization of Spatial Relationships parameter and then specifying the path to the .swm file for the Spatial Weights Matrix File parameter.

About your concerns with regard to the different sampling densities�?�

On the one hand, any bias in your sampling scheme will be reflected in your results�?� what that means:

�?� The Hot Spot Analysis (Gi* statistic) works by looking at each feature (each sample) within the context of neighboring features. It compares the local mean of the total positives to the global mean of the total positives and decides if the difference is statistically significant or not.

�?� If you were to ONLY sample in the high total positive areas (for whatever reason, as one example of bias), the global mean would be higher than if you had a truly random sample of the entire study area. Because the global mean is higher, you will see fewer hot spots than you might with a truly random sample.

On the other hand, if your samples are truly representative of the broader population (what you would see if you could sample all across Canada using a random sampling design), then the fact that you have more samples in some areas and less in others is not so much of a concern. The reason is that the Gi* statistic is conceptually looking at the local mean in relation to the global mean and so the number of features isn�??t so important. The number of features is still considered in determining the z-score, however, but only by the fact that more features means more information. So when a feature has very few neighbors, Gi* still does the very best it can, but it has less information to come up with a result. When a feature has LOTs of features, the Gi* statistic has more information to compute a result. There is not an over count or undercount bias�?� instead, there are just differences in how confident you can be about the results in places with few samples. Does that make sense? If not, please ask and I can try again :)

Regarding the extreme Z scores for Global Moran�??s I, 2 things:

1) When you run Global Moran�??s I with increasing distances, or when you run the Incremental Spatial Autocorrelation sample script tool, be sure to check ON for Row Standardization (this doesn�??t make a difference for Hot Spot Analysis, but it is important for the Global tools). If your points were very reflective of the spatial distribution of what you are sampling, you would not check ON for Row Standardization (example: when we have point data reflecting ALL Crimes, then we see lots of points where there are LOTs of crimes and few points where there are few crimes, and that difference in point densities is reflective of crime patterns in our study area). In your case, you have dense samples in some areas, and less dense samples in others and it probably has more to do with where you decided to sample rather than the underlying distribution of the total positives data (I hope that makes sense). To compensate for any bias in your sampling scheme (the idea that some places happened to get lots of samples and others happened to get very few samples), check ON for Row Standardization.

2) If the Total Positives data is skewed (my guess is that it may be; you can check this by creating a histogram of the Total Positives data values�?� if it deviates from a bell curve, your data is skewed), you want to make sure that on average each feature has 8-ish or more neighbors. When you use the Generate Spatial Weights Matrix tool to create a file that represents the Conceptualization of Spatial Relationships among your features (as recommended above), you automatically get a summary of the number of neighbors. [For ArcGIS 10.0: You can access this information from the Results window�?� or you will automatically see this information if you disable Background processing. If you need additional information about this, please don�??t hesitate to ask :)]. Because you put 2 for the Number of Neighbors parameter, the minimum number of neighbors a feature will have is going to be 2. You want the average to be 8 or more (but more than like 100 is starting to get silly). The Gi* statistic is asymptotically normal: as long as you ensure every feature has at least a few neighbors and none of the features have everyone as a neighbor, you can trust your z-score results. With skewed data and excessively too few (zero neighbors) or too many neighbors (everyone is a neighbor), the skewness in the data analyzed spills over into the z-score results and you have less confidence in those values.

Okay, lots of information. I hope this is helpful!

Best wishes,

Lauren

Lauren M Scott, PhD

Esri

Geoprocessing, Spatial Statistics