AnsweredAssumed Answered

Problem replicating Grouping Analysis 'grouping effectiveness' statistic results

Question asked by lowerre on Nov 4, 2015
Latest reply on Nov 6, 2015 by lowerre

Dear All,

 

I have been using the Grouping Analysis tool to generate a range of clusters from my data. I'm using ArcGIS v10.2.1. I wrote a Python script to run Grouping Analysis a large number of times (with different clustering algorithms and spatial constraints and with the number of clusters (k) ranging from 3 to 250) and report the 'grouping effectiveness' statistic (the Calinski-Harabasz pseudo-F statistic: see ArcGIS Help (10.2, 10.2.1, and 10.2.2), or CH Index for short ) so I can try to identify the 'best' number(s) of clusters for my data. This has all been working fine.

 

To compare these results with some from additional clustering methods, I have also used another application (REDCAP: Regionalization with Constrained Clustering and Partitioning | Spatial Data Mining and Visual Analytics Lab ) on my data as well. Fantastic though the software is, REDCAP does not report the CH Index to help evaluate the 'best' number of clusters in the data. I would like to be able to compare/evaluate my results from ArcGIS's Grouping Analysis with those from REDCAP using the same 'grouping effectiveness' statistic, so I have written another Python script to try to calculate the CH Index 'by hand', as it were, and this is where I've run into a problem.

 

I wrote my script to try to replicate the approach set out in the Help file for 'How Grouping Analysis works' (ArcGIS Help (10.2, 10.2.1, and 10.2.2)) and in the Python script the tool calls (Partition.py). I tried out my script on some of my output feature classes from Grouping Analysis, to check that my script was producing the same results. Here (finally) is the problem: when the Grouping Analysis was based on a single variable, I get exactly the same CH Index result from my script as from Esri's own tool; when the Grouping Analysis was based on more than one variable, I get a different (higher) CH Index result from my script. I am stumped as to why this is happening.

 

This is my code:

# Get and print the starting time for running the script
import datetime
start_time = datetime.datetime.now()
print("Script started at " +str(start_time))

import arcpy
import numpy as np

arcpy.env.workspace = r"E:\Pr6653_GIS_Data_Maps\SKATER_ClusterOutputs_1021FGDB.gdb"
# The input feature class
inFC = "Clstrs_DstNcl_All_DSHC_RC_K9"
# The fields used to generate the clusters using Grouping Analysis
inFields = ["DstNcl_All_SCL01INV", "DSInterp_SCL01INV", "HCInterp_SCL01INV"]
# The field holding the values identifying the clusters
clstField = "SS_GROUP"
# Create a new list from inFields and add the cluster field, to be used when
#     creating the structured Numpy array from the input FC table
arrFields = list(inFields)
arrFields.append(clstField)

# Convert the attribute table of the input FC to a NumPy array
baseArr = arcpy.da.TableToNumPyArray(inFC, arrFields)
n = len(baseArr) # This is the number of features in the input FC

# Get values based on the clustering field
clstArr = np.unique(baseArr[clstField])
nc = len(clstArr) # This is the number of groups
nc_min = clstArr.min() # This is the minimum value for cluster ID
nc_max = clstArr.max() # This is the maximum value for cluster ID

SST = 0.0 # This is the total sum of squared differences
for fld in inFields: # For each field/variable...
    varMean = baseArr[fld].mean() # Get the overall mean for the variable
    varSS = ((baseArr[fld] - varMean)**2.0).sum() # Calculate the sum of squared differences from the variable mean for each value
    SST += varSS # Add that sum to the total sum of squared differences

SSE = 0.0 # This is the sum of squared differences within each group
for group in range(nc_min, nc_max + 1): # For each group...
    grpArr = baseArr[baseArr[clstField] == group] # Create a new array just for that group
    for fld in inFields: # For each field/variable...
        grpMean = grpArr[fld].mean() # Get the mean for the variable in that group
        grpSS = ((grpArr[fld] - grpMean)**2.0).sum() # Calculate the sum of squared differences from the group mean for each value in the group
        SSE += grpSS # Add that sum to the within-group sum of squared differences

# Calculate the pseudo-F statistic
R2 = (SST - SSE) / SST
fStat = (R2 / (nc - 1)) / ((1 - R2) / (n - nc))

# Print the results
print("For input " + inFC + ":")
print("Number of features = " + str(n))
print("Number of clusters = " + str(nc))
print("SST = " + str(SST))
print("SSE = " + str(SSE))
print("R2 = " + str(R2))
print("CH Index = " + str(fStat))
# Clean up
del baseArr, grpArr, clstArr
del n, nc, nc_max, nc_min
del SST, SSE, R2, fStat
del inFields, arrFields

print("Script complete. Elapsed time: " + str(datetime.datetime.now() - start_time))
# End of script

 

If it would help, I'd be happy to post some example data as well.

 

If anyone has any ideas, I'd be very grateful. Like I said, this code seems to work fine in a univariate situation, but not in a multivariate situation. I should also point out that I don't have a strong background in either statistics or programming, so I may just be missing something blindly obvious!

 

Many thanks in advance,

 

Andrew

 

Message was edited by: Andrew Lowerre Just cleaning up some copy-and-paste issues in the code I posted. Should now all be correct!

Outcomes