<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Problem replicating Grouping Analysis 'grouping effectiveness' statistic results in Python Questions</title>
    <link>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530595#M41514</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;So basically you are saying that you now account for the use of samples in the &lt;A href="http://desktop.arcgis.com/en/desktop/latest/tools/spatial-statistics-toolbox/how-grouping-analysis-works.htm"&gt;CH statistics formula&lt;/A&gt;​ and it is now worse? or have you double corrected...just checking&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Wed, 04 Nov 2015 12:01:02 GMT</pubDate>
    <dc:creator>DanPatterson_Retired</dc:creator>
    <dc:date>2015-11-04T12:01:02Z</dc:date>
    <item>
      <title>Problem replicating Grouping Analysis 'grouping effectiveness' statistic results</title>
      <link>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530592#M41511</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Dear All,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have been using the Grouping Analysis tool to generate a range of clusters from my data. I'm using ArcGIS v10.2.1. I wrote a Python script to run Grouping Analysis a large number of times (with different clustering algorithms and spatial constraints and with the number of clusters (k) ranging from 3 to 250) and report the 'grouping effectiveness' statistic (the Calinski-Harabasz pseudo-F statistic: see &lt;A href="http://resources.arcgis.com/EN/HELP/MAIN/10.2/index.html#/How_Grouping_Analysis_works/005p0000004w000000/" title="http://resources.arcgis.com/EN/HELP/MAIN/10.2/index.html#/How_Grouping_Analysis_works/005p0000004w000000/" rel="nofollow noopener noreferrer" target="_blank"&gt;ArcGIS Help (10.2, 10.2.1, and 10.2.2)&lt;/A&gt;, or CH Index for short ) so I can try to identify the 'best' number(s) of clusters for my data. This has all been working fine.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To compare these results with some from additional clustering methods, I have also used another application (&lt;A href="http://www.spatialdatamining.org/software/redcap" title="http://www.spatialdatamining.org/software/redcap" rel="nofollow noopener noreferrer" target="_blank"&gt;REDCAP: Regionalization with Constrained Clustering and Partitioning | Spatial Data Mining and Visual Analytics Lab&lt;/A&gt; ) on my data as well. Fantastic though the software is, REDCAP does not report the CH Index to help evaluate the 'best' number of clusters in the data. I would like to be able to compare/evaluate my results from ArcGIS's Grouping Analysis with those from REDCAP using the same 'grouping effectiveness' statistic, so I have written another Python script to try to calculate the CH Index 'by hand', as it were, and this is where I've run into a problem.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I wrote my script to try to replicate the approach set out in the Help file for 'How Grouping Analysis works' (&lt;A href="http://resources.arcgis.com/EN/HELP/MAIN/10.2/index.html#/How_Grouping_Analysis_works/005p0000004w000000/" title="http://resources.arcgis.com/EN/HELP/MAIN/10.2/index.html#/How_Grouping_Analysis_works/005p0000004w000000/" rel="nofollow noopener noreferrer" target="_blank"&gt;ArcGIS Help (10.2, 10.2.1, and 10.2.2)&lt;/A&gt;) and in the Python script the tool calls (Partition.py). I tried out my script on some of my output feature classes from Grouping Analysis, to check that my script was producing the same results. Here (finally) is the problem: when the Grouping Analysis was based on a single variable, I get exactly the same CH Index result from my script as from Esri's own tool; when the Grouping Analysis was based on more than one variable, I get a different (higher) CH Index result from my script. I am stumped as to why this is happening.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is my code:&lt;/P&gt;&lt;PRE class="lia-code-sample line-numbers language-none"&gt;# Get and print the starting time for running the script
import datetime
start_time = datetime.datetime.now()
print("Script started at " +str(start_time))

import arcpy
import numpy as np

arcpy.env.workspace = r"E:\Pr6653_GIS_Data_Maps\SKATER_ClusterOutputs_1021FGDB.gdb"
# The input feature class
inFC = "Clstrs_DstNcl_All_DSHC_RC_K9"
# The fields used to generate the clusters using Grouping Analysis
inFields = ["DstNcl_All_SCL01INV", "DSInterp_SCL01INV", "HCInterp_SCL01INV"]
# The field holding the values identifying the clusters
clstField = "SS_GROUP"
# Create a new list from inFields and add the cluster field, to be used when
#&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; creating the structured Numpy array from the input FC table
arrFields = list(inFields)
arrFields.append(clstField)

# Convert the attribute table of the input FC to a NumPy array
baseArr = arcpy.da.TableToNumPyArray(inFC, arrFields)
n = len(baseArr) # This is the number of features in the input FC

# Get values based on the clustering field
clstArr = np.unique(baseArr[clstField])
nc = len(clstArr) # This is the number of groups
nc_min = clstArr.min() # This is the minimum value for cluster ID
nc_max = clstArr.max() # This is the maximum value for cluster ID

SST = 0.0 # This is the total sum of squared differences
for fld in inFields: # For each field/variable...
&amp;nbsp;&amp;nbsp;&amp;nbsp; varMean = baseArr[fld].mean() # Get the overall mean for the variable
&amp;nbsp;&amp;nbsp;&amp;nbsp; varSS = ((baseArr[fld] - varMean)**2.0).sum() # Calculate the sum of squared differences from the variable mean for each value
&amp;nbsp;&amp;nbsp;&amp;nbsp; SST += varSS # Add that sum to the total sum of squared differences

SSE = 0.0 # This is the sum of squared differences within each group
for group in range(nc_min, nc_max + 1): # For each group...
&amp;nbsp;&amp;nbsp;&amp;nbsp; grpArr = baseArr[baseArr[clstField] == group] # Create a new array just for that group
&amp;nbsp;&amp;nbsp;&amp;nbsp; for fld in inFields: # For each field/variable...
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; grpMean = grpArr[fld].mean() # Get the mean for the variable in that group
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; grpSS = ((grpArr[fld] - grpMean)**2.0).sum() # Calculate the sum of squared differences from the group mean for each value in the group
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; SSE += grpSS # Add that sum to the within-group sum of squared differences

# Calculate the pseudo-F statistic
R2 = (SST - SSE) / SST
fStat = (R2 / (nc - 1)) / ((1 - R2) / (n - nc))

# Print the results
print("For input " + inFC + ":")
print("Number of features = " + str(n))
print("Number of clusters = " + str(nc))
print("SST = " + str(SST))
print("SSE = " + str(SSE))
print("R2 = " + str(R2))
print("CH Index = " + str(fStat))
# Clean up
del baseArr, grpArr, clstArr
del n, nc, nc_max, nc_min
del SST, SSE, R2, fStat
del inFields, arrFields

print("Script complete. Elapsed time: " + str(datetime.datetime.now() - start_time))
# End of script&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If it would help, I'd be happy to post some example data as well.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If anyone has any ideas, I'd be very grateful. Like I said, this code seems to work fine in a univariate situation, but not in a multivariate situation. I should also point out that I don't have a strong background in either statistics or programming, so I may just be missing something blindly obvious!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Many thanks in advance,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Andrew&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Message was edited by: Andrew Lowerre

Just cleaning up some copy-and-paste issues in the code I posted. Should now all be correct!&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 11 Dec 2021 23:06:03 GMT</pubDate>
      <guid>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530592#M41511</guid>
      <dc:creator>AndrewLowerre</dc:creator>
      <dc:date>2021-12-11T23:06:03Z</dc:date>
    </item>
    <item>
      <title>Re: Problem replicating Grouping Analysis 'grouping effectiveness' statistic results</title>
      <link>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530593#M41512</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Two thoughts come to mind since you don't report the difference in statistics values&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;whenever statistics test from other sources report a value that contains the std deviation or variance, I always check to see whether they are using population or sample representations (ie division by N or N-1 in the reducing term)&lt;/LI&gt;&lt;LI&gt;whenever division is employed, I always ensure that at least one term in the calculation is floating point to ensure floating point division rather than integer division (being a function of python version etc)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;So if you can report your difference values in the context above, it may help to further delineate the issue&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Nov 2015 11:18:35 GMT</pubDate>
      <guid>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530593#M41512</guid>
      <dc:creator>DanPatterson_Retired</dc:creator>
      <dc:date>2015-11-04T11:18:35Z</dc:date>
    </item>
    <item>
      <title>Re: Problem replicating Grouping Analysis 'grouping effectiveness' statistic results</title>
      <link>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530594#M41513</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Dan,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Many thanks for the quick reply.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;OK, for the dataset referenced in my code above, the CH Index as reported by Esri Grouping Analysis is 1911.75679228486. The CH Index value reported by my script is 2448.47996865.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For a different dataset (same input fields as in my code, but with 31 clusters), the results are as follows:&lt;/P&gt;&lt;P&gt;CH Index from Esri tool: 1069.1115254454&lt;/P&gt;&lt;P&gt;CH Index from my script: 1114.27024452&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I just tried addressing the possible integer division problem you mention, so I changed line 23 to&lt;/P&gt;&lt;PRE __default_attr="python" __jive_macro_name="code" class="jive_macro_code _jivemacro_uid_14466371628562622 jive_text_macro" data-renderedposition="260_8_912_16" jivemacro_uid="_14466371628562622"&gt;&lt;P&gt;n = len(baseArr) * 1.0&lt;/P&gt;&lt;/PRE&gt;&lt;P&gt;and line 27 to&lt;/P&gt;&lt;PRE __default_attr="python" __jive_macro_name="code" class="jive_macro_code jive_text_macro _jivemacro_uid_14466371998202103" data-renderedposition="297_8_912_16" jivemacro_uid="_14466371998202103" modifiedtitle="true"&gt;&lt;P&gt;nc = len(clstArr) * 1.0&lt;/P&gt;&lt;/PRE&gt;&lt;P&gt;and it made no difference in my results&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I also tried changing the reducing term to N - 1, so changing line 47 in my code to&lt;/P&gt;&lt;PRE __default_attr="python" __jive_macro_name="code" class="jive_macro_code jive_text_macro _jivemacro_uid_14466373672117270" data-renderedposition="376_8_912_16" jivemacro_uid="_14466373672117270"&gt;&lt;P&gt;fStat = (R2 / (nc - 1.0)) / ((1.0 - R2) / ((n - 1.0) - nc)) )&lt;/P&gt;&lt;/PRE&gt;&lt;P&gt;This also has a negligible effect. The CH Index for the dataset referenced in the original code I posted is now 2448.40544257.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So, still stumped.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Nov 2015 11:49:30 GMT</pubDate>
      <guid>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530594#M41513</guid>
      <dc:creator>AndrewLowerre</dc:creator>
      <dc:date>2015-11-04T11:49:30Z</dc:date>
    </item>
    <item>
      <title>Re: Problem replicating Grouping Analysis 'grouping effectiveness' statistic results</title>
      <link>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530595#M41514</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;So basically you are saying that you now account for the use of samples in the &lt;A href="http://desktop.arcgis.com/en/desktop/latest/tools/spatial-statistics-toolbox/how-grouping-analysis-works.htm"&gt;CH statistics formula&lt;/A&gt;​ and it is now worse? or have you double corrected...just checking&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Nov 2015 12:01:02 GMT</pubDate>
      <guid>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530595#M41514</guid>
      <dc:creator>DanPatterson_Retired</dc:creator>
      <dc:date>2015-11-04T12:01:02Z</dc:date>
    </item>
    <item>
      <title>Re: Problem replicating Grouping Analysis 'grouping effectiveness' statistic results</title>
      <link>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530596#M41515</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Assuming I understand you correctly, I've accounted for the use of samples and the result is now very, very slightly better, but only &lt;STRONG&gt;very &lt;/STRONG&gt;slightly:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;CH Index from Esri tool: 1911.75679228486&lt;/P&gt;&lt;P&gt;CH Index from my script (not accounting for using samples): 2448.47996865&lt;/P&gt;&lt;P&gt;CH Index from my script (accounting for using samples): 2448.40544257&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Not much difference in my results, and both are a long way off from what Esri's tool reports.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Nov 2015 12:10:22 GMT</pubDate>
      <guid>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530596#M41515</guid>
      <dc:creator>AndrewLowerre</dc:creator>
      <dc:date>2015-11-04T12:10:22Z</dc:date>
    </item>
    <item>
      <title>Re: Problem replicating Grouping Analysis 'grouping effectiveness' statistic results</title>
      <link>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530597#M41516</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I guess I am missing whether R2 is similar or way off since fstat and CH depend upon it.&amp;nbsp; Assuming options are consistent between the two and you are working with projected data since this will have an impact on the grouping methods that employ distance or enclosure. And I also assume that your SSE and SST calculation methods are correct.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Nov 2015 12:20:16 GMT</pubDate>
      <guid>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530597#M41516</guid>
      <dc:creator>DanPatterson_Retired</dc:creator>
      <dc:date>2015-11-04T12:20:16Z</dc:date>
    </item>
    <item>
      <title>Re: Problem replicating Grouping Analysis 'grouping effectiveness' statistic results</title>
      <link>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530598#M41517</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi Dan,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I suspect the issue probably is that the way I've calculated SSE and SST are not correct (or at least not correct for a multivariate situation), but I can't figure out how or why.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The issue won't be with projected vs unprojected input data. To be clear, the dataset I'm using as an input to my script is one that was output from the Grouping Analysis tool. And I was definitely using projected data as input to the Grouping Analysis tool.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;When I use a different dataset as an input to my script (again, a dataset that was output from the Grouping Analysis tool), where I created the clusters using only one input field, the results are as follows:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;CH Index from Esri tool: 6349.09734505356&lt;/P&gt;&lt;P&gt;CH Index from my script (not accounting for using samples): 6349.09734505&lt;/P&gt;&lt;P&gt;CH Index from my script (accounting for using samples): 6348.90412258&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So when I have a univariate input, my script (not accounting for using samples) replicates exactly the results from Esri's tool (well, to 8 decimal places, which is good enough for me). And when I have a univariate input, accounting for using samples gives a result that is different from Esri's tool.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What I'm trying to do is check that I can replicate the CH Index value that the Grouping Analysis tool produces using my own script, so that I can be sure that I've got the calculations correct. As far as I can tell, my calculations &lt;STRONG&gt;are&lt;/STRONG&gt; correct when I'm calculating SSE and SST for a single input variable, but they are&lt;STRONG&gt; not&lt;/STRONG&gt; correct when I calculate SSE and SST for more than one input variable. That's where I'm stumped.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 04 Nov 2015 12:56:35 GMT</pubDate>
      <guid>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530598#M41517</guid>
      <dc:creator>AndrewLowerre</dc:creator>
      <dc:date>2015-11-04T12:56:35Z</dc:date>
    </item>
    <item>
      <title>Re: Problem replicating Grouping Analysis 'grouping effectiveness' statistic results</title>
      <link>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530599#M41518</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;OK, I figured out the problem: I needed to standardise the values in each of the input fields to z-scores (subtract the mean and divide by the standard deviation).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My new code is as follows:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE class="lia-code-sample line-numbers language-none"&gt;# Get and print the starting time for running the script
import datetime
start_time = datetime.datetime.now()
print("Script started at " +str(start_time))

import arcpy
import numpy as np

def zScore(inTable, inFld, outFld):
&amp;nbsp;&amp;nbsp;&amp;nbsp; """Calculate z-scores for values in one field and write them to another field"""
&amp;nbsp;&amp;nbsp;&amp;nbsp; fldArr = arcpy.da.TableToNumPyArray(inTable, inFld)
&amp;nbsp;&amp;nbsp;&amp;nbsp; fldMean = fldArr[inFld].mean()
&amp;nbsp;&amp;nbsp;&amp;nbsp; fldStdDev = fldArr[inFld].std()

&amp;nbsp;&amp;nbsp;&amp;nbsp; with arcpy.da.UpdateCursor(inTable, [inFld, outFld]) as cur:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for x, y in cur:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; y = (float(x) - fldMean)/fldStdDev
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; cur.updateRow([x,y])
&amp;nbsp;&amp;nbsp;&amp;nbsp; del fldArr
&amp;nbsp;&amp;nbsp;&amp;nbsp; del fldMean, fldStdDev

arcpy.env.workspace = r"E:\Pr6653_GIS_Data_Maps\SKATER_ClusterOutputs_1021FGDB.gdb"

inFC = "Clstrs_DstNcl_All_DSHC_RC_K9"
# inFC = "Clstrs_DstABCDE_DSHC_QC_K36"
inFields = ["DstNcl_All_SCL01INV", "DSInterp_SCL01INV", "HCInterp_SCL01INV"]
# inFields = ["DstNcl_A_SCL01INV", "DstNcl_B_SCL01INV", "DstNcl_C_SCL01INV", "DstNcl_D_SCL01INV", "DstNcl_E_SCL01INV", "DSInterp_SCL01INV", "HCInterp_SCL01INV"]

arrFields = []

# Copy input dataset into memory
print("Copying source features into memory")
arcpy.CopyFeatures_management(inFC, "in_memory\\inMemSrcCopy")

# Add and populate fields for z-Scores
print("Adding and populating z-score fields for in-memory FC")
for fld in inFields:
&amp;nbsp;&amp;nbsp;&amp;nbsp; arcpy.AddField_management("in_memory\\inMemSrcCopy", fld + "_Z", "DOUBLE")
&amp;nbsp;&amp;nbsp;&amp;nbsp; zScore("in_memory\\inMemSrcCopy", fld, fld + "_Z")
&amp;nbsp;&amp;nbsp;&amp;nbsp; arrFields.append(fld + "_Z")

clstField = "SS_GROUP"
arrFields.append(clstField)

# Convert the attribute table of the in-memory input FC to a Numpy array
baseArr = arcpy.da.TableToNumPyArray("in_memory\\inMemSrcCopy", arrFields)
n = len(baseArr) # This is the number of features in the input FC

# Get values based on the clustering field
clstArr = np.unique(baseArr[clstField])
nc = len(clstArr) # This is the number of groups
nc_min = clstArr.min() # This is the minimum value for cluster ID
nc_max = clstArr.max() # This is the maximum value for cluster ID


SST = 0.0 # This is the total sum of squared differences
print("Calculating the total sum of squared differences")

for group in range(nc_min, nc_max + 1): # For each group...
&amp;nbsp;&amp;nbsp;&amp;nbsp; grpArr = baseArr[baseArr[clstField] == group] # Create a new array just for that group
&amp;nbsp;&amp;nbsp;&amp;nbsp; for fld in inFields: # For each field/variable...
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fld = fld + "_Z"
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; varMean = baseArr[fld].mean() # Get the overall mean for the variable
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; varSS = ((grpArr[fld] - varMean)**2.0).sum() # Calculate the sum of squared differences from the variable mean for each value in the group
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; SST += varSS # Add that sum to the total sum of squared differences

SSE = 0.0 # This is the sum of squared differences within each group
print("Calculating the sum of squared differences within each group")
##lstSSE = []
for group in range(nc_min, nc_max + 1): # For each group...
# for group in range(nc): # For each group...
&amp;nbsp;&amp;nbsp;&amp;nbsp; grpArr = baseArr[baseArr[clstField] == group] # Create a new array just for that group
&amp;nbsp;&amp;nbsp;&amp;nbsp; for fld in inFields: # For each field/variable...
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; fld = fld + "_Z"
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; grpMean = (grpArr[fld]).mean() # Get the mean for the variable in that group
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; grpSS = ((grpArr[fld] - grpMean)**2.0).sum() # Calculate the sum of squared differences from the group mean for each value in the group
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; SSE += grpSS # Add that sum to the within-group sum of squared differences

R2 = (SST - SSE) / SST
fStat = (R2 / (nc - 1)) / ((1 - R2) / (n - nc))

print("For input " + inFC + ":")
print("\tNumber of features = " + str(n))
print("\tNumber of clusters = " + str(nc))
print("\tSST = " + str(SST))
print("\tSSE = " + str(SSE))
print("\tR2 = " + str(R2))
print("\tCH Index = " + str(fStat))

del baseArr, grpArr, clstArr
del n, nc, nc_max, nc_min
del SST, SSE, R2, fStat
del inFields, arrFields
arcpy.Delete_management("in_memory")

print("Script complete. Elapsed time: " + str(datetime.datetime.now() - start_time))
# End of script&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This revised code produces results identical to those reported by Esri's tool.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Problem solved.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 11 Dec 2021 23:06:05 GMT</pubDate>
      <guid>https://community.esri.com/t5/python-questions/problem-replicating-grouping-analysis-grouping/m-p/530599#M41518</guid>
      <dc:creator>AndrewLowerre</dc:creator>
      <dc:date>2021-12-11T23:06:05Z</dc:date>
    </item>
  </channel>
</rss>

