GroupBy 2 columns and keep all fields

JamesCrandall · ‎08-03-2017

Edit: here's the current code I've got so far:

itsct_nparr = arcpy.da.FeatureClassToNumPyArray(prItsct, ['FID_preproc','NAME','Shape_Area'])

#create a pandas DataFrame objects from the NumPy arrays
itsct_df = DataFrame(itsct_nparr, columns=['FID_preproc','NAME','Shape_Area'])
maxarea = itsct_df.groupby(['FID_preproc','NAME'], as_index=False).max()
maxarea.to_csv(csvout)
del itsct_nparr‍‍‍‍‍‍‍

I need to come up with a solution that allows me to summarize an input table, performing a GroupBy on 2 columns ("FID_preproc" and "Shape_Area") and keep all of the fields in the original table in the output/result. From there I'll be joining that result with another table and calculating a field using the NAME column.

From the example, I need to maintain the FID_preproc, and NAME fields (see the desired result). GroupBy on FID_preproc and MAX(Shape_Area).

Any method is acceptable (numpy, pandas, summarize table, da.SearchCursor, etc.) -- whatever you have I can use!

Input:
FID_preproc	NAME	Shape_Area
1340	A	25952.35775
1341	A	118099.5219
1341	B	305220.1244
1342	A	12053.13585

Desired Result:
FID_preproc	NAME	Shape_Area
1340	A	25952.35775
1341	B	305220.1244
1342	A	12053.13585

JamesCrandall · ‎08-04-2017

I was able to get your suggestion to run. However, it only outputs 2 columns and not what the desired result in my OP.

JoshuaBixby · ‎08-04-2017

Can you post the code back that you are using? When I tested the code on my end, I was able to output all columns in the original table, including the original ObjectID. I am not sure why you would only get 2 columns.

JamesCrandall · ‎08-04-2017

Ah -- that was my goof. Yes, you are correct it generates the desired output.

Double award points for questioning my obvious flawed assessment!

JoshuaBixby · ‎08-04-2017

As powerful as pandas is, I find getting data into and out of it cumbersome at times and unnecessary at other times. If the data is already stored where SQL can be used, like a geodatabase, I like pairing ArcPy Data Access cursors, itertools, and SQL to keep the data in place and do my analysis. My code from Deleting records from tabular intersect table based on 'PERCENTAGE' field can be easily modified to work here:

>>> from itertools import groupby
>>> from operator import itemgetter
>>> 
>>> fc = # path to feature class or table
>>> case_fields = ["FID_preproc"]
>>> max_field = "Shape_Area"
>>> sql_orderby = "ORDER BY {}, {} DESC".format(", ".join(case_fields), max_field)
>>> with arcpy.da.UpdateCursor(fc, "*", sql_clause=(None, sql_orderby)) as cur:
...     case_func = itemgetter(
...         *(cur.fields.index(fld) for fld in case_fields)
...     )
...     for key, group in groupby(cur, case_func):
...         next(group)
...         for extra in group:
...             cur.deleteRow()
...             
>>> del cur
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

NOTE: the above code modifies the table in-place so make sure to test it on a copy.

JamesCrandall · ‎08-04-2017

I will be processing this feature class in the in_memory workspace -- not sure if that sql will work there. Your pandas solution above will work great.

This solution looks good too! But not sure if that ORDER BY clause will work outside of RDMBS sources.

JoshuaBixby · ‎08-04-2017

I did a quick check using an in-memory table, and the code works the same as having the table in a file geodatabase. As I mentioned before, I am running 10.5.1, which shouldn't make a difference in terms of SQL support with in-memory workspaces, but one never knows. I know in-memory workspaces don't support as much SQL as file geodatabases, but ORDER BY appears to work fine.

JamesCrandall · ‎08-04-2017

Excellent! I'll give it a go as well.

JoshuaBixby · ‎08-04-2017

If you want to create a table view of the table and select records, the following works without modifying the original table:

from itertools import groupby
from operator import itemgetter

tbl = # path to table
case_fields = ["FID_preproc"]
max_field = "Shape_Area"
sql_orderby = "ORDER BY {}, {} DESC".format(", ".join(case_fields), max_field)

vw = arcpy.MakeTableView_management(tbl, "tbl_vw").getOutput(0)
oidField = arcpy.Describe(vw).OIDFieldName
oids = []

with arcpy.da.SearchCursor(vw, "*", sql_clause=(None, sql_orderby)) as cur:
    oidField = cur.fields.index(oidField)
    case_func = itemgetter(
        *(cur.fields.index(fld) for fld in case_fields)
    )
    for key, group in groupby(cur, case_func):
        oids.append(next(group)[oidField])
        continue

vw.setSelectionSet("NEW_SELECTION", oids)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

JamesCrandall · ‎08-04-2017

I'll be using the grouped table to join back in with the feature class and calculating a field.