GroupBy 2 columns and keep all fields

48292
18
Jump to solution
08-03-2017 02:04 PM
JamesCrandall
MVP Frequent Contributor

Edit: here's the current code I've got so far:

itsct_nparr = arcpy.da.FeatureClassToNumPyArray(prItsct, ['FID_preproc','NAME','Shape_Area'])

#create a pandas DataFrame objects from the NumPy arrays
itsct_df = DataFrame(itsct_nparr, columns=['FID_preproc','NAME','Shape_Area'])
maxarea = itsct_df.groupby(['FID_preproc','NAME'], as_index=False).max()
maxarea.to_csv(csvout)
del itsct_nparr

I need to come up with a solution that allows me to summarize an input table, performing a GroupBy on 2 columns ("FID_preproc" and "Shape_Area") and keep all of the fields in the original table in the output/result.  From there I'll be joining that result with another table and calculating a field using the NAME column.

From the example, I need to maintain the FID_preproc, and NAME fields (see the desired result).  GroupBy on FID_preproc and MAX(Shape_Area).

Any method is acceptable (numpy, pandas, summarize table, da.SearchCursor, etc.) -- whatever you have I can use!

Input:
FID_preprocNAMEShape_Area
1340A25952.35775
1341A118099.5219
1341B305220.1244
1342A12053.13585
Desired Result:
FID_preprocNAMEShape_Area
1340A25952.35775
1341B305220.1244
1342A12053.13585
0 Kudos
18 Replies
JamesCrandall
MVP Frequent Contributor

I was able to get your suggestion to run.  However, it only outputs 2 columns and not what the desired result in my OP.

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

Can you post the code back that you are using?  When I tested the code on my end, I was able to output all columns in the original table, including the original ObjectID.  I am not sure why you would only get 2 columns.

0 Kudos
JamesCrandall
MVP Frequent Contributor

Ah -- that was my goof.  Yes, you are correct it generates the desired output.

Double award points for questioning my obvious flawed assessment!

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

As powerful as pandas is, I find getting data into and out of it cumbersome at times and unnecessary at other times.  If the data is already stored where SQL can be used, like a geodatabase, I like pairing ArcPy Data Access cursors, itertools, and SQL to keep the data in place and do my analysis.  My code from Deleting records from tabular intersect table based on 'PERCENTAGE' field can be easily modified to work here:

>>> from itertools import groupby
>>> from operator import itemgetter
>>> 
>>> fc = # path to feature class or table
>>> case_fields = ["FID_preproc"]
>>> max_field = "Shape_Area"
>>> sql_orderby = "ORDER BY {}, {} DESC".format(", ".join(case_fields), max_field)
>>> with arcpy.da.UpdateCursor(fc, "*", sql_clause=(None, sql_orderby)) as cur:
...     case_func = itemgetter(
...         *(cur.fields.index(fld) for fld in case_fields)
...     )
...     for key, group in groupby(cur, case_func):
...         next(group)
...         for extra in group:
...             cur.deleteRow()
...             
>>> del cur
>>> 

NOTE: the above code modifies the table in-place so make sure to test it on a copy.

0 Kudos
JamesCrandall
MVP Frequent Contributor

I will be processing this feature class in the in_memory workspace -- not sure if that sql will work there.  Your pandas solution above will work great.

This solution looks good too!  But not sure if that ORDER BY clause will work outside of RDMBS sources.

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

I did a quick check using an in-memory table, and the code works the same as having the table in a file geodatabase.  As I mentioned before, I am running 10.5.1, which shouldn't make a difference in terms of SQL support with in-memory workspaces, but one never knows.  I know in-memory workspaces don't support as much SQL as file geodatabases, but ORDER BY appears to work fine.

JamesCrandall
MVP Frequent Contributor

Excellent!  I'll give it a go as well.

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

If you want to create a table view of the table and select records, the following works without modifying the original table:

from itertools import groupby
from operator import itemgetter

tbl = # path to table
case_fields = ["FID_preproc"]
max_field = "Shape_Area"
sql_orderby = "ORDER BY {}, {} DESC".format(", ".join(case_fields), max_field)

vw = arcpy.MakeTableView_management(tbl, "tbl_vw").getOutput(0)
oidField = arcpy.Describe(vw).OIDFieldName
oids = []

with arcpy.da.SearchCursor(vw, "*", sql_clause=(None, sql_orderby)) as cur:
    oidField = cur.fields.index(oidField)
    case_func = itemgetter(
        *(cur.fields.index(fld) for fld in case_fields)
    )
    for key, group in groupby(cur, case_func):
        oids.append(next(group)[oidField])
        continue

vw.setSelectionSet("NEW_SELECTION", oids)
JamesCrandall
MVP Frequent Contributor

I'll be using the grouped table to join back in with the feature class and calculating a field.

0 Kudos