GroupBy 2 columns and keep all fields

48242
18
Jump to solution
08-03-2017 02:04 PM
JamesCrandall
MVP Frequent Contributor

Edit: here's the current code I've got so far:

itsct_nparr = arcpy.da.FeatureClassToNumPyArray(prItsct, ['FID_preproc','NAME','Shape_Area'])

#create a pandas DataFrame objects from the NumPy arrays
itsct_df = DataFrame(itsct_nparr, columns=['FID_preproc','NAME','Shape_Area'])
maxarea = itsct_df.groupby(['FID_preproc','NAME'], as_index=False).max()
maxarea.to_csv(csvout)
del itsct_nparr

I need to come up with a solution that allows me to summarize an input table, performing a GroupBy on 2 columns ("FID_preproc" and "Shape_Area") and keep all of the fields in the original table in the output/result.  From there I'll be joining that result with another table and calculating a field using the NAME column.

From the example, I need to maintain the FID_preproc, and NAME fields (see the desired result).  GroupBy on FID_preproc and MAX(Shape_Area).

Any method is acceptable (numpy, pandas, summarize table, da.SearchCursor, etc.) -- whatever you have I can use!

Input:
FID_preprocNAMEShape_Area
1340A25952.35775
1341A118099.5219
1341B305220.1244
1342A12053.13585
Desired Result:
FID_preprocNAMEShape_Area
1340A25952.35775
1341B305220.1244
1342A12053.13585
0 Kudos
1 Solution

Accepted Solutions
JoshuaBixby
MVP Esteemed Contributor

In terms of semantics, I think most people working with data think of "group by" from a SQL perspective, even if they aren't working with SQL directly.  From a SQL perspective, this case isn't grouping by 2 columns but grouping by 1 column and selecting based on an aggregate function of another column, e.g., SELECT FID_preproc, MAX(Shape_Area) FROM table GROUP BY FID_preproc.  I mention this because pandas also views this as grouping by 1 column like SQL.

I think the following pandas code will work for you:

import pandas

tbl = # path to table
tbl_out = # path to output table

narr = arcpy.da.TableToNumPyArray(tbl,"*")
df = pandas.DataFrame.from_records(narr)
idxs = df.groupby(by="FID_preproc", as_index=True)["Shape_Area"].idxmax()
recs = df.loc[idxs].to_records(index=False).astype(narr.dtype)
arcpy.da.NumPyArrayToTable(recs, tbl_out)

The code above is similar to using an SQL query to select the records and then joining that result set back to the original data set to get all the columns for the selection. 

View solution in original post

18 Replies
RebeccaStrauch__GISP
MVP Emeritus

You may want to look at some of the suggestions by Joshua Bixby‌  in my post How to convert a python "collection" to a gdb table? (1-to-many) 

He pointed me to the rollup tool using pandas.  You may need to run it twice and then join the outputs, but that was a clean solution (if you have 10.4+ or pandas).

oh...and I would use either summary or frequency to sum the areas.

JamesCrandall
MVP Frequent Contributor

I appreciate it Rebecca.  I'm still not sure that will work for this implementation (or I'm just missing the obvious).  At this point I'd be happy if someone could show me how to accomplish this manually with default gp tools in ArcGIS Desktop!

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

In terms of semantics, I think most people working with data think of "group by" from a SQL perspective, even if they aren't working with SQL directly.  From a SQL perspective, this case isn't grouping by 2 columns but grouping by 1 column and selecting based on an aggregate function of another column, e.g., SELECT FID_preproc, MAX(Shape_Area) FROM table GROUP BY FID_preproc.  I mention this because pandas also views this as grouping by 1 column like SQL.

I think the following pandas code will work for you:

import pandas

tbl = # path to table
tbl_out = # path to output table

narr = arcpy.da.TableToNumPyArray(tbl,"*")
df = pandas.DataFrame.from_records(narr)
idxs = df.groupby(by="FID_preproc", as_index=True)["Shape_Area"].idxmax()
recs = df.loc[idxs].to_records(index=False).astype(narr.dtype)
arcpy.da.NumPyArrayToTable(recs, tbl_out)

The code above is similar to using an SQL query to select the records and then joining that result set back to the original data set to get all the columns for the selection. 

JamesCrandall
MVP Frequent Contributor

Thank you Joshua!  However, I'm getting:

AttributeError: 'DataFrame' object has no attribute 'loc'

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

I am about to post an itertools.groupby solution that I like a bit more, but I did test the pandas code before posting it.  What version of ArcGIS are you running and what version of pandas is it?

0 Kudos
JamesCrandall
MVP Frequent Contributor

ArcGIS 10.3

Not sure what version of Pandas it is.

Also: I updated main post with current code.  I thought it was important to show how I'm getting the numpy array and pandas dataframe setup.

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

Seeing most of the SciPy stack, including pandas, wasn't bundled with ArcGIS until 10.4; I can't infer what version of pandas you installed.  What does the following code say the pandas version is on your machine:

>>> import pandas
>>> pandas.__version__
u'0.18.1'
>>> 

I am running ArcGIS 10.5.1 that comes bundled with pandas 0.18.1, but dataframe.loc has been around for several versions of pandas.

JamesCrandall
MVP Frequent Contributor

0.10.0

To exacerbate the issue, user's are on a mix of Citrix and local installs of ArcGIS desktop! 

I've got a 10.4 install that has pandas 0.16.1 that will run the .loc and I'll see if your example works.

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

Pandas 0.10.0 is over 4 years old, a lifetime for the SciPy stack.  Pandas 0.15.1 was released a month before ArcGIS 10.3, which raises the question of why your Citrix environments are running such an old version of pandas.