GroupBy 2 columns and keep all fields

JamesCrandall · ‎08-03-2017

Edit: here's the current code I've got so far:

itsct_nparr = arcpy.da.FeatureClassToNumPyArray(prItsct, ['FID_preproc','NAME','Shape_Area'])

#create a pandas DataFrame objects from the NumPy arrays
itsct_df = DataFrame(itsct_nparr, columns=['FID_preproc','NAME','Shape_Area'])
maxarea = itsct_df.groupby(['FID_preproc','NAME'], as_index=False).max()
maxarea.to_csv(csvout)
del itsct_nparr‍‍‍‍‍‍‍

I need to come up with a solution that allows me to summarize an input table, performing a GroupBy on 2 columns ("FID_preproc" and "Shape_Area") and keep all of the fields in the original table in the output/result. From there I'll be joining that result with another table and calculating a field using the NAME column.

From the example, I need to maintain the FID_preproc, and NAME fields (see the desired result). GroupBy on FID_preproc and MAX(Shape_Area).

Any method is acceptable (numpy, pandas, summarize table, da.SearchCursor, etc.) -- whatever you have I can use!

Input:
FID_preproc	NAME	Shape_Area
1340	A	25952.35775
1341	A	118099.5219
1341	B	305220.1244
1342	A	12053.13585

Desired Result:
FID_preproc	NAME	Shape_Area
1340	A	25952.35775
1341	B	305220.1244
1342	A	12053.13585

JoshuaBixby · ‎08-04-2017

In terms of semantics, I think most people working with data think of "group by" from a SQL perspective, even if they aren't working with SQL directly. From a SQL perspective, this case isn't grouping by 2 columns but grouping by 1 column and selecting based on an aggregate function of another column, e.g., SELECT FID_preproc, MAX(Shape_Area) FROM table GROUP BY FID_preproc. I mention this because pandas also views this as grouping by 1 column like SQL.

I think the following pandas code will work for you:

import pandas

tbl = # path to table
tbl_out = # path to output table

narr = arcpy.da.TableToNumPyArray(tbl,"*")
df = pandas.DataFrame.from_records(narr)
idxs = df.groupby(by="FID_preproc", as_index=True)["Shape_Area"].idxmax()
recs = df.loc[idxs].to_records(index=False).astype(narr.dtype)
arcpy.da.NumPyArrayToTable(recs, tbl_out)
‍‍‍‍‍‍‍‍‍‍

The code above is similar to using an SQL query to select the records and then joining that result set back to the original data set to get all the columns for the selection.

View solution in original post

RebeccaStrauch__GISP · ‎08-03-2017

You may want to look at some of the suggestions by Joshua Bixby‌ in my post How to convert a python "collection" to a gdb table? (1-to-many)

He pointed me to the rollup tool using pandas. You may need to run it twice and then join the outputs, but that was a clean solution (if you have 10.4+ or pandas).

oh...and I would use either summary or frequency to sum the areas.

JamesCrandall · ‎08-04-2017

I appreciate it Rebecca. I'm still not sure that will work for this implementation (or I'm just missing the obvious). At this point I'd be happy if someone could show me how to accomplish this manually with default gp tools in ArcGIS Desktop!

JoshuaBixby · ‎08-04-2017

In terms of semantics, I think most people working with data think of "group by" from a SQL perspective, even if they aren't working with SQL directly. From a SQL perspective, this case isn't grouping by 2 columns but grouping by 1 column and selecting based on an aggregate function of another column, e.g., SELECT FID_preproc, MAX(Shape_Area) FROM table GROUP BY FID_preproc. I mention this because pandas also views this as grouping by 1 column like SQL.

I think the following pandas code will work for you:

import pandas

tbl = # path to table
tbl_out = # path to output table

narr = arcpy.da.TableToNumPyArray(tbl,"*")
df = pandas.DataFrame.from_records(narr)
idxs = df.groupby(by="FID_preproc", as_index=True)["Shape_Area"].idxmax()
recs = df.loc[idxs].to_records(index=False).astype(narr.dtype)
arcpy.da.NumPyArrayToTable(recs, tbl_out)
‍‍‍‍‍‍‍‍‍‍

The code above is similar to using an SQL query to select the records and then joining that result set back to the original data set to get all the columns for the selection.

JamesCrandall · ‎08-04-2017

Thank you Joshua! However, I'm getting:

AttributeError: 'DataFrame' object has no attribute 'loc'

JoshuaBixby · ‎08-04-2017

I am about to post an itertools.groupby solution that I like a bit more, but I did test the pandas code before posting it. What version of ArcGIS are you running and what version of pandas is it?

JamesCrandall · ‎08-04-2017

ArcGIS 10.3

Not sure what version of Pandas it is.

Also: I updated main post with current code. I thought it was important to show how I'm getting the numpy array and pandas dataframe setup.

JoshuaBixby · ‎08-04-2017

Seeing most of the SciPy stack, including pandas, wasn't bundled with ArcGIS until 10.4; I can't infer what version of pandas you installed. What does the following code say the pandas version is on your machine:

>>> import pandas
>>> pandas.__version__
u'0.18.1'
>>> ‍‍‍‍

I am running ArcGIS 10.5.1 that comes bundled with pandas 0.18.1, but dataframe.loc has been around for several versions of pandas.

JamesCrandall · ‎08-04-2017

0.10.0

To exacerbate the issue, user's are on a mix of Citrix and local installs of ArcGIS desktop!

I've got a 10.4 install that has pandas 0.16.1 that will run the .loc and I'll see if your example works.

JoshuaBixby · ‎08-04-2017

Pandas 0.10.0 is over 4 years old, a lifetime for the SciPy stack. Pandas 0.15.1 was released a month before ArcGIS 10.3, which raises the question of why your Citrix environments are running such an old version of pandas.