Deleting records from tabular intersect table based on 'PERCENTAGE' field

MollyMoore · ‎05-02-2017

I have an output table from a tabular intersect that looks like this:

For any set of records that has the same 'Planning Polygon ID', 'ELCODE', and 'SeasonCode', I would like to delete the record with the lower 'PERCENTAGE', while retaining the 'OccProb' attribute. For example, in the above table, records 12 and 13 have matching 'Planning Polygon ID', 'ELCODE', and 'SeasonCode' fields, but their 'OccProb' is different. I would like to retain the 'OccProb' of the record that has the highest 'PERCENTAGE'. In this case, it would be Medium.

Summary Statistics gets me close by preserving the record with the max 'PERCENTAGE' value, but it does not retain the 'OccProb' field. Is there another way to do this? I will eventually be converting this into Python code, so Arcpy solutions are welcome.

JoshuaBixby · ‎05-02-2017

Although my preference would be to use SQL, like I demonstrated in my previous comment, the lack of correlated subquery support in file geodatabases complicates trying to do so. Fortunately, it can be done in Python without too much extra work.

The itertools.groupby() function was made exactly for situations like this one. The following code works by modifying an existing table to remove the extra rows, so test it out on a copy first.

from itertools import groupby
from operator import itemgetter

fc = # path to feature class
case_fields = ["Planning Polygon ID", "ELCODE", "SeasonCode"]
max_field = "PERCENTAGE"

sql_orderby = "ORDER BY {}, {} DESC".format(", ".join(case_fields), max_field)

with arcpy.da.UpdateCursor(fc, "*", sql_clause=(None, sql_orderby)) as cur:
    case_func = itemgetter(
        *(cur.fields.index(fld) for fld in case_fields)
    )
    for key, group in groupby(cur, case_func):
        next(group)
        for extra in group:
            cur.deleteRow()
            ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The format of the code above is flexible enough that it can accommodate a single case field or multiple case fields. Without too much extra work, this code could be expanded to create a handy function where the user passes a few arguments and the table or feature class is collapse/reduced.

View solution in original post

JakeSkinner · ‎05-02-2017

Hi Molly,

In this scenario can the OccProb be Low but still have a higher PERCENTAGE than a duplicate value that has an OccProb value of Medium?

MollyMoore · ‎05-02-2017

Yes, the OccProb can be Low and still have a higher PERCENTAGE than a duplicate value that has an OccProb value of Medium. The PERCENTAGE field was populated based on the proportion of overlap as result of the tabulate intersection tool.

JakeSkinner · ‎05-02-2017

Hi Molly,

Try the following:

import arcpy
from arcpy import env
env.workspace = r"D:\temp\python\test.gdb"
env.overwriteOutput = 1

table = 'Species'

# Create empty list for elcode values
elcodeList = []

with arcpy.da.UpdateCursor(table, ["ELCODE"]) as cursor:
    for row in cursor:
        # Append elcode values to list
        elcodeList.append(row[0])

del cursor

# Remove duplicates from list
elcodeList = dict.fromkeys(elcodeList)
elcodeList = elcodeList.keys()

arcpy.MakeTableView_management(table, "tableView")

# Create empty list for OBJECTIDs
oidList = []

# Iterate through each elcode and check for duplicate rows
for code in elcodeList:
    expression = "ELCODE = '{0}'".format(code)
    arcpy.SelectLayerByAttribute_management("tableView", "NEW_SELECTION", expression)
    result = arcpy.GetCount_management("tableView")
    count = int(result.getOutput(0))
    # if more than 1 row returned
    if count > 1:
        # Run Summary Stats to get MAX Percentage
        arcpy.Statistics_analysis("tableView", r"IN_MEMORY\Species_percent", [["PERCENTAGE", "MAX"]])
        with arcpy.da.UpdateCursor(r"IN_MEMORY\Species_percent", ["MAX_PERCENTAGE"]) as cursor:
            for row in cursor:
                maxPercentage = row[0]
        del cursor
        # Select OBJECTID from view that had the max percent
        with arcpy.da.SearchCursor("tableView", ["OBJECTID"], "PERCENTAGE = " + str(maxPercentage)) as cursor:
            for row in cursor:
                # Append OBJECTID to list
                oidList.append(row[0])
        del cursor

# Convert list to string
oid = ''.join(str(str(e) +",") for e in oidList)
oidExpression = "OBJECTID IN (" + oid[:-1] + ")"

# Select all records based on OBJECTIDs and copy
arcpy.SelectLayerByAttribute_management("tableView", "NEW_SELECTION", oidExpression)
arcpy.CopyRows_management("tableView", "Species_Percentage")

print("Finished")



‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

JoshuaBixby · ‎05-02-2017

I would ditch the spaces in column names, it is only going to create headache for you at some point, especially if scripting.

I assume you are primarily working with file geodatabases? This is one of those situations where there is a very straightforward, even simple, ANSI SQL-compatible approach that works in all geodatabases except file geodatabases because file geodatabases don't support correlated subqueries. The following even works in SQLite databases.

>>> # Assume field names don't have spaces so we don't have
>>> #   to deal with various field delimiters
>>> 
>>> gdb =  # path to geodatabse containing table
>>> fc_name = # name of feature class or table
>>> vw_name = # name of output view_out
>>> 
>>> arcpy.env.workspace = gdb
>>> sql = """NOT EXISTS 
...        (SELECT 1 FROM {0} t1
...         WHERE t1.PlanningPolygonID = {0}.PlanningPolygonID
...           AND t1.ELCODE = {0}.ELCODE
...           AND t1.SeasonCode = {0}.SeasonCode
...           AND t1.PERCENTAGE > {0}.PERCENTAGE)""".format(fc_name)
...           
>>> arcpy.MakeTableView_management(fc_name, vw_name, sql)
<Result 'view_out'>
>>> arcpy.GetCount_management(vw_name)
<Result '11'>
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

lshipman-esristaff‌, I know you are always looking for business cases for adding correlated subqueries to file geodatabases, here is another one. I can understand shape files not supporting correlated subqueries, but it seems like a big oversight with file geodatabases.

JoshuaBixby · ‎05-02-2017

Although my preference would be to use SQL, like I demonstrated in my previous comment, the lack of correlated subquery support in file geodatabases complicates trying to do so. Fortunately, it can be done in Python without too much extra work.

The itertools.groupby() function was made exactly for situations like this one. The following code works by modifying an existing table to remove the extra rows, so test it out on a copy first.

from itertools import groupby
from operator import itemgetter

fc = # path to feature class
case_fields = ["Planning Polygon ID", "ELCODE", "SeasonCode"]
max_field = "PERCENTAGE"

sql_orderby = "ORDER BY {}, {} DESC".format(", ".join(case_fields), max_field)

with arcpy.da.UpdateCursor(fc, "*", sql_clause=(None, sql_orderby)) as cur:
    case_func = itemgetter(
        *(cur.fields.index(fld) for fld in case_fields)
    )
    for key, group in groupby(cur, case_func):
        next(group)
        for extra in group:
            cur.deleteRow()
            ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The format of the code above is flexible enough that it can accommodate a single case field or multiple case fields. Without too much extra work, this code could be expanded to create a handy function where the user passes a few arguments and the table or feature class is collapse/reduced.

MollyMoore · ‎05-03-2017

This works perfectly and is succinct. I am relatively new to Python, so I learned a lot after looking up these tools running this bit of code. Thank you!

JoshuaBixby · ‎05-03-2017

I am glad it worked out for you, and it is especially good you took the time to dissect the code to understand how it works. Python is a great language to learn these days.

I started a blog series a while back titled, "The Iterable Cursor:..," that explores how to look at ArcPy cursors as "iterables" in the Python world. Once the connection is made, it helps one apply all of the various tools that apply to iterables.

This question and code have motivated me to look into a second "Built-ins & Itertools" post that focuses not just on accessing and analyzing data but managing and editing it as well. I still have ideas for other posts looking at how cursors work with the SciPy functionality that is now supported.