Arcpy - search duplicate attributes, then update values

timdunlevie · ‎01-29-2015

Hi all,

Could someone point me in the right direction.

I have limited python scripting experience but know a enough (to be dangerous).

This is what I want to do.

I have a shapefile which has duplicate objects based on an "ID".

Object1: ID1 + Owner1

Object2: ID1 + Owner2

Object3: ID1 + Owner3

in this example there are 3 objects with identical ID but with 3 different owners for the same object.

What I want to do is search through the shapefiles for all identical "ID", then update a new shapefile with :

Object1: ID1 + Owner1 + Owner2 + Owner3

The new shapefile would have 3 extra columns added for the Owner field to be updated to.

so I want 1 object only but keep the owner information in the row.

thanks,

Tim

XanderBakker · ‎02-03-2015

but you are right, it is possible to fill a dictionary with the features from the first search cursor and use the dictionary during the insert cursor. This might become a problem for large datasets...

XanderBakker · ‎02-03-2015

And this could be the code with a cursor less and an additional dictionary (I think it is a little faster, but I didn't test for speed):

def main():
    import arcpy
    import os

    fc_in = r"D:\Xander\GeoNet\RemoveDuplicates\test.gdb\Parcels"
    fc_out = r"D:\Xander\GeoNet\RemoveDuplicates\test.gdb\Parcels_cleaned"
    fld_id = "ID"
    fld_owner = "OWNER"

    sr = arcpy.Describe(fc_in).spatialReference
    fld_oid = arcpy.Describe(fc_in).OIDFieldName

    # create empty output fc
    fc_ws, fc_name = os.path.split(fc_out)
    arcpy.CreateFeatureclass_management(fc_ws, fc_name, template=fc_in, spatial_reference=sr)

    # create dictionary with ID's and related owners and dictionary with rows
    flds = correctFieldList(arcpy.ListFields(fc_in))
    dct_owner = {}
    dct_ids2use = {}
    with arcpy.da.SearchCursor(fc_in, flds) as curs_in:
        for row_in in curs_in:
            p_id = row_in[flds.index(fld_id)]
            p_owner = row_in[flds.index(fld_owner)]
            if p_id in dct_owner:
                dct_owner[p_id] += ", {0}".format(p_owner)
            else:
                dct_owner[p_id] = p_owner
                dct_ids2use[p_id] = list(row_in) # just register the first encounter of the id
    del row_in, curs_in

    # now do the update:
    with arcpy.da.InsertCursor(fc_out, flds) as curs_out:
        for p_id, lst_row in dct_ids2use.items():
                lst_row[flds.index(fld_owner)] = dct_owner[p_id]
                curs_out.insertRow(tuple(lst_row))

def correctFieldList(flds):
    flds_use = ['Shape@']
    fldtypes_not = ['Geometry', 'Guid', 'OID']
    for fld in flds:
        if not fld.type in fldtypes_not:
            flds_use.append(fld.name)
    return flds_use

if __name__ == '__main__':
    main()

BlakeTerhune · ‎02-03-2015

Here is what I was thinking. Where you do the update in the second part of your original script, maybe do it this way.

I have not tested this though, code may have bugs/errors.

# now do the update:
flds = correctFieldList(arcpy.ListFields(fc_out))
with arcpy.da.InsertCursor(fc_out, flds) as curs_out:
    for p_id, p_owner in dct.items():
        where_clause = '''"{0}" = {1}'''.format(fld_id, p_id)
        with arcpy.da.SearchCursor(fc_in, flds, where_clause) as curs_in:
            lst_row = list(curs_in.next())
            lst_row[flds.index(fld_owner)] = p_owner
            curs_out.insertRow(tuple(lst_row))

Not sure if recreating the search cursor every time is going to slow it down though (compared to writing to a dictionary or reading all rows).

EDIT:

The code syntax highlight keeps adding two extra single quote characters at the beginning of the where clause assignment.

EDIT 2:

I just noticed the InsertCursor documentation says the insertRow can be a list or tuple of values. Maybe it's not necessary to convert the lst_row back to a tuple on the last line?

Sorry if I'm beating a dead horse, I'm still learning.

XanderBakker · ‎02-03-2015

As far as I know it works with tuples (a row is a normally a tuple when using the da cursors), but you are right the documentation states list or tuple...

It will slow down for every time it needs to search for a specific feature (row). If you set attribute indexes to the fields used, it will become less slow, but recreating the cursor will definitely be slower.

But to be sure, you should test and register the time for each process.

KatieBoston · ‎02-03-2023

Hi Xander,

I am working on a very similar script and your example has been super helpful. I am trying my best to interpret some of the language you've used here (new python user here!) to see if there is a way to update not just one field of data but update 2(or more) fields of data if there is a duplicate ID. I'm working with points that fall on top of mining permits. What I am hoping to update are permit types AND commodities. So if ID01 exists 2 times I envision my output to have updated ID01 with both permit types as well as both commodities. In your example above you have p_id and p_owner- would I need to repeat that block of code (the for loop under the first search cursor) with a separate variable (like p_commodity) for the additional field of data I want to update? Any guidance in the right direction would be greatly appreciated!