Appending- how to avoid duplicates?

TheoFaull · ‎03-27-2017

I have two point shapefiles, both have the exact same fields in them. There are some new records AND some duplciate records when comparing the two datasets.

I want to use the append tool as I don't want to create a new dataset, I just want to add data to the existing original shapefile.

However, when I append the two shapefiles, matching records are appended, thus leaving lots of duplicate records. How can I tell my script to only append new records and ignore duplicates?

TheoFaull · ‎03-28-2017

FGDB. Thanks for the help Neil. I've gone for the method posted earlier now. Where I create a new dataset with a date tag added to the end of the feature class name. I just have to remember to resource the MXDs to this new layer after every update.

EinarDørheim1 · ‎07-26-2018

I've been using this script, posted here, to do basically what you asked for in the first place. It needs one field where you need to look for duplicate values and a date-field to sort your data. Before this code block runs I append my new/updated data to my master feature class, and then run this code to see if there is any duplicates, if yes it keeps only the newest.

# Script loops through a feature class and deletes duplicate records, keeps newest entry
import arcpy

# Set parameters
# path to feature class:
fc = r"C:\Users\Documents\ArcGIS\Projects\kun.gdb\kun_A"
# unique field to search for duplicates
id = "USER_Id" 
# Field to sort by. D = Descending A = Ascending. choose D to keep newest entry
sort_field = "date D"

# print("Looking for duplicates")
cursor = arcpy.UpdateCursor(fc,"","",id,sort_field)
keepList = list()
for row in cursor:
    row_val = row.getValue(id)
    if row_val not in keepList:
        keepList.append(row_val)
    elif row_val in keepList:
        cursor.deleteRow(row)
    else:
        pass
    
# print("Finito!")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The downside of this code as i see it is that it's slow. For my use it has to loop through 4.5M records each time, thus taking 15-20minutes to complete. (So if anyone has tips for faster methods I'm all ears)

NeilAyres · ‎07-27-2018

Use the da module.

as in arcpy.da.UpdateCursor(....)

MichaelVolz · ‎07-27-2018

Use a truncate and append process. I've been using this type of process for years on file gdbs that are tied to AGS mapservices and it works well.

You just need to test things when there is a schema change, as that would require stopping the service but hopefully this would be a very infrequent operation.

You might also need to compact the file gdb on some interval (quarterly, semi-annually) as performance can degrade over time as the OBJECTIDs keep getting larger in the file gdb.

If performance gets really bad, you could also create a new file gdb and just import the feature classes in from the existing renamed file gdb (This maintenance would require stopping the service). This would be even less frequent, maybe once a year or every other year.

EinarDørheim1 · ‎07-30-2018

I tried modifying the script to use the da module, but it only slightly improved performance. previous script with arcpy.UpdateCursor used 1070s, while the new script with arcpy.da.UpdateCursor used 1032s.

Here's the code I used this time:

import arcpy
fc = fc = r"C:\Users\Documents\ArcGIS\Projects\kun.gdb\kun_A"
fieldss = ["USER_Id", "date"]
keepList = []
with arcpy.da.UpdateCursor[fc, fieldss, sql_clause=(None, "ORDER BY date DESC"))
for row in cursor:
row_val = [0]
if row_val not in keepList:
keepList.append(row_val)
else:
cursor.deleteRow()

Am I using the da module correctly? after reading about the performance differend between arcpy.da. Cursors and arcpy.Cursors I would think the improvement should be larger.

NeilAyres · ‎07-30-2018

In the past I have seen speed improvements of about X 10. So yes, I would expect this to be quicker.

How many records are being deleted?

EinarDørheim1 · ‎07-30-2018

in this test I had 424'857 records where 19 where duplicates. so after the script runs I'm left with 424'838 records.

NeilAyres · ‎07-30-2018

You code doesn't look quite right

should look a bit like this :

with arcpy.da.UpdateCursor[fc, fieldss, sql_clause=(None, "ORDER BY date DESC")) as cursor:
    for row in cursor:
        row_val = row[0]
        if row_val not in keepList:
            keepList.append(row_val)
        else:
            cursor.deleteRow()
‍‍‍‍‍‍‍

So there are very few (a tiny %) that are actually deleted.

EinarDørheim1 · ‎07-30-2018

Sorry, that was a typo from my part, the code does state the row_val = row[0] as you pointed out.

Indeed there is only a small % that gets deleted every time, will seldom be more than 500 rows that are updated/added each turn.

JasonRogers1 · ‎06-12-2019

This worked PERFECTLY for me! Thank you so much for this post!!!