How does "Delete Identical" choose which duplicate to keep?

6626
8
Jump to solution
02-19-2018 02:00 PM
ArianneFord
New Contributor III

I've run Find Identical to flag duplicates in my dataset, that works fine. I've created a new field for duplicates, and now need to purge entries, but I'm not quite clear on how the Delete Identical tool decides which of the duplicates to delete?

I highlight an example in the following image:

duplicate example

Based on a number of attributes, the highlighted samples have been flagged as duplicates. However while I consider them to be duplicates, you can see they do not in fact contain identical values across all fields. 

This is ok as I know why it occurs, but when you set the field(s) to check on in Delete Identical how does the tool decide which one to keep and which one(s) to delete? Does it keep the first unique object it finds and delete any duplicates thereafter? For example, in the table snapshot above, would it keep the first highlighted object and delete the 2 beneath it? Does it iterate through the file based on FID/OID (so whichever had the lowest FID/OID value would be kept)?

I just want to clarify which order it selects the objects to delete?

Thanks in advance

1 Solution

Accepted Solutions
DanPatterson_Retired
MVP Emeritus

I would do a try with Find Identical first using the option to only keep the duplicates.  Then do a Delete Identical on that tableit.  That will reveal whether the first in the sequence is kept and the remainder deleted.  In this manner, your original data are not touched should the outcome not be what you want.  If all is good, you can run a delete duplicate on the full set after having archived the original in case some one questions the process.

View solution in original post

8 Replies
DanPatterson_Retired
MVP Emeritus

Find Identical http://pro.arcgis.com/en/pro-app/tool-reference/data-management/find-identical.htm

You might want to consider expanding your definition of 'identical' if you have concerns.  Critical fields should go into the definition then it wouldn't matter which field was kept.  In your first two rows, the file_id has different values, so it shouldn't matter which will be deleted, if it does, then that field should be included in the fields definition

ArianneFord
New Contributor III

It matters in so far as I have to document it. It doesn't matter for practical purposes, it unfortunately matters for reporting purposes.

0 Kudos
DanPatterson_Retired
MVP Emeritus

I would do a try with Find Identical first using the option to only keep the duplicates.  Then do a Delete Identical on that tableit.  That will reveal whether the first in the sequence is kept and the remainder deleted.  In this manner, your original data are not touched should the outcome not be what you want.  If all is good, you can run a delete duplicate on the full set after having archived the original in case some one questions the process.

ArianneFord
New Contributor III

Thanks. That answered the question. Be good if ESRI could formally document it so that I can reference it though 

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

Esri doesn't say, which is why no one said here how the tool decides which record to keep.  What Esri's documentation does say, and Dan alludes to in his response, is that the Find Identical and Delete Identical tools use the same algorithm, so you can run the Find Identical tool first to see what records would be deleted using the Delete Identical tool.

DanPatterson_Retired
MVP Emeritus

Well put... trust nothing unless you confirm it yourself

ArianneFord
New Contributor III

It retained the first object it found. I might be mistaken on the specific details, but it looked like it stored each unique object from the original file in a temp file, compared the next object in the original file to the objects in the temp file, if the object existed in the temp file it wouldn't add it to the temp file as a new unique object. If it was a new unique object, it got added to the temp file. It then saved the temp file with "no duplicates" at the end.

0 Kudos

here is a roundabout way to delete the identical records keeping the one with the lowest (minimum) object ID, demonstrated with my sloppy python copied from the results tab.  

arcpy.Merge_management(inputs="CPNewMeasEndEventsY;CPNewMeasEndEventsX;CPNewMeasEventsY;CPNewMeasEventsX", output="C:/temp/routeparts.gdb/CalibrationPoints_MultipartLocals")


arcpy.FindIdentical_management(in_dataset="CalibrationPoints_MultipartLocals", out_dataset="C:/temp/routeparts.gdb/cp_identicals", fields="RID;MEAS;POINT_X;POINT_Y;Measure", xy_tolerance="", z_tolerance="0", output_record_option="ONLY_DUPLICATES")


arcpy.Statistics_analysis(in_table="cp_identicals", out_table="C:/temp/routeparts.gdb/cp_identicals_to_delete", statistics_fields="IN_FID MIN", case_field="FEAT_SEQ")


arcpy.AddJoin_management(in_layer_or_view="CalibrationPoints_MultipartLocals", in_field="OBJECTID", join_table="cp_identicals", join_field="IN_FID", join_type="KEEP_COMMON")


arcpy.SelectLayerByAttribute_management(in_layer_or_view="CalibrationPoints_MultipartLocals", selection_type="NEW_SELECTION", where_clause="1=1")


arcpy.AddJoin_management(in_layer_or_view="CalibrationPoints_MultipartLocals", in_field="cp_identicals.FEAT_SEQ", join_table="cp_identicals_to_delete", join_field="FEAT_SEQ", join_type="KEEP_COMMON")


arcpy.SelectLayerByAttribute_management(in_layer_or_view="CalibrationPoints_MultipartLocals", selection_type="REMOVE_FROM_SELECTION", where_clause="cp_identicals_to_delete.MIN_IN_FID = cp_identicals.IN_FID")


arcpy.RemoveJoin_management(in_layer_or_view="CalibrationPoints_MultipartLocals", join_name="cp_identicals_to_delete")


arcpy.RemoveJoin_management(in_layer_or_view="CalibrationPoints_MultipartLocals", join_name="cp_identicals")


arcpy.DeleteFeatures_management(in_features="CalibrationPoints_MultipartLocals")

0 Kudos