my python script to check for duplicates is sometimes removing data.

Laura_m_Conner · ‎06-11-2025

My script is sometimes removing data from my map. I have developed a script to check a key field in several feature classes for duplicates. The script does work sometimes, but other times, it throws an error, error 000732. This mainly happens to 2 feature classes, one in which duplicates are found, and the other is the one that needs to test 2 different fields. Before running the script, I verified the layer was in the TOC and on the map. However, it was gone when I checked the TOC after the error. The strange thing is it can run the find identical in the loop fine without error, but it throws the error when it comes to the add join function. This also happens with the layer in which I need to check two fields. The find identical for the first field runs fine, but the find identical on the 2nd field can return the error. Any insights would be appreciated.

my code is:

F=("Sanitary Sewer Manholes","Water Network Structures","Water Fittings","Water System Valves","Water Service Connections","Water Hydrants","Sewer Gravity Mains","Sewer Pressurized Mains","Storm Drain Pipes", "Water Lateral Lines", "Water Mains")
b =0 

for e in F:
    a=F[b]
    s=str(a).replace(" ", "_")
    o= "C:\\Users\\lconner\\Documents\\ArcGIS\\Projects\\edit_map6\\edit_map6.gdb\\" +s+"_findidencal"+str(9)
    m="C:\\Users\\lconner\\Documents\\ArcGIS\\Projects\\edit_map6\\edit_map6.gdb\\MBID_duplicate" + str(1)
   
    arcpy.management.FindIdentical(
    in_dataset= a,
    out_dataset=o,
    fields="FACILITYID",
    xy_tolerance=None,
    z_tolerance=0,
    output_record_option="ONLY_DUPLICATES")

    count_result = arcpy.management.GetCount(o)
    count = int(count_result[0])
    
    
    if count > 0:
        print("Duplicate entries in "+str (a) )
        print(f"Number of duplicates found: {count}")
        arcpy.management.AddJoin(
            in_layer_or_view=a,
            in_field="OBJECTID",
            join_table=o,
            join_field="IN_FID",
            join_type="KEEP_ALL",
            index_join_fields="NO_INDEX_JOIN_FIELDS",
            rebuild_index="NO_REBUILD_INDEX",
            join_operation="JOIN_ONE_TO_FIRST"
        )


        arcpy.management.SelectLayerByAttribute(
            in_layer_or_view= a,
            selection_type="NEW_SELECTION",
            where_clause="IN_FID IS NOT NULL",
            invert_where_clause=None
            )
    

    else:
        print("no Duplicate in " + str(a) )

    
    b=b+1


arcpy.management.SelectLayerByAttribute(
    in_layer_or_view="Water Service Connections",
    selection_type="NEW_SELECTION",
    where_clause="MBID IS NOT NULL",
    invert_where_clause=None
)

arcpy.management.FindIdentical(
    in_dataset="Water Service Connections",
    out_dataset=m,
    fields="MBID",
    xy_tolerance=None,
    z_tolerance=0,
    output_record_option="ONLY_DUPLICATES"
)

if count > 0:
        print("Duplicate MBID in Meters " )
        print(f"Number of duplicates found: {count}")
        arcpy.management.AddJoin(
            in_layer_or_view="Water Service Connections",
            in_field="OBJECTID",
            join_table=m,
            join_field="IN_FID",
            join_type="KEEP_ALL",
            index_join_fields="NO_INDEX_JOIN_FIELDS",
            rebuild_index="NO_REBUILD_INDEX",
            join_operation="JOIN_ONE_TO_FIRST"
        )


        arcpy.management.SelectLayerByAttribute(
            in_layer_or_view= "Water Service Connections",
            selection_type="NEW_SELECTION",
            where_clause="IN_FID IS NOT NULL",
            invert_where_clause=None
            )

else:
        print("no Duplicate MBIDs ")

Pro 3.5.1

data source: enterprise geodata base

HaydenWelch · ‎06-13-2025

Your setup isn't removing data, it's overwriting layers. The layer data source is still there, but since you're creating a new layer with the same name, it appears to vanish.

What I did here is just ignore the layers entirely. I'm working directly on the source data. First I load all the features into memory, build a mapping (dict) that organizes the features from all input sources into a single location. Then I find all records that have collisions between the datasets (len (mapping[fid]) > 1).

I only chose to avoid the function call to FindIdentical, because I already know what I need and a hash map can accommodate that perfectly. If you needed the side effects of running that function, you should absolutely use it, but since all we need is a sequence of feature names and object IDs, we can drop that complexity.

If you did want to continue using the layer operations and higher level functions, you can. I'd just make sure that you are putting them in a memory layer instead of writing them directly to a featureclass on disk.

import arcpy

fc = 'my_fc'

join = arcpy.management.AddJoin(f'memory/{fc}', ...).getOutput(0)

# Other operations on join
...

final = arcpy.management.DeleteIdentical(join, ...).getOutput(0)

# Here we finally output a layer
arcpy.management.CopyFeatures(final, f'<Full_Path>/{fc}')

In that snippet, the join is an intermediate step and we don't output it to the map. We only do that at the end when CopyFeatures is called. Creating intermediate layers while the environment is set to addOutputsToMap can be a pain. You can also use EnvManager to prevent layers from being overwritten and added:

import arcpy
from arcpy import EnvManager

fc = 'my_fc'

with EnvManager(addOutputsToMap=False, overwriteOutput=False):
    join = arcpy.management.AddJoin(f'{fc}', ...).getOutput(0)

In the end there's a ton of different ways to do things with arcpy. I tend to stick to the most pure python I can since the APIs for some of the tool functions can be difficult to work with or oddly documented. A Cursor is just a quick way to turn your source data into Python objects which means now you just have a data science problem on your hands instead of a workflow problem.