topic Re: my python script to check for duplicates is sometimes removing data. in Python Questions

my python script to check for duplicates is sometimes removing data.

Laura_m_Conner — Wed, 11 Jun 2025 13:50:33 GMT

My script is sometimes removing data from my map. I have developed a script to check a key field in several feature classes for duplicates. The script does work sometimes, but other times, it throws an error, error 000732. This mainly happens to 2 feature classes, one in which duplicates are found, and the other is the one that needs to test 2 different fields. Before running the script, I verified the layer was in the TOC and on the map. However, it was gone when I checked the TOC after the error. The strange thing is it can run the find identical in the loop fine without error, but it throws the error when it comes to the add join function. This also happens with the layer in which I need to check two fields. The find identical for the first field runs fine, but the find identical on the 2nd field can return the error. Any insights would be appreciated.

my code is:

F=("Sanitary Sewer Manholes","Water Network Structures","Water Fittings","Water System Valves","Water Service Connections","Water Hydrants","Sewer Gravity Mains","Sewer Pressurized Mains","Storm Drain Pipes", "Water Lateral Lines", "Water Mains") b =0 for e in F: a=F[b] s=str(a).replace(" ", "_") o= "C:\\Users\\lconner\\Documents\\ArcGIS\\Projects\\edit_map6\\edit_map6.gdb\\" +s+"_findidencal"+str(9) m="C:\\Users\\lconner\\Documents\\ArcGIS\\Projects\\edit_map6\\edit_map6.gdb\\MBID_duplicate" + str(1) arcpy.management.FindIdentical( in_dataset= a, out_dataset=o, fields="FACILITYID", xy_tolerance=None, z_tolerance=0, output_record_option="ONLY_DUPLICATES") count_result = arcpy.management.GetCount(o) count = int(count_result[0]) if count > 0: print("Duplicate entries in "+str (a) ) print(f"Number of duplicates found: {count}") arcpy.management.AddJoin( in_layer_or_view=a, in_field="OBJECTID", join_table=o, join_field="IN_FID", join_type="KEEP_ALL", index_join_fields="NO_INDEX_JOIN_FIELDS", rebuild_index="NO_REBUILD_INDEX", join_operation="JOIN_ONE_TO_FIRST" ) arcpy.management.SelectLayerByAttribute( in_layer_or_view= a, selection_type="NEW_SELECTION", where_clause="IN_FID IS NOT NULL", invert_where_clause=None ) else: print("no Duplicate in " + str(a) ) b=b+1 arcpy.management.SelectLayerByAttribute( in_layer_or_view="Water Service Connections", selection_type="NEW_SELECTION", where_clause="MBID IS NOT NULL", invert_where_clause=None ) arcpy.management.FindIdentical( in_dataset="Water Service Connections", out_dataset=m, fields="MBID", xy_tolerance=None, z_tolerance=0, output_record_option="ONLY_DUPLICATES" ) if count > 0: print("Duplicate MBID in Meters " ) print(f"Number of duplicates found: {count}") arcpy.management.AddJoin( in_layer_or_view="Water Service Connections", in_field="OBJECTID", join_table=m, join_field="IN_FID", join_type="KEEP_ALL", index_join_fields="NO_INDEX_JOIN_FIELDS", rebuild_index="NO_REBUILD_INDEX", join_operation="JOIN_ONE_TO_FIRST" ) arcpy.management.SelectLayerByAttribute( in_layer_or_view= "Water Service Connections", selection_type="NEW_SELECTION", where_clause="IN_FID IS NOT NULL", invert_where_clause=None ) else: print("no Duplicate MBIDs ")

Pro 3.5.1

data source: enterprise geodata base

Re: my python script to check for duplicates is sometimes removing data.

TonyAlmeida — Wed, 11 Jun 2025 14:55:54 GMT

I think one of your issues is with the AddJoin and SelectLayerByAttribute, it can create the appears of deleted features.

join_type="KEEP_ALL" keeps all records from the left table (a) and includes only matching rows from the join table (o) where OBJECTID = IN_FID.

Filtering with "IN_FID IS NOT NULL" isolates features identified as duplicates.

Unmatched features aren’t deleted they're just unselected or excluded.

Always work with feature layers not feature classes, and clear the selection after your done.

clear join after SelectLayerByAttribute

Re: my python script to check for duplicates is sometimes removing data.

HaydenWelch — Wed, 11 Jun 2025 15:24:28 GMT

I'm noticing that you have a primary loop that loops over features followed by the same thing flattened. Is that intentional?

You're also not changing the name of the out `MBID_duplicate` layer. it's always evaluating to "...MBID_duplicate1". It can also help to use verbose variable names. Here's a modified version of your code that tries to simplify the main loop so you can identify your error:

import arcpy from pathlib import Path from functools import partial features=( "Sanitary Sewer Manholes", "Water Network Structures", "Water Fittings", "Water System Valves", "Water Service Connections", "Water Hydrants", "Sewer Gravity Mains", "Sewer Pressurized Mains", "Storm Drain Pipes", "Water Lateral Lines", "Water Mains", ) gdb = Path(r"C:\Users\lconner\Documents\ArcGIS\Projects\edit_map6\edit_map6.gdb") p_find = partial( arcpy.management.FindIdentical( fields="FACILITYID", z_tolerance=0, output_record_option="ONLY_DUPLICATES", ) ) p_join = partial( arcpy.management.AddJoin( in_field="OBJECTID", join_field="IN_FID", join_type="KEEP_ALL", index_join_fields="NO_INDEX_JOIN_FIELDS", rebuild_index="NO_REBUILD_INDEX", join_operation="JOIN_ONE_TO_FIRST", ) ) p_select = partial( arcpy.management.SelectLayerByAttribute( selection_type="NEW_SELECTION", where_clause="IN_FID IS NOT NULL", ) ) for feature in features: feature=feature.replace(" ", "_") # This was originally writing all mbid_duplicate tables to the # same features `MBID_duplicate1` out_feature = gdb / f"FindIdentical_{feature}" mbid_feature = gdb / f"MBID_duplicate_{feature}" # Originally you were counting the input features, not the identical table identicals, _ = p_find(in_dataset=feature, out_dataset=out_feature) count = sum(1 for _ in arcpy.da.SearchCursor(identicals, ['OID@'])) # Continue the loop if no identicals found if count == 0: print(f"No duplicate in {feature}" ) continue # Log identical counts and add join, then select the features that don't have a null IN_FID print(f"{count} duplicate entries in {feature}") p_join(in_layer_or_view=feature, join_table=out_feature) p_select(in_layer_or_view=feature)

You can ignore or remove the `partial` blocks, I just did that to pre-fill the constant arguments in the tool calls and bubble them up to the top of the script in case you want to change them later.

Re: my python script to check for duplicates is sometimes removing data.

Laura_m_Conner — Wed, 11 Jun 2025 17:16:50 GMT

@ TonyAlmeida, The problem is not that features are disappearing from the feature layer; the problem is that the feature layer is being removed from my map in the middle of running the script. I know that when I open the attribute table, I just see the selected duplicates. Would the add join or SelectLayerByAttribute cause the feature layer to be removed from the map?

@HaydenWelch, Yes, the loop and then rechecking the one file are intentional. All feature layers need the FID checked for duplicates. However, only the meters have the MBID field, and thus, they are the only feature layer where that needs to be checked, too.

As of now, the names of the outputs are not causing a problem. I thought that might be the issue. I added the + str(1) on the end as a crude way to get a unique output name, as I could change the number in the (). It would be best to generate a unique output name every time. It would be appreciated if you could point me to some article or explanation on how to do that, but I don't think that is causing the problem.

This is the first I have heard of partial functions, so I will need to read up on them before I use them. Do you have any good resources for that? Would the absence of the partial function cause it to remove a feature layer from the map in the middle of a script? Right now, I am focused on finding what is causing the problem. I am not sure how simplifying the loop will help with that.

Here is the printout and error message I am getting:

the script returns:

no Duplicate in Sanitary Sewer Manholes
no Duplicate in Water Network Structures
no Duplicate in Water Fittings
no Duplicate in Water System Valves
no Duplicate in Water Service Connections
no Duplicate in Water Hydrants
no Duplicate in Sewer Gravity Mains
no Duplicate in Sewer Pressurized Mains
no Duplicate in Storm Drain Pipes
Duplicate entries in Water Lateral Lines
Number of duplicates found: 6

The error in full i am getting is:

ExecuteError                              Traceback (most recent call last)
Cell In[5], line 25
     23     print("Duplicate entries in "+str (a) )
     24     print(f"Number of duplicates found: {count}")
---> 25     arcpy.management.AddJoin(
     26         in_layer_or_view=a,
     27         in_field="OBJECTID",
     28         join_table=o,
     29         join_field="IN_FID",
     30         join_type="KEEP_ALL",
     31         index_join_fields="NO_INDEX_JOIN_FIELDS",
     32         rebuild_index="NO_REBUILD_INDEX",
     33         join_operation="JOIN_ONE_TO_FIRST"
     34     )
     37     arcpy.management.SelectLayerByAttribute(
     38         in_layer_or_view= a,
     39         selection_type="NEW_SELECTION",
     40         where_clause="IN_FID IS NOT NULL",
     41         invert_where_clause=None
     42         )
     45 else:

File C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py:13600, in AddJoin(in_layer_or_view, in_field, join_table, join_field, join_type, index_join_fields, rebuild_index, join_operation)
  13598     return retval
  13599 except Exception as e:
> 13600     raise e

File C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py:13582, in AddJoin(in_layer_or_view, in_field, join_table, join_field, join_type, index_join_fields, rebuild_index, join_operation)
  13578 from arcpy.arcobjects.arcobjectconversion import convertArcObjectToPythonObject
  13580 try:
  13581     retval = convertArcObjectToPythonObject(
> 13582         gp.AddJoin_management(
  13583             *gp_fixargs(
  13584                 (
  13585                     in_layer_or_view,
  13586                     in_field,
  13587                     join_table,
  13588                     join_field,
  13589                     join_type,
  13590                     index_join_fields,
  13591                     rebuild_index,
  13592                     join_operation,
  13593                 ),
  13594                 True,
  13595             )
  13596         )
  13597     )
  13598     return retval
  13599 except Exception as e:

File C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\geoprocessing\_base.py:533, in Geoprocessor.__getattr__.<locals>.<lambda>(*args)
    531 val = getattr(self._gp, attr)
    532 if callable(val):
--> 533     return lambda *args: val(*gp_fixargs(args, True))
    534 else:
    535     return convertArcObjectToPythonObject(val)

ExecuteError: Failed to execute. Parameters are not valid.
ERROR 000732: Layer Name or Table View: Dataset Water Lateral Lines does not exist or is not supported
Failed to execute (AddJoin).

P.S. is there any thing i could have added to my original post to help clarify it?

Thanks

Laura

Re: my python script to check for duplicates is sometimes removing data.

HaydenWelch — Wed, 11 Jun 2025 17:48:31 GMT

Simplifying the loop just makes it easier to see exactly where the code is failing. I believe your issue is still in the fact that you are looping over the features and essentially doing nothing. Each loop overwrites the output of the previous. I fixed that and commented it in my version. When you go to join, you're joining the last table in the loop.

You are also not using the updated name (something I also fixed in my version) and are calling the space delimited version of the feature name which seems to not exist.

I only did the partials and the loop simplification to make it clear that your core logic was incorrect. Sometimes cleaning things up like that can make the issue super obvious.

Re: my python script to check for duplicates is sometimes removing data.

TonyAlmeida — Wed, 11 Jun 2025 17:50:10 GMT

I also have some scripts that remove the layer for what ever reason, annoying ESRI bug is my guess. the issue is the Water Later lines, AddJoin requires the input to be a layer in the current map, not just a feature class name. So create feature layres, something like this,

This how I would write the code, untested but hopefully it helps.

F = ("Sanitary Sewer Manholes", "Water Network Structures", "Water Fittings", "Water System Valves", "Water Service Connections", "Water Hydrants", "Sewer Gravity Mains", "Sewer Pressurized Mains", "Storm Drain Pipes", "Water Lateral Lines", "Water Mains") for a in F: # No need for counter 'b', just iterate directly s = str(a).replace(" ", "_") o = "C:\\Users\\lconner\\Documents\\ArcGIS\\Projects\\edit_map6\\edit_map6.gdb\\" + s + "_findidencal" + str(9) # First create a feature layer from the feature class layer_name = a + "_layer" arcpy.MakeFeatureLayer_management(a, layer_name) arcpy.management.FindIdentical( in_dataset=a, out_dataset=o, fields="FACILITYID", xy_tolerance=None, z_tolerance=0, output_record_option="ONLY_DUPLICATES") count_result = arcpy.management.GetCount(o) count = int(count_result[0]) if count > 0: print("Duplicate entries in " + str(a)) print(f"Number of duplicates found: {count}") # Use the layer instead of the feature class name arcpy.management.AddJoin( in_layer_or_view=layer_name, in_field="OBJECTID", join_table=o, join_field="IN_FID", join_type="KEEP_ALL", index_join_fields="NO_INDEX_JOIN_FIELDS", rebuild_index="NO_REBUILD_INDEX", join_operation="JOIN_ONE_TO_FIRST" ) arcpy.management.SelectLayerByAttribute( in_layer_or_view=layer_name, selection_type="NEW_SELECTION", where_clause="IN_FID IS NOT NULL", invert_where_clause=None ) else: print("no Duplicate in " + str(a)) # For the MBID check, you should similarly create a layer first arcpy.MakeFeatureLayer_management("Water Service Connections", "Water_Service_Connections_layer") m = "C:\\Users\\lconner\\Documents\\ArcGIS\\Projects\\edit_map6\\edit_map6.gdb\\MBID_duplicate" + str(1) arcpy.management.SelectLayerByAttribute( in_layer_or_view="Water_Service_Connections_layer", selection_type="NEW_SELECTION", where_clause="MBID IS NOT NULL", invert_where_clause=None ) arcpy.management.FindIdentical( in_dataset="Water_Service_Connections_layer", out_dataset=m, fields="MBID", xy_tolerance=None, z_tolerance=0, output_record_option="ONLY_DUPLICATES" ) count_result = arcpy.management.GetCount(m) # You need to get the count here count = int(count_result[0]) if count > 0: print("Duplicate MBID in Meters") print(f"Number of duplicates found: {count}") arcpy.management.AddJoin( in_layer_or_view="Water_Service_Connections_layer", in_field="OBJECTID", join_table=m, join_field="IN_FID", join_type="KEEP_ALL", index_join_fields="NO_INDEX_JOIN_FIELDS", rebuild_index="NO_REBUILD_INDEX", join_operation="JOIN_ONE_TO_FIRST" ) arcpy.management.SelectLayerByAttribute( in_layer_or_view="Water_Service_Connections_layer", selection_type="NEW_SELECTION", where_clause="IN_FID IS NOT NULL", invert_where_clause=None ) else: print("no Duplicate MBIDs")

Re: my python script to check for duplicates is sometimes removing data.

Laura_m_Conner — Wed, 11 Jun 2025 18:56:01 GMT

I am taking that there is no way to mitigate the problem of the script removing the layer. Is there some where I can give the details to Esri so they can work on a fix for the problem.

Re: my python script to check for duplicates is sometimes removing data.

HaydenWelch — Wed, 11 Jun 2025 19:15:19 GMT

No, you're just using a constant string value for the output layer and so its being overwritten. If you want the layer to persist, you need the name to change for each feature. I Did that in the first few lines of my simplified loop by adding the feature name to the end of the layer name.

to clarify, `str(1)` is not unique. It's the equvalent of just putting a "1" in there. You need that value to change.

Re: my python script to check for duplicates is sometimes removing data.

HaydenWelch — Wed, 11 Jun 2025 20:44:21 GMT

Okay, I think I was initially confused by your question. I now see that you're trying to check for duplicate IDs in all feature classes. I'm gonna revise my initial response with one that actually tries to solve your problem.

I did manage to get it all done using SearchCursors and Python's built in mapping. All you need to do is create a dictionary of FACILITYID keys with a sequence of something that contains the name of the feature class and the OBJECTID for the duplicate feature (for selection later).

After filtering out the facility records that only have one hit, you can then re-map those duplicates to a new mapping of Feature Class to Object IDs (since the original check was against all Feature Classes). From there we can print out the feature classes that have no duplicates as well as the count of duplicates in each feature class that does contain collisions:

from pathlib import Path from arcpy.da import SearchCursor from arcpy.mp import Layer from arcpy.management import ( SelectLayerByAttribute, CopyFeatures, ) feature_classes=( "Sanitary Sewer Manholes", "Water Network Structures", "Water Fittings", "Water System Valves", "Water Service Connections", "Water Hydrants", "Sewer Gravity Mains", "Sewer Pressurized Mains", "Storm Drain Pipes", "Water Lateral Lines", "Water Mains", ) gdb = Path(r"C:\Users\lconner\Documents\ArcGIS\Projects\edit_map6\edit_map6.gdb") # Initialize a dictionary that will contain duplicate entries # in the form: {<facility>: [(<feature_class>, <oid>), ...]} facility_mapping: dict[int|str, list[tuple[str, int]]] = {} # Build the mapping of facilities for feature_class in feature_classes: # Sanitize name feature_class = feature_class.replace(' ', '_') # The `feature_class` in this call might need to be interpolated to see your features, # Possibly with the gdb global above if that's where they live for facility_id, object_id in SearchCursor(feature_class, ['FACILITYID', 'OID@']): facility_mapping.setdefault(facility_id, []) facility_mapping[facility_id].append((feature_class, object_id)) # Filter the mapping to only have facilities with duplicates facility_mapping = { facility: duplicates for facility, duplicates in facility_mapping.items() if len(duplicates) > 1 } # Clear current selection SelectLayerByAttribute(selection_type='CLEAR_SELECTION') # Build a selection map for selecting duplicates in all FCs selections: dict[str, list[int]] = {} for _, record in facility_mapping.items(): # Unpack the record fc, oid = record # Add the OID to the selection mapping selections.setdefault(fc, []) selections[fc].append(oid) # Final loop for printing and selecting for fc in feature_classes: # Make sure this points to your features! fc = fc.replace(' ', '_') if fc not in selections: print(f"No duplicates found in {fc.replace('_', ' ')}") continue print(f"Found {len(selections[fc])} duplicates in {fc.replace('_', ' ')}") SelectLayerByAttribute( in_layer_or_view=fc, selection_type='ADD_TO_SELECTION', where_clause=f"OBJECTID IN ({','.join(selections[fc])})", ) # Uncomment to create a new layer with duplicates # CopyFeatures(fc, f"{fc}_duplicates")

Re: my python script to check for duplicates is sometimes removing data.

Laura_m_Conner — Thu, 12 Jun 2025 16:50:59 GMT

ok, I have a few questions.

1. would this approach keep arc pro from sometimes removing Water Lateral Lines from the project while in the middle of executing the script.

2 Is there a way or approach that generally stops arc pro form arbitrarily removing data from the project in the middle of running the script?

3. Is there a reason you are not using the built in find identical tool instead of using long hand to specify the operation?

I am trying tying to see if i can get your script to work for me. this is my 1st time with the Search Cursor function so it might take some time to figure it out.

Re: my python script to check for duplicates is sometimes removing data.

HaydenWelch — Fri, 13 Jun 2025 16:22:11 GMT

Your setup isn't removing data, it's overwriting layers. The layer data source is still there, but since you're creating a new layer with the same name, it appears to vanish.

What I did here is just ignore the layers entirely. I'm working directly on the source data. First I load all the features into memory, build a mapping (dict) that organizes the features from all input sources into a single location. Then I find all records that have collisions between the datasets (len (mapping[fid]) > 1).

I only chose to avoid the function call to FindIdentical, because I already know what I need and a hash map can accommodate that perfectly. If you needed the side effects of running that function, you should absolutely use it, but since all we need is a sequence of feature names and object IDs, we can drop that complexity.

If you did want to continue using the layer operations and higher level functions, you can. I'd just make sure that you are putting them in a memory layer instead of writing them directly to a featureclass on disk.

import arcpy fc = 'my_fc' join = arcpy.management.AddJoin(f'memory/{fc}', ...).getOutput(0) # Other operations on join ... final = arcpy.management.DeleteIdentical(join, ...).getOutput(0) # Here we finally output a layer arcpy.management.CopyFeatures(final, f'<Full_Path>/{fc}')

In that snippet, the join is an intermediate step and we don't output it to the map. We only do that at the end when CopyFeatures is called. Creating intermediate layers while the environment is set to addOutputsToMap can be a pain. You can also use EnvManager to prevent layers from being overwritten and added:

import arcpy from arcpy import EnvManager fc = 'my_fc' with EnvManager(addOutputsToMap=False, overwriteOutput=False): join = arcpy.management.AddJoin(f'{fc}', ...).getOutput(0)

In the end there's a ton of different ways to do things with arcpy. I tend to stick to the most pure python I can since the APIs for some of the tool functions can be difficult to work with or oddly documented. A Cursor is just a quick way to turn your source data into Python objects which means now you just have a data science problem on your hands instead of a workflow problem.