Find Identical tool Replacement

ClintonCooper1 · ‎10-30-2013

I am trying to l create a python script that will be a substitute to the find identical tool. I ran it last night on a large dataset, and it is taking 10+ hours to run. I believe I can run a search cursor and update cursor that will be light years ahead in performance. So far, I have gotten this far with my script:

import arcpy  from arcpy import env env.workspace = r"C:\Users\cc1\Desktop\NEW.gdb\WAYNE"  table = "WAYNE"  list = []  with arcpy.da.SearchCursor(table, ["FULL_ADDRESS_NAME"]) as cursor:     for row in cursor:         list.append(row[0])  del row, cursor  with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:     for updateRow in updateRows:         nameValue = updateRow[0]         if nameValue in list:             updateRow[1] = lutDict[nameValue]             updateRows.updateRow(updateRow) del updateRow, updateRows

To be specific for what I am doing, I need to search through a field (that had duplicate values) and return a new value that is a unique number for all the different set of duplicates. For example:

search ID   new Unique ID
aaa        1
aaa        1
bbb        2
ccc        3
ccc        3
aaa        1
ddd        4

So there would be an increment, but based if the number on the field of search ID is unique, and each successive value that is the same search ID would have the same value.

Any thoughts on how to accomplish this? Thanks in advance!!

DouglasSands · ‎10-30-2013

I'm not sure. Although now that I think about it, I have had trouble in the past assigning values to a row directly from a dictionary. Try this?

import arcpy

from arcpy import env
env.workspace = r"C:\Users\cc1\Desktop\NEW.gdb\WAYNE"

table = "WAYNE"

uniqueValues = {}
values = []
newID = 1

with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:
    for row in updateRows:
        nameValue = row[0]
        values.append(nameValue)
        if nameValue in uniqueVals:
            row[1] = uniqueValues[[nameValue]]
        else:
            newID += 1
            uniqueValues[nameValue] = [newID]
            row[1] = newID
        updateRows.updateRow(row)
            
del row
del updateRows

uniqueCount = {}
for val in uniqueValues:
    uniqueCount[val] = values.count(val)

with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME", "FREQ"]) as updateRows:
    for row in updateRows:
        nameValue = row[0]
        count = uniqueCount[nameValue]
        row[1] = count
        updateRows.updateRow(row)
        
del row
del updateRows

If that doesn't work, try inserting a print statement and printing values after the first cursor to see if it gets populated.

For resources, as someone who taught myself also I found that in general the ESRI online help for each tool has all of the syntax information you need for tools and always has examples. Thats the first place I go when I can't figure something out. Quickly googling "arcpy 10.1 <tool name>" usually gets good results. Also searching the web in general - lots of people have asked lots of questions that have been answered all over the place.

Good luck!

ClintonCooper1 · ‎10-30-2013

added this line:

values = [row[0] for row in arcpy.da.SearchCursor(table, ("FULL_ADDRESS_NAME"))]

and it worked with this code:

import arcpy

from arcpy import env
env.workspace = r"C:\Users\ccooper\Desktop\DATA.gdb\WAYNE"

table = "WAYNE"

uniqueValues = {}
values = [row[0] for row in arcpy.da.SearchCursor(table, ("FULL_ADDRESS_NAME"))]
newID = 0

with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:
    for row in updateRows:
        nameValue = row[0]
        if nameValue in uniqueValues:
            row[1] = uniqueValues[nameValue]
        else:
            newID += 1
            uniqueValues[nameValue] = newID
            row[1] = newID
        updateRows.updateRow(row)

            
del row, updateRows

uniqueCount = {}
for val in uniqueValues:
    uniqueCount[val] = values.count(val)

with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME", "FREQ_NAME"]) as updateRows:
    for row in updateRows:
        nameValue = row[0]
        row[1] = uniqueCount[nameValue]
        updateRows.updateRow(row)
        
del row, updateRows

on my test data (1/100th) the size of my main file, it ran for 85 seconds.....much slower than my other method. Do you see any in efficiencies in my code that I could change to increase performance?

ChrisSnyder · ‎10-30-2013

Summary statistics is written in C++. It will be significantly faster than any Python solution.

Not completely true... The Summary Stats tool (and Frequency tool) seems to run a pre-sort on the dataset as step 1 (at least that's what the tool status says it's doing), which doesn't seem to be neccessary.

The out of the box Summary Statistics tool takes 13 seconds to get the maximum OBJECTID value for each case field value in my table that has 81k records (5855 unique case field values).

By comparison, this Python code run on the same dataset takes 1.6 seconds to generate the same information:

import arcpy, time
myFC = r"C:\my_fgdb.gdb\my_fc"
statDict = {}
statField = "OID@"
caseField = "ELEV"
time1 = time.clock()
searchRows = arcpy.da.SearchCursor(myFC, [statField,caseField])
for searchRow in searchRows:
   statValue, caseValue = searchRow
   if caseValue in statDict:
      statDict[caseValue].append(statValue)
   else:
      statDict[caseValue] = [statValue]
sumDict = {}
for caseValue in statDict:
   sumDict[caseValue] = len(statDict[caseValue]), max(statDict[caseValue])
time2 = time.clock()

In addition, the ESRI Summary Statistics tool (and ther Frequency tool) give incorrect results in the output table when the case field values are either NULL or 0. Which was a bug that got fixed a long time ago, but seems to be back (at least in v10.1 SP1).

As a solution to little issues like this, I too have a little collection of Python-based code/tools I have written over the years that are either bug work arounds or major performace enhancments for some of the out of the box geoprocessing tools.

ClintonCooper1 · ‎10-30-2013

In addition, the ESRI Summary Statistics tool (and ther Frequency tool) give incorrect results in the output table when the case field values are either NULL or 0. Which was a bug that got fixed a long time ago, but seems to be back (at least in v10.1 SP1).

As a solution to little issues like this, I too have a little collection of Python-based code/tools I have written over the years that are either bug work arounds or major performace enhancments for some of the out of the box geoprocessing tools.

Based on your solution, what can I do to fix my code as the times are way too high given what you have found?

ChrisSnyder · ‎10-30-2013

So per your original post where I assume you had some data like:

ORIG_ID
aaa
aaa
bbb
ccc
ccc
aaa
ddd

And you want it to come out as this:

ORIG_ID SEQ_ID
aaa 1
aaa 1
bbb 2
ccc 3
ccc 3
aaa 1
ddd 4

This code should work:

myFC = r"C:\my_fgdb.gdb\my_fc"
valueSet = set([r[0] for r in arcpy.da.SearchRows(myFC, ["ORIG_ID"])])
valueList = list(valueSet)
valueList.sort()
arcpy.AddField_managment(myFC, "SEQ_ID", "LONG")
updateRows = arcpy.da.UpdateCursor(myFC, ["ORIG_ID","SEQ_ID"])
for updateRow in updateRows:
   updateRow[1] = valueList.index(updateRow[0]) + 1
   updateRows.updateRow(updateRow)
del updateRow, updateRows

ChrisSnyder · ‎10-30-2013

Okay, sorry missed the part about you wanting a COUNT field as well... So how about something like:

import arcpy, collections
myFC = r"C:\my_fgdb.gdb\my_fc"
valueList = [r[0] for r in arcpy.da.SearchRows(myFC, ["ORIG_ID"])]
valueDict = collections.Counter(valueList)
uniqueList = valueDict.keys()
uniqueList.sort() #if you want SEQ_ID to be sorted numeric or alphabetic
arcpy.AddField_managment(myFC, "SEQ_ID", "LONG")
arcpy.AddField_managment(myFC, "COUNT", "LONG")
updateRows = arcpy.da.UpdateCursor(myFC, ["ORIG_ID","SEQ_ID","COUNT"])
for updateRow in updateRows:
   updateRow[1] = uniqueList.index(updateRow[0]) + 1
   updateRow[2] = valueDict[updateRow[0]]  
   updateRows.updateRow(updateRow)
del updateRow, updateRows

ClintonCooper1 · ‎10-30-2013

This worked great, Thank you!!

ChristalHigdon · ‎05-30-2014

I stumbled upon this after looking for a couple days into my favorite Error 999999 when running the Find Identical tool on a very large dataset in ArcGIS Desktop version 10.2 (in a Python script but also tried it through Toolbox). I noticed it was a 'fixed' bug in 10.1 but I'm guessing the datasets I work with are larger than what they tested as a 'large' dataset. Or else the bug is back in 10.2. Anyway, I was able to code the above solution directly into my Python script and it ran WAY faster than the other method (We're talking close to a 14 hour processing time). Thank you so much for posting this!! It helped me tremendously.