Find Identical tool Replacement

ClintonCooper1 · ‎10-30-2013

I am trying to l create a python script that will be a substitute to the find identical tool. I ran it last night on a large dataset, and it is taking 10+ hours to run. I believe I can run a search cursor and update cursor that will be light years ahead in performance. So far, I have gotten this far with my script:

import arcpy  from arcpy import env env.workspace = r"C:\Users\cc1\Desktop\NEW.gdb\WAYNE"  table = "WAYNE"  list = []  with arcpy.da.SearchCursor(table, ["FULL_ADDRESS_NAME"]) as cursor:     for row in cursor:         list.append(row[0])  del row, cursor  with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:     for updateRow in updateRows:         nameValue = updateRow[0]         if nameValue in list:             updateRow[1] = lutDict[nameValue]             updateRows.updateRow(updateRow) del updateRow, updateRows

To be specific for what I am doing, I need to search through a field (that had duplicate values) and return a new value that is a unique number for all the different set of duplicates. For example:

search ID   new Unique ID
aaa        1
aaa        1
bbb        2
ccc        3
ccc        3
aaa        1
ddd        4

So there would be an increment, but based if the number on the field of search ID is unique, and each successive value that is the same search ID would have the same value.

Any thoughts on how to accomplish this? Thanks in advance!!

DouglasSands · ‎10-30-2013

import arcpy  from arcpy import env env.workspace = r"C:\Users\cc1\Desktop\NEW.gdb\WAYNE"  table = "WAYNE"  list = []  with arcpy.da.SearchCursor(table, ["FULL_ADDRESS_NAME"]) as cursor:     for row in cursor:         list.append(row[0])  del row, cursor  with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:     for updateRow in updateRows:         nameValue = updateRow[0]         if nameValue in list:             updateRow[1] = lutDict[nameValue]             updateRows.updateRow(updateRow) del updateRow, updateRows

This is a good start. A few things though. For one, I would avoid naming any variables you define things like list, dict etc that might get confused with list(), dict() etc. Also, you will probably throw errors because you are calling the row variable updateRow but also calling the method updateRow. This is generally a bad idea, you should keep names unique so the code is clear.

But your general idea is almost right on. My thought would be that you only need the update cursor, you can do without the first cursor if you use a dictionary instead of a list to store the values. Try this (granted that I can't test it for myself):

import arcpy  from arcpy import env env.workspace = r"C:\Users\cc1\Desktop\NEW.gdb\WAYNE"  table = "WAYNE"  uniqueValues = {} newID = 1  with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:     for row in updateRows:         nameValue = row[0]         if nameValue in uniqueVals:             row[1] = uniqueValues[nameValue]         else:             newID += 1             uniqueValues[nameValue] = newID             row[1] = newID         updateRows.updateRow(row)               del row, updateRows

And you are right about the cursors - in general they tend to be much faster.

View solution in original post

DouglasSands · ‎10-30-2013

import arcpy  from arcpy import env env.workspace = r"C:\Users\cc1\Desktop\NEW.gdb\WAYNE"  table = "WAYNE"  list = []  with arcpy.da.SearchCursor(table, ["FULL_ADDRESS_NAME"]) as cursor:     for row in cursor:         list.append(row[0])  del row, cursor  with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:     for updateRow in updateRows:         nameValue = updateRow[0]         if nameValue in list:             updateRow[1] = lutDict[nameValue]             updateRows.updateRow(updateRow) del updateRow, updateRows

This is a good start. A few things though. For one, I would avoid naming any variables you define things like list, dict etc that might get confused with list(), dict() etc. Also, you will probably throw errors because you are calling the row variable updateRow but also calling the method updateRow. This is generally a bad idea, you should keep names unique so the code is clear.

But your general idea is almost right on. My thought would be that you only need the update cursor, you can do without the first cursor if you use a dictionary instead of a list to store the values. Try this (granted that I can't test it for myself):

import arcpy  from arcpy import env env.workspace = r"C:\Users\cc1\Desktop\NEW.gdb\WAYNE"  table = "WAYNE"  uniqueValues = {} newID = 1  with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:     for row in updateRows:         nameValue = row[0]         if nameValue in uniqueVals:             row[1] = uniqueValues[nameValue]         else:             newID += 1             uniqueValues[nameValue] = newID             row[1] = newID         updateRows.updateRow(row)               del row, updateRows

And you are right about the cursors - in general they tend to be much faster.

ClintonCooper1 · ‎10-30-2013

This worked great! Thank you!

Now if I wanted to find all the duplicates and sum them up, and print that value for each record (basically a like the Frequency tool) what would I need to change to get that to work? Thanks again!

Clinton

DouglasSands · ‎10-30-2013

The Summary Statistics tool works exactly like frequency if you specify the inputs correctly, and it only requires ArcGIS Basic (ArcEditor). The python syntax is available at the link.

ClintonCooper1 · ‎10-30-2013

The Summary Statistics tool works exactly like frequency if you specify the inputs correctly, and it only requires ArcGIS Basic (ArcEditor). The python syntax is available at the link.

well I am looking for a python data cursor solution for better performance.

JasonScheirer · ‎10-30-2013

Summary statistics is written in C++. It will be significantly faster than any Python solution.

ClintonCooper1 · ‎10-30-2013

Summary statistics is written in C++. It will be significantly faster than any Python solution.

even with having to add/join the data back into the original table? What I have been doing, is running the Frequency Tool (which creates a new table) and then using a search and update cursor to get the data back into the original file. I guess I am wondering if it would be faster to run a cursor in the original file that calculates the "frequency" and updates the data directly into the original table?

DouglasSands · ‎10-30-2013

If you really want to do it all in python, as a first step, add a field called 'FREQ' to the table. Then run this modified version of the above code:

import arcpy

from arcpy import env
env.workspace = r"C:\Users\cc1\Desktop\NEW.gdb\WAYNE"

table = "WAYNE"

uniqueValues = {}
values = []
newID = 1

with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:
    for row in updateRows:
        nameValue = row[0]
        values.append(nameValue)
        if nameValue in uniqueVals:
            row[1] = uniqueValues[[nameValue]]
        else:
            newID += 1
            uniqueValues[nameValue] = [newID]
            row[1] = newID
        updateRows.updateRow(row)
            
del row, updateRows

uniqueCount = {}
for val in uniqueValues:
    uniqueCount[val] = values.count(val)

with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME", "FREQ"]) as updateRows:
    for row in updateRows:
        nameValue = row[0]
        row[1] = uniqueCount[nameValue]
        updateRows.updateRow(row)
        
del row, updateRows

The arcpy.da.UpdateCursor is definitely faster than the field calculator tool, at least in my testing (if you were going to summarize -> join -> calculatefield -> remove join), can't say that they will be than the summary stats alone.

ClintonCooper1 · ‎10-30-2013

If you really want to do it all in python, as a first step, add a field called 'FREQ' to the table. Then run this modified version of the above code:

import arcpy

from arcpy import env
env.workspace = r"C:\Users\cc1\Desktop\NEW.gdb\WAYNE"

table = "WAYNE"

uniqueValues = {}
values = []
newID = 1

with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:
    for row in updateRows:
        nameValue = row[0]
        values.append(nameValue)
        if nameValue in uniqueVals:
            row[1] = uniqueValues[[nameValue]]
        else:
            newID += 1
            uniqueValues[nameValue] = [newID]
            row[1] = newID
        updateRows.updateRow(row)
            
del row, updateRows

uniqueCount = {}
for val in uniqueValues:
    uniqueCount[val] = values.count(val)

with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME", "FREQ"]) as updateRows:
    for row in updateRows:
        nameValue = row[0]
        row[1] = uniqueCount[nameValue]
        updateRows.updateRow(row)
        
del row, updateRows

The arcpy.da.UpdateCursor is definitely faster than the field calculator tool, at least in my testing (if you were going to summarize -> join -> calculatefield -> remove join), can't say that they will be than the summary stats alone.

Thanks Doug for the help! On my main file it takes about 10 minutes to run the summarize tool, and another 24 to update it, taking a total time of 34 minutes. I will run this script and report back if the script is faster. Again, thanks for the help. What good resources would you suggest to help with my learning curve in python for acrgis? I do have the python scripting for ArcGIS book, but it only goes over the very basics, and I am now starting to get into deeper code writing. Thanks again!!

ClintonCooper1 · ‎10-30-2013

import arcpy

from arcpy import env
env.workspace = r"C:\Users\cc1\Desktop\NEW.gdb\WAYNE"

table = "WAYNE"

uniqueValues = {}
values = []
newID = 1

with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME","FEAT_SEQ"]) as updateRows:
    for row in updateRows:
        nameValue = row[0]
        values.append(nameValue)
        if nameValue in uniqueVals:
            row[1] = uniqueValues[[nameValue]]
        else:
            newID += 1
            uniqueValues[nameValue] = [newID]
            row[1] = newID
        updateRows.updateRow(row)
            
del row, updateRows

uniqueCount = {}
for val in uniqueValues:
    uniqueCount[val] = values.count(val)

with arcpy.da.UpdateCursor(table, ["FULL_ADDRESS_NAME", "FREQ"]) as updateRows:
    for row in updateRows:
        nameValue = row[0]
        row[1] = uniqueCount[nameValue]
        updateRows.updateRow(row)
        
del row, updateRows

Ran the code, and it is throwing a zero into the FREQ field for all values?