Find Duplicates and increment

5097
10
04-10-2011 08:25 PM
DavidBrett
New Contributor III
Hello,

I'm attempting to find the duplicates in a field, then number the results sequentially within each duplicate set.  I've managed increment the duplicates, but not start over when the next duplicate is hit in the code.

At 9.3, I was able to use the field_Mark_Duplicates_2.cal from Easy Calculate 5.0 (http://www.ian-ko.com/free/free_arcgis.htm).  Now at 10, I need to do it in python and I'm having little success.

example of desired result:
a==>1
b==>1
c==>1
b==>2
b==>3
a==>2


Here is a snippet of code that increments all the duplicate in the table:


    rows = arcpy.UpdateCursor("Junk1.dbf")
    fields = arcpy.ListFields("Junk1.dbf" )
    
    #Create an empty list
    myList = []
       
    i = 0
    for row in rows:
        for field in fields:            
            if field.name == 'ID':
                value = row.getValue(field.name)                
                if value in myList:       
                    i += 1
                    row.TEST = i
                    rows.updateRow(row)
                if value not in myList:
                    myList.append(value)


Any help would be greatly appreciated!!
Tags (2)
0 Kudos
10 Replies
RichardFairhurst
MVP Honored Contributor
This is has been extremely helpful.  Quick question as I am new to arcpy.  If I want to mark ALL repeated IDs with a flag and not necessarily just the second, third, fourth occurrence, how might I accomplish this?  For instance, if I had the following list of values:

a
b
c
d
e
e

With the above scripts, I would get

a-0
b-0
c-0
d-0
e-0
e-1

What I want is

a-0
b-0
c-0
d-0
e-1
e-1 or 2 (Doesn't matter as long as I can see that this value is not equal to zero).

My goal is to flag the values that are duplicates, identify the value and then assign a new value to those records.  If I split a polygon, for instance, I don't want to maintain the ID of the old one, I want to seek out the value and replace it with two new unique IDs.

Thanks in advance.


It makes no sense to use alphabetic unique IDs for an incremented value.  I won't write you the code to support it since with alphabetic IDs you are limited to meaningful values, which violates the concept of a unique ID.  If you are using an alpha ID switch it to a numeric unique ID as the only useful type for an autoincremented value.

Personally I just use the Summary Statistics tool to get a count of the field I want to detect duplicates for.  I do a join, select for count > 1, break the join and then do whatever update process I want.  The Summary Statistics also can give me the highest value of that field to let me identify the next highest value.  It does not take much to do this manually.

With a numeric ID, after getting the duplicate selection I would use the autoincrement calculation to begin with the next number and increment it.  Here is the autoincrement calculation:

Parser: Python

Show Codeblock: Checked

Pre-Logic Script Code:
rec=0
def autoIncrement():
  global rec
  pStart = 1 #adjust start value to be the next number in the ID series
  pInterval = 1 #adjust interval value, if req'd
  if (rec == 0): 
    rec = pStart 
  else: 
    rec = rec + pInterval 
  return rec


Expression:  autoIncrement()

However, with a numeric unique ID this whole process can be done as a standalone Python script.  This script does not sort the data, since IDs are meaningless and I want to gain the full speed of the da cursor, which will whip the pants off of any code based on the old cursor model.  It does process all records twice, but only updates records where duplicates occur.  It also always overwrites Null values with a sequenced number higher than any previously assigned number, whether or not there is more than 1 record that is Null.  (Deleted records at the end of the sequential ID series could cause previously assigned and removed IDs to be reused.  To correct for that set the "highest" variable to a value that will ensure no prior assigned numbers are ever used by the script).

# Import the arcpy module
import arcpy

pStart = 1 # adjust start value, if required 

pInterval = 1 # adjust interval value, if required

# Initialize the sequence number dictionary and the Route ID variables
seqDict = {}

ID = -1
highest = 0
changed = False

# Assign data and field list variables.
# Customize these variable inputs for your specific data
myData = r"C:\MyPath\MyData.shp"

fields = ["ID"]

# Step 1 - Use an search cursor to get counts of the unique ID numbers
rows = arcpy.da.SearchCursor(myData, fields)
for row in rows:
    if row[0] is None:
        ID = "Null"
    else:
        ID = row[0]
    if ID in seqDict:
        seqDict[ID] = seqDict[ID] + pInterval
    elif ID = 'Null':
        seqDict[ID] = 2
    else:
        seqDict[ID] = pStart
        if highest < ID:
            highest = ID
    del row
del rows
    
# Step 2 - Use an update cursor to update all records with a count greater than 1
rows = arcpy.da.UpdateCursor(myData, fields)
for row in rows:
    if row[0] is None:
        ID = "Null"
    else:
        ID = row[0]
    if seqDict[ID] > 1:
      highest = highest + pInterval
      row[0] = highest
      changed = True
    if changed:
      rows.updateRow(row)
      changed = False
    del row
del rows


If you really have to have an alphabetic unique ID, what is the increment style you propose to use?  Whatever increment style you want it involves much more code than it is worth to support it, which is why I won't write the code for you.  For that I would just do the Summary Statistics approach I originally suggested. After selecting the records based on the summary count being greater than 1 or having a Null ID value, I would update the alphabetic field manually to fit my style needs rather than doing the autoincrement calculation.
0 Kudos