Find duplicates, update only one record

2Quiker · ‎08-13-2020

I would like to only populate one of the duplicate record but with number of duplicates like below.

How can do this or what is the best way to achieve this with python?

well_id      Dup

D36528   3

D36528

D36528

D36532   2

D36532

D36521

D36522

D36525   2

D36525‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

forestknutsen1 · ‎08-17-2020

You could use a search cursor to read in the ids and then an update cursor the write the count out the feature class.

import arcpy

fc_path = r'C:\my_temp\junk3\geo\fgdb.gdb\wells'

ids = []

with arcpy.da.SearchCursor(fc_path, 'ID') as cursor:
    for row in cursor:
        ids.append(row[0])

# make empty dict
ids_dict = {}

# get a list of unique ids, e.g. remove duplicates, set does this 
# then just convert it back to a list
uids = list(set(ids))

# for each uid make a dict item with the key the uid and an empty list as the value 
for uid in uids:
    ids_dict[uid] = []

# this is what sets you up for counting. it finds the matching well id keys 
# and adds the key to the list value so you get in the dict {'w1': [w1, w1, w1]}
# then all you need to do count the length of the value list.
for id in ids:
    ids_dict[id] = ids_dict[id] + [id]

with arcpy.da.UpdateCursor(fc_path, ['ID', 'DUP']) as cursor:
    for row in cursor:
        for k, v in ids_dict.items():
            if row[0] == k and v is not None:
                row[1] = len(v)
                ids_dict[k] = None
                cursor.updateRow(row)
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

View solution in original post

forestknutsen1 · ‎08-13-2020

You could use dictionaries to count the number of repeating ids. Maybe something like this...

ids = ['D36528',
       'D36528',
       'D36528',
       'D36532',
       'D36532',
       'D36521',
       'D36522',
       'D36525',
       'D36525', ]

ids_dict = {}

uids = list(set(ids))

for uid in uids:
    ids_dict[uid] = []

for id in ids:
    ids_dict[id] = ids_dict[id] + [id]

for k, v in ids_dict.items():
    print '{k}, {n}'.format(k=k, n=len(v))

Output:

D36532, 2
D36528, 3
D36521, 1
D36522, 1
D36525, 2

2Quiker · ‎08-14-2020

How do reading the valuse from a dictionary use dict for with arcpy.da.UpdateCursor?

forestknutsen1 · ‎08-14-2020

I am not sure I understand your question. Maybe it would be helpful if we had a little more info...

what format is your original data is in? e.g. csv, feature class in a fgdb, database... etc.
what is your target format? e.g. feature class in a fgdb
can you give us an overview of what you are trying to do?

I would use an update cursor to write the info from the dictionary back to a feature class in a fgdb for sure.

2Quiker · ‎08-17-2020

Data is in a feature class in a fgdb.

I want to be able to see how many times an attribute is duplicated in a the field well_Id but I don't need all of them to have a number in the Dup filed but if that is the only way then that is ok. How do I take k & n and write to the feature class thought update cursor?

forestknutsen1 · ‎08-17-2020

You could use a search cursor to read in the ids and then an update cursor the write the count out the feature class.

import arcpy

fc_path = r'C:\my_temp\junk3\geo\fgdb.gdb\wells'

ids = []

with arcpy.da.SearchCursor(fc_path, 'ID') as cursor:
    for row in cursor:
        ids.append(row[0])

# make empty dict
ids_dict = {}

# get a list of unique ids, e.g. remove duplicates, set does this 
# then just convert it back to a list
uids = list(set(ids))

# for each uid make a dict item with the key the uid and an empty list as the value 
for uid in uids:
    ids_dict[uid] = []

# this is what sets you up for counting. it finds the matching well id keys 
# and adds the key to the list value so you get in the dict {'w1': [w1, w1, w1]}
# then all you need to do count the length of the value list.
for id in ids:
    ids_dict[id] = ids_dict[id] + [id]

with arcpy.da.UpdateCursor(fc_path, ['ID', 'DUP']) as cursor:
    for row in cursor:
        for k, v in ids_dict.items():
            if row[0] == k and v is not None:
                row[1] = len(v)
                ids_dict[k] = None
                cursor.updateRow(row)
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

2Quiker · ‎08-17-2020

Thank for sharing.

Can you explain what lines 11-19 are doing please?

forestknutsen1 · ‎08-17-2020

I added comments to the code above. I would get it into a python IDE and step through it with a debugger. Doing this when I am taking on something new speeds up my learning... it takes away the "back box" effect.

DanPatterson · ‎08-13-2020

Or for something completely different... out to *.dbf or *.csv

import numpy as np
from arcpy.da import NumPyArrayToTable
#
ids = ['D36528', 'D36528', 'D36528','D36532', 'D36532',
       'D36521', 'D36522','D36525', 'D36525', ]
# ---- numpy magic
u, c = np.unique(ids, return_counts=True)
dt = np.dtype([("Class", "<U7"), ("Count", "i4")])
uc = np.asarray(list(zip(u, c)), dtype=dt)
# ---- output to dbf and csv
NumPyArrayToTable(uc, "c:/temp/out.dbf")
np.savetxt("c:/temp/out.csv", uniq_counts, fmt="%7s, %3s", 
           comments="", header="Class, Counts")‍‍‍‍‍‍‍‍‍‍‍‍‍

The csv as an example

Class, Counts
 D36521,   1
 D36522,   1
 D36525,   2
 D36528,   3
 D36532,   2‍‍‍‍‍‍

You can add the *.csv or *.dbf to arcmap or Pro, or join to an existing table

... sort of retired...

JoshuaBixby · ‎08-17-2020

In terms of semantics, you are calling the field you want to populate "Dup", but you are putting the total count of records in that field unless there is only 1 record. As Dan and Forest have already implied, it is probably better to go with a field name like "Count" and then put the total count of records for all records. If you want to know which records have a duplicate, triplicate, etc...; you simply query on Count being greater than 1.

Along the lines of Forest's answer, but using slightly more compact code without having to create two cursors:

from arcpy.da import UpdateCursor
from collections import Counter

fc = # path to feature class

with UpdateCursor(fc, ["well_id", "Dup"]) as cur:
    cnt = Counter(row[0] for row in cur)
    cur.reset()
    
    for row in cur:
        cur.updateRow([row[0], cnt.pop(row[0], None)])
    

 ‍‍‍‍‍‍‍‍‍‍‍‍‍‍