Remove duplicate text within an attribute of a feature class

TinaKwitkoski

Hi

I have an feature class with a field of concatenated species_codes. I would like to remove the duplicate values.

I would like to have just one label for each species. ie: AO, BB, BMC, CC, CH....

is there a way to do this?

Any help appreciated

Tina

MErikReedAugusta

Piece of cake; I just did this operation on some data this morning.

Here's some code for Field Calculator:

A. Function Call

RemoveDupes(!AC!)

B. Code Block

def RemoveDupes(old):
    # Watch out for NULL values, because they'll crash the rest of the code
    if old is None:
        return None

    # If you're still here, assume it's a comma-delimited string, and break
    #   it up into parts.
    oldEntries = old.split(',')
    newEntries = []

    # Loop through the entries
    for entry in oldEntries:
        # Only save the entry if you haven't seen it already
        if entry not in newEntries:
            newEntries.append(entry)

    # Stitch the entries back to a string & write it to the field
    return ','.join(newEntries)

Results for the first row:

ANODON,BB,BMC,C,CAS,CC,CH,DC,DV,KO,L,LDC,LNC,LSU,MARFAL,MW,NSC,PK,RB,RSC,SP,SU,WF

This could also be done via an UpdateCursor if you're doing other things, but I gave you Field Calculator first, for the simpler option.

Also, for those familiar with python: I could have used a set for newEntries, and it would've taken care of skipping the duplicates automatically. But sets in python are unordered, which means the order would scramble when you ran this. The longer approach of the code above arbitrarily maintains whatever order the entries occur in the field.

------------------------------
M Reed
"The pessimist may be right oftener than the optimist, but the optimist has more fun, and neither can stop the march of events anyhow." — Lazarus Long, in Time Enough for Love, by Robert A. Heinlein

View solution in original post

DanPatterson

related, this code block may work

Splitting numbers and removing duplicates in a text string

Or for a numpy def

import numpy as np
def dedup(vals):
    """remove duplicate entries"""
    u = np.unique([i.strip() for i in vals.split(",")])
    return ", ".join([i for i in u])

... sort of retired...

TinaKwitkoski

Thank you 🙂

MErikReedAugusta

Very efficient; love it!

------------------------------
M Reed
"The pessimist may be right oftener than the optimist, but the optimist has more fun, and neither can stop the march of events anyhow." — Lazarus Long, in Time Enough for Love, by Robert A. Heinlein

DanPatterson

import numpy as np
def dedup(vals):
    """remove duplicate entries"""
    u, cnts = np.unique([i.strip() for i in vals.split(",")], return_counts=True)
    v = np.vstack((u, cnts)).T
    return ", ".join(["{}: {}".format(i[0], i[1]) for i in v])
    
b ='AO, BB, BMC, CC, CH, AO, BB, BMC, CC, CH'
dedup(b)
'AO: 2, BB: 2, BMC: 2, CC: 2, CH: 2'

This can be simplified, but verbose demonstrates the principle

... sort of retired...

MErikReedAugusta

Piece of cake; I just did this operation on some data this morning.

Here's some code for Field Calculator:

A. Function Call

RemoveDupes(!AC!)

B. Code Block

def RemoveDupes(old):
    # Watch out for NULL values, because they'll crash the rest of the code
    if old is None:
        return None

    # If you're still here, assume it's a comma-delimited string, and break
    #   it up into parts.
    oldEntries = old.split(',')
    newEntries = []

    # Loop through the entries
    for entry in oldEntries:
        # Only save the entry if you haven't seen it already
        if entry not in newEntries:
            newEntries.append(entry)

    # Stitch the entries back to a string & write it to the field
    return ','.join(newEntries)

Results for the first row:

ANODON,BB,BMC,C,CAS,CC,CH,DC,DV,KO,L,LDC,LNC,LSU,MARFAL,MW,NSC,PK,RB,RSC,SP,SU,WF

This could also be done via an UpdateCursor if you're doing other things, but I gave you Field Calculator first, for the simpler option.

Also, for those familiar with python: I could have used a set for newEntries, and it would've taken care of skipping the duplicates automatically. But sets in python are unordered, which means the order would scramble when you ran this. The longer approach of the code above arbitrarily maintains whatever order the entries occur in the field.

------------------------------
M Reed
"The pessimist may be right oftener than the optimist, but the optimist has more fun, and neither can stop the march of events anyhow." — Lazarus Long, in Time Enough for Love, by Robert A. Heinlein

MErikReedAugusta

Bonus solution for anyone with a similar case that wants to keep/display the duplicate entries, but wants a slightly easier-to-read value in the attribute table:

def CountDupes(old):
    if old is None:
        return None

    oldEntries = old.split(',')
    newEntries = {}

    for entry in oldEntries:
        if entry not in newEntries:
            newEntries[entry] = 0
        newEntries[entry] += 1

    return ','.join([f'{entry}:{count}' for entry, count in newEntries.items()]

Run on the sample data from the original post, this would be the result in the first row:

ANODON:1,BB:3,BMC:2,C:1,CAS:1,CC:3,CCG:1,CH:2,DC:2,DV:1,KO:2,L:1,LDC:1,LNC:1,LSU:1,MARFAL:1,MW:2,NSC:2,PK:1,RB:1,RSC:3,SP:1,SU:2,WF:1

Even cleaner, would be displaying it without the singles:

def CountDupes(old):
    if old is None:
        return None

    oldEntries = old.split(',')
    newEntries = {}
    strEntries = []

    for entry in oldEntries:
        if entry not in newEntries:
            newEntries[entry] = 0
        newEntries[entry] += 1

    for entry, count in newEntries.items():
        if count == 1:
            strEntries.append(entry)
        else:
            strEntries.append(f'{entry}:{count}')

And results:

ANODON,BB:3,BMC:2,C,CAS,CC:3,CCG,CH:2,DC:2,DV,KO:2,L,LDC,LNC,LSU,MARFAL,MW:2,NSC:2,PK,RB,RSC:3,SP,SU:2,WF

------------------------------
M Reed
"The pessimist may be right oftener than the optimist, but the optimist has more fun, and neither can stop the march of events anyhow." — Lazarus Long, in Time Enough for Love, by Robert A. Heinlein

TinaKwitkoski

thank you 🙂

JoshuaBixby

Since counting the number of species codes wasn't stated as a requirement, this can be done much simpler without having to use a code block. For the expression:

",".join(sorted(set(!field!.split(',')))) if !field! is not None else None

MErikReedAugusta

Worth noting: That one-liner presumes that alphanumeric sorting is the desired order.

The example data is all alphabetical, but there's insufficient information to predict whether that's presumable or just coincidental.

By avoiding casting to a set, whatever order each value appears in the field is maintained, sans duplicate entries.

(Also, the version that counts instances was just a fun 2-second self-challenge, more for the benefit of people landing on this page from a search in the future.)

------------------------------
M Reed
"The pessimist may be right oftener than the optimist, but the optimist has more fun, and neither can stop the march of events anyhow." — Lazarus Long, in Time Enough for Love, by Robert A. Heinlein