How do I select the LAST of any number of duplicate records?

Dean_Heather · ‎05-23-2018

I have a table with parcel numbers in one field and soil types in another. There are several soil types for each parcel, and therefore several records for each parcel.

I am looking for a script (Python, SQL, VB, Arcade) that will iterate through each record grouped by APN, selecting (or outputting) the LAST record for each APN.

Example input table:

APN	SOIL TYPE
1	A
2	A
2	A, B
3	A
3	A, E
3	A, E, G

Desired result:

APN	SOIL TYPE
1	A
2	A, B
3	A, E, G

It's a simple task that seems like it should be easy, but I can't seem to find a solution. I've been working in Excel, ModelBuilder inside of ArcGIS Pro, and Python in Jupyter Notebook, but I'm not very well versed in any one program or language; maybe there's a better route of attack.

Thank you for any suggestions you might be able to provide!

DanPatterson_Retired · ‎05-23-2018

Haven't checked all scenarios, but try this to get the index value for the elements of both

a = [1, 2, 2, 3, 3, 3]

b = []
def dif(a):
    """differencing"""
    for i in range(1, len(a)):
        if a[i-1] != a[i]:
            b.append(i-1)
    b.append(i)
    return b


dif(a)

[0, 2, 5]  # zero based indexing‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

XanderBakker · ‎05-24-2018

If you create a dictionary and replace the value for the same key, you will end up with the last value for each key.

JoshuaBixby · ‎05-24-2018

If they are duplicates, how do you differentiate the first from last record, i.e., why not grab the first record? It looks like you are trying to select the record with the most number of soil types per APN, not the last record of duplicates. Depending on your answer, there may be a different approach to this.

JoeBorgione · ‎05-24-2018

I've been wondering the same thing: first, last.... Based on what?

That should just about do it....

DanPatterson_Retired · ‎05-24-2018

If python 3.6+ is being used, then dictionary order is retained...

If python 2.7 is being used, the dictionary approach will fail.

In that simple example, there is only one set of sequences and the ID number is given by the APN field, which is assumed to be sequential. If it were not an id field and the pattern represented classes, there could be duplicates in the sequence and the dictionary approach would fail again.

A blog is coming on sequences, patterns and replicants which will show how sequences and patterns of sequences and their representation and slicing (eg, start of sequence, sequence length and end of sequence) can all be obtained relatively simply and produce a summary table in ArcGIS PRO.

I will post back here hopefully within a day

XanderBakker · ‎05-24-2018

IMHO a dictionary will work just fine with sequential access to the table rows. See snippet below:

table = [[1, "A"], [2, "A"], [2, "A, B"], [3, "A"], [3, "A, E"], [3, "A, E, G"]]
dct = {}

for row in table:
    apn = row[0]
    soil_type = row[1]
    dct[apn] = soil_type

for apn, soil_type in sorted(dct.items()):
    print("{} - {}".format(apn, soil_type))
‍‍‍‍‍‍‍‍‍‍

This will yield:

1 - A
2 - A, B
3 - A, E, G‍‍‍

DanPatterson_Retired · ‎05-24-2018

If they are completely sequential... not just consisting of sequences... that is not fully clear from the question posed, it might be one of those expanding questions.

upcoming details soon

JoeBorgione · ‎05-24-2018

Xander- I've been messing with this on and off all day. I created a file gdb table just like the one in the o.p. and with a da.SearchCursor created a list. That list was in the form of tuples so somewhat different than yours:

myList = [(1, "A"), (2, "A"), (2, "A, B"), (3, "A"), (3, "A, E"), (3, "A, E, G")]‍‍

I then created a dictionary from that list and since the keys are not unique, the last one wins and that produces the desired result as shown in the o.p. But that's where things fall apart for me. If I pop() the last element and then insert() it back into the list at the first location, the resulting dictionary is quite different. 3:'A,E' wins.....

Would line 9 in your example take care of that?

Xander Bakker‌

That should just about do it....

XanderBakker · ‎05-24-2018

That's exactly what "should" happen, since the last element for APN 3 will be "A, E".

Or that exactly what I understand as objective from the question that the OP asked. It's best if the OP clarifies if this is the objective or that the LAST element should be interpreted differently (like last when sorted alphabetically or the one with the most "complex" soil type). Point is that with the solution I suggested the order of the data will influence the result and this is often not what one would like to happen.