Code to select one record from a number of duplicates

ShannonBlackadder · ‎12-05-2013

I???m trying to write code that will identify site records to use in a habitat species model.

There are 4000+ site records of presence and absence of a specific species. Some of these site records are duplicates (same location sampled multiple times over a number of years), some sites were only sampled once. At the sites that were sampled multiple times, in some cases an organism was present, and in other cases it was absent. For habitat model we only want to use each site location once, and if the organism ever visited the site we want to use one of the 'present' records.

I can write code that identifies duplicate sites and code that identifies whether the site was occupied, but I somehow need to combine these based on the following:
-If no duplicate sites: return 1 (whether occupied or not)
-if duplicate sites: check all duplicate site records
[INDENT]-if all duplicate site records have occurrences ??? assign one of the records =1, all others 0
-If all duplicate site records were absences ??? assign one of the records= 1, all others 0
-if there is a combination of occurrences and non-occurrences ??? assign one of the occurrence records = 1, all others 0[/INDENT]

My data is in the following format where the Modelled column is the column I am trying to populate:
[ATTACH=CONFIG]29622[/ATTACH]

Does anyone have any thoughts as to how to approach this problem? I've been trying to write nested loops but I can't seem to figure out how to get the logic to work.

RichardFairhurst · ‎12-05-2013

I�??m trying to write code that will identify site records to use in a habitat species model.

There are 4000+ site records of presence and absence of a specific species. Some of these site records are duplicates (same location sampled multiple times over a number of years), some sites were only sampled once. At the sites that were sampled multiple times, in some cases an organism was present, and in other cases it was absent. For habitat model we only want to use each site location once, and if the organism ever visited the site we want to use one of the 'present' records.

I can write code that identifies duplicate sites and code that identifies whether the site was occupied, but I somehow need to combine these based on the following:
-If no duplicate sites: return 1 (whether occupied or not)
-if duplicate sites: check all duplicate site records
[INDENT]-if all duplicate site records have occurrences �?? assign one of the records =1, all others 0
-If all duplicate site records were absences �?? assign one of the records= 1, all others 0
-if there is a combination of occurrences and non-occurrences �?? assign one of the occurrence records = 1, all others 0[/INDENT]

My data is in the following format where the Modelled column is the column I am trying to populate:
[ATTACH=CONFIG]29622[/ATTACH]

Does anyone have any thoughts as to how to approach this problem? I've been trying to write nested loops but I can't seem to figure out how to get the logic to work.

I would create this in model builder using the Summary Statistics tool and Add Join/Field Calculatior\Remove Joins tools and then export that model to a python script. The model would be as follows (I am listing every tool step needed individually so it looks like a lot of steps, but it is not difficult to set up):

1. Use the Make Table View on your original table. (Optionally use the Add Attribute Index tool on the Site field to speed up joins)

2. Summarize the Table View of the original table with the unique Case fields being the Site and the Occupied Value with the summary value and obtain the Min summary of the Field_Session field (this field needs an underscore between the words to work correctly). I'll call this output Summary1.

3. Use the Make Table View tool on the Summary1 result. (Optionally use the Add Attribute Index tool on the Site field to speed up joins)

4. Summarize the Summary1 result with the Site field being the only case field and get the Max Summary of the Occupied field. I will call this output Summary2. (Optionally use the Add Attribute Index tool on the Site field to speed up joins)

5. Join the Summary2 table to the Summary1 table on the common Site fields and select where Summary1.Occupied = Summary2.Max_Occupied.

6. Remove the Join.

7. Use the Table to Table tool on the Table View selection of Summary1 to create a table of just the selected records from Summary1. I will call this result TableToTable1. (Optionally use the Add Attribute Index tool on the Site field to speed up joins)

8. Join the TableToTable result to the Table View of your original table on the common Site fields.

9. Use the Select Layer By Attributes tool on the joined table view of your original table and select with the NEW_SELECTION option using the SQL where Original.Field_Session = TableToTable1.Min_Field_Session.

10. Calculate the Modelled field of the selected records to be 1.

11. Remove the Join.

12. Use the Select Layer By Attributes tool with the SWITCH_SELECTION option and no selection SQL to reverse the selection of the Table View of your original table and calculate the Modelled field to be 0.

All records are now calculated the way you want. The model can be rerun with each new sample period to update the model so that it will use first time occurrences in the current sample period.

If you go the cursor route would run a select cursor through once with logic to populate a dictionary of Site values with the optimum Occupied value and matching Field_Session value that met your criteria. Then run an update cursor to update your records with a 1 or a 0 depending on whether the Field_Session value was retained in the dictionary of not for the site.

T__WayneWhitley · ‎12-05-2013

Maybe I misunderstood your logic or I'm oversimplifying, but this gave me the results you displayed - I think it meets your criteria:

import arcpy
tbl = r'C:\Documents and Settings\whitley-wayne\Desktop\fieldSessions.dbf'
rows = arcpy.UpdateCursor(tbl, '', '', '', 'Site A; Occupied D; FldSession A')
 
initSite = 0
for row in rows:
    if row.Site != initSite:
        row.Modelled = 1
        initSite = row.Site
        rows.updateRow(row)
 
del rows

I duplicated your little table to a dbf (called fieldSessions.dbf), made short integer fields to match yours, with the exception of the field session field (which I shortened to FldSession)...

Hope that's on track,
Wayne

AdamCox1 · ‎12-11-2013

Hi there, this is a little old but perhaps you're still working on it. My understanding is that there are potentially multiple rows per site, with multiple occurrences of 1 in the modelled field, and you want every set of site rows to have exactly one row where modelled = 1. Here are my thoughts (if this is a feature class, just substitute MakeFeatureLayer for MakeTableView):


table = r"path\to\table"

#create unique list of all sites
all_sites = []
rows = arcpy.SearchCursor(table)
for row in rows:
    site = row.getValue("Site")
    if not site in all_sites:
        all_sites.append(site)
del rows, row

#use list to make a series of cursors and then update rows
for s in all_sites:
    #set first row to 1, and all others to 0
    first = True
    rows = arcpy.UpdateCursor(table,'"Site" = ' + str(s))
    for row in rows:
        if first == True:
            row.setValue("Modelled", 1)
            first = False
        else:
            row.setValue("Modelled", 0)
        rows.updateRow(row)
    del rows, row

You may need to alter the query a little, maybe change the field delimiters, or how you deal with the site number (I'm not sure if you have it stored as an integer or string, here I'm thinking integer).

Hopefully this will help out.