Checking if field is unique value field

JohnDye · ‎10-16-2017

I have a parameter in a python toolbox which allows a user to select a field from a dataset given in a previous parameter.

# Feature Class to absorb geometry from
param2 = arcpy.Parameter(
    displayName="Geometry Feature Class",
    name="in_geoFC",
    datatype="GPFeatureLayer",
    parameterType="Required",
    direction="Input")‍‍‍‍‍‍‍

# Table ID field
param3 = arcpy.Parameter(
    displayName="Table Geometry ID Field",
    name="table_geoIDField",
    datatype="GPString",
    parameterType="Required",
    direction="Input",
    enabled=False)
param3.filter.type = "ValueList"
param3.filter.list = []‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The updateParameters section contains logic to update Parameter 3 with the field names for all of the fields in the dataset provided in parameter 2 which have a datatype of string:

def updateParameters(self, parameters):
    """Modify the values and properties of parameters before internal
       validation is performed.  This method is called whenever a parameter
       has been changed."""
    
    # if 'in_geoFC' is populated with a value
    if parameters[2].value:
        # if 'in_geoFC' does not have an error set
        if not parameters[2].hasError():
            #  Create a list of all of the fields in the 'in_geoFC'
            # which have a datatype of 'String'
            fc_geoIDFields = [field.name for field in arcpy.Describe(
                              parameters[2].valueAsText).fields
                              if field.type == 'String']
            # Enable the parameter
            parameters[3].enabled = True
            # Populate the parameter with the list of text fields in the
            # table
            parameters[3].filter.list = fc_geoIDFields‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

This all works just fine...

There's one more thing I need to do though. Whichever field the user selects for parameter 3, I need to ensure that the values contained in this field are unique - every record must have a unique value.

I know of a pretty easy and elegant way to get the number of unique values in that field:

len(set(r[0] for r in arpy.da.SearchCursor(parameters[1].valueAsText
                                           , parameters[2].valueAsText)))‍‍‍‍‍‍‍‍

What I don't know is the quickest and most elegant way to get the total number of features in that dataset so that I can compare it to the number of unique values in that field and thus, determine if all of the values in that field are unique.

Keep in mind that this would be occurring in the updateMessages function, so it needs to be a fairly quick process.

Any thoughts on how to get a record count super fast?

JoshuaBixby · ‎10-16-2017

Have you done any benchmarking? Get Count easily beats other methods for getting the total records in a data set, especially for larger data set. If you are already running a cursor against the data set, then using Get Count might be unnecessary, but any extra time added to the script would be from an necessary call and not a slow function.

In terms of balancing simplicity and performance, I am a big fan of Counter, as I stated above. It is trivially slower than proposed set operations, and it gives you much richer information in case that might have value at some point.

import collections

def unique_check_cnt(iterable):
    counter = collections.Counter(iterable)
    return True if counter.most_common(1)[0][1] == 1 else False
    
fc = # path to feature class
fld = # field name to check for uniqueness

with arcpy.da.SearchCursor(fc, fld) as cur:
    print unique_check_cnt(i for i, in cur)‍‍‍‍‍‍‍‍‍‍‍

If the data sets are large and there is a reasonable chance there will be duplicates, then it might pay off to implement a method that can stop as soon as a duplicate is found.

import collections

def unique_check_defdict(iterable):
    d = collections.defaultdict(int)
    for i in iterable:
        if d[i] > 0:
            return False
        d[i] += 1
    else:
        return True

fc = # path to feature class
fld = # field name to check for uniqueness

with arcpy.da.SearchCursor(fc, fld) as cur:
    print unique_check_defdict(i for i, in cur)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

View solution in original post

JoshuaBixby · ‎10-16-2017

Personally, I like Counter.

DanPatterson_Retired · ‎10-16-2017

some may offer numpy and pandas solutions, but the initial load time

%timeit len(set(random.randint(0,10) for i in range(1000000)))
1.5 s ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy as np
%timeit len(np.unique(np.random.randint(0, 10, size=1000000, dtype='l')))
43.4 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# ---- is a coffee sip worth the difference? ----‍‍‍‍‍‍‍‍

JohnDye · ‎10-16-2017

numpy or pandas is likely going to be faster. I always forget about Pandas being available now. I wouldn't hold the import times against them with a Python toolbox. You pay those costs upfront at toolbox initialization unless you're importing inside of your tool classes.

DanPatterson_Retired · ‎10-16-2017

_as_narray ... I have used it recently in some of my blogs.

import numpy as np

a = arcpy.da.SearchCursor(in_fc, 'OID@', explode_to_points=True)._as_narray()

# in_fc is a featureclass,  explode to points not needed, but I wanted a bigger file

len(a) == len(np.unique(a))
True‍‍‍‍‍‍‍‍

Now if you are playing Code Golf, you can put that all in two lines

JamesCrandall · ‎10-16-2017

import numpy as np

nparr = arcpy.da.FeatureClassToNumPyArray(TheInputFeatureClass, ['TheFieldToEvaluate'])
uniqueValueCount = len(np.unique(nparr))
print uniqueValueCount‍‍‍‍‍

DanPatterson_Retired · ‎10-16-2017

the timing for both conversion to numpy array has a bit of overhead but insignificant, for about 25,000 unique points, but still within a coffee sip and an import

%timeit len(np.unique(arcpy.da.FeatureClassToNumPyArray(in_fc, 'OID@')))
107 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

JamesCrandall · ‎10-16-2017

Yep. Great point considering the OP needs this to perform during an onLoad event.

JohnDye · ‎10-16-2017

Look at you rockstars! Lightning quick to respond.

I actually came up with this a few minutes later. Benchmarked it against 465,051 records. I'd like to get it down to 3 seconds or less

def calculateRunTime(function, *args):
    startTime = time.time()
    result = function(*args)
    return time.time() - startTime, result
    
def isUniqueValueField(dataset, field):
    idList = []
    with arcpy.da.SearchCursor(dataset, field) as cursor:
        for row in cursor:
            idList.append(row[0])
    if len(idList) != len(set(idList)):
        return False
    else:
        return True
        
>>> calculateRunTime(isUniqueValueField
                    , parameters[2].valueAsText
                    , parameters[3].valueAsText)
(8.847000122070312, True)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

JamesCrandall · ‎10-16-2017

Considering Dan's valid point about numpy setup time, what about simply setting a "set" to the SearchCursor?

print len(set(arcpy.da.SearchCursor(TheInputFeatureClass, 'TheFieldToEvaluate')))‍