Checking if field is unique value field

JohnDye · ‎10-16-2017

I have a parameter in a python toolbox which allows a user to select a field from a dataset given in a previous parameter.

# Feature Class to absorb geometry from
param2 = arcpy.Parameter(
    displayName="Geometry Feature Class",
    name="in_geoFC",
    datatype="GPFeatureLayer",
    parameterType="Required",
    direction="Input")‍‍‍‍‍‍‍

# Table ID field
param3 = arcpy.Parameter(
    displayName="Table Geometry ID Field",
    name="table_geoIDField",
    datatype="GPString",
    parameterType="Required",
    direction="Input",
    enabled=False)
param3.filter.type = "ValueList"
param3.filter.list = []‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The updateParameters section contains logic to update Parameter 3 with the field names for all of the fields in the dataset provided in parameter 2 which have a datatype of string:

def updateParameters(self, parameters):
    """Modify the values and properties of parameters before internal
       validation is performed.  This method is called whenever a parameter
       has been changed."""
    
    # if 'in_geoFC' is populated with a value
    if parameters[2].value:
        # if 'in_geoFC' does not have an error set
        if not parameters[2].hasError():
            #  Create a list of all of the fields in the 'in_geoFC'
            # which have a datatype of 'String'
            fc_geoIDFields = [field.name for field in arcpy.Describe(
                              parameters[2].valueAsText).fields
                              if field.type == 'String']
            # Enable the parameter
            parameters[3].enabled = True
            # Populate the parameter with the list of text fields in the
            # table
            parameters[3].filter.list = fc_geoIDFields‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

This all works just fine...

There's one more thing I need to do though. Whichever field the user selects for parameter 3, I need to ensure that the values contained in this field are unique - every record must have a unique value.

I know of a pretty easy and elegant way to get the number of unique values in that field:

len(set(r[0] for r in arpy.da.SearchCursor(parameters[1].valueAsText
                                           , parameters[2].valueAsText)))‍‍‍‍‍‍‍‍

What I don't know is the quickest and most elegant way to get the total number of features in that dataset so that I can compare it to the number of unique values in that field and thus, determine if all of the values in that field are unique.

Keep in mind that this would be occurring in the updateMessages function, so it needs to be a fairly quick process.

Any thoughts on how to get a record count super fast?

JohnDye · ‎10-16-2017

I am doing that to determine the unique values. The catch is determining if that number of unique values matches the length of the recordset - which tells me every value in the given field is unique

JamesCrandall · ‎10-16-2017

if int(arcpy.GetCount_management(TheInputFeatureClass).getOutput(0))<> len(set(arcpy.da.SearchCursor(TheInputFeatureClass, 'TheFieldToEvaluate'))):
    print 'length did not match'‍‍

JohnDye · ‎10-16-2017

GetCount_management is a no-go because it has to instantiate the geoprocessor. Big performance hit. Not a good thing to do in Python Toolbox parameter validation unless you like making your users watch the cursor spin.

JamesCrandall · ‎10-16-2017

Aren't you already doing that? I see you are setting up arcpy.da.SearchCursor().

JoshuaBixby · ‎10-16-2017

Have you done any benchmarking? Get Count easily beats other methods for getting the total records in a data set, especially for larger data set. If you are already running a cursor against the data set, then using Get Count might be unnecessary, but any extra time added to the script would be from an necessary call and not a slow function.

In terms of balancing simplicity and performance, I am a big fan of Counter, as I stated above. It is trivially slower than proposed set operations, and it gives you much richer information in case that might have value at some point.

import collections

def unique_check_cnt(iterable):
    counter = collections.Counter(iterable)
    return True if counter.most_common(1)[0][1] == 1 else False
    
fc = # path to feature class
fld = # field name to check for uniqueness

with arcpy.da.SearchCursor(fc, fld) as cur:
    print unique_check_cnt(i for i, in cur)‍‍‍‍‍‍‍‍‍‍‍

If the data sets are large and there is a reasonable chance there will be duplicates, then it might pay off to implement a method that can stop as soon as a duplicate is found.

import collections

def unique_check_defdict(iterable):
    d = collections.defaultdict(int)
    for i in iterable:
        if d[i] > 0:
            return False
        d[i] += 1
    else:
        return True

fc = # path to feature class
fld = # field name to check for uniqueness

with arcpy.da.SearchCursor(fc, fld) as cur:
    print unique_check_defdict(i for i, in cur)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

JohnDye · ‎10-17-2017

I benchmarked GetCount_management v the function I defined above. Results below.
My custom function was just a little faster but it wouldn't surprise me to find that GetCount_management were faster against a much larger dataset. I'm currently testing against a dataset with a record count of 465,051 as that's the biggest one I have on my system.

>>> calculateRunTime(arcpy.GetCount_management
                    , parameters[2].valueAsText)
(7.444000005722046, <Result '465051'>)
>>> calculateRunTime(isUniqueValueField
                    , parameters[2].valueAsText
                    , parameters[3].valueAsText)
(6.894999980926514, True)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

However, my custom function also does more than just count the rows. In addition, it compares the record count to the number of unique values in the given field, returning True or False to indicate whether or not the number of unique values in the field matches the total record count (see logic for calculateRunTime and inUniqueValueField functions in previous post). That's really what I'm after. Figuring out if the values in the given field are unique for every record.

I also tried the logic you outlined above, which is much more elegant than mine.

def unique_check_defdict(iterable):
    d = collections.defaultdict(int)
    for i in iterable:
        if d[i] > 0:
            return False
        d[i] += 1
    else:
        return True

def isUniqueValueField_byBixby(fc, field):
    with arcpy.da.SearchCursor(fc, field) as cur:
        print unique_check_defdict(i for i, in cur)

>>> calculateRunTime(isUniqueValueField_byBixby
                    , parameters[2].valueAsText
                    , parameters[3].valueAsText)
True
(7.770999908447266, None)
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Since this function is going to execute in a Python toolbox tool under the updateMessages function for the tool - essentially validation, I need it to be as fast as I can possibly get it. GetCount_management might be fast at scale, but if a user throws a smaller dataset at it - waiting almost 8 seconds for validation is probably going to be suboptimal.

I could take this out of the updateMessages function and put it into the runtime execution logic then throw a runtime error if the values aren't unique but I'm trying to avoid doing that.

I'm going to play with numpy today and see if that can get me toward my goal of getting this down to 3 seconds or less against approx. half a million records.

FYI - I really appreciate the help from everyone. If you can think of other ways to optimize or make it go zoom, zoom, please share!

JoshuaBixby · ‎10-17-2017

Have you tried running your code several times by just looping over the cursor and doing nothing? Setting up the cursor and iterating through it are likely the most expensive part of the function, possibly by orders of magnitude. In most of the examples I tested, the slowest code was < 200 ms for an iterable with 1 million items. It runtimes are several seconds long, it makes me think generating the iterable is taking most of the time.

In production, setting up a cursor for the same data set and just iterating over it will have variability in timing due to resource contention and scheduling. Although it never hurts to optimize code, at some point there is a point of diminishing returns and squeezing 15% improvement out of the code that only account for 15% of the total runtime might not be worth the cost.

JohnDye · ‎10-17-2017

Good points. The slowest part is mostly likely setting up the cursor and there's not going to be a way around that. The laws of diminishing returns are certainly something to think about here. If I can't get it shrunk down by lunch, I'll probably just take the most efficient code outlined above and use it at runtime and throw an error if the field is not unique.

DanPatterson_Retired · ‎10-17-2017

the numpy option with the search cursor hasn't been tried... I use it all the time, but I always have it imported, but it take way less time than a base import of arcpy so you probably wouldn't notice the difference. The 'arcgis' module uses it and Pandas extensively, so it is slow

JohnDye · ‎10-17-2017

Would be nice if there were a Describe property for record count. That would be fast. Seems strange to me that there's not. FIDSet on a layer will return selected records, but you have to set up a cursor and iterate the dataset just to get the full count.