I benchmarked GetCount_management v the function I defined above. Results below.
My custom function was just a little faster but it wouldn't surprise me to find that GetCount_management were faster against a much larger dataset. I'm currently testing against a dataset with a record count of 465,051 as that's the biggest one I have on my system.
>>> calculateRunTime(arcpy.GetCount_management
, parameters[2].valueAsText)
(7.444000005722046, <Result '465051'>)
>>> calculateRunTime(isUniqueValueField
, parameters[2].valueAsText
, parameters[3].valueAsText)
(6.894999980926514, True)
However, my custom function also does more than just count the rows. In addition, it compares the record count to the number of unique values in the given field, returning True or False to indicate whether or not the number of unique values in the field matches the total record count (see logic for calculateRunTime and inUniqueValueField functions in previous post). That's really what I'm after. Figuring out if the values in the given field are unique for every record.
I also tried the logic you outlined above, which is much more elegant than mine.
def unique_check_defdict(iterable):
d = collections.defaultdict(int)
for i in iterable:
if d[i] > 0:
return False
d[i] += 1
else:
return True
def isUniqueValueField_byBixby(fc, field):
with arcpy.da.SearchCursor(fc, field) as cur:
print unique_check_defdict(i for i, in cur)
>>> calculateRunTime(isUniqueValueField_byBixby
, parameters[2].valueAsText
, parameters[3].valueAsText)
True
(7.770999908447266, None)
Since this function is going to execute in a Python toolbox tool under the updateMessages function for the tool - essentially validation, I need it to be as fast as I can possibly get it. GetCount_management might be fast at scale, but if a user throws a smaller dataset at it - waiting almost 8 seconds for validation is probably going to be suboptimal.
I could take this out of the updateMessages function and put it into the runtime execution logic then throw a runtime error if the values aren't unique but I'm trying to avoid doing that.
I'm going to play with numpy today and see if that can get me toward my goal of getting this down to 3 seconds or less against approx. half a million records.
FYI - I really appreciate the help from everyone. If you can think of other ways to optimize or make it go zoom, zoom, please share!