Using Collections with da Search Cursor for validation

ZacharyHart · ‎10-05-2016

I have a tool which creates a number of points grouped together in 'clusters'. Inevitably, the tool will generate points that fall outside the project boundary. Removing these points is a simple task using select by location. However, I also want to remove any cluster of points that has less than a required threshold value. I'm having some success with using 'collections' much in the same way that Summary Stats would work. However, I'm running into difficulty using an IF statement with the collection....it seems to print anytime the value (regardless of value) is 'collection > X' and prints nothing when ' collection < X' and fails to work when I try ' collection = X'.

So it appears the collection is a 'Counter'. Am I on the wrong track here?I couldn't get 'GetCount' to summarize by each value in a field.

EDIT: it looks like 'sorted' may be something I can use here, but having a hard time finding examples of this.

I want to remove all points in the cluster if the particular cluster contains less than a threshold value.

Here is my code so far:

def CleanSamplePlots (inPoints, standLayer, outPoints):
    #inPoints are the raw output of the desired point generation process
    #standLayer is the project stand dataset
    #outPoints are the cleaned, validated points
    arcpy.MakeFeatureLayer_management(inPoints, "ptLayer")
    arcpy.SelectLayerByLocation_management ("ptLayer", "intersect", standLayer,"", "", 'INVERT')
    if int(arcpy.GetCount_management("ptLayer").getOutput(0)) > 0:
        arcpy.DeleteFeatures_management("ptLayer")
    with arcpy.da.SearchCursor("ptLayer", "CLUSTER_ID") as cursor:
        count_of_items = collections.Counter(row[0] for row in cursor)
        for item in sorted(count_of_items.items(), key=lambda x:x[1]):
            if count_of_items < 8:
                print "Cluster:{0} Points{1}".format(item[0], item[1])

Cluster:4 Points4
Cluster:14 Points4
Cluster:6 Points7
Cluster:7 Points8
Cluster:12 Points8
Cluster:21 Points9
Cluster:1 Points9
Cluster:2 Points9
Cluster:20 Points10....etc
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

DanPatterson_Retired · ‎10-05-2016

There is another way. I have taken some output format liberties in output formatting. I am on an iThing so I can't produce a search cursor, so we will pretend that I have one (src_cur)

create an enumerate object ... enumerate(src_cur) ... that you can cycle through
establish your threshold value from row[0] .... cleverly masked as rho_zero
the enumeration object produces a pair for each 'rho_zero'... an ID number and the corresponding value from the cursor
only keep those that are above a certain threshold... in this case, 5
the enum_ result is the rows that meet the threshold conditions, with each pair representing (OBJECTID, Value)
do what you need with the result

>>> src_cur = np.random.randint(0, 10, size=50).tolist() # emulate reading a cursor
>>>
>>> enum_ = [rho_zero for rho_zero in enumerate(src_cur) if rho_zero[1] > 5]
>>> 
>>> src_cur  # a 'searchcursor' returning 50 values... liberties taken with formatting
[4, 4, 6, 7, 1, 8, 1, 2, 3, 2,
 2, 5, 6, 5, 9, 8, 0, 0, 8, 4,
 5, 7, 3, 4, 8, 0, 0, 2, 7, 2,
 1, 3, 6, 6, 6, 8, 9, 7, 4, 5,
 4, 2, 2, 9, 3, 1, 0, 7, 3, 6]
>>>
>>> enum_ # again with the formatting (x,y) represents (OBJECTID, value)
[(2, 6), (3, 7), (5, 8), (12, 6), (14, 9),
 (15, 8), (18, 8), (21, 7), (24, 8), (28, 7),
 (32, 6), (33, 6), (34, 6), (35, 8), (36, 9),
 (37, 7), (43, 9), (47, 7), (49, 6)]
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

View solution in original post

DanPatterson_Retired · ‎10-05-2016

class Counter(builtins.dict)
| Dict subclass for counting hashable items. Sometimes called a bag
| or multiset. Elements are stored as dictionary keys and their counts
| are stored as dictionary values.

>>> from collections import Counter as c
>>> dir(c)
['__add__', .. snip... '_keep_positive', 'clear', 'copy', 'elements', 'fromkeys', 
'get', 'items', 'keys', 'most_common', 'pop', 'popitem', 'setdefault', 'subtract', 
'update', 'values']
>>> a = [1, 2, 3, 2, 2, 2, 3, 3, 3, 1]
>>> c(a)
Counter({2: 4, 3: 4, 1: 2})‍‍‍‍‍‍
>>> c.most_common(c(a))
[(2, 4), (3, 4), (1, 2)]
‍‍‍‍‍‍‍‍‍‍

Notice that it returns a Counter object which can be used with its special methods

ZacharyHart · ‎10-05-2016

I see what you mean. So is there a way to get at those summarized/sorted values and evaluate them as a basic integer value?

FYI, I'm normally hesitant to muddy the waters with the background purpose, but felt it was appropriate here.

DanPatterson_Retired · ‎10-05-2016

At least you can explore the functionality of the collections module since it does offer certain container types and conditions that pure lists, tuples and dictionaries don't. I find it one of the most underused modules out there but widely used by other higher level packages... much like itertools.

ZacharyHart · ‎10-05-2016

Yeah...this could be quite side venture. Its late for me here...my mind can't see a way forward just yet. I would have thought it would be easier to create some kind of validation like this ...maybe i can create something a bit more..uh...crummy? Like looking for a string value to be true...if a certain point ID doesn't exist (the one coinciding with the threshold value) then delete all points in the 'cluster' (like cluster ID value). Seems...crummy.

Anyway, per my original post, could you perhaps help me develop a search string I can use to look for something similar on other forums? My grasp of proper terminology is admittedly weak here (but perseverance I'm an ace). 'Get sorted count' comes to mind...'evaluate sorted count', 'evaluate sorted field'....'greater than sorted value'...bah...

DanPatterson_Retired · ‎10-05-2016

There is another way. I have taken some output format liberties in output formatting. I am on an iThing so I can't produce a search cursor, so we will pretend that I have one (src_cur)

create an enumerate object ... enumerate(src_cur) ... that you can cycle through
establish your threshold value from row[0] .... cleverly masked as rho_zero
the enumeration object produces a pair for each 'rho_zero'... an ID number and the corresponding value from the cursor
only keep those that are above a certain threshold... in this case, 5
the enum_ result is the rows that meet the threshold conditions, with each pair representing (OBJECTID, Value)
do what you need with the result

>>> src_cur = np.random.randint(0, 10, size=50).tolist() # emulate reading a cursor
>>>
>>> enum_ = [rho_zero for rho_zero in enumerate(src_cur) if rho_zero[1] > 5]
>>> 
>>> src_cur  # a 'searchcursor' returning 50 values... liberties taken with formatting
[4, 4, 6, 7, 1, 8, 1, 2, 3, 2,
 2, 5, 6, 5, 9, 8, 0, 0, 8, 4,
 5, 7, 3, 4, 8, 0, 0, 2, 7, 2,
 1, 3, 6, 6, 6, 8, 9, 7, 4, 5,
 4, 2, 2, 9, 3, 1, 0, 7, 3, 6]
>>>
>>> enum_ # again with the formatting (x,y) represents (OBJECTID, value)
[(2, 6), (3, 7), (5, 8), (12, 6), (14, 9),
 (15, 8), (18, 8), (21, 7), (24, 8), (28, 7),
 (32, 6), (33, 6), (34, 6), (35, 8), (36, 9),
 (37, 7), (43, 9), (47, 7), (49, 6)]
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

ZacharyHart · ‎10-06-2016

Dan, want to thank you for your (as always) high-level view on things here. I've actually had some good success with itertools. I did play around for a while this AM with what you provided.I think that I'm just too much of a novice to put collections to work here. I ended up leveraging summary stats with a da SearchCursor which works pretty slick (albeit far less elegant than what you propose). I'd post it up when its cleaned up, but I'm not sure it will add much to the conversation here.

Don't count me out for collections just yet though...I'm going to keep working on the basics.