SelectLayerByLocation Alternatives?

JamesCrandall · ‎09-29-2017

Being a bit greedy here, I know but has anyone come up with more efficient/faster alternatives to arcpy.SelectLayerByLocation_management?

I'd say for the vast majority (ie. 99%) implementing this is entirely acceptable. However, I've run into a couple of instances lately that require several of these calls to be made in order to build a list/set of results that ultimately return an expected response.

...or in another instance, there are a large amount of point features (85 million) that need to be identified as COMPLETELY_WITHIN a simple single polygon feature. In most cases it's sub 30 seconds to return the expected result.

However if these processes are to be published as synchronous GP services, the processing times can be excessive in that context.

numpy? dictionaries? pandas? Anything at all that you may have tried to exceed the performance of arcpy.SelectLayerByLocation_management???

Case / scenario: select by location on points within a simple polygon feature and return a yes or no if the count is > 0

JamesCrandall · ‎10-03-2017

Edit: added clip_analysis to generate the final polygons to use in SelectLayerByLocation process.

I think the best solution I've tested so far is to break up the polygon into individual polygons and then iterating over each polygon to perform the SelectbyLocation on the point features, checking to see if any features fall inside, then break out of that loop if that's the case.

Pretty straight forward:

1. CreateFishnet_management of the polygon feature.

2. FeatureToPolygon_management of the output Fishnet

3. Search cursor loop over each polygon in the fishnet

4. SelectLayerByLocation and perform test to count result.

5. If count > 0 then break and return with Boolean TRUE

What's interesting and the challenge is that the combination of rows/cols for the fishnet vs. the overall size of the input polygon geometry will determine the efficiency of the overall process. That is, if it's a large input polygon and too many rows/cols are created for the fishnet, then it will take longer to loop, but it also takes less time to SelectLayerByLocation.

There's a direct correlation between the content of the fishnet vs. the processing time of the SelectLayerByLocation. That is, if the faster it finds points within one of the fishnet polygons, the faster the overall process.

This beats any straight-up SelectLayerByLocation call against the full extent of the original point / polygon feature classes!

        arcpy.MakeFeatureLayer_management(pnt_fc, "PointFeatures")
        extent = polygon.extent

        originXY = extent.lowerLeft
        yaxisXY = extent.upperLeft
        opp_corner =extent.upperRight

        print "origin: " + str(originXY)
        print "yaxis: " + str(yaxisXY)
        print "opposite corner: " + str(opp_corner)

        #keep the cell width/height set to zero as it will use the opp_corner to figure it out
        cellWidth = 0
        cellHeight = 0

        #plug in the #rows/#cols you want the output grids to contain
        nrows = 4
        ncols = 4
        outFC = r'in_memory\outsections_line'
        outFishnet = r'in_memory\outsections' 
        outClip = r'in_memory\clipfeatures'
        arcpy.CreateFishnet_management(outFC, str(originXY), str(yaxisXY), cellWidth, cellHeight, nrows, ncols, str(opp_corner))
        arcpy.FeatureToPolygon_management(outFC, outFishnet)
        arcpy.Clip_analysis(outFishnet, polygon, outClip)

        with arcpy.da.SearchCursor(outClip, ['SHAPE@']) as outcur:
            for outrow in outcur:

                tshape = outrow[0]

                arcpy.SelectLayerByLocation_management("PointFeatures", "INTERSECT", tshape)
                rowcounter = int(arcpy.GetCount_management("PointFeatures").getOutput(0))

                if rowcounter > 0:
                    print "rowcounter: " + str(rowcounter)
                    return True
                    break

                else:
                    print "rowcounter: " + str(rowcounter)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

View solution in original post

JoshuaBixby · ‎09-29-2017

Even for 85 million points, 30 seconds seems long to select points that intersect a simply polygon feature. Can you elaborate a bit more on how these points are stored, file geodatabase, enterprise geodatabase??? Also, you say "simple polygon feature," how simple is it? Are we talking a square, or at least something convex? Is the polygon an ArcPy Polygon or a polygon in a feature class?

JamesCrandall · ‎09-29-2017

File Geodatabase (10.3)

On a network share

Spatial index applied

Registered with ArcGIS Server (10.4) site

Polygon: is quite simple maybe 25 vertices, an in_memory feature class that gets created from string input of rings, but clearly the SelectbyLocation is the culprit (timing each component):

jsonFrmt = geo_convert(feature_info)
jsonData = json.loads(jsonFrmt)
inSr = arcpy.SpatialReference(3857)

for key,value in jsonData.iteritems():
    if value == 'FeatureCollection':
        pass
    else:
        for i in value:
            for k,v in i.iteritems():
                if k == 'geometry':
                    features = []
                    features2 = []
                    feature = v['coordinates']
                    polygon = arcpy.Polygon(arcpy.Array([arcpy.Point(*coords) for coords in feature]), inSr)
                    features.append(polygon)
                else:
                    pass

fc2 = arcpy.CreateFeatureclass_management('in_memory', 'tmpFC', "POLYGON")
with arcpy.da.InsertCursor(fc2, ['SHAPE@']) as cur:
    cur.insertRow([polygon])

pnt_fc = r'\\pathtonetworkfolder\myPointFeatureClass'
fl = "PointFeatures"
arcpy.MakeFeatureLayer_management(pnt_fc, "PointFeatures")
arcpy.SelectLayerByLocation_management(fl, "COMPLETELY_WITHIN", fc2)

rowcount = int(arcpy.GetCount_management(fl).getOutput(0))
if rowcount>0:
    data.update({'HasPoints': 'yes'})
else:
    data.update({'HasPoints': 'no'})‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

DanPatterson_Retired · ‎09-29-2017

nothing is clearly the culprit until your profile your code. You can do this with actual code profiling tools or even simple decorators (I will look for my blog post on this)

JamesCrandall · ‎09-29-2017

I timed the:

arcpy.CreateFeatureclass_management (.02 secs)

arcpy.MakeFeatureLayer_management (2 secs)

arcpy.SelectLayerByLocation_management (30 secs)

I'm implementing a class:

class Timer:
    def __enter__(self):
        self.start = time.clock()
        return self
    def __exit__(self, *args):
        self.end = time.clock()
        self.interval = self.end - self.start‍‍‍‍‍‍‍

Then just messaging the print statement to see results for each arcpy method as such:

t1 = Timer()
with t1:
   fc2 = arcpy.CreateFeatureclass_management('in_memory', 'tmpFC', "POLYGON")
   with arcpy.da.InsertCursor(fc2, ['SHAPE@']) as cur:
      cur.insertRow([polygon])

msgt1 = "%s...CreateFeatureClass time %.02f seconds" % (bndy,t1.interval)‍‍‍‍‍‍‍

JoshuaBixby · ‎09-29-2017

We have had some situations where network-share file geodatabases perform terribly with selections or geoprocessing operations. Just to test the impact of storing the data on a network share, can you copy the data local and see if that speeds it up much? Also, what about a different spatial operators, some perform better than others. If you can live with "WITHIN" instead of "COMPLETELY_WITHIN", then it opens the door to try "INTERSECT."

JamesCrandall · ‎09-29-2017

You made an interesting comment about just using the polygon feature in the SelectbyLocation rather than creating the in_memory feature class. This works too and I can take out that extra step of CreateFeatureClass_management and shave processing time right there.

Thanks!

Still interested in what might be wrong with the SelectbyLocation

mikeking1 · ‎10-06-2017

Hi James

85 million points is a lot! The key to these kind of issues is breaking down the problem and making sure you have the shortest path to a solution because slow response is often a death by a thousand cuts. Whilst some approaches in ArcGIS are definitely quicker than others I've found for large datasets the less data you send to particular query the better. Is there a way you can pre-filter your result set by using environments or processing masks?

This approach prevents the entire dataset being used to exclude data points that will never be in your results. At least that is my conclusion after finding it massively improves performance in raster and vector processing exercises. Others might have a more in depth knowledge of how ArcGIS does its thing under the hood.

Good luck!

JamesCrandall · ‎10-06-2017

Hi Mike,

See my last post on this thread. I show how I broke it all down using arcpy methods.

DanPatterson_Retired · ‎09-29-2017

testing time....

"""
Script:   point_in_rect
Author:   Dan.Patterson@carleton.ca
Purpose:
  To determine whether points are within the extents of polygons.
"""
# ----10| ------20| ------30| ------40| ------50| ------60| ------70| ------80|
import numpy as np
np.set_printoptions(edgeitems=5,linewidth=80,precision=4,
                    suppress=True,threshold=20)


def time_deco(func):  # timing originally
    """timing decorator function
    :print("\n  print results inside wrapper or use <return> ... ")
    """
    import time
    from functools import wraps

    @wraps(func)
    def wrapper(*args, **kwargs):
        t0 = time.perf_counter()        # start time
        result = func(*args, **kwargs)  # ... run the function ...
        t1 = time.perf_counter()        # end time
        dt = t1 - t0
        print("\nTiming function for... {}".format(func.__name__))
        print("  Time: {: <8.2e}s for {:,} objects".format(dt, len(result)))
        return result                   # return the result of the function
        return dt                       # return delta time
    return wrapper


@time_deco
def contains(pnts, ext):
    """Performs a logical_and to find points within a box/ext
    comp = np.logical_and( great-eq LB, less RT) # logic
    inside = np.where(np.prod(comp, axis=1) == 1) # logic
    """
    L, B, R, T = ext.flatten()
    comp = np.logical_and(([L, B] <= pnts), (pnts <= [R, T]))  # < or <=
    idx = np.where(comp[:, 0] * comp[:, 1] == 1)[0]   # faster incarnation
    inside = pnts[idx]
    return inside

if __name__=="__main__":
    """make some points for testing, create and extent,
    time each option and optionally graph"""
    N = 10000000
    pnts = np.random.random_sample((N, 2)) * 5
    ext = np.array([[2., 2.],[3., 3.]])
    contains(pnts, ext)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

N = 10,000,000 points and the number of points returned within the extent.

Timing function for... contains
Time: 2.44e-01s for 399,628 objects

N = 100,000,000

Timing function for... contains
Time: 3.02e+00s for 4,001,834 objects

N = 1, 000, 000, 000

damn... ran out of memory on my toy computer

So even though this is a point in simple rectangle test, you could use it to prune out the candidates before passing on to the SelectByLocation. Some fodder for thought