Some basic exploration into the generic point in... style questions.
A recent comment about Select By Location taking a long time when point sets were large got me thinking. The background on this topic generally leave one staring a reams of code with little to no explanation on the logic. I will continue with that trend .
First the code. I use python and numpy to do a lot of the work and we will use arcpy to communicate with ArcMap or ArcGIS Pro.
I created a featureclass in a file geodatabase of points, and a polygon representing a selection polygon. The following script was used to generate the data prior to testing the code. The sample was small for now, 10,000 points and one polygon (square). More on scaling at a later time.
Explore.
# -*- coding: UTF-8 -*-
"""
:Script: point_in_rect
:Author: Dan.Patterson@carleton.ca
:Purpose:
: To determine whether points are within the extents of polygons.
:
:References:
: http://stackoverflow.com/questions/30481577/
: assign-numpy-array-of-points-to-a-2d-square-grid
: http://stackoverflow.com/questions/33051244/
: numpy-filter-points-within-bounding-box
: https://stackoverflow.com/questions/33051244/
: numpy-filter-points-within-bounding-box/33051576#33051576
:
"""
# ----10| ------20| ------30| ------40| ------50| ------60| ------70| ------80|
import numpy as np
import arcpy
np.set_printoptions(edgeitems=5, linewidth=80, precision=4,
suppress=True, threshold=20)
def contains(pnts, ext, in_out=True):
"""Performs a logical_and to find points within a box/ext
:Requires:
:--------
: pnts - an array of points
: ext - the extent of the rectangle being tested as an array of the
: left bottom (LB) and upper right (RT) coordinates
: in_out - boolean, whether to return inside and outside points
:
:Notes:
:-----
: comp - np.logical_and( great-eq LB, less RT) # logic
: inside - np.where(np.prod(comp, axis=1) == 1) # logic
: case - comp returns [True, False] so you take the product
: idx_in - indices derived using where since case will be 0 or 1
: inside - slice the pnts using idx_in
"""
outside = None
LB, RT = ext
comp = np.logical_and((LB <= pnts), (pnts <= RT))
case = comp[:, 0] * comp[:, 1]
idx_in = np.where(case)[0]
inside = pnts[idx_in]
if in_out:
idx_out = np.where(~case)[0] # invert case
outside = pnts[idx_out]
return inside, outside
if __name__ == "__main__":
"""make some points for testing, create and extent,
"""
ext = np.array([[342000, 5022000], [343000, 5023000]])
in_fc = r'C:\Git_Dan\a_Data\testdata.gdb\xy_10k'
SR = arcpy.SpatialReference(2951)
a = arcpy.da.FeatureClassToNumPyArray(in_fc,
['SHAPE@X', 'SHAPE@Y'],
spatial_reference=SR)
a0 = a.view(dtype=np.float).reshape(len(a), 2)
arcpy.env.overwriteOutput = True
arcpy.env.workspace = r'C:\Git_Dan\a_Data\testdata.gdb'
arcpy.MakeFeatureLayer_management('xy_10k', 'xy_lyr')
#arcpy.MakeFeatureLayer_management('subpoly', 'subpoly_lyr')
arcpy.management.SelectLayerByLocation("xy_lyr", "INTERSECT", "subpoly",
None, "NEW_SELECTION", "NOT_INVERT")
matchcount = int(arcpy.GetCount_management('xy_lyr')[0])
The results are shown in the figure below.
In this example, the script was run as it was, then the IPython magic %timeit was used to time just the portions that didn't require data preparation.
In both cases, 401 points were returned from a featureclass of 10,000 points. The timing results are below.
Timing Results
%timeit contains(a0, ext, in_out=False)
195 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit arcpy.management.SelectLayerByLocation("xy_lyr", "INTERSECT", "subpoly", None, "NEW_SELECTION", "NOT_INVERT")
308 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# imports as in previous script
def src_geom(pnt_fc, poly_fc):
""" a search cursor approach
"""
pnts = [p[0] for p in arcpy.da.SearchCursor(pnt_fc, "SHAPE@")]
polys = [pl[0] for pl in arcpy.da.SearchCursor(poly_fc, "SHAPE@")]
poly = polys[0]
is_in = [pnt.within(poly) for pnt in pnts]
return pnts, poly, is_in
if __name__ == "__main__":
"""searchcursor, arcpy point within polygon approach
"""
arcpy.env.overwriteOutput = True
arcpy.env.workspace = r'C:\Git_Dan\a_Data\testdata.gdb'
pnt_fc = r'C:\Git_Dan\a_Data\testdata.gdb\xy_10k' # the point layer as before
poly_fc = r'C:\Git_Dan\a_Data\testdata.gdb\subpoly' # a single polygon
# ----
pnts, poly, is_in = src_geom(pnt_fc, poly_fc)
pnts_in = sum(is_in)
# ----
The result yielded the same number of points ( ie pnts_in = 401)
%timeit src_geom(pnt_fc, poly_fc)
472 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(4) MatPlotLib and Path.contains_point
MatPlotLib's 'path' module can be used to implement point in polygon testing. Since it is available within the Anaconda suite, then it is included here for comparison purposes.
from matplotlib.path import Path
pip = Path.contains_point # ---- short form of method ----
def _extent(ext):
"""construct the extent rectangle from the extent points which are the
: lower left and upper right points [LB, RT]
"""
LB, RT = ext
L, B = LB
R, T = RT
box = [LB, [L, T], RT, [R, B], LB]
ext_rect = np.array(box)
return ext_rect
def containsmpl(pnts, ext):
"""matplotlib incarnation
"""
poly = _extent(ext)
poly = matplotlib.path.Path(poly)
is_in = [pip(poly, p) for p in pnts]
return np.array(is_in)
Using the numpy array objects for the points and the polygon, as inputs, the results are as follows:
%timeit containsmpl(a, ext)
24.7 ms ± 761 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I think it uses the 'crossing number' method of testing which would be better for general polygons, something which has to be added to the pure numpy approach, after the selection of the points in the extent has been completed.
To come...
Results so far... In all cases, the result was 401 points within the one polygon.
Numpy 0.195 ms (195 microseconds)
MatPlotLib 24.7 ms
SelectByLocation 308 ms
arcpy 472 ms