Are searchcursors brutally slow? They need not be

Background...

Exploring the ArcGIS module and looking at how it worked which lead me to SpatialDataFrames, the cloaked version of a Pandas df with geometry stuff, which lead me to some of its inner workings etc etc.

I use Numpy and lot and the FeatureClassToNumPy array with some fancy array splitting generally satisfies all my needs to get the geometry yanked out of a featureclass... do the magic... then send it back to a featureclass.

I tend not to use searchcursors, but the ArcGIS code uses it in several places, so, I thought... give it a shot.

Helper functions ...

Featureclass information

Some imports... some needed now, some later.

import sys
from textwrap import dedent
import numpy as np

import arcpy
from arcgis.geometry import _types
from arcgis.features._data.geodataset import SpatialDataFrame as SDF
import pandas as pd

def fc_info(in_fc):
    """Return basic featureclass information, including...
    : SR - spatial reference object (use SR.name to get the name)
    : shp_fld - field name which contains the geometry object
    : oid_fld - the object index/id field name
    : - others: 'areaFieldName', 'baseName', 'catalogPath','featureType',
    :   'fields', 'hasOID', 'hasM', 'hasZ', 'path'
    : - all_flds =[i.name for i in desc['fields']]
    """
    desc = arcpy.da.Describe(in_fc)
    args = ['shapeFieldName', 'OIDFieldName', 'spatialReference', 'shapeType']
    shp_fld, oid_fld, SR, shp_type = [desc[i] for i in args]
    return shp_fld, oid_fld, SR, shp_type‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

@time_deco
def to_arr(in_fc, use_geo=False):
    """Convert a featureclass to a structured or recarray using a searchcursor.
    :
    :Requires: import arcpy, numpy as np
    :--------
    : in_fc - featureclass
    : use_geo - True .__geo_interface__
    :         - list comprehension
    :get the row information
    : cycle through all geometries and get xy pairs
    :
    :References:
    :----------
    : - see the polygon, polyline etc classes in
    :   C:\ArcPro\Resources\ArcPy\arcpy\arcobjects\geometry.py
    """
    shp_fld, oid_fld, SR, shp_type = fc_info(in_fc)  # get the base information
    flds = arcpy.ListFields(in_fc)
    fnames = [f.name for f in flds if f.type not in ['OID', 'Geometry']]
    geom_flds = ['SHAPE@', oid_fld] + fnames
    flds = [shp_fld, oid_fld] + fnames
    vals = []
    geoms = []
    coords = []
    idx = flds.index(shp_fld)
    with arcpy.da.SearchCursor(in_fc,
                               field_names=geom_flds, where_clause=None,
                               spatial_reference=SR,  explode_to_points=False,
                               sql_clause=(None, None)) as rows:
        for row in rows:
            row = list(row)
            geom = row.pop(idx)
            vals.append(row)
            geoms.append(geom)  # pop the geometry out
            if use_geo:
                xy = geom.__geo_interface__['coordinates']
            else:
                xy = [np.array([(pt.X, pt.Y) for pt in arr if pt])
                      for arr in geom]  # if pt else None
            coords.append(np.asarray(xy))  # maybe dump the last as np.asarray
            del row, geom, xy
        del rows
    return vals, coords, geoms‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Now on to the next '2'

#@time_deco
def two_arrays(in_fc, keep_geom=True, dtype0=True, as_recarray=False):
    """Send to a numpy structured/array and split it into a geometry array
    :  and an attribute array.
    :
    :Requires:
    :--------
    : fc_info(in_fc) - function needed to return fc properties
    : keep_geom
    :    - True, split the points into their geometry groups as an object array
    :    - False, a sequential array of shape = (N,)
    : dtype0
    :    - True  dt = [('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]
    :    - False dt = [('Idx', '<i4'), ('XY', '<f8', (2,))]
    """
    shp_fld, oid_fld, SR, shp_type = fc_info(in_fc)
    all_flds = arcpy.ListFields(in_fc)
    b_flds = [i.name for i in all_flds]
    a = arcpy.da.FeatureClassToNumPyArray(in_fc,
                                          field_names=[oid_fld, shp_fld],
                                          explode_to_points=True,
                                          spatial_reference=SR)
    if dtype0:
        dt = [('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]  # option 1
        dtb = [('Idx', '<i4'), ('Xc', '<f8'), ('Yc', '<f8')]
    else:
        dt = [('Idx', '<i4'), ('XY', '<f8', (2,))]
        dtb = [('Idx', '<i4'), ('XYc', '<f8', (2,))]
    a.dtype = dt
    oid_fld = 'Idx'
    if keep_geom:
        a = np.split(a, np.where(np.diff(a[oid_fld]))[0] + 1)  # oid_fld
        a = np.r_[a]
    if as_recarray:
        a = a.view(object, np.recarray)
    #
    #
    b = arcpy.da.FeatureClassToNumPyArray(in_fc, field_names=b_flds,
                                          explode_to_points=False,
                                          spatial_reference=SR)
    dtb.extend(b.dtype.descr[2:])
    b.dtype = dtb
    return a, b‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Now testing on trivial samples is trivial. Before I moved on to big samples (geometries with millions of points), I tried it on a suitable one that was readily available... Although it consists of 3 polygons, one is a single part and the others have 1 or two other parts. There are about 1.8 million points in total.

# ---- A multipart polygon file, with 3 parts and 0, 1 or 2 subparts
a.size  # = 3
a.shape # = (3,)
sum([i.size for i in a]) # = 1,814,562 points‍‍‍‍‍

# ---- Big snips because of array print options -----
array([ array([(1, -65.55527496337874, 49.25722122192411),
       (1, -65.55305480957014, 49.25694274902389), ...,      # big snip
       (1, -55.902732849121094, 51.63003158569353),
       (1, -55.90134429931635, 51.63085556030313)], 
      dtype=[('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]),
       array([(2, -81.8187484741211, 68.89791870117205),
       (2, -81.81718444824207, 68.89531707763712), ...,      # other snip
       (2, -99.49025726318348, 51.53623199462936),
       (2, -99.49021148681629, 51.536178588867415)], 
      dtype=[('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]),
       array([(3, -114.51197814941384, 73.37343597412138),
       (3, -114.5083312988279, 73.37291717529314), ...,      # and another
       (3, -69.89323425292969, 83.10884857177757),
       (3, -69.89251708984364, 83.10861206054705)], 
      dtype=[('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')])], dtype=object)

# ---- again, 1.8 ish million points ----‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Timing results

Timing function for... to_arr
Time: 4.63e+01s for 3 objects # 46 seconds

_ = two_arrays(in_fc)

Timing function for... two_arrays
Time: 4.45e+00s for 2 objects # 4.5 seconds

I kept the scripts intact for people to try out on their data. I even went as far as to strip out anything unnecessary in other versions but nothing reduced the time of the searchcursor approach. I am trying to get a handle on 'why'. I would have guessed that a similar base code would have been recycled in the searchcursor and the FeatureClassToNumPyArray functionality. I guess I am wrong OR there is some fundamental flaw in my implementation of 'to_arr' vs 'two_arrays'... so if you have a moment or two... have a look-see

Dan

Update

Seems that when they say 'SHAPE@' is an 'expensive' operation in the help files, they are being generous/cautious. Not quite sure what the difference the addition of 2 NAN's to the point has, but it is painful with some data operations. SHAPE@XY or its partial kindred @X and @Y performed better as did @JSON if memory serves.

I fiddled some, and produced two variants of a function to just get the id and geometry data using 'OID@', 'SHAPE@X' and 'SHAPE@Y' to get the geometry. The first uses FeatureClassToNumPyArray and the second uses arcpy.da.SearchCursor

#@time_deco
def to_arr0(in_fc):
    """Just get the geometry and id field
    """
    shp_fld, oid_fld, SR, shp_type = fc_info(in_fc)
    flds = ['OID@', 'SHAPE@X', 'SHAPE@Y']
    a = arcpy.da.FeatureClassToNumPyArray(in_fc,
                                          field_names=flds,
                                          spatial_reference=SR,
                                          explode_to_points=True)
    dt = [('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]
    a.dtype = dt
    return a‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

And now for the searchcursor approach

#@time_deco
def to_arr1(in_fc, split_geom=True):
    """Pulls out the geometry and index values from a featureclass.  Optionally
    :  the full array can be split into an object array containing an array
    :  of individual geometries.  Note, this is only useful for poly* objects
    :
    :Requires: numpy, arcpy and fc_info
    :--------
    :  in_fc - featureclass
    :  split-geom - True, separate arrays are created for each geometry object
    :
    :Notes:
    : - arcpy.da.SearchCursor(
    :     in_fc, field_names, where_clause, spatial_reference,
    :     explode_to_points, sql_clause=(None, None))
    :-------------------------------------------------------------------------
    """
    shp_fld, oid_fld, SR, shp_type = fc_info(in_fc)
#    flds = arcpy.ListFields(in_fc)
#    fnames = [f.name for f in flds if f.type not in ['OID', 'Geometry']]
#    flds = [oid_fld, shp_fld]
    g_flds = ['OID@', 'SHAPE@X', 'SHAPE@Y']  # 'SHAPE@', SHAPE@Z etc
    vals = []
    with arcpy.da.SearchCursor(in_fc, g_flds, None, SR, True) as rows:
        for row in rows:
            vals.append(row)
    del row, rows
    # ---- construct the array ----
    dt = [('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]
    a = np.array(vals, dtype=dt)
    # ---- split out into an object array containing arrays of dt ----
    if split_geom:
        a = np.split(a, np.where(np.diff(a['Idx']))[0] + 1)  # oid_fld
        a = np.r_[a]
    return a‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

You can ignore the commented out lines and the split_geom section. In theory, the functions do the same if split_geom=False

The results are comparable within a coffee sip for a medium sized featureclass (too big to attach)

a = to_arr0(in_fc)

Timing function for... to_arr0
  Time: 1.75e+00s for 1,814,562 objects

a1 = to_arr1(in_fc, split_geom=False)

Timing function for... to_arr1
  Time: 2.18e+00s for 1,814,562 objects

# ---- comparison of sequential Xs and Ys ----
np.allclose(a0['Xs'], a0['Xs'])
True

np.allclose(a0['Ys'], a1['Ys'])
True‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The original two_arrays function can be modified to use either of the above, but to_arr1 allows the user to keep the geometries separate with a known dtype. This could be reshaped to comply with a SpatialDataFrame if desired. Sadly, Pandas is used as the middle glue.

I will leave this open to see if anyone has any useful information to contribute.

Dan

Much Later

I should re-read my notes more often, 'subclassing'? ... _as_narray

def _shps_recarray(in_fc, to_pnts=True):
    """Convert shapes to a recarray quickly
    """
    cur = arcpy.da.SearchCursor(in_fc, "*", explode_to_points=to_pnts)
    flds = cur.fields
    dt = cur._dtype
    a = cur._as_narray()
    return a, flds, dt‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Tests ....

A Canada level 0 gdb tile 4ish million points with a bunch of attributes

a, flds, dt = _shps_recarray(in_fc)

Timing function for... _shps_recarray
  Time: 8.99e+00s for 3 objects

a.shape(3890048,)

flds
('OBJECTID', 'Shape', 'Shape_Length', 'Shape_Area', 'AREA_GEO', 'PERIM_GEO', 'PNT_COUNT')

dt
dtype([('OBJECTID', '<i4'), ('Shape', '<f8', (2,)), ('Shape_Length', '<f8'),
        ('Shape_Area', '<f8'), ('AREA_GEO', '<f8'), ('PERIM_GEO', '<f8'),
        ('PNT_COUNT', '<f8')])‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Certainly cuts out all the

with arcpy.da.SearchCursor(in_fc, "*") as cursor:‍‍‍

stuff and you can then pull out what you either before or after.

I knew there had to be direct support of numpy arrays beyond SpatialDataFrames.

The observational gallery to this thread has grown but no comments, I guess I will turn into a document in case anyone has any thoughts.

References (so I don't forget again)

UpdateCursor link

Joshua Bixby's ... comment ... and my supplemental

This document was generated from the following discussion: Are searchcursors brutally slow? They need not be