Background...
Exploring the ArcGIS module and looking at how it worked which lead me to SpatialDataFrames, the cloaked version of a Pandas df with geometry stuff, which lead me to some of its inner workings etc etc.
I use Numpy and lot and the FeatureClassToNumPy array with some fancy array splitting generally satisfies all my needs to get the geometry yanked out of a featureclass... do the magic... then send it back to a featureclass.
I tend not to use searchcursors, but the ArcGIS code uses it in several places, so, I thought... give it a shot.
Helper functions ...
Featureclass information
Some imports... some needed now, some later.
import sys
from textwrap import dedent
import numpy as np
import arcpy
from arcgis.geometry import _types
from arcgis.features._data.geodataset import SpatialDataFrame as SDF
import pandas as pd
def fc_info(in_fc):
"""Return basic featureclass information, including...
: SR - spatial reference object (use SR.name to get the name)
: shp_fld - field name which contains the geometry object
: oid_fld - the object index/id field name
: - others: 'areaFieldName', 'baseName', 'catalogPath','featureType',
: 'fields', 'hasOID', 'hasM', 'hasZ', 'path'
: - all_flds =[i.name for i in desc['fields']]
"""
desc = arcpy.da.Describe(in_fc)
args = ['shapeFieldName', 'OIDFieldName', 'spatialReference', 'shapeType']
shp_fld, oid_fld, SR, shp_type = [desc[i] for i in args]
return shp_fld, oid_fld, SR, shp_type
@time_deco
def to_arr(in_fc, use_geo=False):
"""Convert a featureclass to a structured or recarray using a searchcursor.
:
:Requires: import arcpy, numpy as np
:--------
: in_fc - featureclass
: use_geo - True .__geo_interface__
: - list comprehension
:get the row information
: cycle through all geometries and get xy pairs
:
:References:
:----------
: - see the polygon, polyline etc classes in
: C:\ArcPro\Resources\ArcPy\arcpy\arcobjects\geometry.py
"""
shp_fld, oid_fld, SR, shp_type = fc_info(in_fc) # get the base information
flds = arcpy.ListFields(in_fc)
fnames = [f.name for f in flds if f.type not in ['OID', 'Geometry']]
geom_flds = ['SHAPE@', oid_fld] + fnames
flds = [shp_fld, oid_fld] + fnames
vals = []
geoms = []
coords = []
idx = flds.index(shp_fld)
with arcpy.da.SearchCursor(in_fc,
field_names=geom_flds, where_clause=None,
spatial_reference=SR, explode_to_points=False,
sql_clause=(None, None)) as rows:
for row in rows:
row = list(row)
geom = row.pop(idx)
vals.append(row)
geoms.append(geom) # pop the geometry out
if use_geo:
xy = geom.__geo_interface__['coordinates']
else:
xy = [np.array([(pt.X, pt.Y) for pt in arr if pt])
for arr in geom] # if pt else None
coords.append(np.asarray(xy)) # maybe dump the last as np.asarray
del row, geom, xy
del rows
return vals, coords, geoms
Now on to the next '2'
#@time_deco
def two_arrays(in_fc, keep_geom=True, dtype0=True, as_recarray=False):
"""Send to a numpy structured/array and split it into a geometry array
: and an attribute array.
:
:Requires:
:--------
: fc_info(in_fc) - function needed to return fc properties
: keep_geom
: - True, split the points into their geometry groups as an object array
: - False, a sequential array of shape = (N,)
: dtype0
: - True dt = [('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]
: - False dt = [('Idx', '<i4'), ('XY', '<f8', (2,))]
"""
shp_fld, oid_fld, SR, shp_type = fc_info(in_fc)
all_flds = arcpy.ListFields(in_fc)
b_flds = [i.name for i in all_flds]
a = arcpy.da.FeatureClassToNumPyArray(in_fc,
field_names=[oid_fld, shp_fld],
explode_to_points=True,
spatial_reference=SR)
if dtype0:
dt = [('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')] # option 1
dtb = [('Idx', '<i4'), ('Xc', '<f8'), ('Yc', '<f8')]
else:
dt = [('Idx', '<i4'), ('XY', '<f8', (2,))]
dtb = [('Idx', '<i4'), ('XYc', '<f8', (2,))]
a.dtype = dt
oid_fld = 'Idx'
if keep_geom:
a = np.split(a, np.where(np.diff(a[oid_fld]))[0] + 1) # oid_fld
a = np.r_[a]
if as_recarray:
a = a.view(object, np.recarray)
#
#
b = arcpy.da.FeatureClassToNumPyArray(in_fc, field_names=b_flds,
explode_to_points=False,
spatial_reference=SR)
dtb.extend(b.dtype.descr[2:])
b.dtype = dtb
return a, b
Now testing on trivial samples is trivial. Before I moved on to big samples (geometries with millions of points), I tried it on a suitable one that was readily available... Although it consists of 3 polygons, one is a single part and the others have 1 or two other parts. There are about 1.8 million points in total.
# ---- A multipart polygon file, with 3 parts and 0, 1 or 2 subparts
a.size # = 3
a.shape # = (3,)
sum([i.size for i in a]) # = 1,814,562 points
# ---- Big snips because of array print options -----
array([ array([(1, -65.55527496337874, 49.25722122192411),
(1, -65.55305480957014, 49.25694274902389), ..., # big snip
(1, -55.902732849121094, 51.63003158569353),
(1, -55.90134429931635, 51.63085556030313)],
dtype=[('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]),
array([(2, -81.8187484741211, 68.89791870117205),
(2, -81.81718444824207, 68.89531707763712), ..., # other snip
(2, -99.49025726318348, 51.53623199462936),
(2, -99.49021148681629, 51.536178588867415)],
dtype=[('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]),
array([(3, -114.51197814941384, 73.37343597412138),
(3, -114.5083312988279, 73.37291717529314), ..., # and another
(3, -69.89323425292969, 83.10884857177757),
(3, -69.89251708984364, 83.10861206054705)],
dtype=[('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')])], dtype=object)
# ---- again, 1.8 ish million points ----
Timing results
Update
Seems that when they say 'SHAPE@' is an 'expensive' operation in the help files, they are being generous/cautious. Not quite sure what the difference the addition of 2 NAN's to the point has, but it is painful with some data operations. SHAPE@XY or its partial kindred @X and @Y performed better as did @JSON if memory serves.
I fiddled some, and produced two variants of a function to just get the id and geometry data using 'OID@', 'SHAPE@X' and 'SHAPE@Y' to get the geometry. The first uses FeatureClassToNumPyArray and the second uses arcpy.da.SearchCursor
#@time_deco
def to_arr0(in_fc):
"""Just get the geometry and id field
"""
shp_fld, oid_fld, SR, shp_type = fc_info(in_fc)
flds = ['OID@', 'SHAPE@X', 'SHAPE@Y']
a = arcpy.da.FeatureClassToNumPyArray(in_fc,
field_names=flds,
spatial_reference=SR,
explode_to_points=True)
dt = [('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]
a.dtype = dt
return a
And now for the searchcursor approach
#@time_deco
def to_arr1(in_fc, split_geom=True):
"""Pulls out the geometry and index values from a featureclass. Optionally
: the full array can be split into an object array containing an array
: of individual geometries. Note, this is only useful for poly* objects
:
:Requires: numpy, arcpy and fc_info
:--------
: in_fc - featureclass
: split-geom - True, separate arrays are created for each geometry object
:
:Notes:
: - arcpy.da.SearchCursor(
: in_fc, field_names, where_clause, spatial_reference,
: explode_to_points, sql_clause=(None, None))
:-------------------------------------------------------------------------
"""
shp_fld, oid_fld, SR, shp_type = fc_info(in_fc)
# flds = arcpy.ListFields(in_fc)
# fnames = [f.name for f in flds if f.type not in ['OID', 'Geometry']]
# flds = [oid_fld, shp_fld]
g_flds = ['OID@', 'SHAPE@X', 'SHAPE@Y'] # 'SHAPE@', SHAPE@Z etc
vals = []
with arcpy.da.SearchCursor(in_fc, g_flds, None, SR, True) as rows:
for row in rows:
vals.append(row)
del row, rows
# ---- construct the array ----
dt = [('Idx', '<i4'), ('Xs', '<f8'), ('Ys', '<f8')]
a = np.array(vals, dtype=dt)
# ---- split out into an object array containing arrays of dt ----
if split_geom:
a = np.split(a, np.where(np.diff(a['Idx']))[0] + 1) # oid_fld
a = np.r_[a]
return a
You can ignore the commented out lines and the split_geom section. In theory, the functions do the same if split_geom=False
The results are comparable within a coffee sip for a medium sized featureclass (too big to attach)
a = to_arr0(in_fc)
Timing function for... to_arr0
Time: 1.75e+00s for 1,814,562 objects
a1 = to_arr1(in_fc, split_geom=False)
Timing function for... to_arr1
Time: 2.18e+00s for 1,814,562 objects
# ---- comparison of sequential Xs and Ys ----
np.allclose(a0['Xs'], a0['Xs'])
True
np.allclose(a0['Ys'], a1['Ys'])
True
The original two_arrays function can be modified to use either of the above, but to_arr1 allows the user to keep the geometries separate with a known dtype. This could be reshaped to comply with a SpatialDataFrame if desired. Sadly, Pandas is used as the middle glue.
I will leave this open to see if anyone has any useful information to contribute.
Dan
Much Later
I should re-read my notes more often, 'subclassing'? ... _as_narray
def _shps_recarray(in_fc, to_pnts=True):
"""Convert shapes to a recarray quickly
"""
cur = arcpy.da.SearchCursor(in_fc, "*", explode_to_points=to_pnts)
flds = cur.fields
dt = cur._dtype
a = cur._as_narray()
return a, flds, dt
Tests ....
A Canada level 0 gdb tile 4ish million points with a bunch of attributes
a, flds, dt = _shps_recarray(in_fc)
Timing function for... _shps_recarray
Time: 8.99e+00s for 3 objects
a.shape(3890048,)
flds
('OBJECTID', 'Shape', 'Shape_Length', 'Shape_Area', 'AREA_GEO', 'PERIM_GEO', 'PNT_COUNT')
dt
dtype([('OBJECTID', '<i4'), ('Shape', '<f8', (2,)), ('Shape_Length', '<f8'),
('Shape_Area', '<f8'), ('AREA_GEO', '<f8'), ('PERIM_GEO', '<f8'),
('PNT_COUNT', '<f8')])
Certainly cuts out all the
with arcpy.da.SearchCursor(in_fc, "*") as cursor:
stuff and you can then pull out what you either before or after.
I knew there had to be direct support of numpy arrays beyond SpatialDataFrames.
The observational gallery to this thread has grown but no comments, I guess I will turn into a document in case anyone has any thoughts.
References (so I don't forget again)
Joshua Bixby's ... comment ... and my supplemental
This document was generated from the following discussion: Are searchcursors brutally slow? They need not be