Dan_Patterson

Cursors by any other name are...

Blog Post created by Dan_Patterson Champion on Mar 28, 2019

Arrays

The case against <null> 

 

The scenario...

Some data.  Simple, but a combination of singlepart, multipart a couple of holes and tabular data with nulls.

Time for the SearchCursor

Easy couple of steps, when you know what you want.  We will just blow up the polygons to points and get all the attributes in the table.

in_fc = r"C:\My_spaceless_path_to_my\file_geodatabase.gdb\Polygons"
flds = arcpy.ListFields(in_fc, "*")
desc = arcpy.da.Describe(in_fc)          # create the describe object
SR = desc['spatialReference']            # spatial reference always needed
flds = [i.name for i in desc['fields']]  # field names are good

# time to make some points from the polygons
cur = arcpy.da.SearchCursor(in_fc, flds, spatial_reference=SR,
                            explode_to_points=True)

Any questions?

Pretty straightforward.

 

But cursors have some interesting properties.

cur.fields
  ('OBJECTID', 'Shape', 'Id', 'Long_1', 'Short_1', 'Float_1', 'Double_1', 'Text_1',
   'Shape_Length', 'Shape_Area', 'Date_time', 'DT_str')

cur._dtype

dtype([('OBJECTID', '<i4'), ('Shape', '<f8', (2,)), ('Id', '<i4'), ('Long_1', '<i4'),
       ('Short_1', '<i4'), ('Float_1', '<f4'), ('Double_1', '<f8'), ('Text_1', '<U10'),
       ('Shape_Length', '<f8'), ('Shape_Area', '<f8'), ('Date_time', '<M8[us]'), 'DT_str', '<U20')])

Wow!  Almost looks numpy array-like.  There is even an _as_narray ... I wonder.

a = cur._as_narray()  # ---- give it whirl and see what happens

Traceback (most recent call last):
  File "<ipython-input-174-371bed78ca5c>", line 1, in <module>

    a = cur._as_narray()

TypeError: int() argument must be a string, a bytes-like object or a number,
                 not 'NoneType'

# ---- dismal failure... translation?

 

Well that didn't go well!  Translation? looks like an attempt was made to convert some bit of data to an integer and it couldn't be because of those pesky <nulls> ...aka... None in the table!

Yes those things that you have to exclude from calculations, you have to query for, you have to remember it isn't good to see if something is equal to None but whether something IS None.

Hence, Nulls are evil.  They are a lazy way out of not recording a value in a table that represents the fact that you have no observation for that field.  

 

You have to get rid of nulls, so if you are too lazy to do it at the beginning during data preparation and Q/A work, then there is a way out.  So if you ignore the verbosity in the code below, there are a few tips.

  • For a particular data type, pick an appropriate null
    • for floats/doubles, python and numpy have NaN  (not a number, which is a double not a number... with me?)
    • text/string, easy ... choose an appropriate null value, anything that is in text form. Some suggestions:
      • "NoneType", "None_the_string", "Nadda", "Missed_that_one", "not_mine" ... 
    • Integers.... sadly there is no Nint (Not an integer), so you have to get creative, like old school, -9, -99, -999 etc or chose specify a minimum or maximum based on the integer dtype (see line 14-16 for examples)
    • Time... NaT... Not a Time... got one, use it instead of <null>

 

def _make_nulls_(in_flds, int_null=-999):
    """Return null values for a list of fields objects, excluding objectid
    and geometry related fields.  Throw in whatever else you want.

    Parameters
    ----------
    in_flds : list of field objects
        arcpy field objects, use arcpy.ListFields to get a list of featureclass
        fields.
    int_null : integer
        A default to use for integer nulls since there is no ``nan`` equivalent
        Other options include

    >>> np.iinfo(np.int32).min # -2147483648
    >>> np.iinfo(np.int16).min # -32768
    >>> np.iinfo(np.int8).min  # -128
    """
    if not isinstance(in_flds, (list, tuple)):
        in_flds = [in_flds]
    nulls = {'Double': np.nan, 'Single': np.nan, 'Float': np.nan,
             'Short': int_null, 'SmallInteger': int_null, 'Long': int_null,
             'Integer': int_null, 'String':str(None), 'Text':str(None),
             'Date': np.datetime64('NaT')}
    bad =  ['Guid', 'FID', 'OID', 'OBJECTID', 'Geometry', 'Shape',
            'Shape_Length', 'Shape_Area',   'Raster', 'Blob']
    good = [f for f in in_flds if f.editable and f.type not in bad]
    fld_dict = {f.name: f.type for f in good}
    fld_names = list(fld_dict.keys())
    null_dict = {f: nulls[fld_dict[f]] for f in fld_names}
    return null_dict, fld_names

 

So what happens when you try to use the shortcut _as_narray?

Row values in a list format...
with arcpy.da.SearchCursor(in_fc, '*', None, SR) as cursor:
    a = [row for row in cursor]
   
a
[(1, (300015.0, 5000005.0), 1, 1, 4, 1.0, 100.0, 'A 10 chars', 40.0, 100.0, datetime.datetime(2019, 3, 28, 0, 0), '2019/03/28 00:00:00'),
(2, (300005.0, 5000015.0), 2, None, None, None, None, None, 64.0, 64.0, None, None),
(3, (300010.4945054945, 5000010.593406593), 3, 3, 6, 3.0, 300.0, 'C not null', 99.41640786499875, 182.0, datetime.datetime(2019, 3, 30, 0, 0), '2019/03/30 00:00:00')]

 

Notice the integer columns have nulls in them as do the other columns.  Columns should have one data type.  We get too complacent because of spreadsheets.

 

To the rescue...

b = arcpy.da.FeatureClassToNumPyArray(
        in_fc, "*", "", spatial_reference=SR, explode_to_points=False,
        skip_nulls=False, null_value=null_dict)

b
array([(1, [ 300015.   , 5000005.   ], 1,    1,    4,  1., 100., 'A 10 chars', 40.   , 100., '2019-03-28T00:00:00.000000', '2019/03/28 00:00:00'),
       (2, [ 300005.   , 5000015.   ], 2, -999, -999, nan,  nan, 'None', 64.   ,  64., '2019-03-28T00:00:00.000000', 'None'),
       (3, [ 300010.495, 5000010.593], 3,    3,    6,  3., 300., 'C not null', 99.416, 182., '2019-03-30T00:00:00.000000', '2019/03/30 00:00:00')],
dtype=[('OBJECTID', '<i4'), ('Shape', '<f8', (2,)), ('Id', '<i4'), ('Long_1', '<i4'),
       ('Short_1', '<i4'), ('Float_1', '<f4'), ('Double_1', '<f8'), ('Text_1', '<U10'),
       ('Shape_Length', '<f8'), ('Shape_Area', '<f8'), ('Date_time', '<M8[us]'), ('DT_str', '<U20')])

Do the scrolly thing and you will notice that the null_dict that I created previously now converts integer <null> values to -999, Floating point values are tru NaN's and text is 'None'.

 

Lesson... don't use <null>, define a true null value that represents what is meant by it.  There can be more than one type of value to represent conditions like, no data collected, versus no data available, versus no observation possible... don't lump them into one category at the beginning.  You can aggregate AFTER data collection, you shouldn't do it before

Outcomes