Skip navigation
All People > Dan_Patterson > Py... blog > 2019 > March
2019

Arrays

The case against <null> 

 

The scenario...

Some data.  Simple, but a combination of singlepart, multipart a couple of holes and tabular data with nulls.

Time for the SearchCursor

Easy couple of steps, when you know what you want.  We will just blow up the polygons to points and get all the attributes in the table.

in_fc = r"C:\My_spaceless_path_to_my\file_geodatabase.gdb\Polygons"
flds = arcpy.ListFields(in_fc, "*")
desc = arcpy.da.Describe(in_fc)          # create the describe object
SR = desc['spatialReference']            # spatial reference always needed
flds = [i.name for i in desc['fields']]  # field names are good

# time to make some points from the polygons
cur = arcpy.da.SearchCursor(in_fc, flds, spatial_reference=SR,
                            explode_to_points=True)

Any questions?

Pretty straightforward.

 

But cursors have some interesting properties.

cur.fields
  ('OBJECTID', 'Shape', 'Id', 'Long_1', 'Short_1', 'Float_1', 'Double_1', 'Text_1',
   'Shape_Length', 'Shape_Area', 'Date_time', 'DT_str')

cur._dtype

dtype([('OBJECTID', '<i4'), ('Shape', '<f8', (2,)), ('Id', '<i4'), ('Long_1', '<i4'),
       ('Short_1', '<i4'), ('Float_1', '<f4'), ('Double_1', '<f8'), ('Text_1', '<U10'),
       ('Shape_Length', '<f8'), ('Shape_Area', '<f8'), ('Date_time', '<M8[us]'), 'DT_str', '<U20')])

Wow!  Almost looks numpy array-like.  There is even an _as_narray ... I wonder.

a = cur._as_narray()  # ---- give it whirl and see what happens

Traceback (most recent call last):
  File "<ipython-input-174-371bed78ca5c>", line 1, in <module>

    a = cur._as_narray()

TypeError: int() argument must be a string, a bytes-like object or a number,
                 not 'NoneType'

# ---- dismal failure... translation?

 

Well that didn't go well!  Translation? looks like an attempt was made to convert some bit of data to an integer and it couldn't be because of those pesky <nulls> ...aka... None in the table!

Yes those things that you have to exclude from calculations, you have to query for, you have to remember it isn't good to see if something is equal to None but whether something IS None.

Hence, Nulls are evil.  They are a lazy way out of not recording a value in a table that represents the fact that you have no observation for that field.  

 

You have to get rid of nulls, so if you are too lazy to do it at the beginning during data preparation and Q/A work, then there is a way out.  So if you ignore the verbosity in the code below, there are a few tips.

  • For a particular data type, pick an appropriate null
    • for floats/doubles, python and numpy have NaN  (not a number, which is a double not a number... with me?)
    • text/string, easy ... choose an appropriate null value, anything that is in text form. Some suggestions:
      • "NoneType", "None_the_string", "Nadda", "Missed_that_one", "not_mine" ... 
    • Integers.... sadly there is no Nint (Not an integer), so you have to get creative, like old school, -9, -99, -999 etc or chose specify a minimum or maximum based on the integer dtype (see line 14-16 for examples)
    • Time... NaT... Not a Time... got one, use it instead of <null>

 

def _make_nulls_(in_flds, int_null=-999):
    """Return null values for a list of fields objects, excluding objectid
    and geometry related fields.  Throw in whatever else you want.

    Parameters
    ----------
    in_flds : list of field objects
        arcpy field objects, use arcpy.ListFields to get a list of featureclass
        fields.
    int_null : integer
        A default to use for integer nulls since there is no ``nan`` equivalent
        Other options include

    >>> np.iinfo(np.int32).min # -2147483648
    >>> np.iinfo(np.int16).min # -32768
    >>> np.iinfo(np.int8).min  # -128
    """
    if not isinstance(in_flds, (list, tuple)):
        in_flds = [in_flds]
    nulls = {'Double': np.nan, 'Single': np.nan, 'Float': np.nan,
             'Short': int_null, 'SmallInteger': int_null, 'Long': int_null,
             'Integer': int_null, 'String':str(None), 'Text':str(None),
             'Date': np.datetime64('NaT')}
    bad =  ['Guid', 'FID', 'OID', 'OBJECTID', 'Geometry', 'Shape',
            'Shape_Length', 'Shape_Area',   'Raster', 'Blob']
    good = [f for f in in_flds if f.editable and f.type not in bad]
    fld_dict = {f.name: f.type for f in good}
    fld_names = list(fld_dict.keys())
    null_dict = {f: nulls[fld_dict[f]] for f in fld_names}
    return null_dict, fld_names

 

So what happens when you try to use the shortcut _as_narray?

Row values in a list format...
with arcpy.da.SearchCursor(in_fc, '*', None, SR) as cursor:
    a = [row for row in cursor]
   
a
[(1, (300015.0, 5000005.0), 1, 1, 4, 1.0, 100.0, 'A 10 chars', 40.0, 100.0, datetime.datetime(2019, 3, 28, 0, 0), '2019/03/28 00:00:00'),
(2, (300005.0, 5000015.0), 2, None, None, None, None, None, 64.0, 64.0, None, None),
(3, (300010.4945054945, 5000010.593406593), 3, 3, 6, 3.0, 300.0, 'C not null', 99.41640786499875, 182.0, datetime.datetime(2019, 3, 30, 0, 0), '2019/03/30 00:00:00')]

 

Notice the integer columns have nulls in them as do the other columns.  Columns should have one data type.  We get too complacent because of spreadsheets.

 

To the rescue...

b = arcpy.da.FeatureClassToNumPyArray(
        in_fc, "*", "", spatial_reference=SR, explode_to_points=False,
        skip_nulls=False, null_value=null_dict)

b
array([(1, [ 300015.   , 5000005.   ], 1,    1,    4,  1., 100., 'A 10 chars', 40.   , 100., '2019-03-28T00:00:00.000000', '2019/03/28 00:00:00'),
       (2, [ 300005.   , 5000015.   ], 2, -999, -999, nan,  nan, 'None', 64.   ,  64., '2019-03-28T00:00:00.000000', 'None'),
       (3, [ 300010.495, 5000010.593], 3,    3,    6,  3., 300., 'C not null', 99.416, 182., '2019-03-30T00:00:00.000000', '2019/03/30 00:00:00')],
dtype=[('OBJECTID', '<i4'), ('Shape', '<f8', (2,)), ('Id', '<i4'), ('Long_1', '<i4'),
       ('Short_1', '<i4'), ('Float_1', '<f4'), ('Double_1', '<f8'), ('Text_1', '<U10'),
       ('Shape_Length', '<f8'), ('Shape_Area', '<f8'), ('Date_time', '<M8[us]'), ('DT_str', '<U20')])

Do the scrolly thing and you will notice that the null_dict that I created previously now converts integer <null> values to -999, Floating point values are tru NaN's and text is 'None'.

 

Lesson... don't use <null>, define a true null value that represents what is meant by it.  There can be more than one type of value to represent conditions like, no data collected, versus no data available, versus no observation possible... don't lump them into one category at the beginning.  You can aggregate AFTER data collection, you shouldn't do it before

Short one

 

Came up in a question.  I sadly suggested a spreadsheet.  To correct this, here is the numpy solution.

Normalizing data...

Here is the input and output tables

names = ['a', 'b', 'c', 'd']
a = arcpy.da.TableToNumPyArray(out_tbl, names)
a0 = a.view('f8').reshape(a.shape[0], len(names))
dt = [('a1', 'f8'), ('b1', 'f8'), ('c1', 'f8'), ('d1', 'f8')]
n = normalize(a0)
new_names = ['a1', 'b1', 'c1', 'd1']
out = np.zeros((n.shape[0],), dtype=dt)
for i, name in enumerate(new_names):
    out[name] = n[:, i]
arcpy.da.NumPyArrayToTable(out, out_tbl+"norm")
def normalize(a):
    # a is a (n x dimension) np.array
    tmp = a - np.min(a, axis=0)
    out = tmp / np.ptp(tmp, axis=0)
    return out

 

 

Line 1 and 2, read the table from ArcGIS Pro

Line 3, 'view' the array as a floating point numbers.

Line 4, create an output data type for sending it back

Line 5, normalize the data

Lines 6 to 10, bumpfh to send it back to Pro as a table

 

Normalize... hope I got it right... take the array, subtract the min then divide by the range.  np.ptp is the 'point-to-point' function which is the range

 

Normalize by row, column or overall

 

Now, lets assume that an input dataset could be data arranged by row, column or as a raster...  We need to change of normalize equation just a bit to see the results.

Header 1
   
# ---- Adding an axis parameter ----

def normalize(a, axis=None):
    # a is a (n x dimension) np.array
    tmp = a - np.min(a, axis=axis)
    out = tmp / np.ptp(tmp, axis=axis)
    return out

a = np.arange(25).reshape(5,5)   # ---- some data

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

normalize(a, axis=0)     # ---- normalize by column

array([[0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.25, 0.25, 0.25, 0.25, 0.25],
       [0.5 , 0.5 , 0.5 , 0.5 , 0.5 ],
       [0.75, 0.75, 0.75, 0.75, 0.75],
       [1.  , 1.  , 1.  , 1.  , 1.  ]])

normalize(a, axis=1)     # ---- normalize by row

array([[ 0.  , -0.25, -0.5 , -0.75, -1.  ],
       [ 0.31,  0.06, -0.19, -0.44, -0.69],
       [ 0.62,  0.38,  0.12, -0.12, -0.38],
       [ 0.94,  0.69,  0.44,  0.19, -0.06],
       [ 1.25,  1.  ,  0.75,  0.5 ,  0.25]])

normalize(a, axis=None)  # ---- normalize overall

array([[0.  , 0.04, 0.08, 0.12, 0.17],
       [0.21, 0.25, 0.29, 0.33, 0.38],
       [0.42, 0.46, 0.5 , 0.54, 0.58],
       [0.62, 0.67, 0.71, 0.75, 0.79],
       [0.83, 0.88, 0.92, 0.96, 1.  ]])

 

Lots of stuff you can do

Dan_Patterson

Geometry in NumPy... # 1

Posted by Dan_Patterson Champion Mar 16, 2019

Geometry 

 

By example

 

Start with a few points. The table and the graph below show what the arrangement is.

 

How was the table generated from the coordinates?

Every sequence needs an Id field, some x and y coordinates.  The rest of the columns were calculated.

I will exclude the id column from the example, since all the other fields are decimal numbers.

 

  1. Make an empty array to hold the results
  2. Add the x and y coordinates (xs, ys)
  3. Calculate the sequential differences in the coordinates (dx, dy)
  4. Determine the segment lengths between the points (leng)
  5. Calculate the cumulative length of the path formed by the points
  6.  Magic happens....

 

Sample Data Points...

 id  xs     ys     dx      dy     leng   cumleng   steps   deltaX   deltaY 
----------------------------------------------------------------------------
 000   0.00   0.00    0.00   0.00   0.00      0.00    0.00     0.00     0.00
 001   0.00  10.00    0.00  10.00  10.00     10.00    2.50     0.00     4.00
 002  10.00  10.00   10.00   0.00  10.00     20.00    2.50     4.00     0.00
 003  10.00   8.00    0.00  -2.00   2.00     22.00    0.50     0.00    -4.00
 004   2.00   8.00   -8.00   0.00   8.00     30.00    2.00    -4.00     0.00
 005   2.00   2.00    0.00  -6.00   6.00     36.00    1.50     0.00    -4.00
 006  10.00   2.00    8.00   0.00   8.00     44.00    2.00     4.00     0.00
 007  10.00   0.00    0.00  -2.00   2.00     46.00    0.50     0.00    -4.00
 008   0.00   0.00  -10.00   0.00  10.00     56.00    2.50    -4.00     0.00
# ---- Procedure... start with an array 'a'

a.shape  # (9, 2) the input array, it has 9 rows and 2 columns (ie x, y)

z = np.zeros((a.shape[0],9))                 # make the empty array
z[:, :2] = a                                 # xys : x, y coordinates
z[1:, 2:4] = dxdy = z[1:, :2] - z[:-1, :2]   # sequential differences
z[1:, 4] = np.sqrt(np.einsum('ij,ij->i', dxdy, dxdy))  # calculate distance
z[:, 5] = np.cumsum(z[:,4])                  # cumulative distance

 

The code so far

  1. Start with a basic input of 9 points consisting of x and y values... it has a 'shape' property reflecting this.
  2. Make a 'container' to hold our results in line 5... we will call it 'z'.  To facilitate this, I made a numpy array filled with zeros.  I would show it, but it is just an array with 9 rows ('a' is the points, and there are 9 of them) and 9 columns.
  3. The 6th line fills columns 0 and 1 with the values in 'a'... in other words, the x and y coordinates
  4. Stick with me... the 7th line fills in rows 1 to the end with two columns of data representing the sequential differences in the coordinates.  z[1: .... means from the first row on, … 2:4] … means fill the 2nd and 3rd column (aka, the 'from, upto but not including' syntax)
  5. Calculate the segment distances/lengths.  I use einsum (Einstein summation).  Its a long story, I have blogged about it elsewhere.  Trust me it is fast and works with in multiple dimension and allows you way more tools other than distance calculations
  6. Line 9 just does a cumulative summation of the distances.

 

 

What to do now?

Lots of things. 

 

Basic geometric operations

shift, rotate, scale, skew

Derived geometric properties

interpoint distance (done), angles/directions/bearing, bounding boxes and way more

More later

 

But today, lets

 

Densify the points

To add to the basic code, we are going to take a leap

The 7th column in the table above was labelled 'steps'.  That columns was derived because I wanted to 'densify' the number of points between the input points.  In this example, we will use a 'dist' of 4 units.  This is line 1 below.

 

The 8th and 9th column need some explaining.  I slipped that calculation in line 7 in the initial code.... dxdy

Between the first 2 points (0,0) and (0, 10), there is a 10 unit distance.  Divide that by 4 and the first densification will have 2.5 steps in distance terms.  In coordinate terms, we have a vertical line (right? check! so you understand).  Out increment is obviously not going to be the same in the X and Y direction is it?  The X coordinate will increment by 0 and the Y coordinate will increment by 4 (look at the 2nd row deltaX, deltaY columns).  This calculation is done in line 2 below.

 

Densify by distance
z[1:, 6] = steps = z[1:, 4] / dist           # steps to create
z[1:, 7:] = deltas = dxdy/(steps.reshape(-1, 1))
#
# ---- above, calculating the steps and densification
# ---- below, for later... calculating the new points
#
N = len(a) - 1  # number of segments
pnts = np.empty((N,), dtype='O')
for i in range(N):              # cycle through the segments and make
    num = np.arange(steps[i])   # the new points
    pnts[i] = np.array((num, num)).T * deltas[i] + a[i]
a0 = a[-1].reshape(1,-1)
final = np.concatenate((*pnts, a0), axis=0)

 

So what does this produce?  Out focus now turns to lines 7-13 above.

Sadly, we will be constructing arrays with 'jagged' shapes, object arrays, like geometry objects that don't have the same number of shapes, points or parts.  (more in another blog).  Basically line 8 above creates the container to save the results.  Each 'row' can contain an indeterminate number of objects and each row's objects need not be the same length.

 

Line 10, produces the number of steps from our previous calculation.  the np.arange function will truncate the decimal portion resulting in integer steps (ie steps = 2.5 results in

 

np.arange(2.5)  # ---- array([0., 1., 2.])

 

Line 11 produces the points.  Essentially line 11 is nothing more than the equation to calculate points along a line given a point and the line slope (or equivalent)

 

The zero in our arange, will be used to reproduce the first point, the 1 will be used to determine the position of the point 4 units up the Y axis and the 2 will be for the point up 8 units.  That is.... (0, 0), (0, 4), and (0, 8).  The last point between (0, 0) and (0, 10) will be created during the next sequence.

 

The points look like the following.

 

First sequence2nd3rd4th ... and so on
step (0) ...points
[[0. 0.]
 [0. 4.]
 [0. 8.]]

step (1) ...points
[[ 0. 10.]
 [ 4. 10.]
 [ 8. 10.]]

step (2) ...points
[[10. 10.]]
step (3) ...points
[[10.  8.]
 [ 6.  8.]]

 

The first 4 sequences show how the points are produced. 

We have already established that first sequence should produce 3 point, including the start point, two intermediate points but excluding the last. 

The 2nd sequence is similar, and the end point of the first sequence is also the first point of the second.

The 3rd sequence... it is too short for any intervening points (have a look, it has a length/distance of 2), so only the first point is produced.

... carry on if you need it using the following.

Produce your own results
for i, pnt in enumerate(pnts):
    print("\nstep ({}) ...points\n{}".format(i, pnt))

 

The final result is a sequence of points which represent the densified input pattern.

Points ... translated/rotated for viewing.
final.T
array([[ 0.,  0.,  0.,  0.,  4.,  8., 10., 10.,  6.,  2.,  2.,  2.,  6., 10., 10.,  6.,  2.,  0.],
       [ 0.,  4.,  8., 10., 10., 10., 10.,  8.,  8.,  8.,  4.,  2.,  2.,  2.,  0.,  0.,  0.,  0.]])

 

Pretty much it.

Next steps... just produce the points, produce a polyline using the points, subdivide the points into segments of fixed and/or variable lengths, add a Z dimension to the results... whatever..